Office Action Analysis: 17445139 — LEVERAGING LAGGING GRADIENTS IN MACHINE-LEARNING MODEL TRAINING

Examiner Intelligence

MEYER, JACQUELINE CHRISTINE View full profile →
Grants 67% — above average
Career Allowance Rate
10 granted / 15 resolved
+11.7% vs TC avg
Strong +70% interview lift
Without
With
+70.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
14 currently pending
Career history
37
Total Applications
across all art units
Statute-Specific Performance

§101
5.3%
-34.7% vs TC avg
§103
85.5%
+45.5% vs TC avg
§102
5.3%
-34.7% vs TC avg
§112
1.3%
-38.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 15 resolved cases
Office Action

§103
DETAILED ACTION
	This final office action is responsive to the amendment filed on November 5, 2025. Claims 1-20 are pending. Claims 1, 10, and 18 are independent.
	Claim rejections under 35 USC §103 are maintained. See sections Claim Rejections – 35 USC §103 and Response to Arguments below.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	Claims 1-6, 8-15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ye et al. (More Effective Distributed Deep Learning Using Staleness Based Parameter Updating), hereinafter Ye, in view of Ho et al. (More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server), hereinafter Ho, in view of Dube et al. (US20190303787), hereinafter Dube.

Regarding claim 1, Ye teaches a computer-implemented method:
	detecting, by the parameter server, gradient sets from the plurality of worker machines, each worker machine generating a gradient set in a current iteration of a training data set by performing forward compute and backward compute operations on a portion of the training data set, and each gradient set of the gradient sets comprising a plurality of gradients; (Ye, introduction, paragraph 3: “The training data set is divided and assigned to the workers, and each worker is in charge of processing a data-subset.” – The training data being divided and assigned to the workers shows that each worker machine is using a portion of the training data set. Section 2.1, paragraph 3: “At the end of each iteration, a worker sends the gradients newly calculated to the parameter servers.” – The worker machine sending the gradient sets to the parameter server is analogous to the parameter server detecting the gradient sets of the worker machines. And section 3.2, paragraph 2: “Thirdly, the worker uses the samples and labels to perform forward propagation and gets the output value. In this process. each worker is calculating independently and doesn't communicate with each other. Then, the worker use the Loss function to calculate the error between the network's output value and the sample's label, carries on the back propagation, and calculates the gradient                         
                            ∆
                            
                                    W
                                
                                    1
                                
                     of the parameter by layers.” – The worker machines are calculating the gradients by performing forward and backward propagation.)
detecting, by the parameter server, a lagging gradient set from a lagging worker machine, the lagging gradient set generated by the lagging worker machine in a prior iteration of the training data set; (Ye, Section 3.2, paragraph 2: “Finally, the worker sends the gradients                         
                            ∆
                            
                                    W
                                
                                    l
                                
                     and the parameter version number                         
                            
                                    θ
                                
                                    l
                                
                     to the parameter server, waiting for the parameters to be updated.” – The gradients being sent with the parameter version number indicates that there are some lagging gradient sets being generated and detected during a prior iteration of the training data.)
	generating, by the parameter server and subsequent to receiving gradient sets in a current iteration from a threshold number of the plurality of worker machines in a current iteration, (Ye, section 4.3, paragraph 1: “ We fix the N in the N-soft protocol to 4, so parameters are updated as soon as the parameter servers receive four gradients from any four workers.” – Since the parameter server aggregates the gradients before updating the parameters (see claim 3 above), fixing the N to 4 before updating parameters indicates that N could be set to any threshold number and the gradient aggregation would occur before the parameters are updated.) aggregated gradients by performing weighted gradient aggregation based on the gradient sets from the current iteration and the weighted lagging gradient set, wherein the gradient aggregation combines the gradient sets from the current iteration with the lagging gradient set to generate the aggregated gradients; and (Ye, section 2.1, paragraph 3: “A parameter server collects gradients from workers and aggregates them and calculates new parameters based on distributed training protocols.” – Since there is a staleness measure associated with the gradient sets which are analogous to the weighting of lagging gradients then when the parameter server aggregates the gradients it would include both gradients based off the most recent parameters on the parameter server as well as stale gradients that are based off older parameters on the worker machines.)
updating, by the parameter server, parameters of the neural network model based on the aggregated gradients to improve training convergence in the distributed neural network system. (Ye, Section 2.1, paragraph 3: “A parameter server collects gradients from workers and aggregate them and calculates new parameters based on distributed training protocols.” – where calculating new parameters is analogous to updating the neural network model based on the aggregated gradients. And section 3.1, paragraph 2: “By forcing synchronization. the impact caused by the loss of partial gradients can be weakened, and the training convergence rate is improved.” – Forcing synchronization once a threshold number of gradient sets is received and aggregated at the parameter server is analogous to the method proposed here which in turn improves convergence.)

Ye does not explicitly teach:
	maintaining, by a parameter server, synchronization of distributed neural network training across a distributed neural network system, the distributed neural network system including a plurality of worker machines operating in parallel;
	determining, by the parameter server, a lagging gradient set weight for the lagging gradient set by calculating a weight value based on a numerical iteration difference between a first index of the current iteration and a second index of the prior iteration;

However, Ho teaches:
	maintaining, by a parameter server, synchronization of distributed neural network training across a distributed neural network system, the distributed neural network system including a plurality of worker machines operating in parallel; (Ho, page 2, paragraph 2: “In this paper, we explore a path to such a system using the idea of a parameter server [22, 2], which we define as the combination of a shared key-value store that provides a centralized storage model (which may be implemented in a distributed fashion) with a synchronization model for reading/updating model values.” – The parameter server providing a centralized storage model indicates that the parameter server is maintaining the system. And page 2, paragraph 3: “Towards this end, we propose a parameter server using a Stale Synchronous Parallel (SSP) model of computation, for distributed ML algorithms that are parallelized into many computational workers (technically, threads) spread over many machines.” – Indicates the synchronization of distributed neural network training across a distributed neural network system includes a plurality of worker machines operating in parallel.)
	Ho is considered analogous to the claimed invention as it is in the same field of endeavor, machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have modified Ye, which already teaches a distributed method for training a neural network but does not explicitly teach that the method is a synchronized distributed neural network including a plurality of worker machines operating in parallel, to include the teachings of Ho which does teach that the method is a synchronized distributed neural network including a plurality of worker machines operating in parallel since it “yields more convergence progress per minute, i.e. faster convergence.” (Ho, page 5, paragraph 1)

Ye and Ho do not explicitly teach:
	determining, by the parameter server, a lagging gradient set weight for the lagging gradient set by calculating a weight value based on a numerical iteration difference between a first index of the current iteration and a second index of the prior iteration;

However, Dube teaches:
	determining, by the parameter server, a lagging gradient set weight for the lagging gradient set by calculating a weight value based on a numerical iteration difference between a first index of the current iteration and a second index of the prior iteration; (Dube, paragraph 0033: “Moreover , the learning rate n is dependent upon the iteration, which may be expressed in terms of the time (e.g.                         
                            
                                    η
                                
                                    t
                                
                    ) or in terms of the jth iteration (e.g.                         
                            
                                    η
                                
                                    j
                                
                    ) , which is to say, the learning rate is particular to the computed gradient.” and paragraph 0034: “Accordingly, the learning rate is inversely proportional to the true staleness, which is defined as the squared norm of the difference of the stale parameter and the current parameter.”)
	Dube is considered analogous to the claimed invention as it is in the same field of endeavor, machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have modified Ye and Ho, which already teach the method of aggregating the gradient sets and the lagging gradient sets to train the neural network but do not explicitly teach weighting the lagging gradient sets according to the differences between iterations, to include the teachings of Dube which does teach weighting the lagging gradient sets according to the differences between iterations since “more stale gradients may have less of an impact on parameter updating while less stale gradients may have more of an impact on parameter updating.” (Dube, paragraph 0014)

Regarding claim 2, Ye, Ho, and Dube teach the method of claim 1, as cited above.

Ye further teaches:
	averaging the aggregated gradients to generate an averaged gradient set; and updating a plurality of weights of the neural network model using the averaged gradient set. (Ye, Section 2.2, paragraph 2: “The parameter server averages the gradients over λ workers and updates the parameters according to Eq. 1, where α is the learning rate.” – Wherein the parameters of the neural network include the weights of the model.)

Regarding claim 3, Ye, Ho, and Dube teach the method of claim 2, as cited above.

Ye and Ho do not explicitly teach:
	determining a granularity level for the lagging gradient set weight to control usage of the lagging gradient set and improve testing accuracy within a given model training time; and 
performing the gradient aggregation using the plurality of granularity level to tune the weighted gradient sets and aggregation based on an importance of the lagging gradient set weight during the current iteration.

However, Dube further teaches:
	determining a granularity level for the lagging gradient set weight to control usage of the lagging gradient set and improve testing accuracy within a given model training time; and performing the gradient aggregation using the plurality of granularity level to tune the weighted gradient sets and aggregation based on an importance of the lagging gradient set weight during the current iteration.  (Dube, paragraph 0035: “Thus, according to exemplary embodiments of the present invention, the central parameter server may calculate the learning rate                        
                             
                                    η
                                
                                    t
                                
                     or                         
                            
                                    η
                                
                                    j
                                
                    , for example as described above, (Step S34) and then the central parameter server calculated the updated parameters                         
                            
                                    W
                                
                                    t
                                    +
                                    1
                                
                     according to the processed gradient VF and the calculated learning rate, for example as described above (Step S35).” And paragraph 0028: “In this way, the present approach may have improved speed, improved efficiency, and improved efficacy.” – The learning rate is analogous to the gradient set weight the parameter server calculating different learning rates is analogous to determining the granularity level for the gradient set weight which is then used to calculate the parameters, or gradients, for                         
                            
                                    W
                                
                                    t
                                    +
                                    1
                                
                     where t+1 is the current iteration. Using the approach described, there is an improvement of efficacy which indicates improved accuracy.)

Regarding claim 4, Ye, Ho, and Dube teach the method of claim 3, as cited above.

Ye and Ho do not explicitly teach:
	the lagging gradient set weight is determined based on an index of the current iteration and an index of the prior iteration.

However, Dube further teaches:
	the lagging gradient set weight is determined based on an index of the current iteration and an index of the prior iteration. (Dube, paragraph 0033: “Moreover , the learning rate n is dependent upon the iteration , which may be expressed in terms of the time (e.g.                         
                            
                                    η
                                
                                    t
                                
                    ) or in terms of the jth iteration (e.g.                         
                            
                                    η
                                
                                    j
                                
                    ), which is to say , the learning rate is particular to the computed gradient . As the learning rate may be computed so as to be inversely related to the extent to which parameter vectors have changed, for example, the following equation may be used:
                
                            η
                        
                            j
                        
                    =
                    min
                    
                                    C
                                
                                                            w
                                                        
                                                            j
                                                        
                                                    -
                                                    
                                                            w
                                                        
                                                            τ
                                                        
                                            2
                                        
                                            2
                                        
                            ,
                             
                                    η
                                
                                    m
                                    a
                                    x
                                
                    .
                
	Here C is a predetermined constant and                         
                            τ
                            
                                    j
                                
                            ≤
                            j
                            .
                        
                    ” – The denominator of the equation gives a difference between the gradient set which is based on the index of the current iteration,                         
                            τ
                        
                    , and the index of a prior iteration, j.)

Regarding claim 5, Ye, Ho, and Dube teach the method of claim 4, as cited above.

Ye and Ho do not explicitly teach:
	the lagging gradient set weight is 1/(1+Δ), where Δ is a difference between the index of the current iteration and the index of the prior iteration.

However, Dube teaches:
	the lagging gradient set weight is 1/(1+Δ), where Δ is a difference between the index of the current iteration and the index of the prior iteration. (Dube, paragraph 0034: “Accordingly , the learning rate is inversely proportional to the true staleness, which is defined as the squared norm of the difference of the stale parameter and the current parameter .” –Dube presents the gradient set weight as being inversely proportional to the staleness which is the difference between the current iteration and the iteration where the gradients are coming from (a prior iteration), any such formula that is inversely proportional to the difference between the index of the current iteration and the index of the iteration from which the gradient comes would be suitable to accomplish this weighting.)

Regarding claim 6, Ye, Ho, and Dube teach the method of claim 4, as cited above.

Ye and Ho do not explicitly teach:
	the lagging gradient set weight is 1/aΔ, where Δ is a difference between the index of the current iteration and the index of the prior iteration and a is an integer greater than 1.

However, Dube further teaches:
	the lagging gradient set weight is 1/aΔ, where Δ is a difference between the index of the current iteration and the index of the prior iteration and a is an integer greater than 1. (Dube, paragraph 0034: “Accordingly , the learning rate is inversely proportional to the true staleness, which is defined as the squared norm of the difference of the stale parameter and the current parameter .” –Dube presents the gradient set weight as being inversely proportional to the staleness which is the difference between the current iteration and the iteration where the gradients are coming from (a prior iteration), any such formula that is inversely proportional to the difference between the index of the current iteration and the index of the iteration from which the gradient comes would be suitable to accomplish this weighting.)

Regarding claim 8, Ye, Ho, and Dube teach the method of claim 1, as cited above.

Ye further teaches:
	receiving the plurality of gradient sets from the plurality of worker machines via corresponding push operations, subsequent to completion of the forward compute and the backward compute operations at each worker machine of the plurality of worker machines during the current iteration. (Ye, section 3.2, paragraph 2: “Thirdly, the worker uses the samples and labels to perform forward propagation and gets the output value. In this process, each worker is calculating independently and doesn’t communicate with each other. Then, the worker use the Loss function to calculate the error between the network’s output value and the sample’s label, carries on the back propagation, and calculates the gradient                         
                            
                                    ∆
                                    W
                                
                                    l
                                
                     of the parameter by layers. Finally, the worker sends the gradients                         
                            
                                    ∆
                                    W
                                
                                    l
                                
                     and the parameter version number                         
                            
                                    θ
                                
                                    l
                                
                     to the parameter server, waiting for the parameters to be updated.” – Sending the gradients to the parameter server would be a push operation. The forward and backward propagation are the forward compute and backward compute operations.)
	
Regarding claim 10, claim 10 has all the same limitations as claim 1 which are taught by Ye, Ho, and Dube, see claim 1.

Ye further teaches:
	a memory storing instructions; and one or more processors in communication with the memory, the one or more processors executing the instructions to: (Ye, section 4.1, paragraph 1: “Experiments are conducted at the partition of CPU pool of Tianhe-2. Each node of this partition contains two 12-core IntelXeon E5-2692v2 processors with a clock frequency of 2.2 GHz. A single CPU has a theoretical double-precision floating point peak performance of 211.2Gflops/s, and the computing node peak performance up to 3.432Tflops/s. The memory capacity of each node is 64 GB and nodes are connected through Intel’s Ivy Bridge micro-architecture built-in PCI-E 2.0, which a single lane bandwidth of 10 Gbps, providing a powerful speed support for cross-node data communications.”)

Regarding claim 11, Ye, Ho, and Dube teach the system of claim 10, as cited above. Claim 11 additionally has the same limitations of claim 2 which are taught by Ye, Ho, and Dube – see claim 2 above.

Regarding claim 12, Ye, Ho, and Dube teach the system of claim 10, as cited above. Claim 12 additionally has the same limitations of claim 3 which are taught by Ye, Ho, and Dube – see claim 3 above.

Regarding claim 13, Ye, Ho, and Dube teach the system of claim 12, as cited above. Claim 13 additionally has the same limitations of claim 4 which are taught by Ye, Ho, and Dube – see claim 4 above.

Regarding claim 14, Ye, Ho, and Dube teach the system of claim 13, as cited above. Claim 14 additionally has the same limitations of claim 5 which are taught by Ye, Ho, and Dube – see claim 5 above.

Regarding claim 15, Ye, Ho, and Dube teach the system of claim 13, as cited above. Claim 15 additionally has the same limitations of claim 6 which are taught by Ye, Ho, and Dube – see claim 6 above.

Regarding claim 17, Ye, Ho, and Dube teach the system of claim 10, as cited above. Claim 17 additionally has the same limitations of claim 8 which are taught by Ye, Ho, and Dube – see claim 8 above.

Regarding claim 18, claim 18 has all the same limitations as claim 1 which are taught by Ye, Ho, and Dube, see claim 1.

Ye additionally teaches:
	A non-transitory computer-readable medium storing computer instructions for training a neural network model, wherein the instructions when executed by one or more processors, cause the one or more processors to perform steps of: (Ye, section 4.1, paragraph 1: “Experiments are conducted at the partition of CPU pool of Tianhe-2.”)

Regarding claim 19, Ye, Ho, and Dube teach the non-transitory computer-readable medium of claim 18, as cited above. Claim 19 additionally has the same limitations of claim 2 which are taught by Ye, Ho, and Dube – see claim 2 above.

Regarding claim 20, Ye, Ho, and Dube teach the non-transitory computer-readable medium of claim 18, as cited above. Claim 20 additionally has the same limitations of claim 3 which are taught by Ye, Ho, and Dube – see claim 3 above.

Claims 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ye in view of Ho in view of Dube in view of Gupta et al. (Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study), hereinafter Gupta.

Regarding claim 7, Ye, Ho, and Dube teach the method of claim 1, as cited above.

Ye and Ho do not explicitly teach:
	performing the gradient aggregation when a number of worker machines of the plurality of worker machines from which gradient sets of the plurality of gradient sets are received reaches a threshold number of worker machines, the threshold number of worker machines configured by a deep learning training architecture function management module; and
	storing the gradient sets from the current iteration and weighted lagging gradient sets for use in subsequent iterations.

Dube further teaches:
	storing the gradient sets from the current iteration and weighted lagging gradient sets for use in subsequent iterations. (Dube, paragraph 0021: “As discussed above, the state of the parameter server at the time of job assignment may be stored for later use in assessing staleness either by the central processing server or the model learner.” – the state of the parameter server at the time of the job is analogous to storing the gradient sets from the current iteration and weighted lagging gradient sets as the parameter server has received and aggregated the gradient sets, as taught in claim 1, and therefore the state of the parameter server would include this information. Assessing the staleness by the central processing server or the model learner is indicative that the information is being used in subsequent iterations.)

Ye, Ho, and Dube do not explicitly teach:
	performing the gradient aggregation when a number of worker machines of the plurality of worker machines from which gradient sets of the plurality of gradient sets are received reaches a threshold number of worker machines, the threshold number of worker machines configured by a deep learning training architecture function management module; and

However, Gupta teaches:
	performing the gradient aggregation when a number of worker machines of the plurality of worker machines from which gradient sets of the plurality of gradient sets are received reaches a threshold number of worker machines, the threshold number of worker machines configured by a deep learning training architecture function management module; and (Gupta, page 3, paragraph 2: “n-softsync protocol: Each learner pulls the weights from the parameter server, calculates the gradients and pushes the gradients to the parameter server. The parameter server updates the weights after collecting at least                         
                            c
                            =
                            
                                    λ
                                    /
                                    n
                                
                     gradients. The splitting parameter                         
                            n
                        
                     can vary from 1 to λ. The n-softsync weight update rule is given by: 
                
                    c
                     
                    =
                    [
                    λ
                    /
                    n
                    ]
                     
                            ∇
                            θ
                             
                                    k
                                
                            i
                        
                    =
                    
                            1
                        
                            c
                        
                            ∑
                            
                                l
                                -
                                1
                            
                                c
                            
                            ∇
                            
                                    θ
                                
                                    l
                                
                                    (
                                    k
                                    )
                                
                    ,
                     
                            L
                        
                            j
                        
                    ∈
                     
                            L
                        
                            1
                        
                    ,
                     
                    .
                    .
                    .
                    ,
                     
                            L
                        
                            λ
                        
                            θ
                        
                            (
                            k
                            )
                        
                    (
                    i
                     
                    +
                     
                    1
                    )
                     
                    =
                     
                            θ
                        
                            (
                            k
                            )
                        
                    (
                    i
                    )
                     
                    -
                     
                    α
                    ∇
                    
                            θ
                        
                            (
                            k
                            )
                        
                    (
                    i
                    )
                    ”
                
- The n-softsync protocol is analogous to the deep training architecture function management module. While c is analogous to the threshold number of workers, the update rule is configured by the protocol.)
	Gupta is considered analogous to the claimed invention as it is in the same field of endeavor, machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to have modified Ye, Ho, and Dube, which already teaches receiving gradient sets from a threshold number of workers and then aggregating the gradients but does not explicitly teach that the threshold number of workers is configured by a deep learning training architecture, to include the teachings of Gupta which does teach that the threshold number of workers is configured by a deep learning training architecture in order to “effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy.” (Gupta, abstract)
	
Regarding claim 16, Ye, Ho, and Dube teach the system of claim 10, as cited above. Claim 16 additionally has the same limitations of claim 7 which are taught by Ye, Ho, Dube, and Gupta – see claim 7 above.

Response to Arguments
Applicant's arguments filed November 5, 2025 have been fully considered but they are not persuasive. 
In response to Applicant’s argument at the bottom of page 9 of Remarks that “Ho’s SSP model is different from Applicant’s claimed distributed training approach.” Applicant contrasts Ho’s teachings with the “maintaining, by a parameter server, synchronization…” and “generating, by the parameter server and subsequent to receiving…” First, examiner notes that Ho was not relied upon to teach the generating limitation, this was taught by Ye as the primary reference. For the maintaining limitation, Ho teaches a parameter server that maintains the distributed neural network training across a plurality of machines working in parallel. Ho also has the idea of forcing synchronization when the staleness threshold is reached (see e.g. section 2). Applicant alleges that Ho’s teaching differs from the cited features of the claim but provides no explanation on how it differs.
In response to Applicant’s argument on page 10 of Remarks that Ye’s teachings are different from Applicant’s approach, the “generating limitation” that Ye is teaching has the gradient sets in a current iteration being generated once the parameter server receives a threshold number of gradients from the worker machines, similarly, the N-soft protocol of Ye has a fixed N which, once the parameter server receives the gradients from N machines, is used as a threshold to generate the aggregated gradients. Additionally, the staleness measure associated with the gradients shows that the weighted gradient aggregation includes the gradients from the current iteration, from the N workers, as well as the older gradients that get assigned the staleness measure.
In response to Applicant’s argument on page 11 that Dube does not cure the failings of Ye and Ho. Dube is relied upon to teach the lagging gradient set weight based on a numerical iteration difference between a first index of the current iteration and a second index of the prior iteration. Applicant states that “Dube compares parameters from different time points (job assignment time versus current time)” However, Dube states that the learning rate “may be expressed in terms of time or in terms of the jth iteration” thus teaching a difference between iterations as “the difference of the stale parameter and the current parameter” using the iterations of each. 
Applicant states that their use of the numerical iteration index involves “straightforward arithmetic between iteration counters,” however, examiner notes that the claim states that the gradient set weight is “calculating a weight value based on a numerical iteration difference.” The weight value being based on the numerical iteration difference between the iterations is being taught by Dube’s “difference of the stale parameter and current parameter” regardless of the further calculations done for the learning rate.
Applicant states that their “weight is applied during gradient aggregation to combine multiple gradients from different workers.” Examiner notes that Dube was relied upon to teach determining the lagging gradient set weight while Ye is already teaching the gradient aggregation from the generating limitation. Applicant argues on page 13 that Dube does not teach “Applicant’s synchronous, threshold-based aggregation protocol.” This argument relies upon Dube in isolation, the teachings of Ye, Ho, and Dube together teach the threshold-based aggregation protocol that uses the learning rate of Dube to weight the lagging gradients prior to the aggregation.
Applicant’s further characterization of the cited references as being asynchronous and thus different from the claimed synchronization is unpersuasive. The claimed invention, while stating that it is maintaining synchronization, has a commonality of features with the prior art of record. Namely, the prior art has the aspect of coordination between a parameter server and worker machines, where the parameter server is receiving gradients from worker machines with a threshold number of the gradients being gradients from the current iteration while the gradients from the other worker machines are incorporated into the updated gradients by weighting them according to their staleness (or lag). Applicant appears to be using the plain meaning of “synchronous” and “asynchronous” and stating that the claimed invention is different based off that. However, the synchronous training of the claimed invention is not the conventional synchronous training, as applicant themselves pointed out in Applicant’s Remarks dated June 18, 2025, but rather a modified synchronous training that does not wait for all worker machines to complete the iteration before forcing an update to the gradients on the parameter server as described in paragraph 0083 of the specification. This is similar to what is being done in Ye (section 2.1, paragraph 3 and section 4.3 paragraph 1) which is used to teach the generating limitation.
Therefore, claim rejections under 35 USC §103 are maintained. See section Claim Rejections – 35 USC §103 above.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACQUELINE MEYER whose telephone number is (703)756-5676. The examiner can normally be reached M-F 8:00 am - 4:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached at 571-272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.C.M./Examiner, Art Unit 2144                                                                                                                                                                                                        
/TAMARA T KYLE/Supervisory Patent Examiner, Art Unit 2144
Read full office action
Prosecution Timeline

Show 4 earlier events
Jun 18, 2025
Request for Continued Examination
Jun 23, 2025
Response after Non-Final Action
Aug 05, 2025
Non-Final Rejection mailed — §103
Oct 27, 2025
Response Filed
Jan 13, 2026
Final Rejection mailed — §103
Mar 26, 2026
Applicant Interview (Telephonic)
Mar 26, 2026
Examiner Interview Summary
Apr 13, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/368,168
Patent 12639619
ARTIFICIAL INTELLIGENCE-BASED MULTI-GOAL-AWARE DEVICE SAMPLING
4y 10m to grant Granted May 26, 2026
17/539,971
Patent 12608611
MULTISCALE DIMENSIONAL REDUCTION OF DATA
4y 4m to grant Granted Apr 21, 2026
17/381,240
Patent 12585981
MANAGING AN INSTALLED BASE OF ARTIFICIAL INTELLIGENCE MODULES
4y 8m to grant Granted Mar 24, 2026
17/570,468
Patent 12468941
SYSTEMS AND METHODS FOR DYNAMICS-AWARE COMPARISON OF REWARD FUNCTIONS
3y 10m to grant Granted Nov 11, 2025
Study what changed to get past this examiner. Based on 4 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+70.0%)
3y 10m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 15 resolved cases by this examiner. Grant probability derived from career allowance rate.
LEVERAGING LAGGING GRADIENTS IN MACHINE-LEARNING MODEL TRAINING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

LEVERAGING LAGGING GRADIENTS IN MACHINE-LEARNING MODEL TRAINING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email