Last updated: May 29, 2026

Application No. 18/758,347

FAULT IDENTIFICATION AND RECOVERY FOR DISTRIBUTED TRAINING

Non-Final OA §102

Filed

Jun 28, 2024

Examiner

LIN, KATHERINE Y

Art Unit

2113

Tech Center

2100 — Computer Architecture & Software

Assignee

Lemon Inc.

OA Round

2 (Non-Final)

Interview Optional

— +7.0% interview lift. Interview lift (+7.0%) is below the 15.0% threshold. A written response is recommended.

Based on 353 resolved cases, 2023–2026

Examiner Intelligence

LIN, KATHERINE Y View full profile →

Grants 91% — above average

Career Allowance Rate

322 granted / 353 resolved

+36.2% vs TC avg

Moderate +7% lift

Without

With

+7.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

18 currently pending

Career history

384

Total Applications

across all art units

Statute-Specific Performance

§101

19.5%

-20.5% vs TC avg

§103

49.1%

+9.1% vs TC avg

§102

18.9%

-21.1% vs TC avg

§112

4.4%

-35.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 353 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jiang et al. (MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs).

Jiang discloses: 
1. A method comprising: 
obtaining, during a distributed training task performed across a plurality of computing nodes, at least one heartbeat message from the plurality of computing nodes, each computing node including multiple graphics processing unit (GPU) workers; (p 5: 3.4; p 6: 4.1)

detecting, based on the at least one heartbeat message, an abnormal status of the distributed training task; (p 6: 4.1)

commanding the plurality of computing nodes to run at least one self-check diagnostics test; (p 6: 4.1)

identifying, based on results of the at least one self-check diagnostics test, at least one faulty node from the plurality of computing nodes; and (p 6: 4.1)

replacing the at least one faulty node with an equivalent number of heathy computing nodes that have passed the at least one self-check diagnostics test. (p 6: 4.1)

2. The method of claim 1, wherein the at least one heartbeat message includes at least one of: 
output and error logs of a training process running on a corresponding computing node; and (p 6: fig 5; p 7: 4.2)
a Remote Direct Memory Access (RDMA) traffic metric indicating network utilization and efficiency among the plurality of computing nodes. (p 7: 4.2)

3. The method of claim 1, wherein detecting the abnormal status of the distributed training task comprises: 
performing first monitoring to assess an overall health status and to rule out common configuration impacts on the distributed training task; and (p 7: 4.2)
performing second monitoring to determine whether there is network congestion among the plurality of computing nodes and whether a data transfer speed of data parallelism and pipe parallelism has reached its physical limit.  (p 7: 4.2)

4. The method of claim 1, wherein the at least one self-check diagnostics test comprises at least one of: 
a first test to diagnose potential bottlenecks associated with RDMA network interface cards (RNICs) in an intra-host network of a computing node; or (p 7: 4.3)
a second test to identify potential faults in GPU communication within a single computing node and among the plurality of computing nodes. (p 7: 4.3)

5. The method of claim 1, further comprising: suspending, upon detection of the abnormal status of the distributed training task, the distributed training task across the plurality of computing nodes. (p 6: 4.1)

6. The method of claim 1, wherein replacing the at least one faulty node with an equivalent number of heathy computing nodes that have passed the at least one self-check diagnostics test comprises: 
evicting the at least one faulty node from the distributed training task; and (p 6: 4.1)
loading model weights and optimizer states from the most recent checkpoint into the heathy computing nodes. (p 7: 4.4)

7. The method of claim 6, further comprising: 
at a checkpoint, cause each GPU worker of a computing node to write its on-chip states including the model weights and the optimizer states into a memory of the computing node; and (p 7: 4.4)
cause the computing node to asynchronously transfer the on-chip states from the memory to a distributed file system.  (p 7: 4.4)

8. The method of claim 7, wherein loading model weights and optimizer states from the most recent checkpoint into the heathy computing nodes comprises: 
for a group of GPU workers that share a same state partition of the distributed file system, designating a single GPU worker in the group to read the shared state partition from the distributed file system; and (p 7: 4.4)
causing the single GPU worker to broadcast the shared state partition to all other GPU works in the group.  (p 7: 4.4)

9. The method of claim 1, further comprising: 
collecting data regarding execution time of a code segment on a set of GPU workers; and (fig 7; p 8: 5.1; p 9: 5.2)
identifying a computing node, by visualizing the collected data, that includes a GPU worker with slower performance as a faulty node.  (fig 7; p 8: 5.1; p 9: 5.2)

10. The method of claim 9, wherein visualizing the collected data comprises: 
generating a heat map that shows time consumption differences time consumption differences between the set of GPU workers.  (p 8: 5.1)

11. The method of claim 9, wherein visualizing the collected data comprises: 
generating an event timeline on the set of GPU workers in a trace format.  (p 8: 5.1)

12. The method of claim 9, wherein identifying at least one GPU worker with slower performance comprises: 
displaying a logical topology of the GPU workers with respect to at least one of data parallelism, pipeline parallelism, or tensor parallelism.  (p 9: 5.2)

Claim(s) 13-19 is/are rejected as being the device implemented by the method of claim(s) 1-4, 6-8, and is/are rejected on the same grounds.

Claim(s) 20 is/are rejected as being the medium implemented by the method of claim(s) 1, and is/are rejected on the same grounds.

Response to Remarks
The amendments overcome the objections/rejections to the claim(s) under informalities and 112(b).

Conclusion
Applicant's submission of an information disclosure statement under 37 CFR 1.97(c) with the timing fee set forth in 37 CFR 1.17(p) on 7-7-2025 prompted the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 609.04(b). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHERINE LIN whose telephone number is (571)431-0706. The examiner can normally be reached Monday-Friday; 8 a.m. - 5 p.m. EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bryce Bonzo can be reached at (571) 272-3655. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KATHERINE LIN/Primary Examiner, Art Unit 2113

Read full office action

Prosecution Timeline

Jun 28, 2024

Application Filed

Jul 09, 2025

Non-Final Rejection mailed — §102

Oct 09, 2025

Response Filed

Jan 16, 2026

Final Rejection mailed — §102

Mar 16, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/479,497

Patent 12625760

System and method for machine-to-machine re-imaging

2y 7m to grant Granted May 12, 2026

18/357,603

Patent 12619486

Mechanism of Enabling Fault Handling with PCIe Re-timer

2y 9m to grant Granted May 05, 2026

18/914,267

Patent 12613772

MEMORY DEVICE AND OPERATING METHOD THEREOF

1y 6m to grant Granted Apr 28, 2026

18/850,469

Patent 12608292

MANAGEMENT METHOD AND APPARATUS AND ATE TEST SYSTEM

1y 6m to grant Granted Apr 21, 2026

18/237,204

Patent 12596953

QUANTUM ERROR CORRECTION USING NEURAL NETWORKS

2y 7m to grant Granted Apr 07, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

91%

Grant Probability

98%

With Interview (+7.0%)

2y 3m (~4m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 353 resolved cases by this examiner. Grant probability derived from career allowance rate.