Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jiang et al. (MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs).
Jiang discloses:
1. A method comprising:
obtaining, during a distributed training task performed across a plurality of computing nodes, at least one heartbeat message from the plurality of computing nodes, each computing node including multiple graphics processing unit (GPU) workers; (p 5: 3.4; p 6: 4.1)
detecting, based on the at least one heartbeat message, an abnormal status of the distributed training task; (p 6: 4.1)
commanding the plurality of computing nodes to run at least one self-check diagnostics test; (p 6: 4.1)
identifying, based on results of the at least one self-check diagnostics test, at least one faulty node from the plurality of computing nodes; and (p 6: 4.1)
replacing the at least one faulty node with an equivalent number of heathy computing nodes that have passed the at least one self-check diagnostics test. (p 6: 4.1)
2. The method of claim 1, wherein the at least one heartbeat message includes at least one of:
output and error logs of a training process running on a corresponding computing node; and (p 6: fig 5; p 7: 4.2)
a Remote Direct Memory Access (RDMA) traffic metric indicating network utilization and efficiency among the plurality of computing nodes. (p 7: 4.2)
3. The method of claim 1, wherein detecting the abnormal status of the distributed training task comprises:
performing first monitoring to assess an overall health status and to rule out common configuration impacts on the distributed training task; and (p 7: 4.2)
performing second monitoring to determine whether there is network congestion among the plurality of computing nodes and whether a data transfer speed of data parallelism and pipe parallelism has reached its physical limit. (p 7: 4.2)
4. The method of claim 1, wherein the at least one self-check diagnostics test comprises at least one of:
a first test to diagnose potential bottlenecks associated with RDMA network interface cards (RNICs) in an intra-host network of a computing node; or (p 7: 4.3)
a second test to identify potential faults in GPU communication within a single computing node and among the plurality of computing nodes. (p 7: 4.3)
5. The method of claim 1, further comprising: suspending, upon detection of the abnormal status of the distributed training task, the distributed training task across the plurality of computing nodes. (p 6: 4.1)
6. The method of claim 1, wherein replacing the at least one faulty node with an equivalent number of heathy computing nodes that have passed the at least one self-check diagnostics test comprises:
evicting the at least one faulty node from the distributed training task; and (p 6: 4.1)
loading model weights and optimizer states from the most recent checkpoint into the heathy computing nodes. (p 7: 4.4)
7. The method of claim 6, further comprising:
at a checkpoint, cause each GPU worker of a computing node to write its on-chip states including the model weights and the optimizer states into a memory of the computing node; and (p 7: 4.4)
cause the computing node to asynchronously transfer the on-chip states from the memory to a distributed file system. (p 7: 4.4)
8. The method of claim 7, wherein loading model weights and optimizer states from the most recent checkpoint into the heathy computing nodes comprises:
for a group of GPU workers that share a same state partition of the distributed file system, designating a single GPU worker in the group to read the shared state partition from the distributed file system; and (p 7: 4.4)
causing the single GPU worker to broadcast the shared state partition to all other GPU works in the group. (p 7: 4.4)
9. The method of claim 1, further comprising:
collecting data regarding execution time of a code segment on a set of GPU workers; and (fig 7; p 8: 5.1; p 9: 5.2)
identifying a computing node, by visualizing the collected data, that includes a GPU worker with slower performance as a faulty node. (fig 7; p 8: 5.1; p 9: 5.2)
10. The method of claim 9, wherein visualizing the collected data comprises:
generating a heat map that shows time consumption differences time consumption differences between the set of GPU workers. (p 8: 5.1)
11. The method of claim 9, wherein visualizing the collected data comprises:
generating an event timeline on the set of GPU workers in a trace format. (p 8: 5.1)
12. The method of claim 9, wherein identifying at least one GPU worker with slower performance comprises:
displaying a logical topology of the GPU workers with respect to at least one of data parallelism, pipeline parallelism, or tensor parallelism. (p 9: 5.2)
Claim(s) 13-19 is/are rejected as being the device implemented by the method of claim(s) 1-4, 6-8, and is/are rejected on the same grounds.
Claim(s) 20 is/are rejected as being the medium implemented by the method of claim(s) 1, and is/are rejected on the same grounds.
Response to Remarks
The amendments overcome the objections/rejections to the claim(s) under informalities and 112(b).
Conclusion
Applicant's submission of an information disclosure statement under 37 CFR 1.97(c) with the timing fee set forth in 37 CFR 1.17(p) on 7-7-2025 prompted the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 609.04(b). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHERINE LIN whose telephone number is (571)431-0706. The examiner can normally be reached Monday-Friday; 8 a.m. - 5 p.m. EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bryce Bonzo can be reached at (571) 272-3655. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KATHERINE LIN/Primary Examiner, Art Unit 2113