DETAILED ACTION
Status of Claims
This action is in reply to the application filed on 10/11/2023.
Claims 1-20 are currently pending and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 5-9, and 12-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Rajbhandari et al. (“ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, 11/2021, SC ’21, ACM conference publication1).
Claims 1 and 12:
Rajbhandari discloses the limitations as shown in the following rejections
A system comprising: a first storage medium (CPU memory); and a processing circuit (CPU with offload engine) in communication with the first storage (pg. 7, Fig. 4; pg. 9, § 6.3; pg. 14).
the processing circuit being configured to: identify first data associated with a task (training) associated with a machine learning program (model); determine a first attribute (e.g. current pass and/or current operation of training) associated with the first data; provide the first data to a first device (GPU) based on determining the first attribute (parameters/gradients needed for current pass); identify second data associated with the task associated with the machine learning program; determine a second attribute (e.g. activation checkpoint type, optimizer state) associated with the second data; and provide the second data to a second device (e.g. NVMe SSD, CPU) different from the first device (see at least pg. 6, § 5.1.1 and 5.1.2; pg. 7, Fig. 4, description; pg. 7-8, § 5.3; pg. 9, § 7.1, para. 4) disclosing a system for parallel training of machine learning models using a heterogeneous memory architecture, including partitioning model and residual state data of the training process are partitioned and determining which portions should be transferred to GPU (first device) memory and which should be transferred to CPU memory or NVMe solid-state drive (SSD) (second device) based on the type of data (attributes) and the stage of the training process; for example parameters needed (first data) for a current operation forward pass are stored at GPU while activations (second data) are offloaded to CPU and/or NVMe storage. Exemplary quotations:
“ZeRO-Infinity is designed with a powerful offload mechanism called the infinity offload engine which can offload all of the partitioned model states to CPU or NVMe memory, or keep them on the GPU…In addition to model states, ZeRO-Infinity can offload activation memory to CPU memory, when necessary” (pg. 6).
“Communication for the backward pass of the first layer is depicted. Partitioned parameters are moved from slow memory to GPU and then collected to form the full layer. After gradients are computed, they are aggregated, re-partitioned, and then offloaded to slow memory…Activation checkpoints…are moved to CPU memory during forward pass, and moved back to GPU one layer at a time before the backward pass on the corresponding layer” (pg. 7, Fig. 4).
“at the end of the submodule’s forward/backward pass, when the parameters belonging to the
submodule is no longer needed. Once again, ZeRO-Infinity injects post forward/backward hooks into the submodules to partition the parameters and optionally offload them to CPU or NVMe.”
“First we offload optimizer states and gradients to the fastest memory with enough capacity, since it gives the largest memory savings with the least communication overhead. Next, between parameters and activation checkpoints, if only one needs to be offloaded to CPU memory, we empirically chose to offload the one that gives better performance. When both need to be offloaded, activation checkpoints are offloaded to CPU and parameters are offloaded to the fastest memory with enough capacity” (pg. 9).
wherein the machine learning program is configured to generate a result based on an input (see at least pg. 3, § 2, first para.; pg. 6, § 5.1.3; pg. 7, Fig. 4, description; pg. 7, col. 2;.
Claims 2, 3, 13, and 14:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari further discloses wherein the first device includes at least one of a graphics processing unit, tensor processing unit, or co-processor…wherein the second device includes a solid state drive (NVMe) (see at least pg. 1, Abstract; pg. 6, § 5.1.1 and 5.1.2; pg. 9, § 7.1, para. 4; pg. 14: “Relevant hardware details…2x NVME SSD Samsung 960GB”).
Claims 5, 6, 15, and 16:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari further discloses wherein the first data is stored in the first storage medium and includes at least one of an optimizer state, gradient, or weight computed for the machine learning program…wherein the second data is stored in the first storage medium and includes at least one of an activation, input batch, or checkpoint data of the machine learning program (see at least pg. 6, § 5.1.1 and 5.1.2; pg. 7, Fig. 4, description; pg. 7-8, § 5.3; pg. 9, § 7.1, para. 4)
Claims 7, 8, 17, and 18:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari further discloses wherein the first attribute includes a state (e.g. current pass and/or current operation of training) of a computing logic for training the machine learning program…wherein the second attribute includes a data type (e.g. activation or checkpoint data type) (see at least pg. 6, § 5.1.1 and 5.1.2; pg. 7, Fig. 4, description; pg. 7-8, § 5.3; pg. 9, § 7.1, para. 4)
Claims 9 and 19:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari further discloses wherein the first device is configured to perform a computation (operation/module) based on the first data, generate third data (e.g. gradients) based on the computation, and transfer the third data, wherein the processing circuit is configured to: receive the third data; and store the third data in the first storage medium (CPU memory), (see at least pg. 7, § 5.2.2; pg. 7-8, § 5.3; pg. 7, Fig. 4, description: “After gradients are computed, they are aggregated, repartitoned, and then offloaded to slow memory.)”
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Rajbhandari in view of Wang et al. (“Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems”, 2022).
Claim 4:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari does not specifically disclose the processing circuit is configured to operate using a cache coherent protocol.
Wang, however, discloses analogous heterogeneous system for parallel training of ML models comprising GPUs and disaggregated memory devices coordinated by a host CPU (processing circuit) that is configured to operate using a cache coherent protocol (see at least pg. 129; pg. 126, Abstract: “we propose COARSE, a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols…to allow low-latency and parallel access to training data and model parameters shared among worker GPUs.”
It would have been obvious to one of ordinary skill in the art prior to the filing date of the invention to modify Rajbhandari to employ CCI protocols as taught by Wang because:
“CCI is beneficial to data parallel training in two aspects: (i) in data parallel training, cross-device communication is a major overhead; this communication can take advantage of CCI low-latency memory access to improve the parameter synchronization performance. (ii) the parameter synchronization operations block the training procedure and take up GPU computing resources; while using CCI, GPU can work with memory device processors coherently to offload these synchronization operations to memory device processors, thus improving the GPU utilization and reducing the communication overhead.” (Wang pg. 129)
Claims 10, 11, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rajbhandari in view of Chang et al. (US 2019/0311257 A1).
Claims 10, 11, and 20:
Rajbhandari discloses the limitations as shown in the rejections above. Rajbhandari does not specifically disclose receive a signal indicative of a state of the first device; and update a parameter associated with the task…wherein the state of the first device includes available memory of the first device, and the parameter includes a batch size of training data for training the machine learning program.
Chang, however, discloses analogous heterogeneous system for parallel training of ML models by a plurality of GPU work servers (first device) and including an advisor server (processing circuit) that is configured to receive a signal indicative of a state of the first device; and update a parameter associated with the task…wherein the state of the first device includes available memory of the first device, and the parameter includes a batch size of training data for training the machine learning program (see at least ¶0006, 0039-0042, 0051-0052, 0056). Exemplary quotation: “Advisor server 140 may also determine a batch size for each work server…controller 132 may also report throughput parameters (e.g., idle time, duration spent processing the batch, free GPU memory, etc.) to advisor server 140. Advisor server 140 may then update a batch size for work server” (¶0052).
It would have been obvious to one of ordinary skill in the art prior to the filing date of the invention to modify Rajbhandari to dynamically update batch sizes as taught by Chang to prevent model mis-training and to increase the overall speed of the training process (¶0002-0003, 0055-0056).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:
Each of the following is directed to parallel training systems with heterogeneous storage architecture US 20220327376 A1; US 20240086328 A1; US 20240135162 A1.
US 20240411710 A1 is directed to a tensor processing system based on the CXL cache coherent protocol.
“Efficient Memory Management for GPU-based Deep Learning Systems” is directed to a training system that automatically swaps variables not in use from GPU to CPU memory and swaps them back before the next access.
“FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks” is directed to offloading/prefetching tensors between GPU and SSD during training.
Any inquiry of a general nature or relating to the status of this application or concerning this communication or earlier communications from the Examiner should be directed to Paul Mills whose telephone number is 571-270-5482. The Examiner can normally be reached on Monday-Friday 11:00am-8:00pm. If attempts to reach the examiner by telephone are unsuccessful, the Examiner’s supervisor, April Blair can be reached at 571-270-1014.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/P. M./
Paul Mills
03/04/2026
/APRIL Y BLAIR/Supervisory Patent Examiner, Art Unit 2196
1 For clarity of record, Examiner notes the reference Rajbhandari cited in the rejection and provided with this Action is a different version than the paper with same title cited in the IDS dated 02/10/2025. See PTO-892 for complete citation.