Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Detailed Action
The following action is in response to the communication(s) received on 12/10/2025.
As of the claims filed 12/10/2025:
Claims 1, 2, 14, 17, and 18 have been amended.
Claims 4, 5, 10-13, 15, and 16 have been canceled.
Claims 21-26 have been added.
Claims 1-3, 6-9, 14, 17-26 are now pending.
Claims 1, 14, and 21 are independent claims.
Response to Arguments
Applicant’s arguments filed 12/10/2025 have been fully considered, but are not fully persuasive.
Regarding the art rejection under 35 U.S.C. 102:
Applicant asserts that, for claim 1, Anthony does not disclose pausing the healthy processes until the restarted process progresses to a common training iteration. Examiner respectfully submits that that Anthony’s hot restart ([p.62 2nd col last ¶]) does indicate that the last saved checkpoint is loaded when restarting the faulty nodes. The restarting method is achieved through the periodic synchronized checkpointing method corresponding to the paused process progression of the healthy nodes.
Applicant further asserts that, for claim 14, Anthony does not teach the limiting progression of healthy workers, what happens to the previous checkpoints, or the constraints on worker progression that are linked to a quantity of checkpoint states stored locally at agents (p.13 ¶2). Examiner respectfully disagrees. The amended limitation merely requires each agent to store a maximum quantity of two or more most recent checkpoint states, which is taught by Anthony ([p.64 1st col last ¶]); neither the details regarding the healthy nodes nor the constraints on worker progression linked to the quantity of checkpoint states are explicitly recited in claim 17, thus the details regarding the healthy nodes, the utilization of the checkpoints, and the constraint on worker progression cannot be read in the claims.
The cancellation of claim 13 has rendered the 35 U.S.C. 103 rejection moot.
Regarding the new claims:
Applicant asserts that Claim 21 is allowable for the same reasons given in claims 1 and 14. Examiner respectfully submits that the claim has been rejected for the similar reasons given above.
Claim(s) 22-26 are rejected at least by virtue of dependency to their rejected parent claim.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Such claim limitation(s) is/are:
Claim 14: “a scheduler configured to…”
Claims 17, 18, 19, 20: “the scheduler is further configured to…”
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 6-9, 14, and 17-26 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Anthony et al., “Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training” (hereinafter Anthony).
Regarding Claim 1, Anthony teaches:
A method for training a machine-learning model, comprising:
assigning a plurality of nodes for training the machine-learning model, each node of the plurality of nodes including one or more agents comprising at least an agent processing unit and a local memory, each agent configured to manage one or more workers via a local network, each worker including at least a worker processing unit; (Anthony [p.61 2nd col 1st] Data parallelism is the most popular approach used in distributed DNN training. It first duplicates the DNN to each compute element (CU) before partitioning the training data across all CUs… We use CU as an abstraction of a single hardware element such as a CPU, GPU, TPU, etc… Figure 1 depicts an example of data parallel DNN training with 4 GPUs on a single node.
PNG
media_image1.png
626
864
media_image1.png
Greyscale
[p.62 2nd col last ¶] Figure 5 depicts a hot restart example using the partner redundancy scheme. Initially, as shown in Figure 5(a), six nodes are allocated for a job. Nodes P0, P1, P2, P3 are actively used while nodes S0, S1 are idle spare nodes.
PNG
media_image2.png
350
851
media_image2.png
Greyscale
) (Note: the CU corresponds to the one or more agents; training a DNN requires local memory within the node.)
distributing shards of a training data set for parallel processing by worker processing units at different nodes, each worker processing unit configured to: (Anthony [p.61 1st col last ¶] Data parallelism is the most popular approach used in distributed DNN training. It first duplicates the DNN to each compute element (CU) before partitioning the training data across all CUs.)
iteratively train on minibatches of a distributed shard, (Anthony [p.61 1st col last ¶] Upon receiving a data partition, each CU performs a forward pass on the data. Then the updated parameters are synchronized by averaging the gradients among all processes. Most implementations of data parallelism realize gradient averaging via an MPI Allreduce operation, which performs an element-wise summation and sends the final result to every process. The number of samples in a data partition sent to each CU at each global training step is known as the batch size.) (Note: performing the forward pass and sending the final result to every process at each global training step corresponds to iteratively training on minibatches of a distributed shard.)
report checkpoint states for storage in the local memory at a respective agent, each checkpoint state indicating updated parameters for the machine-learning model based on one or more training iterations, (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
)
and based on recognizing a worker processing unit failing: initializing and restarting the failed worker processing unit based at least on a checkpoint state stored in local memory, (Anthony [p.62 2nd col 3rd ¶] An often-overlooked yet crucial function for any check-pointing tool is an effective restart. SCR - Exa supports two restart methods. A cold restart will attempt to restart the application from a checkpointing cache within the same allocation and using the same nodes.
[p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image4.png
301
969
media_image4.png
Greyscale
)
and limiting progression of all healthy worker processing units through training iterations by pausing the healthy worker processing units until the initialized and restarted worker processing unit progresses one or more training iterations to a common training iteration previously completed by the healthy worker processing units. (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
) (Note: idling at the last distributed DL training step corresponds to limiting progression within one training iteration and thus a common training iteration.)
Regarding Claim 2, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 1. Anthony further teaches:
The method of claim 1, further comprising: further limiting progression of the worker processor units through the training iterations such that all worker processor units are maintained within a threshold number of training iterations. (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
) (Note: idling at the last distributed DL training step corresponds to limiting progression within one training iteration.)
Regarding Claim 3, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 2. Anthony further teaches:
The method of claim 2, wherein each agent is configured to store the threshold number of most recent checkpoint states for each associated worker processing unit. (Anthony [p.64 1st col last ¶] For both SCR-Exa and root checkpointing, we saved a checkpoint every epoch of DL training. We configured SCR-Exa to flush a checkpoint from node local storage to the parallel file system once every 10th checkpoint. The remaining checkpoints are saved to the local node's storage only.) (Note: the remaining checkpoints correspond to the threshold number of most recent checkpoint states.)
Regarding Claim 6, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 1. Anthony further teaches:
The method of claim 1, further comprising: based at least on an indication of an agent failing, reassigning and initializing the agent and associated worker processing units based at least on one or more checkpoint states stored at an agent associated with a different worker processing unit in a respective peer group. (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
) (Note: activating and copying to the spare node corresponds to reassigning and initializing a different worker processing unit.)
Regarding Claim 7, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 6. Anthony further teaches:
The method of claim 6, further comprising: based at least on an indication of a node failing, reassigning and initializing each agent and each associated worker processing unit of the node based at least on one or more checkpoint states stored at a different agent associated with a worker processing unit in a peer group for each worker processing unit of the failed node. (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
) (Note: since each node includes an agent and associated worker, copying the checkpoint file onto the newly activated node corresponds to reassigning and initializing each failed agent and associated worker.)
Regarding Claim 8, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 1. Anthony further teaches:
The method of claim 1, wherein each agent is further configured to upload checkpoint states from local memory to a global backup at a lower frequency than the checkpoint states are recorded at the respective agent. (Anthony [p.64 1st col last ¶] For both SCR-Exa and root checkpointing, we saved a checkpoint every epoch of DL training. We configured SCR-Exa to flush a checkpoint from node local storage to the parallel file system once every 10th checkpoint. The remaining checkpoints are saved to the local node's storage only. Root checkpointing saves the DNN in the PFS every epoch.) (Note: the parallel file system corresponds to the global backup; the saved checkpoint at node local storage corresponds to the checkpoint states recorded at the respective agent.)
Regarding Claim 9, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 1. Anthony further teaches:
The method of claim 1, wherein each worker is configured to report checkpoint states based at least on a most recent training iteration to a respective agent following progression of the worker to a subsequent training iteration. (Anthony [p.62 1st col last ¶] As shown in Figure 4(b), at the last distributed DL training step of the current epoch, each process saves a snapshot of its local copy of the DNN to its own node-local storage. Every process starts the next epoch as soon as it finishes its own checkpoint saving. Flushing a checkpoint from node-local storage to the parallel filesystem occurs infrequently. Therefore, checkpointing can be moved off the critical execution path of the DL application completely.) (Note: node-local storage corresponds to a respective agent)
Regarding Claim 14, Anthony teaches:
A computing system for training a machine-learning model, comprising: a plurality of agents comprising at least an agent processing unit and a local memory, each agent configured to manage one or more workers via a local network, each worker including at least one worker processing unit; (Anthony [p.61 2nd col 1st] Figure 1 depicts an example of data parallel DNN training with 4 GPUs on a single node.
PNG
media_image1.png
626
864
media_image1.png
Greyscale
[p.62 2nd col last ¶] Figure 5 depicts a hot restart example using the partner redundancy scheme. Initially, as shown in Figure 5(a), six nodes are allocated for a job. Nodes P0, P1, P2, P3 are actively used while nodes S0, S1 are idle spare nodes.
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
a plurality of worker processing units configured to: iteratively train on minibatches of a distributed shard, (Anthony [p.61 1st col last ¶] Upon receiving a data partition, each CU performs a forward pass on the data.)
and report checkpoint states for storage in the local memory at a respective agent, the checkpoint state indicating updated parameters for the machine-learning model based on one or more training iterations; and (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
)
a scheduler configured to: assign a plurality of nodes of one or more pods for training the machine-learning model, each pod including one or more agents in which each agent distributes shards of a training data set for processing by a worker processing unit; (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
(Anthony [p.61 2nd col 1st] Figure 1 depicts an example of data parallel DNN training with 4 GPUs on a single node.
PNG
media_image1.png
626
864
media_image1.png
Greyscale
[p.62 2nd col last ¶] Figure 5 depicts a hot restart example using the partner redundancy scheme. Initially, as shown in Figure 5(a), six nodes are allocated for a job. Nodes P0, P1, P2, P3 are actively used while nodes S0, S1 are idle spare nodes.
PNG
media_image2.png
350
851
media_image2.png
Greyscale
[p.61 1st col last ¶] Data parallelism is the most popular approach used in distributed DNN training. It first duplicates the DNN to each compute element (CU) before partitioning the training data across all CUs. [p.61 1st col last ¶] Upon receiving a data partition, each CU performs a forward pass on the data.)
limit progression of the worker processing units through training iterations such that the worker processing units to which the shards are distributed are maintained within a threshold number of training iterations to each other. (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
[p.64 1st col last ¶] For both SCR-Exa and root checkpointing, we saved a checkpoint every epoch of DL training. We configured SCR-Exa to flush a checkpoint from node local storage to the parallel file system once every 10th checkpoint. The remaining checkpoints are saved to the local node's storage only.) (Note: the remaining checkpoints correspond to the threshold number of most recent checkpoint states.)
and based at least on an indication of an agent failing, reassign and initialize the agent and associated worker processing units based at least on one or more checkpoint states stored at an agent associated with a different worker processing unit in a respective peer group; (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
wherein each agent is configured to store a maximum quantity of two or more most recent checkpoint states in the local memory for each associated worker processing unit that corresponds to the threshold number in which an oldest stored checkpoint state is deleted when a new checkpoint state is recorded upon reaching the maximum quantity. (Anthony [p.64 1st col last ¶] For both SCR-Exa and root checkpointing, we saved a checkpoint every epoch of DL training. We configured SCR-Exa to flush a checkpoint from node local storage to the parallel file system once every 10th checkpoint. The remaining checkpoints are saved to the local node’s storage only. Root checkpointing saves the DNN in the PFS every epoch.) (Note: the remaining checkpoints correspond to the threshold number of most recent checkpoint states.)
Regarding Claim 17, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 14. Anthony further teaches:
The computing system of claim 14, wherein the scheduler is further configured to: synchronize the initialized and reassigned worker processing units with all healthy worker processing units, such that all worker processing units restart from a common training iteration based at least on a respective stored checkpoint state. (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
Regarding Claim 18, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 14. Anthony further teaches:
The computing system of claim 14, wherein the scheduler is further configured to: limit progression of all healthy worker processing units until the initialized and reassigned worker processing units have progressed to a common iteration based at least on a respective stored checkpoint state. (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
)
Regarding Claim 19, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 14. Anthony further teaches:
The computing system of claim 14, wherein the scheduler is further configured to: assign one or more nodes for migration to a different local network, each node including one or more pods; and initialize the migrated nodes based at least on one or more checkpoint states stored at each agent of the node. (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
Regarding Claim 20, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 19. Anthony further teaches:
The computing system of claim 19, wherein the scheduler is further configured to: responsive to initializing the migrated nodes, send a re-start request to each agent processing unit for each node of the plurality of nodes; receive responses from at least some agent processing units, each response indicating an identification of each checkpoint state stored for each associated worker processing unit; based on the received responses, determine a common checkpoint state at which to re- start training; and send a request to all agent processing units to re-start training their respective worker processing units at the determined common checkpoint states. (Anthony [p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image2.png
350
851
media_image2.png
Greyscale
)
Regarding Claim 21, Anthony teaches:
A method for training a machine-learning model, comprising:
assigning a plurality of nodes for training the machine-learning model, each node of the plurality of nodes including one or more agents comprising at least an agent processing unit and a local memory, each agent configured to manage one or more workers via a local network, each worker including at least a worker processing unit; (Anthony [p.61 2nd col 1st] Data parallelism is the most popular approach used in distributed DNN training. It first duplicates the DNN to each compute element (CU) before partitioning the training data across all CUs… We use CU as an abstraction of a single hardware element such as a CPU, GPU, TPU, etc… Figure 1 depicts an example of data parallel DNN training with 4 GPUs on a single node.
PNG
media_image1.png
626
864
media_image1.png
Greyscale
[p.62 2nd col last ¶] Figure 5 depicts a hot restart example using the partner redundancy scheme. Initially, as shown in Figure 5(a), six nodes are allocated for a job. Nodes P0, P1, P2, P3 are actively used while nodes S0, S1 are idle spare nodes.
PNG
media_image2.png
350
851
media_image2.png
Greyscale
) (Note: the CU corresponds to the one or more agents; training a DNN requires local memory within the node.)
and distributing shards of a training data set for parallel processing by worker processing units at different nodes, each worker processing unit configured to: (Anthony [p.61 1st col last ¶] Data parallelism is the most popular approach used in distributed DNN training. It first duplicates the DNN to each compute element (CU) before partitioning the training data across all CUs.)
iteratively train on minibatches of a distributed shard, (Anthony [p.61 1st col last ¶] Upon receiving a data partition, each CU performs a forward pass on the data. Then the updated parameters are synchronized by averaging the gradients among all processes. Most implementations of data parallelism realize gradient averaging via an MPI Allreduce operation, which performs an element-wise summation and sends the final result to every process. The number of samples in a data partition sent to each CU at each global training step is known as the batch size.) (Note: performing the forward pass and sending the final result to every process at each global training step corresponds to iteratively training on minibatches of a distributed shard.)
report checkpoint states for storage in the local memory at a respective agent, each checkpoint state indicating updated parameters for the machine-learning model based on one or more training iterations; (Anthony [p.62 1st col last ¶] Let us first examine the operation flow of the root check-pointing scheme in detail. As shown in Figure 4(a), at the last distributed DL training step of the current epoch, the root process saves a snapshot of the DNN to the parallel file system. All other processes idle until the checkpoint is completely saved. After the checkpoint is saved by the root process, all processes start the next epoch.
PNG
media_image3.png
392
849
media_image3.png
Greyscale
)
and based on recognizing a worker processing unit failing: initializing and restarting the failed worker processing unit based at least on a checkpoint state stored in local memory, (Anthony [p.62 2nd col 3rd ¶] An often-overlooked yet crucial function for any check-pointing tool is an effective restart. SCR - Exa supports two restart methods. A cold restart will attempt to restart the application from a checkpointing cache within the same allocation and using the same nodes.
[p.63 1st col 1st ¶] If a node fails, as shown in Figure 5(b), spare node S0 is then activated. SCR-Exa relabels the active nodes following certain rules based on the placement of checkpoint files. The checkpoint file of the highest rank then is copied onto the newly activated node. At this point, every active node has a copy of the checkpoint file that matches its rank index in its local storage. Each rank of the application can now restart from the last checkpoint and carry on execution, as shown in Figure 5(c).
PNG
media_image4.png
301
969
media_image4.png
Greyscale
)
and limiting progression of all healthy worker processing units through training iterations by rolling back the healthy worker processing units to a common training iteration previously completed by the healthy worker processing units at which the failed worker processing unit is initialized and restarted. (Anthony [p.62 2nd col 3rd ¶] An often-overlooked yet crucial function for any checkpointing tool is an effective restart. SCR-Exa supports two restart methods. A cold restart will attempt to restart the application from a checkpointing cache within the same allocation and using the same nodes.) (Note: The cold restart on the same nodes corresponds to initializing and restarting the failed worker processing unit.)
Claim(s) 22, 24, and 25, dependent on Claim 21, also recite precisely the methods of Claims 6, 8, and 9, respectively, and thus are rejected for reasons set forth in these claims.
Claim(s) 23, dependent on Claim 22, also recite the method precisely the methods of Claims 7, respectively, and thus are rejected for reasons set forth in these claims.
Regarding Claim 26, Anthony respectively teaches and incorporates the claimed limitations and rejections of Claim 21. Anthony further teaches:
each agent is configured to store a maximum quantity of two or more most recent checkpoint states in the local memory for each associated worker processing unit that corresponds to the threshold number in which an oldest stored checkpoint state is deleted when a new checkpoint state is recorded upon reaching the maximum quantity. (Anthony [p.64 1st col last ¶] For both SCR-Exa and root checkpointing, we saved a checkpoint every epoch of DL training. We configured SCR-Exa to flush a checkpoint from node local storage to the parallel file system once every 10th checkpoint. The remaining checkpoints are saved to the local node’s storage only. Root checkpointing saves the DNN in the PFS every epoch.) (Note: the remaining checkpoints correspond to the threshold number of most recent checkpoint states.)
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSEP HAN whose telephone number is (703)756-1346. The examiner can normally be reached Mon-Fri 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.H./Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122