DETAILED ACTION
This action is responsive to the Application filed on 10/20/2025. Claims 1-21 are pending in the case. Claims 1, 8, and 18 are independent claims. Claims 1, 3, 4, 8 and 15 are amended.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/20/2025 has been entered.
Response to Arguments
Applicant’s arguments (10/20/2025) with respect to claim(s) 1-21 have been considered but are not persuasive.
With respect to the rejection under prior art:
Applicant appears to argue that the claims require transferring the neural network and data from the first memory device to the second memory device based on a determination described in the claims.
Examiner highlights that cited art, Park, describes transferring neural network information (weights and data) from a parameter server to each worker. Parameter information from a worker is sent/received to and from the parameter server, which is subsequently processed and forwarded to another worker. It is understood that Park does not explicitly teach transferring the neural network from the first memory device to the second memory device as claimed.
As argued by applicant, the remaining previously cited art does not overcome the asserted deficiencies.
The rejection is updated in view of Park, Narayanan, Mailthody and Zhu, accordingly.
Claim Rejections - 35 U.S.C. § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA 35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 5-21 is rejected under 35 U.S.C. § 103 as being unpatentable over Park et al “HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism”, further in view of Narayanan “PipeDream: Generalized Pipeline Parallelism for DNN Training”
Regarding claim 1
Park teaches, determining characteristics of a first memory device of a memory system, wherein the first memory device includes a first type of memory; writing a neural network to the first memory device; determining characteristics of a second memory device of the memory system, wherein the second memory device includes a second type of memory… (Section 3 pg 5 “To train DNN models based on pipelined model parallelism in virtual workers, the resource allocator first assigns k GPUs to each virtual worker… the resource allocation policy must consider several factors such as the performance of individual GPUs…Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker …” Section 8.1 pg 10 “In our experiments, we use four nodes with two Intel Xeon Octa-core E5-2620 v4 processors (2.10 GHz) connected via InfiniBand (56 Gbps). Each node has 64 GB memory… Each node is configured with a different type of GPU as shown in Table 1” pg 2 multiple memory devices are assigned. GPUs include first and second memory types. These GPUs have different memory properties such as size and bandwidth and clock speed and as such considered first and second memory types as claimed. Allocating model partitions amounts to writing the neural network.) based on a determination that determining one or more second weights for the hidden layer of the neural network cannot be performed more efficiently on the second memory device: and based on a determination determining one ore more second weights for the hidden layer of the neural network can be performed more efficiently on the second memory device: (pg 5 We refer to our system as HetPipe as it is heterogeneous, in GPUs, across and, possibly, within virtual workers and makes use of pipelining in virtual workers for resource efficiency…. Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker such that the performance of the pipeline executed in the virtual worker can be maximized” pg 9 Section 7 “Recall that the goal of our partitioning algorithm is to minimize the maximum execution time of the partitions within the bounds of satisfying the memory requirement.” The model and subsequent training operations are performed based on determinations of the partitioner accordioning to efficient performance. Training as understood by one of ordinary skill in the art is the determining or updating of weights in layers of the neural network. The system architecture in figure 2 pg 5 shows the network having multiple hidden layers. Further, DNN or deep neural networks, are characterized by inclusion of multiple hidden layers each containing multiple weights.) using data corresponding to the neural network written to the first memory device, to determine one or more first weights for a hidden layer of the neural network;… using data corresponding to the neural network written to the first memory device to determine one or more second weights for the hidden layer of the neural network;… using the data corresponding to the neural network written to the second memory device to determine the one or more second weights for the hidden layer of the neural network. (pg 5 Section 3 “Then, for the given DNN model and allocated k GPUs….Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” pg 7 “the virtual worker updates the local version of the weights, wlocal as wlocal = wlocal + up, where up is the updates computed by processing minibatch p.” updating weights periodically (i.e for a first and second set of weights) amounts determining new weights of hidden layers.)
Park does not explicitly teach, transferring the neural network and the data corresponding to the neural network from the first memory device to the second memory device;
Narayanan when addressing profiling the neural network operation on a single device teaches, transferring the neural network and the data corresponding to the neural network from the first memory device to the second memory device; (pg 2 “PipeDream automatically determines how to partition the operators of the DNN based on a short profiling run performed on a single GPU, balancing computational load among the different stages while minimizing communication for the target platform” pg 5 “PipeDream records three quantities for each layer l, using a short (few minutes) profiling run of 1000 minibatches on a single GPU… Our partitioning algorithm takes the output of the profiling step, and computes: 1) a partitioning of layers into stages, 2) the replication factor (number of workers) for each stage, and 3) optimal number of in-flight minibatches to keep the training pipeline busy… PipeDream’s optimizer solves dynamic programming problems progressively from the lowest to the highest level. Intuitively, this process finds the optimal partitioning within a server and then uses these partitions to split a model optimally across servers” based on the neural network and its data on a first profiling run on a first device the network is split or transferred from the first device to 2nd devices on the same or different servers.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the multiple GPU training system described by Park to automatic partitions across works to improve training speed as described by Narayanan. One would have motivated to make this combination because “Pipeline-parallel DNN training helps reduce the communication overheads that can bottleneck intra-batch parallelism” (Narayanan Section 6 conclusion)
Regarding claim 2
Park/Narayanan teaches claim 1
Further Park teaches, receiving, by the first memory device, training data corresponding to the neural network; providing the training data to an input layer of the neural network; and writing, within the first memory device or the second memory device, or both, data associated with an output of the neural network. (Section 2.1 “Data parallelism (DP) utilizes multiple workers to speed up training of a DNN model. It divides the training dataset into subsets and assigns each worker a different subset. Each worker has a replica of the DNN model and processes each minibatch in the subset, thereby computing the weight updates.” Pg 5 Section 3 “The system that we propose in this paper leverages both pipelined model parallelism (PMP) and data parallelism (DP) to enable training of such large DNN models” pg 9 Section 7 “we first derive the amount of input data for each layer in the forward and backward pass from the model graph” training data is received. Deep neural networks are type of model known by PHOSITA to process training data via an input layer.)
Regarding claim 5
Park/Narayanan teaches claim 1
Further Park teaches, performing, prior to writing the data corresponding to the neural network to the second memory device, an operation to select particular vectors from the data corresponding to the neural network; and writing the particular vectors from the data corresponding to the neural network to the second memory device. (Section 2.1 “Data parallelism (DP) utilizes multiple workers to speed up training of a DNN model. It divides the training dataset into subsets and assigns each worker a different subset.” Assigning a subset amounts to selecting particular vectors of training data)
Regarding claim 6
Park/Narayanan teaches claim 1
Further Park teaches, wherein the first memory device or the second memory device has a higher data processing bandwidth than the other of the first memory device or the second memory device. (pg 10 Section 8.1 “Each node is configured with a different type of GPU as shown in Table 1.” Table 1 is provided, which shows the various bandwidths
PNG
media_image1.png
136
427
media_image1.png
Greyscale
)
Regarding claim 7
Park/Narayanan teaches claim 1
Further Park teaches, storing a copy of the data corresponding to a first state of the neural network in the first memory device or the second memory device, or both; (pg 5 Section 3 “As any typical DP, multiple virtual workers must periodically synchronize the global parameters via parameter servers or AllReduce communication; Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” the local copy is the copy stored on both the first and second memory device.) determining that the first state of the neural network has been updated to a second state of the neural network; and deleting the copy of the data corresponding to the first state of the neural network in response to determining that the first state of the neural network has been updated to the second state. ( pg 7 “Let local staleness be the maximum number of missing updates from the most recent minibatches that is allowed for a minibatch to proceed in a virtual worker…For example, as shown in Figure 1 where slocal = 3, at the end of clock 0, the virtual worker pushes the aggregated updates of wave 0, which is composed of minibatches from 1 to 4, and at the end of clock 1, the aggregated updates of wave 1, which is composed of minibatches from 5 to 8, and so on.” pg 8 “When a virtual worker pulls the global weights at the end of clock c to maintain this distance, it may need to wait for other virtual workers to push their updates upon completion of wave c−D.” pulling new weights amounts to determining that the first weights or states need to be updated or synchronized after at the end of the clock. Synchronization and updating is understood to mean that old data is replaced with new data, i.e the old data is deleted at some point.)
Regarding claim 8
Park teaches, An apparatus, comprising: a memory system; a first memory device of the memory system; a second memory device of the memory system coupled to the first memory device of the memory system; and a processing device coupled to the first memory device and the second memory device, the processing device to: (pg 5 “Figure 2 shows the architecture of the proposed cluster system composed of H nodes. Each node comprises a homogeneous set of GPUs, but the GPUs (and memory capacity) of the nodes themselves can be heterogeneous”) determine characteristics of the first memory device, wherein the first memory device includes a first type of memory; determine that the characteristics of the first memory device are conducive for a training operation for a neural network and write the neural network to the first memory device; determine characteristics of the second memory device, wherein the second memory device includes a second type of memory (Section 3 pg 5 “To train DNN models based on pipelined model parallelism in virtual workers, the resource allocator first assigns k GPUs to each virtual worker… the resource allocation policy must consider several factors such as the performance of individual GPUs …Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker … such that the performance of the pipeline executed in the virtual worker can be maximized” Section 8.1 pg 10 “In our experiments, we use four nodes with two Intel Xeon Octa-core E5-2620 v4 processors (2.10 GHz) connected via InfiniBand (56 Gbps). Each node has 64 GB memory… Each node is configured with a different type of GPU as shown in Table 1” pg 2 multiple memory devices are assigned. GPUs include first and second memory types, as noted in the rejection of claim 1. Allocating model partitions for training based on several device characteristic factors amounts to determining characteristics are conducive for the training to maximize performance) based on a determination that determining one or more second weights for the hidden layer of the neural network cannot be performed more efficiently on the second memory device: and based on a determination that determining one or more second weights for the hidden layer of the neural network can be performed more efficiently on the second memory device (pg 5 We refer to our system as HetPipe as it is heterogeneous, in GPUs, across and, possibly, within virtual workers and makes use of pipelining in virtual workers for resource efficiency…. Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker such that the performance of the pipeline executed in the virtual worker can be maximized” pg 9 Section 7 “Recall that the goal of our partitioning algorithm is to minimize the maximum execution time of the partitions within the bounds of satisfying the memory requirement.” The model and subsequent training operations are performed based on determinations of the partitioner accordioning to efficient performance.) determine, on the second memory device, one or more first weights for a hidden layer of the neural network, determine, on the first memory device, one or more second weights for the hidden layer of the neural network; determine, one the second memory device, the one or more second weights for the hidden layer of the neural network. (pg 5 Section 3 “Then, for the given DNN model and allocated k GPUs….Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” pg 7 “the virtual worker updates the local version of the weights, wlocal as wlocal = wlocal + up, where up is the updates computed by processing minibatch p.” updating weights periodically (i.e for a first and second set of weights) amounts to determining new weights of hidden layers.)
Park does not explicitly teach, transfer the neural network from the first memory device to the second memory device subsequent to the determination of the one or more first weights for the hidden layer of the neural network
Narayanan when addressing profiling the neural network operation on a single device teaches, transfer the neural network from the first memory device to the second memory device subsequent to the determination of the one or more first weights for the hidden layer of the neural network (pg 2 “PipeDream automatically determines how to partition the operators of the DNN based on a short profiling run performed on a single GPU, balancing computational load among the different stages while minimizing communication for the target platform” pg 5 “PipeDream records three quantities for each layer l, using a short (few minutes) profiling run of 1000 minibatches on a single GPU… Our partitioning algorithm takes the output of the profiling step, and computes: 1) a partitioning of layers into stages, 2) the replication factor (number of workers) for each stage, and 3) optimal number of in-flight minibatches to keep the training pipeline busy… PipeDream’s optimizer solves dynamic programming problems progressively from the lowest to the highest level. Intuitively, this process finds the optimal partitioning within a server and then uses these partitions to split a model optimally across servers” based on the neural network and its data on a first profiling run on a first device the network is split or transferred from the first device to 2nd devices on the same or different servers. The partitioning decision is made subsequent to prior partitioning decisions of first weights in a prior stage.)
Park and Narayanan are combined for the reasons set forth in the rejection of claim 1
Regarding claim 9
Park/Narayanan teaches claim 8
Park teaches, the first memory device has a first bandwidth associated therewith; and the second memory device has a second bandwidth associated therewith, the second bandwidth being greater than the first bandwidth. (pg 10 Section 8.1 “Each node is configured with a different type of GPU as shown in Table 1.” Table 1 is provided, which shows the various bandwidths, which includes a device with greater bandwidth
PNG
media_image1.png
136
427
media_image1.png
Greyscale
)
Regarding claim 10
Park/Narayanan teaches claim 8
Park teaches, determine the one or more first weights for the hidden layer of the neural network as part of performance of a first level of training the neural network; determine the one or more second weights for the hidden layer of the neural entwork as part of performance of a second level of training the neural network. ( pg 12 “From the results, we can see that the performance of both Horovod and HetPipe increases when additional whimpy GPUs are used for training. With additional GPUs, HetPipe can increase the total number of concurrent minibatches processed, having up to 2.3 times speedup. This scenario can be thought of as an answer to when new, higher end nodes are purchased, but one does not know what to do with existing nodes” the memory devices perform at different levels, low end low level devices are utilized, i.e perform, in tandem with higher end devices of a higher level.)
Regarding claim 11
Park/Narayanan teaches claim 8
Park teaches, wherein the first memory device comprises a processing unit resident thereon, and wherein the processing unit is to cause performance of an operation to pre-process data corresponding with the neural network prior to the data corresponding to the neural network being written to the second memory device. ( pg 5 figure 1 caption “Figure 1: Pipeline execution of minibatches where Mp,k indicates the execution of a minibatch p in partition k, which is executed in GPUk and the yellow and green colors indicate the forward and backward passes, respectively”
PNG
media_image2.png
175
844
media_image2.png
Greyscale
as shown in the figure pre-processing of initial iterations are performed in wave zero before parameter updates or “data corresponding to the neural network” is written to the GPUs for the 2nd wave.)
Regarding claim 12
Park/Narayanan teaches claim 8
Park teaches, wherein the processing device is to: write data corresponding to the neural network to the first memory device subsequent to the determination of the second weights for the hidden layer of the neural network and determine, by the first memory device, one or more third weights for the hidden layer of the neural network. (pg 5 Section 3 “Then, for the given DNN model and allocated k GPUs….Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” pg 7 “the virtual worker updates the local version of the weights, wlocal as wlocal = wlocal + up, where up is the updates computed by processing minibatch p.” updating weights periodically (i.e for a first and second and third set of weights) amounts to an operation on at least the first device including determining new weights of hidden layers.)
Regarding claim 13
Park/Narayanan teaches claim 8
Park teaches, write a copy of data corresponding to a first data state associated with the neural network to the first memory device or the second memory device, or both; (pg 5 Section 3 “As any typical DP, multiple virtual workers must periodically synchronize the global parameters via parameter servers or AllReduce communication; Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” the local copy is the copy stored on both the first and second memory device.) determine that the first data state associated with the neural network written to the first memory device or the second memory device, or both, has been updated to a second data state associated with the neural network; and delete the copy of the data corresponding to the first data state in response to determining that the first data state has been updated to the second data state. ( pg 7 “Let local staleness be the maximum number of missing updates from the most recent minibatches that is allowed for a minibatch to proceed in a virtual worker…For example, as shown in Figure 1 where slocal = 3, at the end of clock 0, the virtual worker pushes the aggregated updates of wave 0, which is composed of minibatches from 1 to 4, and at the end of clock 1, the aggregated updates of wave 1, which is composed of minibatches from 5 to 8, and so on.” pg 8 “When a virtual worker pulls the global weights at the end of clock c to maintain this distance, it may need to wait for other virtual workers to push their updates upon completion of wave c−D.” pulling new weights amounts to determining that the first weights or states need to be updated or synchronized after at the end of the clock. Synchronization and updating is understood to mean that old data is replaced with new data, i.e the old data is deleted at some point.)
Regarding claim 14
Park/Narayanan teaches claim 13
Park teaches, determine that an error involving the neural network has occurred; retrieve a copy of data corresponding to the second data state from the first memory device or the second memory device, or both; and perform an operation to recover the neural network using the copy of the data corresponding to the second data state. (pg 9 “The set Et and the noisy weight parameter w˜t are defined similarly and the difference between wt and w˜t is
PNG
media_image3.png
29
235
media_image3.png
Greyscale
where Rt is the index set of missing updates in the reference weight parameter but not in noisy weight parameter…the parameter learned from the synchronized update is R[W]…
PNG
media_image4.png
34
170
media_image4.png
Greyscale
Thus, when we bound the regret of the two functions, we can bound the error of the noisy updates incurred by the distributed pipeline staleness gradient descent”. The error R involving the neural network is computed during the synchronization. The weights are updated or recovered on the devices in the system based on the copy of data.)
Regarding claim 15
Park teaches, control circuitry comprising a processing device and a memory resource configured to operate as a cache for the processing device; and a memory system comprising a plurality of memory devices coupled to the control circuitry, wherein the control circuitry is to:
(pg 5 “Figure 2 shows the architecture of the proposed cluster system composed of H nodes. Each node comprises a homogeneous set of GPUs, but the GPUs (and memory capacity) of the nodes themselves can be heterogeneous” pg 8 Section 8.1 “In our experiments, we use four nodes with two Intel Xeon Octa-core E5-2620 v4 processors (2.10 GHz) connected via InfiniBand (56 Gbps). Each node has 64 GB memory and 4 homogeneous GPUs. Each node is configured with a different type of GPU as shown in Table 1.” The GPU memory operates a memory resource and associated cache memory) determine characteristics of a first memory device among the plurality of memory devices, wherein the first memory device includes a first type of memory write data corresponding to a neural network to the first memory device; determine characteristics of a second memory device among the plurality of memory devices, wherein the second memory device includes a second type of memory (Section 3 pg 5 “To train DNN models based on pipelined model parallelism in virtual workers, the resource allocator first assigns k GPUs to each virtual worker… the resource allocation policy must consider several factors such as the performance of individual GPUs …Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker … such that the performance of the pipeline executed in the virtual worker can be maximized” Section 8.1 pg 10 “In our experiments, we use four nodes with two Intel Xeon Octa-core E5-2620 v4 processors (2.10 GHz) connected via InfiniBand (56 Gbps). Each node has 64 GB memory… Each node is configured with a different type of GPU as shown in Table 1” pg 2 multiple memory devices are assigned. GPUs include first and second media or memory types. Allocating model partitions for training based on several device characteristic factors amounts to determining characteristics are conducive for the training to maximize performance) based on a determination that determining one or more second weights for the hidden layer of the neural network cannot be performed more efficiently on the second memory device: based on a determination that determining one ore more second weights for the hidden layer of the neural network can be performed more efficiently on the second memory device: (pg 5 We refer to our system as HetPipe as it is heterogeneous, in GPUs, across and, possibly, within virtual workers and makes use of pipelining in virtual workers for resource efficiency…. Then, for the given DNN model and allocated k GPUs, the model partitioner divides the model into k partitions for the virtual worker such that the performance of the pipeline executed in the virtual worker can be maximized” pg 9 Section 7 “Recall that the goal of our partitioning algorithm is to minimize the maximum execution time of the partitions within the bounds of satisfying the memory requirement.” The model and subsequent training operations, or determining or weights of the hidden layers, are performed based on determinations of the partitioner accordioning to efficient performance.) cause, while the neural network is stored in the first memory device, a determination of one or more first weights for a hidden layer of the neural network to be performed; perform, on the first memory device and using data corresponding to the neural network written to the first memory device, a determination of one or more second weights for the hidden layer of the neural network; and write the data corresponding to the neural network to a second memory device; and cause, while the neural network is stored in the second memory device, a determination of the one or more second weights for the hidden layer of the neural network to be performed. (pg 5 Section 3 “Then, for the given DNN model and allocated k GPUs….Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” pg 7 “the virtual worker updates the local version of the weights, wlocal as wlocal = wlocal + up, where up is the updates computed by processing minibatch p.” updating weights periodically (i.e for a first and second set of weights) amounts to performing a training operation on at least the first device including determining new weights of hidden layers.)
Park does not explicitly teach, transfer the neural network and the data corresponding to the neural network from the first memory device to the second memory device;
Narayanan when addressing profiling the neural network operation on a single device teaches, transfer the neural network and the data corresponding to the neural network from the first memory device to the second memory device; (pg 2 “PipeDream automatically determines how to partition the operators of the DNN based on a short profiling run performed on a single GPU, balancing computational load among the different stages while minimizing communication for the target platform” pg 5 “PipeDream records three quantities for each layer l, using a short (few minutes) profiling run of 1000 minibatches on a single GPU… Our partitioning algorithm takes the output of the profiling step, and computes: 1) a partitioning of layers into stages, 2) the replication factor (number of workers) for each stage, and 3) optimal number of in-flight minibatches to keep the training pipeline busy… PipeDream’s optimizer solves dynamic programming problems progressively from the lowest to the highest level. Intuitively, this process finds the optimal partitioning within a server and then uses these partitions to split a model optimally across servers” based on the neural network and its data on a first profiling run on a first device the network is split or transferred from the first device to 2nd devices on the same or different servers.)
Park and Narayanan are combined for the reasons set forth in the rejection of claim 1
Regarding claim 16
Park/Narayanan teaches claim 15
Park teaches, wherein the control circuitry is to: write the data corresponding to the neural network to the first memory device based on a determination that at least one characteristic of the first memory device meets a first set of criterion; and write the data corresponding to the neural network to the second memory device based on a determination that at least one characteristic of the second memory device meets a second set of criterion.(pg 7-8 “Recall that the goal of our partitioning algorithm is to minimize the maximum execution time of the partitions within the bounds of satisfying the memory requirement. To obtain a performance model to predict the execution time of each layer of a model in a heterogeneous GPU, we first profile the DNN model on each of the different types of GPUs in a cluster, where we measure the computation time of each layer of the model. For GPU memory usage, we measure the usage of each layer…To find the best partitions of a DNN model, we make use of CPLEX, which is an optimizer…The algorithm will return partitions for a model with a certain batch size only if it finds partitions that meet the memory requirement for the given GPUs.” Partitions of the neural network, or data, is allocated to the different GPUs based on the devices meeting memory and speed (i.e. first and second criteria) to optimize execution time.)
Regarding claim 17
Park/Narayanan teaches claim 15
Park teaches, write a copy of data corresponding to a first data state associated with the neural network to the first memory device or the second memory device, or both;(pg 5 Section 3 “As any typical DP, multiple virtual workers must periodically synchronize the global parameters via parameter servers or AllReduce communication; Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server” the local copy is the copy stored on both the first and second memory device.) determine that the first data state associated with the neural network written to the first memory device or the second memory device, or both, has been updated to a second data state associated with the neural network; delete the copy of the data corresponding to the first data state in response to determining that the first data state has been updated to the second data state ( pg 7 “Let local staleness be the maximum number of missing updates from the most recent minibatches that is allowed for a minibatch to proceed in a virtual worker…For example, as shown in Figure 1 where slocal = 3, at the end of clock 0, the virtual worker pushes the aggregated updates of wave 0, which is composed of minibatches from 1 to 4, and at the end of clock 1, the aggregated updates of wave 1, which is composed of minibatches from 5 to 8, and so on.” pg 8 “When a virtual worker pulls the global weights at the end of clock c to maintain this distance, it may need to wait for other virtual workers to push their updates upon completion of wave c−D.” pulling new weights amounts to determining that the first weights or states need to be updated or synchronized after at the end of the clock. Synchronization and updating is understood to mean that old data is replaced with new data, i.e the old data is deleted at some point.) determine that an error involving the neural network has occurred; retrieve a copy of data corresponding to the second data state from the first memory device or the second memory device, or both; and perform an operation to recover the neural network using the copy of the data corresponding to the second data state. (pg 9 “The set Et and the noisy weight parameter w˜t are defined similarly and the difference between wt and w˜t is
PNG
media_image3.png
29
235
media_image3.png
Greyscale
where Rt is the index set of missing updates in the reference weight parameter but not in noisy weight parameter…the parameter learned from the synchronized update is R[W]…
PNG
media_image4.png
34
170
media_image4.png
Greyscale
Thus, when we bound the regret of the two functions, we can bound the error of the noisy updates incurred by the distributed pipeline staleness gradient descent”. The error R involving the neural network is computed during the synchronization. The weights are updated or recovered on the devices in the system based on the copy of data.)
Regarding claim 18
Park/Narayanan teaches claim 15
Park teaches, wherein the first memory device has a first bandwidth associated therewith and the second memory device has a second bandwidth associated therewith, the first bandwidth being lower than the second bandwidth. (pg 10 Section 8.1 “Each node is configured with a different type of GPU as shown in Table 1.” Table 1 is provided, which shows the various bandwidths, which includes a device with greater bandwidth
PNG
media_image1.png
136
427
media_image1.png
Greyscale
)
Regarding claim 19
Park teaches claim 15
Park/Narayanan teaches, the first memory device has a first capacity associated therewith and the second memory device has a second capacity associated therewith, the first capacity being greater than the second capacity. (pg 10 Section 8.1 “Each node is configured with a different type of GPU as shown in Table 1.” Table 1 is provided, which shows the various memory capacity. The second device has a greater capacity than the first
PNG
media_image1.png
136
427
media_image1.png
Greyscale
)
Regarding claim 20
Park/Narayanan teaches claim 15
Park teaches, wherein the first memory device has a first latency associated therewith and the second memory device has a second latency associated therewith, the first latency being greater than the second latency. (pg 10 Section 8.1 “Each node is configured with a different type of GPU as shown in Table 1.” Table 1 is provided, which shows the various different clock speeds or latencies.
PNG
media_image1.png
136
427
media_image1.png
Greyscale
)
Regarding claim 21
Park/Narayanan teaches claim 15
Park teaches, subsequent to writing the data corresponding to the neural network to the second memory device, write observed data to the first memory device; and execute the neural network on the second memory device using the observed data written to the first memory device. ( pg 5 figure 1 caption “Figure 1: Pipeline execution of minibatches where Mp,k indicates the execution of a minibatch p in partition k, which is executed in GPUk and the yellow and green colors indicate the forward and backward passes, respectively”
PNG
media_image2.png
175
844
media_image2.png
Greyscale
as shown in the figure data is first written to a first gpu G1, the subsequent activations and weight updates are used by the other GPUs.)
Claim(s) 3 is rejected under 35 U.S.C. § 103 as being unpatentable over Park/Narayanan, further in view of Mailthody et al. “DeepStore: In-Storage Acceleration for Intelligent Queries”
Regarding claim 3
Park/Narayanan teaches claim 1
Narayanan teaches, and wherein the second memory device is a volatile memory device (pg 10 Section 6 “Our evaluation was conducted on a GPU farm with 8 hosts. Each host had 4 Nvidia 2080TI GPUs, 20 CPU cores… and 64 GB RAM. Each GPU had 11 GB physical memory” RAM is volatile memory. Examiner notes that while not explicitly taught, one of ordinary skill in the art would understand that the described GPUs include at least nominal amounts of non-volatile memory such as with EEPROM. Further it is noted that the claim does not require the memory device to consist of only a single type of memory. Rather, the BRI requires that the device consists of at least the named type of memory to justify the label in the claim.)
Nevertheless Park/Narayanan does not explicitly teach, wherein the first memory device is a non-volatile memory device
Mailthody however when addressing SSD (i.e NAND) memory within GPUs for neural network applications teaches, wherein the first memory device is a non-volatile memory device (pg 3 “to achieve high-performance intelligent queries against massive datasets, a common approach is to use GPUs in-conjunction with SSDs for fast data retrieval and parallel similarity comparison. In this paper, we conduct the first characterization study of typical intelligent query workloads. We use two recent generations of high-end NVIDIA GPUs to run the query and a high-end NVMe SSD for data storage” Section 2.2 “SSDs have a large number of dense NAND flash memory elements organized into multiple levels”. The GPUs combined with SSD storage are together a computing device which is called non-volatile memory device because it includes non-volatile NAND memory.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the multiple GPU training system described by Park to include non-volatile memory for faster efficient computation as described by Mailthody. One would have motivated to make this combination because as noted by Mailthody: “our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck… DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators… improves the query performance by up to 17.7×” (Mailthody abstract pg 1)
Claim(s) 4 is rejected under 35 U.S.C. § 103 as being unpatentable over Park/Narayanan/ Mailthody, further in view of Zhu et al “Performance Evaluation and Optimization of HBM-Enabled GPU for Data-Intensive Applications”
Regarding claim 4
Park/Narayanan/Mailthody teaches claim 3
Mailthody teaches, wherein the non-volatile memory device is a NAND memory device(pg 3 “a common approach is to use GPUs in-conjunction with SSDs … We use two recent generations of high-end NVIDIA GPUs to run the query and a high-end NVMe SSD for data storage” Section 2.2 “SSDs have a large number of dense NAND flash memory elements organized into multiple levels”. The GPUs combined with SSD storage are together a computing device which is called non-volatile memory device because it includes non-volatile NAND memory.)
Park/Narayanan/Mailthody does not explicitly teach, and wherein the volatile memory device is a 3D stacked SDRAM memory device
Zhu however when addressing HBM memory for GPUs in data intensive application such as neural network training teaches, and wherein the volatile memory device is a 3D stacked SDRAM memory device (abstract “, a new memory technology called high-bandwidth memory (HBM) based on 3-D die-stacking technology… In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM enabled GPU to evaluate the improvement brought by the adoption of the HBM” pg 1 “HBM is a new type of stacked DRAM memory that vertically integrates multiple memory dies” the GPU used for neural network workload includes 3d stacked RAM. As noted previously RAM is volatile memory. On of ordinary skill in the art would understand that HBM is “3D-stacked synchronous dynamic random-access memory”.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the multiple GPU training system described by Park to include volatile 3d stacked memory described by Zhu. One would have been motivated to make this combination because as noted by Zhu “Experimental results demonstrate that our pipelined CNN training achieves a 1.63× speedup on an HBM-enabled GPU compared with the best high performance GPU in market”
Conclusion
Prior art not relied upon: Liu et al “Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach” describes neural network training with heterogenous memory systems as claimed.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.R.G./
Examiner, Art Unit 2122
/KAKALI CHAKI/ Supervisory Patent Examiner, Art Unit 2122