Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to the preliminary amendment filed on 07/31/2023. By this amendment, Claims 1-12 and 14-19 are amended. Therefore, Claims 1-20 are pending for examination.
Priority
Acknowledgment is made of applicant’s claims for priority from foreign applications no. EP22188051.1, EP22386054.5, EP22188053.7 filed 08/01/2022, GB2214192.3 filed 09/28/2022, and provisional application no. 63394053 filed 08/01/2022.
Claim Objections
Claims 9-12, and 15-18 are objected to because of the following informalities:
Regarding Claim 9, Line 7 states “the at least subset of second machine learning data”. Since previously in the Claim, it states “the at least a second subset of machine learning data,” it is being interpreted as such for examination.
Regarding Claim 10, Line 4 states “the dispatch circuitry configured to cause each”. It should state “the dispatch circuitry is configured to cause each,” and is being interpreted as such for examination.
Regarding Claim 15, Lines 3-6 state “the snoop filter circuitry configured to store, in association with each entry, for coherent traffic coherency state, and for non-coherent traffic to store, an indication of whether that entry is to be broadcast by the broadcast circuitry.”. It should state “the snoop filter circuitry configured to store, in association with each entry for coherent traffic, coherency state; and for non-coherent traffic, to store an indication of whether that entry is to be broadcast by the broadcast circuitry,” and is being interpreted as such for examination.
Regarding Claim 17, Line 9 states “configured to process the job list generate by the”. It should state “configured to process the job list generated by the,” and is being interpreted as such for examination. Furthermore, Lines 11-13 state “the job manager circuitry is configured to determine available plurality of processing circuits and using the job list dispatch tasks to at least a subset of the plurality of processor circuits.” It should state “the job manager circuitry is configured to: determine availability of the plurality of processing circuits, and use the job list to dispatch tasks to at least a subset of the plurality of processor circuits,” and is being interpreted as such for examination.
Regarding Claim 18, Lines 6-9 state “and the available storage circuitry is associated with each of the processing circuits; and the number of at least a subset of the plurality of processing circuits processing the layer.” It should state “and the available storage circuitry is associated with: each of the processing circuits, and the number of at least a subset of the plurality of processing circuits processing the layer,” and is being interpreted as such for examination.
Any of the Claims not specifically described above are objected to due to their dependence on claims that are objected to, as stated above.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 8 and 18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claim 8, it recites the limitation “the local-storage circuitry” on line 5. There is insufficient antecedent basis for this limitation in the claim, as there is no “local-storage circuitry” previously defined in the claim, nor in Claim 1, on which it depends. For examination, this is being interpreted as the same “local-storage circuitry” in the at least a subset of the plurality of processor circuits, as defined in Claim 3, but it should be noted that Claim 8 does not depend on this claim and therefore still requires correction. Additionally, it recites the limitation “the first machine learning data” on line 6. There is insufficient antecedent basis for this limitation in the claim, as there is no “first machine learning data” previously defined in the claim, nor in Claim 1, on which it depends. For examination, this is being interpreted as “the at least a first subset of machine learning data”, previously mentioned in Claim 1.
Regarding Claim 18, it recites the limitation “associated with each of the processing circuits” on Line 6. There is insufficient antecedent basis for this limitation in the claim, as there is no “processing circuits” previously defined in the claim, nor in Claim 17, on which it depends, and it is unclear to what this term refers. For examination, this is being interpreted as “associated with each of the processing circuits of the at least subset of the plurality of processing circuits”. Additionally, on Lines 2-4, it recites “map broadcast mode and kernel broadcast mode” and “in dependence on the size of the kernel and feature map associated with the layer”, but there is no “map broadcast mode,” “kernel broadcast mode,” “size of the kernel” or “feature map” recited previously in the claim, nor in Claim 17. For examination, these are being interpreted as the same “map broadcast mode,” “kernel broadcast mode,” “kernel,” and “feature map” mentioned in Claim 12, but it should be noted that Claim 18 does not depend on this claim and therefore still requires correction.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-14 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sridharan et al. (US 20180322386 A1) in view of Das et al. (US 11373266 B2), hereinafter referred to as Sridharan and Das, respectively.
Regarding Claim 1, Sridharan discloses A processor comprising: a plurality of processor circuits to perform a machine learning process ([0186] The parallel processing platforms used for machine learning can be divided into training platforms and deployment platforms. Training platforms are generally highly parallel and include optimizations to accelerate multi-GPU single node training and multi-node, multi-GPU training. Exemplary parallel processors suited for training include the highly-parallel general-purpose graphics processing unit 700 of FIG. 700 and the multi-GPU computing system 800 of FIG. 800. Please note the parallel processors suited for training of machine learning with multi-node training corresponds to Applicant’s processor comprising a plurality of processor circuits to perform a machine learning process.);
and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits, as there must inherently be a mechanism for the transmission of the split sections of the model to the nodes, corresponding to the subset of the plurality of processor circuits, for computation.);
and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcasting the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.).
Sridharan does not explicitly disclose wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry;
However, Das discloses wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry (Col. 31, Lines 8-16- For example, each layer of a neural network can be trained by a different processing node of the distributed system. The benefits of model parallelism include the ability to scale to particularly large models. Splitting the computations associated with different layers of the neural network enables the training of very large neural networks in which the weights of all layers would not fit into the memory of a single computational node. Please note that splitting the computations associated with different layers of the neural network to be trained by a different processing node of the distributed system, where it would not fit into the memory of a single node, corresponds to Applicant’s obtaining a first subset of machine learning data from memory to storage circuitry, as the model is obtained from one central memory location and split into subsets to be stored in the storage circuitry of the nodes.);
Sridharan and Das are both considered to be analogous to the claimed invention because they are in the same field of distributed machine learning utilizing computer processors. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Sridharan to incorporate the teachings of Das to modify the system with broadcast circuitry to broadcast a first subset of machine learning data from storage circuitry to a subset of the plurality of processor circuits to obtain the first subset of machine learning data from memory to storage circuitry, allowing for improved resource and memory management due to the nature of the distributed network, as described in Das.
Regarding Claim 2, Sridharan-Das as described in Claim 1, Sridharan further discloses fetch circuitry, the fetch circuitry configured to fetch or stream machine learning data from memory to storage circuitry ([0099] In one embodiment, the accelerator integration circuit 436 includes a fetch unit 491 to fetch commands, instructions, work descriptors, etc., that define operations to be performed. In one implementation, a cache 438 stores commands and data for efficient access by the graphics processing engines 431-432, N. Please note that the fetch unit 491 that fetches commands defining operations to be performed and stores them in a cache for efficient access correspond to Applicant’s fetch circuitry configured to fetch machine learning data from memory to storage circuitry.).
Regarding Claim 3, Sridharan-Das as described in Claim 1, Sridharan further discloses the broadcast circuitry is configured to broadcast first machine learning data from storage circuitry to local-storage circuitry in the at least a subset of the plurality of processor circuits ([0149] The GPGPU 700 receives commands from the host processor and uses a global scheduler 704 to distribute execution threads associated with those commands to a set of compute clusters 706A-706H. The compute clusters 706A-706H share a cache memory 708. The cache memory 708 can serve as a higher-level cache for cache memories within the compute clusters 706A-706H. Please note that distributing execution threads associated with commands dispatched by a host processor to a set of compute clusters 706A-706H, each having cache memories, corresponds to Applicant’s broadcast circuitry configured to broadcast first machine learning data, i.e., received commands for training deep neural networks, from storage circuitry to local-storage circuitry in the at least a subset of the plurality of processor circuits, i.e., their respective caches.).
Regarding Claim 5, Sridharan-Das as described in Claim 1, Sridharan further discloses the storage circuitry is a cache ([0074] In one embodiment, the graphics multiprocessor 234 includes an internal cache memory to perform load and store operations. Please note the internal cache memory of the graphics multiprocessor 234 corresponds to Applicant’s storage circuitry being a cache.).
Regarding Claim 6, Sridharan-Das as described in Claim 1, Sridharan further discloses the storage circuitry is configured to store, in association with each entry, an indication of whether that entry is to be broadcast by the broadcast circuitry ([0149] the host interface can also be a vendor specific communications interface or communications fabric. The GPGPU 700 receives commands from the host processor and uses a global scheduler 704 to distribute execution threads associated with those commands to a set of compute clusters 706A-706H. Please note that distributing the execution threads associated with received commands to compute clusters corresponds to Applicant’s storage circuitry being configured to store an indication of whether each entry is to be broadcast by the broadcast circuitry in association with each entry. This is because the commands indicate which associated execution threads are to be sent to the compute clusters, i.e., there is inherently an indication of whether the thread for each part of distributing learning is to be broadcast to the compute clusters.).
Regarding Claim 7, Sridharan-Das as described in Claim 1, Sridharan further discloses the broadcast circuitry is configured to broadcast the first machine learning data to at most a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension). Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcasting the first subset of machine learning data to at most a subset of the plurality of processor circuits, as the first subset of machine learning data is sent to a particular node, i.e., at most a subset of the plurality of processor circuits.);
and the broadcast circuitry is configured to broadcast third machine learning data, different to the first machine learning data, to a further subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcast circuitry being configured to broadcast third machine learning data different from the first to a further subset of the plurality of processor circuits, as, for example, in an instance in which the model is split into 3 sections, it would be able to be distributed in such a manner that there would be third machine learning distinct from the first being sent to a further subset of the plurality of processor circuit, i.e., a further distinct node.);
the subset of the plurality of processor circuits and the further subset of the plurality of processor circuits are mutually exclusive ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s subset of the plurality of processor circuits and the further subset of the plurality of processor circuits being mutually exclusive, as, for example, in an instance in which the model is split into 3 sections, it would be able to be distributed in such a manner that each portion of computation is performed on a different, i.e., mutually exclusive, node.);
and the first machine learning data and the third machine learning data relate to different layers of a neural network ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s first and third machine learning data relating to different layers of a neural network, as it is known to one of ordinary skill in the art that models are comprised of layers, and therefore if they are split into different sections, each section relates to different layers.).
Regarding Claim 8, Sridharan-Das as described in Claim 1, Sridharan further discloses the processor is a tile-based graphics processor ([0347] graphics processor 3610 includes […] a tiling unit 3618 to accelerate tiling operations for tile-based rendering. Please note that the graphics processor 3610 including a tiling unit 3618 corresponds to Applicant’s processor being a tile-based graphics processor.);
and the processor circuits are shader cores ([0347] graphics processor 3610 includes an inter-core task manager 3605, which acts as a thread dispatcher to dispatch execution threads to one or more shader cores 3615A-3615N. Please note the graphics processor including an inter-core task manager 3605 to dispatch execution threads to shader cores 3615A-3615N corresponds to Applicant’s processor circuits being shader cores.);
and the storage circuitry is a cache ([0346] Graphics processor 3610 includes the one or more […] caches 3625A-3625B. Please note the graphics processor 3610 including caches corresponds to Applicant’s storage circuitry being a cache.);
and the local-storage circuitry is a tile buffer in a shader core, wherein the broadcast circuitry is configured to broadcast the first machine learning data from the cache to tile buffers in at least a subset of the plurality of shader cores ([0075] The MMU 245 includes a set of page table entries (PTEs) used to map a virtual address to a physical address of a tile and optionally a cache line index. The MMU 245 may include address translation lookaside buffers (TLB) or caches that may reside within the graphics multiprocessor 234 or the L1 cache or processing cluster 214. Please note that the TLBs of the graphics multiprocessor 234 that is included in the MMU 245 that maps virtual addresses to physical addresses of tiles and a cache line index, with shader cores included in graphics processors as well, corresponds to Applicant’s local-storage circuitry being a tile buffer in a shader core and the broadcast circuitry being configured to broadcast the first machine learning data from the cache to tile buffers in a subset of the plurality of shader cores. ).
Regarding Claim 9, Sridharan-Das as described in Claim 1, Sridharan further discloses obtain at least a second subset of machine learning data from memory to storage circuitry ([0192] Distributed machine learning can be implemented using a variety of parallelism patterns, such as data parallelism, model parallelism, or a hybrid of data and model parallelism, as illustrated in FIG. 12. As described with respect to FIG. 12, data parallelism uses the same model for each compute node, with each node processing different portions of the data.; [0196] model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), as shown in FIG. 14B; [0197] As shown in FIG. 14C, hybrid parallelism can be performed in which a partitioning is performed across activations and weights to minimize skewed matrices. For a layer of a neural network, the input data 1402, weight data 1404, and/or activation data 1406 is partitioned and distributed across multiple compute nodes (e.g., Node 0-Node 3). Please note that distributed machine learning implemented with model parallelism, where the input data is split, corresponds to Applicant’s obtaining at least a second subset of machine learning data from memory to storage circuitry. );
and transfer the at least a second subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits ([0196] model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), as shown in FIG. 14B. Please note that the different nodes performing different portions of the model’s computation corresponds to Applicant’s second subset of machine learning data being transferred to at least a subset of the plurality of processor circuits.);
wherein the at least subset of second machine learning data is different for each of the at least a subset of the processor circuits ([0196] model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), as shown in FIG. 14B. Please note that the different nodes performing different portions of the model’s computation corresponds to Applicant’s subset of second machine learning data being different for each of the subset of processor circuits, as they each process different portions of computation.).
Regarding Claim 10, Sridharan-Das as described in Claim 9, Sridharan further discloses dispatch circuitry to cause processor circuits to process machine learning data, wherein the dispatch circuitry configured to cause each of the at least a subset of the processor circuits to process its first machine learning data with the second machine learning data ([0196] model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), as shown in FIG. 14B. Please note that the different nodes performing different portions of the model’s computation simultaneously corresponds to Applicant’s dispatch circuitry to cause processor circuits to process machine learning data, where it is configured to cause each of the subset of processor circuits to process its first machine learning data with the second machine learning data, i.e., process different portions of the model input data simultaneously.).
Regarding Claim 11, Sridharan-Das as described in Claim 9, Das further discloses either the first machine learning data is a kernel and the second machine learning data is a feature map, or the first machine learning data is the feature map and the second machine learning data is the kernel (Col. 45, Lines 44-62-For each layer of a CNN, forward propagation can be on each quadrant by convolving input feature map data with a set of N×M kernels to generate output feature maps. […] to perform distributed training for a CNN using feature data partitioned along the X and Y dimension, a data transfer of weight and/or feature map data may be required before the convolution can be performed on the various halo regions 1707A-1707D. Please note that using a set of kernels convolved with an input feature map data for forward propagation corresponds to Applicant’s first machine learning data being a kernel and the second machine learning data being a feature map. Furthermore, as the Claim states “either” one of the possible embodiments, this is interpreted as meeting the requirements of the claim.).
Regarding Claim 12, Sridharan-Das as described in Claim 9, Das further discloses the apparatus is configured to operate in a kernel broadcast mode in which the first machine learning data is a kernel and the second machine learning data is a feature map (Col. 45, Lines 44-62-For each layer of a CNN, forward propagation can be on each quadrant by convolving input feature map data with a set of N×M kernels to generate output feature maps. […] to perform distributed training for a CNN using feature data partitioned along the X and Y dimension, a data transfer of weight and/or feature map data may be required before the convolution can be performed on the various halo regions 1707A-1707D. Please note that using a set of kernels convolved with an input feature map data to perform forward propagation in distributed training corresponds to Applicant’s operating in a kernel broadcast mode in which the first machine learning data is a kernel and the second machine learning data is a feature map.);
and the apparatus is configured to operate in a map broadcast mode in which the first machine learning data is the feature map and the second machine learning data is the kernel (Col. 26, Lines In convolutional network terminology, the first function to the convolution can be referred to as the input, while the second function can be referred to as the convolution kernel.; Col. 43, Lines 15-19- A new distribution can be created using a user specified number of nodes and/or groups. Neural network layers can be defined along with input feature map, output feature map, and weight data; Col. 45, Lines 25-39- Each region of the split feature map can be associated with a separate compute node. […]. For various types of neural networks, such as, for example, a CNN, each quadrant will include a halo region 1707 that defines a region of remote data dependency. Please note that the distribution to nodes with output feature maps, utilizing a CNN, wherein the CNN utilizes an input feature map and a convolution kernel applied as a second function, corresponds to Applicant’s map broadcast mode where the first machine learning data is the feature map, i.e., the input, and the second machine learning data is the kernel that is convolved with it.).
Regarding Claim 13, Sridharan-Das as described in Claim 12, Sridharan further discloses dynamically change between the map broadcast mode and the kernel broadcast mode ([0058] The scheduling can be handled dynamically by the scheduler 210, or can be assisted in part by compiler logic during compilation of program logic configured for execution by the processing cluster array 212. In one embodiment, different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that different clusters 214A-214N of the processing cluster array 212 being allocated for processing different types of computations corresponds to Applicant’s dynamically changing between the map and kernel broadcast mode, as it would change the type of machine learning operations to be performed, i.e., the mode.).
Regarding Claim 14, Sridharan-Das as described in Claim 12, Sridharan further discloses dynamically change between the map broadcast mode and the kernel broadcast mode in dependence on a layer of neural network to which the kernel and the feature map relate ([0178] In model parallelism 1202, different computational nodes in a distributed system can perform training computations for different parts of a single network. For example, each layer of a neural network can be trained by a different processing node of the distributed system. […] computation in one or more layers of a neural network model can be split across multiple compute nodes across feature map dimension to reduce size of per node model parameters.; [0219] In one embodiment the MLSL API enables the use different types of parallelization for different layers of the same neural network. The choice of parallelism can be made automatically by the MLSL based on layer properties. Please note that each layer of the neural network being trained by a different processing node, where different types of parallelization may be used for different layers of the same neural network, with the choice dependent on layer properties, corresponds to Applicant’s dynamically changing between the map and kernel broadcast mode in dependence on a layer of neural network to which the kernel and feature map relate.
Regarding Claim 17, Sridharan-Das as described in Claim 1, Sridharan further discloses a host processor configured to execute a driver ([0147] The compute framework 606 can abstract the underlying instructions provided to the GPGPU driver 608. Please note that the compute framework 606 utilizing the GPGPU driver 606 by abstracting instructions provided to it corresponds to Applicant’s host processor executing a driver.);
and job manager circuitry configured to dispatch tasks to at least a subset of the plurality of processor circuits, wherein the driver is configured to analyse layer processing of a neural network and to generate a job list to schedule processing of a neural network to at least a subset of the processing circuits ([0147] The machine learning framework 604 can process input data received from the machine learning application 602 and generate the appropriate input to a compute framework 606. The compute framework 606 can abstract the underlying instructions provided to the GPGPU driver 608 to enable the machine learning framework 604 to take advantage of hardware acceleration via the GPGPU hardware 610 without requiring the machine learning framework 604 to have intimate knowledge of the architecture of the GPGPU hardware 610. Please note that the machine learning framework 604 processing input data received from the machine learning application 602 and generating the appropriate input to a compute framework 606 corresponds to Applicant’s job manager circuitry configured to dispatch tasks to at least a subset of the plurality of processor circuits. Furthermore, as the GPGPU driver 608 is utilized by the compute framework 606 to enable the machine learning framework 604 to take advantage of hardware acceleration, this corresponds to the driver being configured to analyze layer processing of a neural network, i.e., allow the machine learning framework to process input data from the machine learning application 602, known in the art to include layer processing. Since it then provides input to a compute framework to carry out processing of the machine learning data, this corresponds to generating a job list to schedule processing of a neural network to at least a subset of the processing circuits. Since a neural network is known in the art to be a variant of machine learning, this system being applied to machine learning corresponds to Applicant’s neural network.);
and the job manager circuitry is configured to process the job list generate by the driver, wherein the job manager circuitry is configured to determine available plurality of processing circuits and using the job list dispatch tasks to at least a subset of the plurality of processor circuits ([0111] In one embodiment, the process elements 483 are stored in response to GPU invocations 481 from applications 480 executed on the processor 407. A process element 483 contains the process state for the corresponding application 480. A work descriptor (WD) 484 contained in the process element 483 can be a single job requested by an application or may contain a pointer to a queue of jobs. In the latter case, the WD 484 is a pointer to the job request queue in the application's address space 482.; [0112] The graphics acceleration module 446 and/or the individual graphics processing engines 431-432, N can be shared by all or a subset of the processes in the system. Embodiments of the invention include an infrastructure for setting up the process state and sending a WD 484 to a graphics acceleration module 446 to start a job in a virtualized environment. Please note that the process elements 483 being stored in response to GPU invocations 481, which would be obvious to include the machine learning input data processed with the aid of the GPGPU driver 608, and each containing a job requested by an application corresponds to Applicant’s job manager circuitry being configured to process the job list generated by the driver. Furthermore, the individual graphics processing engines N that are used by a subset of the processes in the system that operating using a WD 484 to start a job at a graphics acceleration module corresponds to Applicant’s job manager circuitry being configured to determine an available plurality of processing circuits, inherently required for the individual graphics processing engines, and using the job list to dispatch tasks, i.e., those described by the WD, to at least a subset of the plurality of processor circuits.).
Regarding Claim 18, Sridharan-Das as described in Claim 17, Sridharan further discloses the driver is configured to select between map broadcast mode and kernel broadcast mode to minimise memory accesses to the at least subset of the plurality of processing circuits in dependence on the size of the kernel and feature map associated with the layer ([0181] the parallel processors and GPGPUs described herein can each implement various techniques to reduce the overhead of distributed training, including techniques to enable high bandwidth GPU-to-GPU data transfer and accelerated remote data synchronization.; [0219] In one embodiment the MLSL API enables the use different types of parallelization for different layers of the same neural network. The choice of parallelism can be made automatically by the MLSL based on layer properties, such as the number of learnable parameters and the number of activations. Please note that the parallel processors implementing techniques to reduce the overhead of distributed training, where different types of parallelization may be used for different layers and are chosen based on layer properties such as the number of activations and learnable parameters, corresponds to Applicant’s driver being configured to select between map and kernel broadcast modes to minimize memory accesses, i.e., reduce overhead, in the at least subset of the plurality of processing circuits in dependence on the size of the kernel and feature map associated with the layer.);
and the available storage circuitry is associated with each of the processing circuits; and the number of at least a subset of the plurality of processing circuits processing the layer ([0209] In one embodiment the communication module 1517 can adaptively adjust or assign processing resources to attempt to fully saturate available network resources to attempt to minimize the latency impact of communication within the distributed system. For example, should the communication module 1517 determine that the high-performance communication fabric 1521 is not fully saturated with data, additional processors or processor cores can be assigned to perform network tasks if overall throughput of the distributed compute system would be increased. Please note that assigning processing resources to fully saturate available network resources corresponds to Applicant’s available storage circuitry being associated with each of the processing circuits and the number of at least a subset of the plurality of processing circuits processing the layer, as the network tasks including the distributed machine learning for layers of the neural network would be assigned with an awareness of the number of the subset of the plurality of processing circuits processing the layer, as well as available storage circuitry required to carry out the operation, known in the art to be included as network resources.).
Regarding Claim 19, Sridharan discloses A data processing method of transferring data to a plurality of processing circuits ([0186] The parallel processing platforms used for machine learning can be divided into training platforms and deployment platforms. Training platforms are generally highly parallel and include optimizations to accelerate multi-GPU single node training and multi-node, multi-GPU training. Exemplary parallel processors suited for training include the highly-parallel general-purpose graphics processing unit 700 of FIG. 700 and the multi-GPU computing system 800 of FIG. 800. Please note the method of training, utilizing parallel processors suited for training of machine learning, with multi-node training corresponds to Applicant’s data processing method of transferring data to a plurality of processing circuits, as data is necessarily transferred to the processors in order to carry out the machine learning operations.), the data processing method comprising:
and broadcasting the at least a first subset of machine learning data from storage to at least a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcasting the at least a first subset of machine learning data from storage to at least a subset of the plurality of processor circuits.).
Sridharan does not explicitly disclose fetching at least a first subset of machine learning data from memory to storage;
However, Das discloses fetching at least a first subset of machine learning data from memory to storage (Col. 31, Lines 8-16- For example, each layer of a neural network can be trained by a different processing node of the distributed system. The benefits of model parallelism include the ability to scale to particularly large models. Splitting the computations associated with different layers of the neural network enables the training of very large neural networks in which the weights of all layers would not fit into the memory of a single computational node. Please note that splitting the computations associated with different layers of the neural network to be trained by a different processing node of the distributed system, where it would not fit into the memory of a single node, corresponds to Applicant’s fetching a first subset of machine learning data from memory to storage, as the model is fetched from one central memory location and split into subsets to be stored in the storage of the nodes.);
Sridharan and Das are both considered to be analogous to the claimed invention because they are in the same field of distributed machine learning utilizing computer processors. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Sridharan to incorporate the teachings of Das to modify the system for data processing involving transferring data to a plurality of processing circuits to broadcast a first subset of machine learning data from storage circuitry to a subset of the plurality of processor circuits to fetch the first subset of machine learning data from memory to storage, allowing for improved resource and memory management due to the nature of the distributed network, as described in Das.
Regarding Claim 20, Sridharan discloses A non-transitory computer-readable medium to store computer-readable code for fabrication of an apparatus comprising ([0349] Embodiments may be provided, for example, as a computer program product which may include one or more machine-readable media having stored thereon machine-executable instructions […] A machine-readable medium may include […] non-transitory machine-readable media suitable for storing machine-executable instructions. Please note that non-transitory machine-readable media storing machine-executable instructions corresponds to Applicant’s non-transitory computer-readable medium storing computer-readable code for fabrication of an apparatus.):
a plurality of processor circuits to perform a machine learning process ([0151] The graphics multiprocessors of the compute cluster multiple types of integer and floating point logic units that can perform computational operations at a range of precisions including suited for machine learning computations. Please note that the graphics multiprocessors that can perform computational operations for machine learning computations corresponds to Applicant’s plurality of processor circuits to perform a machine learning process.);
and broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcast circuitry configured to broadcast data to at least a subset of the plurality of processor circuits, as there must inherently be a mechanism for the transmission of the split sections of the model to the nodes, corresponding to the subset of the plurality of processor circuits, for computation.);
and broadcast the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits ([0196] model parallelism can be implemented in which the model or set of weights is split across multiple nodes. Generally, model parallelism performs different portions of a model's computation are performed simultaneous on different nodes for the same batch of examples. For model parallelism, the input data is also split (e.g., along the channel dimension), Please note that the input data for model parallelism being split and sent for computation on different nodes corresponds to Applicant’s broadcasting the at least a first subset of machine learning data from storage circuitry to at least a subset of the plurality of processor circuits.).
Sridharan does not explicitly disclose wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry;
However, Das discloses wherein the processor is configured to: obtain at least a first subset of machine learning data from memory to storage circuitry (Col. 31, Lines 8-16- For example, each layer of a neural network can be trained by a different processing node of the distributed system. The benefits of model parallelism include the ability to scale to particularly large models. Splitting the computations associated with different layers of the neural network enables the training of very large neural networks in which the weights of all layers would not fit into the memory of a single computational node. Please note that splitting the computations associated with different layers of the neural network to be trained by a different processing node of the distributed system, where it would not fit into the memory of a single node, corresponds to Applicant’s obtaining a first subset of machine learning data from memory to storage circuitry, as the model is obtained from one central memory location and split into subsets to be stored in the storage circuitry of the nodes.);
Sridharan and Das are both considered to be analogous to the claimed invention because they are in the same field of distributed machine learning utilizing computer processors. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Sridharan to incorporate the teachings of Das to modify the system with broadcast circuitry to broadcast a first subset of machine learning data from storage circuitry to a subset of the plurality of processor circuits to obtain the first subset of machine learning data from memory to storage circuitry, allowing for improved resource and memory management due to the nature of the distributed network, as described in Das.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Sridharan et al. (US 20180322386 A1) in view of Das et al. (US 11373266 B2) as applied to Claim 1 above, and further in view of Appu et al. (US 20180308203 A1), hereinafter referred to as Sridharan, Das, and Appu, respectively.
Regarding Claim 4, Sridharan-Das as described in Claim 1 does not explicitly disclose the fetch circuitry configured to fetch or stream first compressed machine learning data from memory;
and decompression circuitry configured to decompress first compressed machine learning data to generate first decompressed machine learning data;
and the fetch circuitry configured to fetch first decompressed machine learning data from decompression circuitry to storage circuitry.
However, Appu discloses the fetch circuitry configured to fetch or stream first compressed machine learning data from memory ([0169] an original or regular model of data may be compressed by applying an artefact by compression/expansion logic 709, where this compressed model and the corresponding artefact are communicated from one autonomous machine 600 to another autonomous machine 740 over one or more communication medium(s) 725 (e.g., cloud, Internet, etc.). The artefact may then be received at autonomous machine. Please note that the autonomous machine receiving the compressed model and artefact corresponds to Applicant’s fetch circuitry configured to fetch first compressed machine learning data from memory.);
and decompression circuitry configured to decompress first compressed machine learning data to generate first decompressed machine learning data ([0169] an original or regular model of data may be compressed by applying an artefact by compression/expansion logic 709, where this compressed model and the corresponding artefact are communicated from one autonomous machine 600 to another autonomous machine 740 over one or more communication medium(s) 725 (e.g., cloud, Internet, etc.). […] The two are then separated and autonomous machine 740 can now use the model in its original and uncompressed model. Please note that the expansion logic 709 to obtain the uncompressed model from the compressed model corresponds to Applicant’s decompression circuitry configured to decompress first compressed machine learning data to generate first decompressed machine learning data.);
and the fetch circuitry configured to fetch first decompressed machine learning data from decompression circuitry to storage circuitry ([0169] The two are then separated and autonomous machine 740 can now use the model in its original and uncompressed model. Please note that the autonomous machine 740 using the uncompressed model corresponds to Applicant’s fetch circuitry fetching first decompressed machine learning data from decompression circuitry to storage circuitry, as it must necessarily fetch and store the now decompressed model in order to be able to use it.).
Sridharan-Das and Appu are both considered to be analogous to the claimed invention because they are in the same field of distributed machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Sridharan-Das to incorporate the teachings of Appu to modify the system as described in Claim 1 to configure the fetch circuitry to fetch compressed machine learning data from memory, have decompression circuitry to generate decompressed machine learning data from it, and fetch the decompressed data to storage circuitry, allowing for improved transmission time for larger amounts of data for models, as described in Appu.
Claims 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Sridharan et al. (US 20180322386 A1) in view of Das et al. (US 11373266 B2) as applied to Claim 1 above, and further in view of Bernat (US 20210149803 A1), hereinafter referred to as Sridharan, Das, and Bernat, respectively.
Regarding Claim 15, Sridharan-Das as described in Claim 1 does not explicitly disclose snoop filter circuitry, the snoop filter circuitry configured to store, in association with each entry, for coherent traffic coherency state, and for non-coherent traffic to store, an indication of whether that entry is to be broadcast by the broadcast circuitry.
However, Bernat discloses snoop filter circuitry, the snoop filter circuitry configured to store, in association with each entry, for coherent traffic coherency state ([0074] using example snoop filters to monitor coherent traffic and/or keep track of coherency states of local memories),
and for non-coherent traffic to store, an indication of whether that entry is to be broadcast by the broadcast circuitry ([0074] In some examples, if the example snoop filters 628 determine that an example local memory 408, 428 (FIG. 4) is not accessing a copy of a shared memory region, the example snoop filters 628 communicate to the example remote CA 452 to refrain from sending a snoop request to the corresponding example local memory 408, 428 (FIG. 4). ).
Sridharan-Das and Bernat are both considered to be analogous to the claimed invention because they are in the same field of managing communication between nodes performing operations utilizing shared memory, i.e., a cache. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Sridharan-Das to incorporate the teachings of Bernat to modify the system as described in Claim 1 to have snoop filter circuitry to store a coherency state associated with each entry and an indication of whether an entry is to be broadcast by broadcast circuitry for non-coherent traffic, allowing the ability to keep coherency between memory regions across different devices, as described in Bernat.
Regarding Claim 16, Sridharan-Das-Bernat as described in Claim 15, Bernat further discloses snoop filter circuitry configured to a store snoop filter entry, the snoop filter entry to store at least one of a broadcast flag, a broadcast destination, or a broadcast address, for a non-coherent entry ([0074] In some examples, if the example snoop filters 628 determine that an example local memory 408, 428 (FIG. 4) is not accessing a copy of a shared memory region, the example snoop filters 628 communicate to the example remote CA 452 to refrain from sending a snoop request to the corresponding example local memory 408, 428 (FIG. 4). Please note that the snoop filters 628 communicating to the example remote CA 452 to refrain from sending a snoop request in the instance that an example local memory is not accessing a copy of a shared memory region, i.e., a non-coherent entry, corresponds to Applicant’s snoop filter circuitry configured to store a snoop filter entry to store a broadcast flag, i.e., a flag to refrain from sending a snoop request. As Applicant states “at least one of” the items stored by the snoop filter entry, this is interpreted as fulfilling the requirement of the claim.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ray et al. (US 20190206090 A1) discloses distributed learning splitting a network across multiple nodes, each node with a subset of the data, shader cores, and compressing machine learning models (see [0205-0207, 0231, 0270-0273]).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARAZ T AKBARI whose telephone number is (571)272-4166. The examiner can normally be reached Monday-Thursday 9:30am-7:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, April Blair can be reached at (571)270-1014. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/FARAZ T AKBARI/ Examiner, Art Unit 2196
/APRIL Y BLAIR/ Supervisory Patent Examiner, Art Unit 2196