DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/18/2025 has been entered.
Claims 1-20 are presented for the examination.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 13 are rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view VEMBU( US 20180308195 A1) and further in view of MCCORMACK( US 20140025891 A1).
As to claim 1, Li teaches A computer-implemented method for launching compute tasks on a processing unit( When a user submits a job to COSMIC, the user needs to inform COSMIC the total amount of memory the process needs on a Xeon Phi device through an environment variable COSMIC_MEMORY. COSMIC only launches a submitted job if there is one Xeon Phi device with enough free memory to meet the memory requirement of the job, para[0113], ln 6-12);
the method comprising: launching a first group of threads( scheduler tries to schedule a thread group to run (to become active), para[0011], ln 25-30);
wherein one or more resources included in a free pool are acquired by the first group of threads( The method uses user-specified thread-to-core affinity and the number of threads in a thread group to determine the number of processing cores for the thread group, para[0012]/ Next, the affinity setting is discussed. COSMIC selects the cores that are used by an offload, and affinitizes threads to these cores using programmer directives. The core selection procedure for an offload is discussed next. COSMIC's core selection algorithm scans one or more lists of free physical cores to select cores until it finds enough cores for a given offload region, para[0108], ln 1-9);
and during execution of the first group of threads, changing an allocation of the one or more resources acquired by the first group of threads( If more processing cores are needed, then our method picks cores that are both idle and not currently assigned to any thread group. If other thread groups, are chosen., para[0014], ln 5-11).
Li does not teach during execution of the first group of threads, changing an allocation of the one or more resources acquired by the first group of threads wherein, after changing the allocation, an amount of the one or more first resources acquired by the first group of threads is different than an amount of one or more second resources acquired by a second group of threads. However, Kupferschmidt teaches during execution of the first group of threads, changing an allocation of the one or more resources acquired by the first group of threads ,after changing the allocation, an amount of the one or more first resources acquired by the first group of threads is different than an amount of one or more second resources acquired by a second group of threads( a processor receives a set of tasks to be executed. At 906, the processor identifies a thread type for the set of tasks it has recently received. At 908, the processor determines if there is an appropriate thread pool set corresponding to the thread type for the set of tasks. If there is not an appropriate thread pool set, the processor, at 910, reallocates system resources to establish an appropriate thread pool set. Reallocating system resources may include increasing or decreasing the number of thread pool sets, increasing or decreasing the size of each set, or increasing or decreasing the work flow to each set and is discussed further with reference to FIGS. 10A-10D. Finally, at 912, the processor executes the set of tasks and receives another set of task to be executed 904, para[0127]/ FIGS. 10A-10D illustrate several possible ways a processor may reallocate system resources. FIG. 10A depicts an example initial allocation of 12 processing cores between three thread pool sets. Element 1010 represents a thread pool set utilizing 6 processing cores which may be responsible for displaying real time ray tracing, para[0128], ln 1-6/ FIGS. 10B-10D show ways in which a processor may reallocate system resources. In FIG. 10B, the number of thread pool sets is increased with the addition of a thread pool set 1040. Assuming there are a finite number of processing cores in a system, the thread pool set 1040 may have been established by decreasing the number of processing cores (and consequently the number of threads) available to the thread pool set 1030. In FIG. 10C, the system resources are reallocated to increase the number of processing cores (and consequently the number of threads) available to thread pool set 1020, para[0129], ln 1-12).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li with Kupferschmidt to incorporate the feature of during execution of the first group of threads, changing an allocation of the one or more resources acquired by the first group of threads because this optimizes the performance of threaded applications, the industry has developed computers with processors with multiple processing resource which may be used to simultaneously process data from multiple threads of execution.
Vembu teaches a first group of threads included in a producing cooperative thread array that are executing a first program to generate data for the producing cooperative thread array, a second group of threads included in a consuming cooperative thread array that are executing a second program to consume the data generated by the producing cooperative thread array ( A graphics processing cluster array as described herein is capable of executing potentially thousands of concurrent threads within multiple thread groups. In some instances thread groups[cooperative thread array] can be arranged as an array of cooperating threads that concurrently execute the same program on an input data set to produce an output data set. Threads having the same thread group ID can cooperate by sharing data with each other in a manner that depends on thread ID. For instance, data can be produced by one thread in a thread group and consumed by another thread in the thread group. Additionally synchronization instructions can be inserted into program code to ensure that that data to be consumed by a consumer thread has been produced by a producer thread before the consumer thread attempts to access the data. In instances where threads share access to common resources, it may be beneficial to execute all threads of the thread group within a single shader processor.), para[0037]/ the processing cluster array 212[cooperative thread array] can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N). Each cluster 214A-214N of the processing cluster array 212 is capable of executing a large number (e.g., thousands) of concurrent threads, where each thread is an instance of a program. In one embodiment, different clusters 214A-214N can be allocated for processing different types of programs or for performing different types of computations. The scheduler 210 can allocate work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and/or work distribution algorithms, para[0048], ln 2-6 to para[0049], ln 1-7/ the term “thread” refers to an instance of a particular program executing on a particular set of input data, para[0061], ln 6-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li and Kupferschmidt with Vembu incorporate the above feature because this provides groups of threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency.
Vembu teaches a first group of threads included in a producing cooperative thread array that are executing a first program to generate data for the producing cooperative thread array, a second group of threads included in a consuming cooperative thread array that are executing a second program to consume the data generated by the producing cooperative thread array as described above and Mccormac teaches wherein a first thread included in the first group of threads is that are executing a first program to generate data, wherein a second thread included in the second group of threads is executing a second program to consume the data( plurality of related thread groups may be active (in different phases of execution) at the same time within an SM 310. This collection of thread groups is referred to herein as a "cooperative thread array" ("CTA") or "thread array." The size of a particular CTA is equal to m*k, where k is the number of concurrently executing threads in a thread group , para[0049], ln 1-8/ For example, when a first thread in a CTA produces data for consumption by another thread in the CTA, the first thread writes the data (i.e., performs a STORE operation) and the second thread reads the data (i.e., performs a LOAD operation). Before the second thread reads data from a location in shared memory that was written by the first thread, para[0059], ln 9-18/ a read by a first thread in a CTA is considered "performed" at the CTA level with respect to other threads in a CTA at a point in time when the issuing of a write to the same address by one of the other threads in the CTA cannot affect the value returned to the first thread. In another example, a write by the first thread in a CTA is considered "performed" at the CTA level at a point in time …….. threads that are not in the same CTA may or may not see the result of the write by the first thread, para[0060], ln 18-30/ of data sharing among threads of a CTA is determined by the CTA program; thus, it is to be understood that in a particular application that uses CTAs, para[0056], ln 27-33).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li and Kupferschmidt and Vembu with Mccormac incorporate the above feature because this support execution of multiple threads and the multiple threads access memory and maintains coherence between a first cache and a second cache when different portions of multiple parallel threads access both caches.
As to claim 2, Li teaches comprising launching a second group of threads, wherein one or more resources included in the free pool are acquired by the second group of threads( para[0011], ln 25-30/ para[0034], ln 1-6/ para[0108], ln 1-9 / para[0012]).
As to claim 13, Li teaches during execution of the first group of threads: changing a number of threads included in the first group of threads; and changing an allocation of the one or more resources acquired by the first group of threads( para[0043]).
Claims 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of Tormasov(US 7665090 B1)
As to claim 15, Tomasov teaches during execution of the first group of threads: deallocating a first resource included in the one or more resources acquired by the first group of threads from the first group of threads; and launching a second group of threads, wherein the first resource is allocated to the second group of threads( State 216 shows a higher utilization of the resources by process 106a, the same utilization of the resource by process 106b, and a termination of process 106c, which potentially permits an allocation of more resources to the two remaining processes than their otherwise-assigned limits would normally allow. The diagram at the bottom illustrates the same state 216, where the resource utilized by both processes 106a and 106b are combined, showing the utilized resources 112, and the unutilized resources 114. In the case state 216, with one of the processes terminated, the total resource consumption by that group of processes is therefore less, which can permit reallocation of resources to other groups of processes. In the case of state 212, the situation is the opposite--an additional process has been initiated, and may require an increase in the resource allocation to the group of processes, col 6, ln 40-55).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Tormasov to incorporate the feature of during execution of the first group of threads: deallocating a first resource included in the one or more resources acquired by the first group of threads from the first group of threads; and launching a second group of threads, wherein the first resource is allocated to the second group of threads because this reduces a resource allocation to a particular process based on a history of consumption of the resource by that process and a history of consumption of other resources by that process.
As to claim 16, Tomasov teaches executing a dynamic condition check to generate a result; determining that the result indicates that the first group of threads executes a first branch included in a plurality of branches; determining that first resources for executing the first branch are different from the one or more resources acquired by the first group of threads; and changing an allocation of the one or more resources acquired by the first group of threads based on the resources for executing the first branch( FIG. 3 illustrates the use of a group resource scheduler 302 for managing resource allocation to a group of processes 104. The scheduler 302 uses a storage 304, which may be, for example, a database for managing the resources of each group of processes and of each process within a particular group 104. The scheduler 302 is responsible for keeping track of the resource utilization on a group basis of each group of processes 104 and of each process 106 within the group 104, as well as the amount of under-utilized resources by each process, col 6, ln 55-67 to col 7, ln 1-10) for the same reason as to claim 15 above.
As to claim 17, it is rejected for the same reason as to claim 1 above. Additional, Tomasov teaches a processor that executes one or more threads; and a resource allocator that is coupled to a resource set ( A manager, or scheduler, of resource allocation restrictions allocates resources to each process such that total resource allocation restrictions to a group to which that process belongs remains constant, col 2, ln 18-25/ The computer system 102 includes one or more processors, such as processor 604. The processor 604 is connected to a communication infrastructure 606, such as a bus or network), col 11, ln 3-10).
As to claim 18, it is rejected for the same reason as to claim 2 above.
Claims 3, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of Manchale(US 20200341815 A1).
As to claim 3, Manchale teaches resources acquired by the first group of threads are different in size from the one or more resources acquired by the second group of threads(Different process groups may have different amounts of CPU resources per process, para[0017], ln 14-17).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Manchale to incorporate the feature of the one or more resources acquired by the first group of threads are different in size from the one or more resources acquired by the second group of threads because this controls allocation of execution resources for database connection processes.
As to claim 19, it is rejected for the same reason as to claim 3 above.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331 A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of Takano(US 20140089935 A1).
As to claim 4, Takano teaches the first group of threads and a second group of threads are included in a first thread array( a group of thread arrays each being a group of threads each representing a process unit, para[0035], ln 3-6).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Takano to incorporate the feature of the first group of threads and a second group of threads are included in a first thread array because this optimizes a computer program which is to be executed by a computer equipped with a processing unit having a plurality of processor cores, and to an optimization method and a computer program thereof.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1)and further in view of PANTALEONI(US 20140173606 A1).
As to claim 5, Pantaleoni teaches the first group of threads executes a first function, and the second group of threads executes a second function that is different from the first function( a different thread group executes each stage of the algorithm. Each thread group performs a different set of alignment operations related to a different stage of alignment algorithm for a group of short reads. In operation, the output from a first thread group that performed a particular set of alignment operation on the short read is pushed to a work queue associated with a second thread group. When the second thread group has available processing cycles, the second thread group retrieves the output from the work queue for performing a different set of alignment operations on the output, para[0080], ln 3-15).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Pantaleoni to incorporate the feature of the first group of threads executes a first function, and the second group of threads executes a second function that is different from the first function because this performs all the operations related to each stage of the alignment algorithm on every short read in the group of short reads.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of LEI(CN 108777798 A).
As to claim 6, LEI teaches group of threads executes a first program that includes mathematical functions, and the second group of threads executes a second program that includes data transfer functions( instruction unit 254 instruction can be assigned to a thread group (e.g., thread bundles (warp)), each thread of the thread group is assigned to the GPGPU core 262 in the different execution unit, Sec: In one embodiment, the instruction cache 252 , ln 3-7/ logical instructions in the form of 0001xxxxb. flow control instruction group 2244 (e.g., call, jump (jmp)) comprises an instruction uses the 0010xxxxb form (e.g., 0x20). hybrid instruction group 2246 comprises a mixture of instruction, it comprises using the 0011xxxxb form (e.g., 0x30) synchronization instructions (e.g., waiting, sending). parallel mathematical instruction group 2248 includes employing a 0100xxxxb form (e.g., 0x40) of one component of an arithmetic instruction (e.g., add, multiply (mul)). performing an arithmetic operation in parallel mathematical group 2248 across the data channel in parallel. Vector mathematic group 2250 adopts a 0101xxxxb form (e.g., 0x50) of an arithmetic instruction (e.g., dp4). Vector mathematic operations such as computing the dot product of vector operands (KGyV production) of the arithmetic, Sec: In some embodiments, based on the operation code 2212, ln 6-24).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with LEI to incorporate the feature of the first group of threads executes a first program that includes mathematical functions, and the second group of threads executes a second program that includes data transfer functions because this perform complex scheduling and work distribution operation so as to realize the switching of the context of the thread for execution on the processing array 212 and fast (rapidpreemption).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of Kruglick(US 20110088038 A1).
As to claim 7, Kruglick teaches transitioning a state of the one or more resources acquired by the first group of threads from a free state to a warp owned state( At block 480 (Support Splitting of Affinity Groups Between Processor Cores), splitting of affinity groups may be supported. Splitting of an affinity group may support transitioning a group of processes 140 from a single core within a multicore processor 120 onto two or more cores within the multicore processor 120, para[0045], ln 1-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Kruglick to incorporate the feature of transitioning a state of the one or more resources acquired by the first group of threads from a free state to a warp owned state because this provides Graph theory techniques and interprocess communication analysis techniques may be applied to evaluate efficient assignment of processes to processor cores.
Claims 8, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of Saville( US 9507637 B1).
As to claim 8, Saville teaches during execution of the first group of threads: deallocating a first resource included in the one or more resources acquired by the first group of threads; and transitioning a state of the first resource from a warp owned state to a thread array owned state( determining whether storage is to be deallocated for the allocated thread-specific memory, and in response to determining that storage is to be deallocated for the allocated thread-specific memory: deallocating storage for the allocated thread-specific memory, removing the association between the allocated thread-specific memory and the executable thread, and providing an indication that execution of the executable thread is suspended, col 2, ln 5-10/ the operating system can examine the stack pointer and/or the allocation bit to determine whether thread-specific memory is to be deallocated prior to setting the execution state of the thread to a sleep, deinitialized, and/or or dead state.The deallocation of stacks and other processor storage for sleeping threads can significantly reduce memory required by devices that allocate a large number of threads. For example, suppose a computing device has allocated 16K threads on average during operation with each thread allocated 4 Kbytes of processor storage, col 5, ln 50-62/ Scenario 200 begins with an application making two calls to a GenThreads function, shown as pseudo-code on lines 0001-0014 of Table 2 below, to create two arrays of threads: TT, which is an array of temporary-thread-specific memory threads, and NTT, which is an array of non-temporary-thread-specific memory threads, col 12, ln 5-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Saville to incorporate the feature of execution of the first group of threads: deallocating a first resource included in the one or more resources acquired by the first group of threads; and transitioning a state of the first resource from a warp owned state to a thread array owned state because this supports execution of software written in programming languages that frequently allocate and deallocate blocks of memory .
As to claim 20, it is rejected for the same reason as to claim 8 above.
Claims 9, 10, 11 are rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1)in view of Saville( US 9507637 B1) and further in view of Kruglick(US 20110088038 A1).
As to claim 9, Kruglick teaches during execution of the first group of threads: allocating the first resource to a second group of threads; and transitioning a state of the first resource from the thread array owned state to the warp owned state ( At block 480 (Support Splitting of Affinity Groups Between Processor Cores), splitting of affinity groups may be supported. Splitting of an affinity group may support transitioning a group of processes 140 from a single core within a multicore processor 120 onto two or more cores within the multicore processor 120, para[0045], ln 1-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Kruglick to incorporate the feature of during execution of the first group of threads: allocating the first resource to a second group of threads; and transitioning a state of the first resource from the thread array owned state to the warp owned state because this provides Graph theory techniques and interprocess communication analysis techniques may be applied to evaluate efficient assignment of processes to processor cores.
As to claim 10, Saville teaches wherein the first group of threads and the second group of threads are included in a first thread array( col 12, ln 5-17) for the same reason as to claim 8 above.
As to claim 11, Kruglick teaches the first group of threads passes a value to the second group of threads via the first resource( para[0020], ln 1-10) for the same reason as to claim 9 above.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of VEMBU(US 20180308195 A1).
As to claim 12, Venue teaches determining that the first group of threads has completed execution; and transitioning a state of the one or more resources acquired by the first group of threads from a warp owned state to a free state( The logic 1500 can fill any idle execution resources within a graphics multiprocessor having completed thread sub-groups by launching additional thread sub-groups. The additional thread sub-groups can be associated with a different thread group than any currently executing thread groups. In one embodiment, threads within a thread group can have cross sub-group dependencies, para[0150], ln 1-10).
It would have been obvious to one of the ordinary skill in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with Venue to incorporate the feature of determining that the first group of threads has completed execution; and transitioning a state of the one or more resources acquired by the first group of threads from a warp owned state to a free state during because this allows groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency.
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Li( US-20140208331-A1) in view of Kupferschmidt( US 20090183167 A1) in view of VEMBU( US 20180308195 A1) in view of MCCORMACK( US 20140025891 A1) and further in view of DA(CN 107515785 A).
As to claim 14, DA teaches the free pool includes at least one of a set of registers or a portion of a shared memory( during the resource allocation from the shared memory resource pool for memory resource as the idle resource of the task, in the free resource pool of the task. when the task needs to use memory resources, can preferentially acquires memory resource to be allocated from the free resource pool of the task, sec: free resource pool of the one task, ln 6-15).
It would have been obvious to one of the ordinary skills in the art before the effective filling date of claimed invention was made to modify the teaching Li, Kupferschmidt, VEMBU and MCCORMACK with DA to incorporate the feature of the free pool includes at least one of a set of registers or a portion of a shared memory because this reduces the use of memory exclusive lock.
Response to the argument:
A. Applicant amendment filed on 12/08/2025 has been considered but they are not persuasive:
Applicant argued in substance that :
(1) “ does not disclose any techniques that include a first thread included in the first group of threads is executing a first program to generate data for the producing cooperative thread array, and a consuming cooperative thread array, wherein a second thread included in the second group of threads is executing a second program to consume the data generated by the producing cooperative thread array”
B. Examiner respectfully disagreed with Applicant's remarks:
As to the point (1), VEMBU teaches A graphics processing cluster array as described herein is capable of executing potentially thousands of concurrent threads within multiple thread groups. In some instances thread groups[cooperative thread array] can be arranged as an array of cooperating threads that concurrently execute the same program on an input data set to produce an output data set. Threads having the same thread group ID can cooperate by sharing data with each other in a manner that depends on thread ID. For instance, data can be produced by one thread in a thread group and consumed by another thread in the thread group. Additionally synchronization instructions can be inserted into program code to ensure that that data to be consumed by a consumer thread has been produced by a producer thread before the consumer thread attempts to access the data. In instances where threads share access to common resources, it may be beneficial to execute all threads of the thread group within a single shader processor.), para[0037]/ the processing cluster array 212[cooperative thread array] can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N). Each cluster 214A-214N of the processing cluster array 212 is capable of executing a large number (e.g., thousands) of concurrent threads, where each thread is an instance of a program. In one embodiment, different clusters 214A-214N can be allocated for processing different types of programs or for performing different types of computations. The scheduler 210 can allocate work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and/or work distribution algorithms, para[0048], ln 2-6 to para[0049], ln 1-7/ the term “thread” refers to an instance of a particular program executing on a particular set of input data, para[0061], ln 6-10).
Mccormac teaches wherein a first thread included in the first group of threads is that are executing a first program to generate data, wherein a second thread included in the second group of threads is executing a second program to consume the data( plurality of related thread groups may be active (in different phases of execution) at the same time within an SM 310. This collection of thread groups is referred to herein as a "cooperative thread array" ("CTA") or "thread array." The size of a particular CTA is equal to m*k, where k is the number of concurrently executing threads in a thread group , para[0049], ln 1-8/ For example, when a first thread in a CTA produces data for consumption by another thread in the CTA, the first thread writes the data (i.e., performs a STORE operation) and the second thread reads the data (i.e., performs a LOAD operation). Before the second thread reads data from a location in shared memory that was written by the first thread, para[0059], ln 9-18/ a read by a first thread in a CTA is considered "performed" at the CTA level with respect to other threads in a CTA at a point in time when the issuing of a write to the same address by one of the other threads in the CTA cannot affect the value returned to the first thread. In another example, a write by the first thread in a CTA is considered "performed" at the CTA level at a point in time …….. threads that are not in the same CTA may or may not see the result of the write by the first thread, para[0060], ln 18-30/ of data sharing among threads of a CTA is determined by the CTA program; thus, it is to be understood that in a particular application that uses CTAs, para[0056], ln 27-33).
Conclusion
US 20140025891 A1 teaches For example, when a first thread in a CTA produces data for consumption by another thread in the CTA, the first thread writes the data (i.e., performs a STORE operation) and the second thread reads the data (i.e., performs a LOAD operation).
US 7412532 B2 The producer threads thus produce work requests. To execute the work requests, the system has worker threads 604 (also called consumers, since they consume the requests put onto the queue).
US 20170300228 A1 teaches While the producers are scheduled to run on the N number of threads, and consumers are scheduled to run on M number of threads. The number of producers and the number of threads N may be different because a single producer can run on multiple threads. Similarly, the number of consumers and the number of threads M may be different because a single consumer can run on multiple threads and consume nodes produced by multiple producers.
US 20140049549 A1a plurality of related thread groups may be active (in different phases of execution) at the same time within an SM 310. This collection of thread groups is referred to herein as a "cooperative thread array" ("CIA") or "thread array." The size of a particular CTA is equal to m*k, where k is the number of concurrently executing threads in a thread group and is typically an integer multiple of the number of parallel processing engines within the SM 310, and m is the number of thread groups simultaneously active within the SM 310. Any inquiry concerning this communication or earlier communications from the examiner should be directed to LECHI TRUONG whose telephone number is (571)272-3767. The examiner can normally be reached 10-8 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor Young Kevin can be reached on (571)270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LECHI TRUONG/Primary Examiner, Art Unit 2194