DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending in this application.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
As per claim 1:
Lines 4-6 recite “barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-useable named barriers” but it is unclear how the plurality of hardware threads are synchronized if there are a plurality of re-useable named barriers (Which barrier of the plurality of re0usable names barriers synchronizes the plurality of hardware threads?).
As per claim 11:
Lines 7-8 recite “remaining workgroups executed at the multiple graphics cores” but it is unclear what it means by “remaining”.
Line 11 recites “notifying workgroups execute at the multiple graphics cores” and it is unclear if this includes “a workgroup executed at a graphics core of the multiple graphics core” and “remaining workgroups executed at the multiple graphics cores”.
As per claim 15:
Lines 10-11 recite “cluster barrier circuitry to synchronize the first barrier circuitry with the second barrier circuitry” but it is unclear how the first and second barrier circuitry can be synchronized when the first and second circuitry synchronize different sets of hardware threads.
Claims 2-10, 12-14, and 16-20 are dependent claims of claims 1, 11, and 15, and fail to resolve the deficiencies of claims 1, 11, and 15, so they are rejected for the same reasons.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1 and 10 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Giroux et al. (US 11803380 B2 hereinafter Giroux).
As per claim 1, Giroux teaches the invention substantially as claimed including a graphics processor comprising: a cache memory (Col. 8 lines 57-66 a GPU(s) 110 with local graphics memory(ies) 114…the local graphics memory 114 may be organized into a different hierarchies such as level 3 cache, level 2 cache, level 1 cache); and a graphics core coupled with the cache memory, the graphics core including execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers (Fig. 2B; Col. 8 lines 59-66 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110… the local graphics memory 114 may be organized into a different hierarchies such as level 3 cache, level 2 cache, level 1 cache; Col. 2 lines 35-36 In modern GPU architectures, many execution threads execute concurrently; Col. 2 lines 42-56 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads (or sometimes the same synchronization barrier is reused; claim 20 A synchronization barrier comprising: a counter providing a synchronization barrier count, wherein the counter resides in memory; and circuitry operatively connected to the counter that advances the synchronization barrier count in response to completion of software initiated operations performed by execution threads and advances the synchronization barrier count in response to completion of operations performed by hardware initiated operators).
As per claim 10, Giroux teaches the graphics processor as in claim 1, further comprising a graphics core cluster including the graphics core, the graphics core is a first graphics core, and the graphics core cluster includes cluster barrier circuitry to synchronize the plurality of hardware threads of the first graphics core with a plurality of hardware threads of a second graphics core (Col. 9 lines 7-15 Such memory-implemented synchronization barriers can therefore be used to synchronize between threads running on a common core 206, between different warps running on different cores, different processes running on the same or different GPUs 200, the same or different SOCs, etc. Thus, a particular barrier is no longer limited to synchronizing threads processed in parallel on a particular processor, but may also be used to synchronize many more threads or other executions across any number of different cores, GPUs; Col. 8 lines 59-61 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110;).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2-5 are rejected under 35 U.S.C. 103 as being unpatentable over Giroux, as applied to claim 1 above, in view of Salapura et al. (US 20120179896 A1 hereinafter Salapura).
As per claim 2, Giroux teaches the graphics processor as in claim 1, the barrier circuitry configured to: initialize a first named barrier in response to an initialization request, wherein the first named barrier is initialized into an inactive state (Figs. 5A, 5B; Col. 13 lines 47-48 During initialization, the system will initially set up the synchronization barrier instance; Col. 14 lines 25-27 the now-initialized barrier is stored to the specified location in memory (2110); Col. 14 lines 52-59 calling the _arrive function 2200 will have the effect (based on hardware acceleration associated with the implementation of barriers) of reducing the number of threads the barrier's thread counter indicates to thereby “register” with the barrier that the thread has arrived at the synchronization point and thus that the barrier is no longer waiting on this particular thread to reach its defined synchronization point.); adjust a count of barrier participants in response to arrival of subsequent hardware threads of the plurality of hardware threads (Col. 12 lines 9-14 an arrival counter 1904 that indicates how many threads/processes have arrived (or in one implementation, the remaining arrivals needed for the barrier to clear). In one example implementation, each time a thread or other process arrives at the barrier, the arrival counter 1904 is incremented.); notify barrier participants of arrival of an expected number of participant threads after arrival of the expected number of barrier participants (Col. 15 lines 51-52 Wait until all expected arrives have occurred for the barrier; Col. 11 lines 61-66 the phase counter 1902 can be a single bit state flag (i.e., a one-bit counter) that is flipped (incremented) each time the barrier is reset. Such a state flag can be used into indicate to threads that the state of the barrier has changed to the next processing phase—meaning that the threads do not (or no longer need to) block on the barrier; Col. 12 lines 8-17 an additional field comprises an arrival counter 1904 that indicates how many threads/processes have arrived (or in one implementation, the remaining arrivals needed for the barrier to clear). In one example implementation, each time a thread or other process arrives at the barrier, the arrival counter 1904 is incremented. When the arrival counter 1904 increments to a predetermined known value), this indicates that all threads and processes have arrived, and the barrier can change state and move to the next processing phase.).
Giroux fails to teach transition the first named barrier into an active state in response to receipt of a signal that a hardware thread of the plurality of hardware threads has arrived at the first named barrier.
However, Salapura teaches transition the first named barrier into an active state in response to receipt of a signal that a hardware thread of the plurality of hardware threads has arrived at the first named barrier ([0050] Once the thread completes the execution of instructions for a given phase of the program and reaches the barrier, it issues a wait_wakeup or wait_interrupt instructions, to check if other processors reached the barrier. This command causes the barrier signal to start propagating out to other threads participating in the barrier.).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Giroux with the teachings of Salapura to promote efficiency (see Salapura [0048] wait_wakeup--to indicate that the thread enters the barrier and goes into sleep status until the wakeup signal on this processor becomes active. This may be chosen to be implemented efficiently with hardware or in software as per the goals of the target architecture.).
As per claim 3, Giroux and Salapura teach the graphics processor as in claim 2. Giroux teaches the barrier circuitry configured to transition into the inactive state after notification of the arrival of the expected number of barrier participants (Col. 14 lines 18-25 The _create function 2100 takes as parameters “Ptr” (the direct or indirect memory location where the barrier data structure is to be stored in memory) and “ExpectedCount” (the expected total number of the collection of threads and DMA operations that are synchronized by this particular barrier). In the example shown, the counters in the data structure are initialized (2102). The barrier counter is set to ExpectedCount; Col. 12 line 63-Col. 13 line 3 to reset the synchronization primitive 1900 if the copy engine's operation causes the arrival counter decoder 1956 (which functions as a comparator that compares the count of the counter with a predetermined value and resets the counter and the phase indicator based on results of the comparison) to determine that no more threads or copy engine operations are awaited before the synchronization barrier can reset).
As per claim 4, Giroux and Salapura teach the graphics processor as in claim 2. Giroux teaches the barrier circuitry configured to determine the number of barrier participants for the first named barrier based on the signal (Col. 14 lines 24-25 The barrier counter is set to ExpectedCount (increment on arrive) (2108); Col. 14 lines 52-67 line calling the _arrive function 2200 will have the effect (based on hardware acceleration associated with the implementation of barriers) of reducing the number of threads the barrier's thread counter indicates to thereby “register” with the barrier that the thread has arrived at the synchronization point and thus that the barrier is no longer waiting on this particular thread to reach its defined synchronization point. In the example non-limiting embodiment, the _arrive function 2200 call can be placed anywhere in a thread, and it is the position of the _arrive function 2200 call that defines the synchronization point within the thread. The developer and/or an optimizing compiler take care to ensure that the number of threads containing an _arrive function call 2200 (+DMA or other appropriate hardware calls) matches the Expected Number of Arrivals programmed into the barrier.).
As per claim 5, Giroux and Salapura teach the graphics processor as in claim 4. Giroux teaches wherein the barrier participants include first hardware threads of the plurality of hardware threads and the barrier circuitry is configured to synchronize second hardware threads of the plurality of hardware threads via a second named barrier while the first named barrier is in the active state (Col. 2 lines 42-56 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads (or sometimes the same synchronization barrier is reused; Col. 17 line 46 A split barrier means multiple split barriers can be overlapping).
Claim 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over Giroux and Salapura, as applied to claim 4 above, in view of Guo et al. (US 20230289242 A1 hereinafter Guo).
As per claim 6, Giroux and Salapura teach the graphics processor as in claim 4.
Giroux and Salapura fail to teach wherein the barrier participants include producers, consumers, and producer/consumers.
However, Guo teaches wherein the barrier participants include producers, consumers, and producer/consumers ([0076] improved synchronization between producer and consumer threads in the CGA; Hardware barriers for synchronizing execution across all (or any) threads in a CGA).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Giroux and Salapura with the teachings of Guo to promote efficiency (see [0051] the transaction barrier is a fully software-managed resource with a clear software exposure story, a thread compatible programming model and efficient producer-consumer communication pattern support.).
As per claim 7, Giroux, Salapura, and Guo teach the graphics processor as in claim 6. Guo teaches wherein the signal that the hardware thread of the plurality of hardware threads has arrived indicates a number of producers and a number of consumers ([0092] Each transaction arrive( ) called by the producer threads, in addition to updating the arrive count “Acnt”, also updates the transaction count in the BarDatRdy transaction barrier to indicate how much data (in this example, how many buffers in DataBuf) it will be storing; [0059] The transaction barrier places only minimal requirements on the relative ordering between different arrival events. The expectation can be set by any participating thread before its arrival on the barrier, before or after the associated data transactions. For example, both producer and consumer threads can program the expectation; [0084] The illustrated producer-consumer interaction involves the two producers utilizing an arrive-wait barrier shown in the figure as “BarBufAvail” to determine the availability of space in a buffer “DataBuf” in which to store data produced by the producers, and the two consumers utilizing a transaction barrier shown in the figure as “BarDatRdy” to determine the availability of data in DataBuf to consume.).
As per claim 8, Giroux, Salapura, and Guo teach the graphics processor as in claim 7. Guo teaches the barrier circuitry configured to: save a thread identifier in response to arrival of a consumer; and notify the consumer of the arrival of the expected number of participant threads via the thread identifier (Fig. 6A; [0083] FIG. 6A illustrates a transaction barrier use by two producer CTAs (“Producer0” and “Producer1”) and two consumer CTAs (“Consumer0” and “Consumer1”); [0089] Consumer0 and Consumer1 each reads respective buffers of DataBuf and updates the BarBufAvail barrier to indicate that some of the data in DataBuf has been consumed, by each incrementing the arrive count on BarBufAvail as a result of calling a thread arrive( ) function. When the arrive count on the BarBufAvail indicates that all the expected arrives have been received (i.e., Acnt=0), the BarBufAvail barrier is cleared, and BarBufAvail barrier moves to the next phase—phase 1; [0094] When, in phase 0, the BarDatRdy transaction barrier arrive count is 0 and the transaction count is 0 (Acnt=0, Tcnt=0) indicating that the expected number of thread arrives have arrived and the expected number of transactions have completed, the BarDatRdy barrier is cleared and moves into phase 1 reinitializing the barrier to P=0, Acnt=−2, and Tcnt=0; [0095] The consumer threads then wait on BarDatRdy in phase 1 for data to be available again.).
As per claim 9, Giroux, Salapura, and Guo teach the graphics processor as in claim 7. Guo teaches wherein the graphics core includes direct memory access circuitry configured to perform an asynchronous load from memory via the cache memory and to perform the asynchronous load includes to configure the direct memory access circuitry as a producer associated with the first named barrier (abstract The asynchronous transactions may include transactions resulting from, for example, hardware data movement units such as direct memory units; [0115] In example non-limiting embodiments, the processor architecture may be modified to provide additional circuitry that ties hardware-based processes such as DMA; [0203] In the context of this disclosure, an SM or “streaming multiprocessor” means a processor architected as described in U.S. Pat. No. 7,447,873 to Nordquist including improvements thereto and advancements thereof, and as implemented for example in many generations of NVIDIA GPUs; [0101] A barrier cache 704 attached to the datapath circuit 702 is configured to store cached versions of a plurality of transaction barriers 500 that are stored in shared memory 722; [0152] a direct memory access (“DMA”) by the hardware engine to the instance of the primitive stored in shared memory; [0093] Thus, it can be seen that the transaction arrive( ) of Producer0 updated Tcnt to −2 indicating that it will be storing two buffers, and that each subsequent store operation incremented Tcnt by 1 so that Tcnt=2 at which point Producer1's transaction arrive( ) decremented Tcnt by the number of buffers written by Producer1. This sequence illustrates that the respective threads can call transaction arrive( ) before or after the corresponding store operations. The arrive( ) operations are called by the threads, and the store operations are transactions that may be performed by hardware units, such as, for example, a DMA copy engine, the TMA unit, etc; [0083] FIG. 6A illustrates a transaction barrier use by two producer CTAs (“Producer0” and “Producer1”)).
Claims 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over Giroux in view of Guo.
As per claim 11, Giroux teaches a method comprising: initializing a re-usable named barrier for use by multiple graphics cores of a graphics processor (Col. 2 lines 42-56 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads (or sometimes the same synchronization barrier is reused; Col. 9 lines 7-15 Such memory-implemented synchronization barriers can therefore be used to synchronize between threads running on a common core 206, between different warps running on different cores, different processes running on the same or different GPUs 200, the same or different SOCs, etc. Thus, a particular barrier is no longer limited to synchronizing threads processed in parallel on a particular processor, but may also be used to synchronize many more threads or other executions across any number of different cores, GPUs; Col. 8 lines 59-61 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110); receiving notification of barrier arrival from a workgroup executed at a graphics core of the multiple graphics cores; transitioning into a barrier arriving state in response to receiving the notification (Fig. 5B; Col. 8 lines 59-61 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110; Col. 9 lines 7-11 Such memory-implemented synchronization barriers can therefore be used to synchronize between threads running on a common core 206, between different warps running on different cores, different processes running on the same or different GPUs; Col. 14 lines 52-59 calling the _arrive function 2200 will have the effect (based on hardware acceleration associated with the implementation of barriers) of reducing the number of threads the barrier's thread counter indicates to thereby “register” with the barrier that the thread has arrived at the synchronization point and thus that the barrier is no longer waiting on this particular thread to reach its defined synchronization point; Col. 13 lines 62-63 an _arrive function 2200 (see FIG. 5B) is used by a thread to indicate its arrival at an arrive-wait-barrier; Col. 2 lines 35-37 many execution threads execute concurrently, and many warps each comprising many threads also execute concurrently.); while in the barrier arriving state, receiving notification of barrier arrival from remaining workgroups executed at the multiple graphics cores; after receiving notification of barrier arrival from the remaining workgroups executed at the multiple graphics cores, transitioning (Col. 12 lines 8-17 in FIG. 3, an additional field comprises an arrival counter 1904 that indicates how many threads/processes have arrived (or in one implementation, the remaining arrivals needed for the barrier to clear). In one example implementation, each time a thread or other process arrives at the barrier, the arrival counter 1904 is incremented. When the arrival counter 1904 increments to a predetermined known value), this indicates that all threads and processes have arrived, and the barrier can change state and move to the next processing phase; Col. 11 lines 61-66 the phase counter 1902 can be a single bit state flag (i.e., a one-bit counter) that is flipped (incremented) each time the barrier is reset. Such a state flag can be used into indicate to threads that the state of the barrier has changed to the next processing phase—meaning that the threads do not (or no longer need to) block on the barrier;); and notifying workgroups executed at the multiple graphics cores of completion of a synchronization phase (Col. 15 lines 51-52 Wait until all expected arrives have occurred for the barrier; Col. 11 lines 61-66 the phase counter 1902 can be a single bit state flag (i.e., a one-bit counter) that is flipped (incremented) each time the barrier is reset. Such a state flag can be used into indicate to threads that the state of the barrier has changed to the next processing phase—meaning that the threads do not (or no longer need to) block on the barrier; Col. 12 lines 8-17 an additional field comprises an arrival counter 1904 that indicates how many threads/processes have arrived (or in one implementation, the remaining arrivals needed for the barrier to clear). In one example implementation, each time a thread or other process arrives at the barrier, the arrival counter 1904 is incremented. When the arrival counter 1904 increments to a predetermined known value), this indicates that all threads and processes have arrived, and the barrier can change state and move to the next processing phase).
Giroux fails to teach after receiving notification of barrier arrival from the remaining workgroups executed at the multiple graphics cores, transitioning into a notifying state; and while in the notifying state, notifying workgroups executed at the multiple graphics cores of completion of a synchronization phase.
However, Guo teaches after receiving notification of barrier arrival from the remaining workgroups executed at the multiple graphics cores, transitioning into a notifying state; and while in the notifying state, notifying workgroups executed at the multiple graphics cores of completion of a synchronization phase ([0020] One way for different execution processes to coordinate their states with one another is by using barrier synchronization. Barrier synchronization typically involves each process in a collection of parallel-executing processes waiting at a barrier until all other processes in the collection catch up. No process can proceed beyond the barrier until all processes reach the barrier;[0114] The datapath circuit also can reset the synchronization primitive 500 if the modifying causes the arrival counter decoder 806 (which functions as a comparator that compares the counts of the arrive counter 814 and transaction counter 810 with predetermined values (e.g., 0) and resets the counters and the phase indicator based on results of the comparison) to determine that no more threads or transactions are awaited before the transaction barrier can reset; [0107] When a transaction barrier 500 in the cache 704 clears, the datapath circuit 702 notifies all threads waiting (i.e. blocked) on that transaction barrier 500 in the try-wait buffer 706. These threads may be executing on the same SM, or on different SMs, or on both the same SM and different SMs. Upon being notified, the waiting threads may each reissue the thread wait( ) and, the barrier having already been cleared, proceed in its execution. In some embodiments, the waiting threads may, when notified, proceed to continuing with their respective executions without reissuing the thread wait( ); [0079] The predetermined clearing condition for the barrier, in an embodiment, is an expected number of arrives and an expected number of transaction completions having being reached at the barrier.).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Giroux with the teachings of Guo to reduce overhead (see Guo Abstract A new wait mechanism reduces software overhead associated with waiting on a barrier.).
As per claim 12, Giroux and Guo teach the method as in claim 11. Giroux teaches further comprising transitioning the named barrier into an idle state after notifying the workgroups executed at the multiple graphics cores of completion of the synchronization phase (Col. 12 line 64-Col. 13 line 3 if the copy engine's operation causes the arrival counter decoder 1956 (which functions as a comparator that compares the count of the counter with a predetermined value and resets the counter and the phase indicator based on results of the comparison) to determine that no more threads or copy engine operations are awaited before the synchronization barrier can reset; Col. 11 lines 61-66 the phase counter 1902 can be a single bit state flag (i.e., a one-bit counter) that is flipped (incremented) each time the barrier is reset. Such a state flag can be used into indicate to threads that the state of the barrier has changed to the next processing phase—meaning that the threads do not (or no longer need to) block on the barrier; Col. 9 lines 7-11 Such memory-implemented synchronization barriers can therefore be used to synchronize between threads running on a common core 206, between different warps running on different cores, different processes running on the same or different GPUs; claim 6 a memory access circuit that resets the counter and changes the phase indicator in response to the counter indicating that all threads in the collection of threads and the at least one copy operation have reached a synchronization point and all operations in said collection of threads have complete).
As per claim 13, Giroux and Guo teach the method as in claim 12. Giroux teaches further comprising: initializing the re-usable named barrier for a first number of barrier participants; completing a first synchronization phase for the first number of barrier participants; initializing the re-usable named barrier for a second number of barrier participants; and completing a second synchronization phase for the second number of barrier participants (Col. 2 lines 42-56 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads (or sometimes the same synchronization barrier is reused;).
As per claim 14, Giroux and Guo teach the method as in claim 13. Giroux teaches wherein initializing the re-usable named barrier for the first number of barrier participants includes specifying a first workgroup mask to identify a first plurality of workgroups executed by the multiple graphics cores of the graphics processor and initializing the re-usable named barrier for the second number of barrier participants includes specifying a second workgroup mask to identify a second plurality of workgroups executed by the multiple graphics cores of the graphics processor (Col. 2 lines 42-56 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads (or sometimes the same synchronization barrier is reused; Col. 2 lines 35-44 In modern GPU architectures, many execution threads execute concurrently, and many warps each comprising many threads also execute concurrently. When threads in a warp need to perform more complicated communications or collective operations, the developer can use for example NVIDIA's CUDA “_syncwarp” primitive to synchronize threads. The _syncwarp primitive initializes hardware mechanisms that cause an executing thread to wait before resuming execution until all threads specified in a mask have called the primitive with the same mask; Col. 8 lines 59-61 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110).
Claims 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Giroux in view of Itou (US 20140026138 A1).
As per claim 15, Giroux teaches a data processing system comprising: a memory device; and a graphics processor coupled with the memory device (Col. 8 lines 56-58 For example, referring to the FIG. 2A example system showing a CPU(s) 101, a GPU(s) 110 with local graphics memory(ies) 114), the graphics processor including a cache memory and a graphics core cluster coupled with the cache memory (Col. 8 lines 59-66 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110, or it could be stored in main memory 115 and shared between the CPU 101 and the GPU 110. In more detail, the local graphics memory 114 may be organized into a different hierarchies such as level 3 cache, level 2 cache, level 1 cache), the graphics core cluster including: a first graphics core including first barrier circuitry, the first barrier circuitry to synchronize hardware threads of the first graphics core; a second graphics core including second barrier circuitry, the second barrier circuitry to synchronize hardware threads of the second graphics core (Col. 2 lines 52-55 a program can potentially use a first synchronization barrier to block a first group of threads and a second, different synchronization barrier to block an additional group of threads; Col. 9 lines 7-11 Such memory-implemented synchronization barriers can therefore be used to synchronize between threads running on a common core 206, between different warps running on different cores, different processes running on the same or different GPUs).
Giroux fails to teach cluster barrier circuitry to synchronize the first barrier circuitry with the second barrier circuitry.
However, Itou teaches cluster barrier circuitry to synchronize the first barrier circuitry with the second barrier circuitry ([0426] With the second setting example, synchronization among barriers of two hierarchies referred to in the first embodiment can be performed; [0080] The barrier synchronization mechanism 531 controls the barrier synchronization, and is implemented, for example, with a hardware circuit and a processor. [0081] The barrier synchronization mechanism 531 includes a bottom unit 541 and a top unit 551.).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Giroux with the teachings of Itou to reduce processing time (see Itou [0113] With the barrier synchronization mechanism according to the first embodiment, each barrier synchronization mechanism processes synchronization completion of hardware threads within a barrier bank to which the local barrier synchronization mechanism belongs, and processes a notification of barrier synchronization completion of other barrier bank synchronization mechanisms. As a result, a load is lightened compared with a conventional centralized barrier synchronization management mechanism, and thereby the length of processing time can be reduced.).
As per claim 16, Giroux and Itou teach the data processing system as in claim 15. Giroux teaches the first graphics core including direct memory access circuitry configured to perform an asynchronous load of data from the memory device via the cache memory and transmit the data to the second graphics core (Col. 3 lines 11-12 Copy engines (such as direct memory access (DMA) units); Col. 4 lines 1-3 Asynchronous copy hardware from the same streaming multiprocessor (SM) is able to participate as a virtual or “moral” thread; Col.9 lines 44-46 The copy engine 210 may also copy data for a thread from locations in system memory and store the data into GPU memory 214; Col. 8 lines 63-66 the local graphics memory 114 may be organized into a different hierarchies such as level 3 cache, level 2 cache, level 1 cache; Col. 4 lines 12-14 We also make it possible to introduce more asynchronous copy operations to the streaming multiprocessors (SMs) of a GPU).
As per claim 17, Giroux and Itou teach the data processing system as in claim 16. Giroux teaches the first barrier circuitry additionally configured to synchronize the direct memory access circuitry with the hardware threads of the first graphics core (Col. 10 lines 41-43 the same synchronization barrier can synchronize the 100 DMA operations and the 100 threads; Col. 9 lines 8-9 synchronize between threads running on a common core 206; Col. 8 lines 59-61 the synchronization barrier could be stored within local graphics memory 114 and be shared by all streaming multiprocessors (SMs) 204 (see FIG. 2B) within the GPU 110).
Claims 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Giroux and Itou, as applied to claim 17 above, in view of Guo.
As per claim 18, Giroux and Itou teach the data processing system as in claim 17.
Giroux and Itou fail to teach the first barrier circuitry configured to notify the cluster barrier circuitry of completion of the asynchronous load of the data.
However, Guo teaches the first barrier circuitry configured to notify the cluster barrier circuitry of completion of the asynchronous load of the data (clm 4 The synchronization barrier unit according to claim 3, further comprising one or more coalescing circuits configured to coalesce thread arrive operations from a remote processor, asynchronous transaction completion operations from a remote hardware unit, thread arrive operations from a local processor and/or asynchronous transaction completion operations from a local hardware unit; [0158] For example, the TMA unit or other units that perform asynchronous transactions can update the barrier by a remote arrive operation. The TMA unit receives commands and does memory copies, and as part of the memory copy or as part of the memory copy completion it can update the appropriate transaction barrier(s) by causing the synchronization unit to update the barrier; [0051] The transaction barrier can extend or replace the shared-memory backed arrive-wait barrier in certain implementations, sharing its advantages over the conventional hardware named barrier; [0052] In addition to the thread-arrival tracking capability of the arrive-wait barrier, the transaction barrier introduces the new capability of transaction-arrival tracking and allows user threads to synchronize against both other user threads and asynchronous hardware transactions, which may be generated during, or as a result of, execution of asynchronous data movement features.).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Giroux and Itou with the teachings of Guo to minimize resource utilization (see Guo [0064] The synchronization unit may be deeply integrated with asynchronous data exchange features designs, providing aggressive caching and coalescing to minimize the bandwidth cost from the synchronization traffic.).
As per claim 19, Giroux, Itou, and Guo teach the data processing system as in claim 18. Guo teaches the cluster barrier circuitry configured to notify the second barrier circuitry of completion of the asynchronous load of the data (clm 4 The synchronization barrier unit according to claim 3, further comprising one or more coalescing circuits configured to coalesce thread arrive operations from a remote processor, asynchronous transaction completion operations from a remote hardware unit, thread arrive operations from a local processor and/or asynchronous transaction completion operations from a local hardware unit; [0158] For example, the TMA unit or other units that perform asynchronous transactions can update the barrier by a remote arrive operation. The TMA unit receives commands and does memory copies, and as part of the memory copy or as part of the memory copy completion it can update the appropriate transaction barrier(s) by causing the synchronization unit to update the barrier; [0051] The transaction barrier can extend or replace the shared-memory backed arrive-wait barrier in certain implementations, sharing its advantages over the conventional hardware named barrier; [0052] In addition to the thread-arrival tracking capability of the arrive-wait barrier, the transaction barrier introduces the new capability of transaction-arrival tracking and allows user threads to synchronize against both other user threads and asynchronous hardware transactions, which may be generated during, or as a result of, execution of asynchronous data movement features.).
As per claim 20, Giroux, Itou, and Guo teach the data processing system as in claim 19. Guo teaches the second barrier circuitry is configured to notify the hardware threads of the second graphics core of completion of the asynchronous load ([0052] In addition to the thread-arrival tracking capability of the arrive-wait barrier, the transaction barrier introduces the new capability of transaction-arrival tracking and allows user threads to synchronize against both other user threads and asynchronous hardware transactions, which may be generated during, or as a result of, execution of asynchronous data movement features; clm 4 The synchronization barrier unit according to claim 3, further comprising one or more coalescing circuits configured to coalesce thread arrive operations from a remote processor, asynchronous transaction completion operations from a remote hardware unit, thread arrive operations from a local processor and/or asynchronous transaction completion operations from a local hardware unit; [0066] The illustrated GPU shows how some GPU implementations may enable plural partitions that operate as micro GPUs such as the shown micro GPU0 and micro GPU1; [0023] In modern GPU architectures, many execution threads execute concurrently, and many warps each comprising many threads also execute concurrently).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HSING CHUN LIN whose telephone number is (571)272-8522. The examiner can normally be reached Mon - Fri 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached at (571) 272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.L./Examiner, Art Unit 2195
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195