Office Action Analysis: 18148993 — SYNCHRONIZATION FOR DATA MULTICAST IN COMPUTE CORE CLUSTERS

Office Action

§103 §112 §DP
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The Office Action is in response to claims filed 12/30/2022.
Claims 1-20 are pending.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 1, 9, and 15 is rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1 and 6 of copending application 18/148,997 (hereafter ‘997) and in view of Palemer et al. Pub. No. US 20230289215 A1 (hereafter Palmer) as exemplified in the table below.

Instant Application
18/148,997
1. An apparatus comprising: 

one or more processors including at least a graphics processing unit (GPU), the GPU 









including one or more clusters of cores and a memory; wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry; 





wherein the GPU is to: initiate broadcast of a data element from a producer core to one or more consumer cores, 








and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores; and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
1. An apparatus comprising:

one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory;
Palmer ¶ [0125] states “a CGA could require the CTAs it references to all run on the same portion (GPC and/or μGPU) of a GPU, on the same GPU, on the same cluster of GPUs, etc.” Examiner’s Note: the environment includes one or more GPUs.

(1. continued) wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared memory, and broadcast circuitry;
7. The apparatus of claim 6, wherein each core of the first core cluster includes gateway circuitry, the gateway circuitry of the cores providing synchronization for the broadcast of the data element.

(1. continued) and wherein a first core in a first cluster of cores is to: request a data element, determine whether any additional cores in the first cluster require the data element, and upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores
Palmer ¶ [0319] states “source SM can communicate with any target SM”

7. The apparatus of claim 6, wherein each core of the first core cluster includes gateway circuitry, the gateway circuitry of the cores providing synchronization for the broadcast of the data element.
Palmer ¶ [0274] states “Barriers are useful for example to synchronize all of the CTAs in a CGA for any reason”. ¶ [0239] states “enable CTAs that are concurrently running on SMs to read from, write to, and do atomic accesses to memory allocated to other CTAs running on other SMs”. Examiner’s Note: barriers are multi-core because they synchronize CTAs executing on multiple different SMs
9. One or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising:

initiating broadcast of a data element from a producer core of a cluster of cores of a graphics processing unit (GPU) to one or more consumer cores of the cluster, each core including one or more processing resources, shared local memory, 











and gateway circuitry; and synchronizing the broadcast of the data element utilizing gateway circuitry of the producer core and the one or more consumer cores; wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
9. One or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: 

requesting a data element by a first core in a first cluster of cores of a graphics processor, each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry; determining whether any additional cores in the first cluster require the data element; and upon determining that one or more additional cores in the first cluster require the data element, broadcasting the data element to the one or more additional cores via interconnects between broadcast circuitry of the cores of the first core cluster.
Palmer ¶ [0319] states “source SM can communicate with any target SM”

14. The one or more non-transitory computer-readable storage mediums of claim 13, wherein the data element is directed to the SLM portion, and wherein the executable computer program instructions further include instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: providing, by the cores of the first cluster of cores, synchronization for the broadcast of the data element.
Palmer ¶ [0319] states “source SM can communicate with any target SM”
Palmer ¶ [0274] states “Barriers are useful for example to synchronize all of the CTAs in a CGA for any reason”. ¶ [0239] states “enable CTAs that are concurrently running on SMs to read from, write to, and do atomic accesses to memory allocated to other CTAs running on other SMs”. Examiner’s Note: barriers are multi-core because they synchronize CTAs executing on multiple different SMs
15. A method comprising: 

initiating broadcast of a data element from a first core of a cluster of cores of a graphics processing unit (GPU) to one or more other cores of the cluster, each core including one or more processing resources, shared local memory, 









and gateway circuitry; and synchronizing the broadcast of the data element utilizing gateway circuitry of the first core and the one or more other cores; wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.
15. A method comprising: 

requesting a data element by a first core in a first cluster of cores of a graphics processor, each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry; 
determining whether any additional cores in the first cluster require the data element; and upon determining that one or more additional cores in the first cluster require the data element, broadcasting the data element to the one or more additional cores via interconnects between broadcast circuitry of the cores of the first core cluster.  


20. wherein the data element is directed to the SLM portion, and further comprising: providing, by the cores of the first cluster of cores, synchronization for the broadcast of the data element.
Palmer ¶ [0274] states “Barriers are useful for example to synchronize all of the CTAs in a CGA for any reason”. ¶ [0239] states “enable CTAs that are concurrently running on SMs to read from, write to, and do atomic accesses to memory allocated to other CTAs running on other SMs”. Examiner’s Note: barriers are multi-core because they synchronize CTAs executing on multiple different SMs


Although the claims at issue are not identical, they are not patentably distinct from each other because the instant application and ‘997 overlap in scope. The instant application claims that each core has “shared local memory”. Application ‘997 differs in language stating that each core has “shared memory”. This difference in language does not create a patentable distinction because the “shared memory” is in each core which by definition would make it local to the core. Therefore, the claims of the instant application and ‘997 are not patentably distinct despite minor different in language. 
Further, the system including on or more GPUs, the elements of a producer and consumer core, and the multi-core barrier of Palmer would be obvious to combine with the gateway and cluster of cores of the instant application. One of ordinary skill in the art would have been motivated to combine these teachings so that “CGAs guarantee all their CTAs execute concurrently”, which enables “other hardware optimizations are possible such as: Multicasting data returned from memory to multiple SMs (CTAs) to save interconnect bandwidth. Direct SM-to-SM communication for lower latency data sharing and improved synchronization between producer and consumer threads in the CGA. Hardware barriers for synchronizing execution across all (or any) threads in a CGA (¶ [0116] – [0119]). One of ordinary skill in the art would recognize the benefits of improving interconnect bandwidth, lower latency data sharing, and improved synchronization.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

The term “neighboring” in claim 7 is a relative term which renders the claim indefinite. The term “neighboring” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. For the purposes of examination, the term “neighboring” will be interpreted to mean any other core within the cluster of cores. 


		Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 5, 6, 8, 9, 10, 13-16, 19, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valerio et al. Pat. No. US 20210263785 A1 (hereafter Valerio) in view of Parle et al. Pat. No. US 20240289132 A1 (hereafter Parle).
With regard to claim 1, Valerio teaches an apparatus comprising: one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory (¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0243] states “GPU 1914 is divided into slices, where each slice includes a plurality of slices”. See FIG. 3B and FIG. 20. Examiner’s Note: in FIG. 20, the slice 2000 is the cluster of cores. Sub-Slice 2005A-C are the cores. In FIG. 3B, memory 326A-D is within the GPU);
wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry (¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 21);
wherein the GPU is to: initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores (¶ [0248] states “barrier synchronization mechanism 2130 enables fine-grained synchronization such that M producer threads (e.g., data generating threads) and N consumer threads (e.g., data consuming threads) may be implemented”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier”. Examiner’s Note: there are producer and consumer threads that are synchronized using the gateway);
Although Valerio teaches the synchronization of threads using a gateway, Valerio does not explicitly teach the broadcast of a data between multiple cores and the synchronization of a broadcast between multiple cores.
However, in an analogous art, Parle teaches wherein the GPU is to: initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores (¶ [0053] states “SM 204 generates a multicast request packet 224. More specifically, a thread in the CTA executing on SM 204 generates the multicast request packet 224. The SM 204 may be referred to as the “multicast source SM” or “multicast requesting SM” because the thread that generates the multicast request packet is on SM 204”. ¶ [0054] states “The thread that generates the multicast request packet 224, may be referred to as the “leader thread”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212.” ¶ [0059] states “An LRC multicast response packet 230 that comprises the requested data received from the L2s slice 220 and information regarding the multiple receivers for the requested data is generated”. ¶ [0060] states “the result data carried in packet 230 is duplicated to two packets 232 and 234 for receiving SMs 204 and 206 as identified in the list of receivers, respectively, at a separation point which is a point in the crossbar at which the common path from an input port to the receiver crossbar 210 separates to a first path to SM 204 and a second path to SM 206”. ¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. See ¶ [0053] – [0064] for further details. Examiner’s Note: the SM is the core).
and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address”. ¶ [0084] states “the new load operation include the global memory address to read from (e.g. source data address), destination (receiver) CTAs/SMs, destination shared memory address, and the synchronization entity” and “The synchronization entity may be represented by a barrier ID to indicate completion”. Examiner’s Note: the barrier is used by the source SM and receiving SMs, so it is a multi-core barrier).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the multicast of data of and synchronization of SMs of Parle with the GPU architecture and gateway synchronization of Valerio. A person having ordinary skill in the art would have motivated to make this combination “to reduce the bandwidth and power required to move the same amount of data and better scale” (¶ [0047]). Specifically, it is to improve L2 bandwidth (¶ [0040] states “L2 cache to SM bandwidth (referred to also as “L2 bandwidth”) improvements”). One of ordinary skill in the art would recognize the efficiency improvement of fetching L2 data once and then broadcasting the result as opposed to fetching L2 data multiple times for the same data.

With regard to claim 2, Valerio and Parle teach the apparatus of claim 1. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: generating a producer broadcast instruction associated with the multi-core barrier (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier”. Examiner’s Note: the producer signaling the barrier is the producer broadcast instruction); 
and generating a consumer broadcast instruction from each of the one or more consumer cores (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other … the consumer thread waits for the signal from producer”. Examiner’s Note: the consumer waiting on the barrier is the consumer broadcast instruction).
Parle additionally teaches generating a producer broadcast instruction associated with the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, so the barrier is a multi-core barrier)
and generating a consumer broadcast instruction from each of the one or more consumer cores (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, or cores, that have threads that can wait. The receiving SMs wait on the barrier by calling the wait instruction).

With regard to claim 5, Valerio and Parle teach the apparatus of claim 1. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The consumer thread closes the named barrier)
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon consumer threads of the one or more consumer cores having each received the data element, the one or more consumer cores to: provide confirmation to the producer core that receipt of the data element is complete (¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM.”),
and close a barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used multiple SMs, or cores).

With regard to claim 6, Valerio and Parle teach the apparatus of claim 5. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The producer thread closes the named barrier).
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon the producer core receiving the confirmation from the one or more consumer cores, the producer core to: cease broadcast of the data element to the one or more consumer cores (¶ [0109] states “The multicast sender SM keeps track of all outstanding transactions in counters”. ¶ [0114] states “The sender SM keeps track of total outstanding requests, not per receiver SM”. ¶ [0061] states “Each of the one or more receivers of multicast result data sends an ack message to the requester of the multicast data”. ¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM”. Examiner’s Note: the sender SM keeps track of outstanding requests. When the sender SM receives an acknowledgement of a completed operation from the destination SM, the broadcast is complete and the transaction is no longer outstanding. When the broadcast is completed, it has ended),
and close the barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used by multiple SMs, or cores).

With regard to claim 8, Valerio and Parle teach the apparatus of claim 1. Additionally, Parle teaches wherein the producer core and the one or more consumer cores are to receive the data element at the shared local memory of the respective core (¶ [0096], [0099] states ““Metadata” transported from source SM to destination SMs may include … Data SMEM (shared memory) Offset”. ¶ [0096] also states “Destination SMs may not have metadata that describe how the received data is to be processed (e.g., such as, received data should be written in image-to-column format, etc.), unlike source SMs”. ¶ [0104] states “The shared memory offset represents the offset, from the shared memory base address for the SM, at which the result data should be written”. Examiner’s Note: both source and destination SMs use the shared memory offset to write received data).

With regard to claim 9, Valerio teaches one or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising (¶ [0163] states “One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, the machine-readable medium may include instructions which represent various logic within the processor”): 
initiating broadcast of a data element from a producer core of a cluster of cores of a graphics processing unit (GPU) to one or more consumer cores of the cluster, each core including one or more processing resources, shared local memory, and gateway circuitry (¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 21); 
and synchronizing the broadcast of the data element utilizing gateway circuitry of the producer core and the one or more consumer cores (¶ [0248] states “barrier synchronization mechanism 2130 enables fine-grained synchronization such that M producer threads (e.g., data generating threads) and N consumer threads (e.g., data consuming threads) may be implemented”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier”. Examiner’s Note: there are producer and consumer threads that are synchronized using the gateway); 
Although Valerio teaches the synchronization of threads using a gateway, Valerio does not explicitly teach the broadcast of a data between multiple cores and the synchronization of a broadcast between multiple cores.
However, in an analogous art, Parle teaches initiating broadcast of a data element from a producer core of a cluster of cores of a graphics processing unit (GPU) to one or more consumer cores of the cluster, each core including one or more processing resources, shared local memory, and gateway circuitry (¶ [0053] states “SM 204 generates a multicast request packet 224. More specifically, a thread in the CTA executing on SM 204 generates the multicast request packet 224. The SM 204 may be referred to as the “multicast source SM” or “multicast requesting SM” because the thread that generates the multicast request packet is on SM 204”. ¶ [0054] states “The thread that generates the multicast request packet 224, may be referred to as the “leader thread”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212.” ¶ [0059] states “An LRC multicast response packet 230 that comprises the requested data received from the L2s slice 220 and information regarding the multiple receivers for the requested data is generated”. ¶ [0060] states “the result data carried in packet 230 is duplicated to two packets 232 and 234 for receiving SMs 204 and 206 as identified in the list of receivers, respectively, at a separation point which is a point in the crossbar at which the common path from an input port to the receiver crossbar 210 separates to a first path to SM 204 and a second path to SM 206”. ¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. See ¶ [0053] – [0064] for further details. Examiner’s Note: the SM is the core);
and synchronizing the broadcast of the data element utilizing gateway circuitry of the producer core and the one or more consumer cores (¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”); 
wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. ¶ [0084] states “the new load operation include the global memory address to read from (e.g. source data address), destination (receiver) CTAs/SMs, destination shared memory address, and the synchronization entity” and “The synchronization entity may be represented by a barrier ID to indicate completion”. Examiner’s Note: the barrier is used by the source SM and receiving SMs, so it is a multi-core barrier).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the multicast of data of and synchronization of SMs of Parle with the GPU architecture and gateway synchronization of Valerio. A person having ordinary skill in the art would have motivated to make this combination “to reduce the bandwidth and power required to move the same amount of data and better scale” (¶ [0047]). Specifically, it is to improve L2 bandwidth (¶ [0040] states “L2 cache to SM bandwidth (referred to also as “L2 bandwidth”) improvements”). One of ordinary skill in the art would recognize the efficiency improvement of fetching L2 data once and then broadcasting the result as opposed to fetching L2 data multiple times for the same data. 

With regard to claim 10, Valerio and Parle teach the one or more non-transitory computer-readable storage mediums of claim 9. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: generating a producer broadcast instruction associated with the multi-core barrier (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier”. Examiner’s Note: the producer signaling the barrier is the producer broadcast instruction); 
and generating a consumer broadcast instruction from each of the one or more consumer cores (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other … the consumer thread waits for the signal from producer”. Examiner’s Note: the consumer waiting on the barrier is the consumer broadcast instruction).
Parle additionally teaches generating a producer broadcast instruction associated with the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, so the barrier is a multi-core barrier).
and generating a consumer broadcast instruction from each of the one or more consumer cores (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, or cores, that have threads that can wait. The receiving SMs wait on the barrier by calling the wait instruction).

With regard to claim 13, Valerio and Parle teach the one or more non-transitory computer-readable storage mediums of claim 9. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The consumer thread closes the named barrier).
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon consumer threads of the one or more consumer cores having each received the data element, the one or more consumer cores to: provide confirmation to the producer core that receipt of the data element is complete (¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM.”),
and close a barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used multiple SMs, or cores).

With regard to claim 14, Valerio and Parle teach the one or more non-transitory computer-readable storage mediums of claim 13. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The producer thread closes the named barrier).
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon the producer core receiving the confirmation from the one or more consumer cores, the producer core to: cease broadcast of the data element to the one or more consumer cores (¶ [0109] states “The multicast sender SM keeps track of all outstanding transactions in counters”. ¶ [0114] states “The sender SM keeps track of total outstanding requests, not per receiver SM”. ¶ [0061] states “Each of the one or more receivers of multicast result data sends an ack message to the requester of the multicast data”. ¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM”. Examiner’s Note: the sender SM keeps track of outstanding requests. When the sender SM receives an acknowledgement of a completed operation from the destination SM, the broadcast is complete and the transaction is no longer outstanding. When the broadcast is completed, it has ended),
and close the barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used multiple SMs, or cores).

With regard to claim 15, Valerio teaches a method comprising (¶ [0258] states “Examples may include subject matter such as a method”): 
initiating broadcast of a data element from a first core of a cluster of cores of a graphics processing unit (GPU) to one or more other cores of the cluster, each core including one or more processing resources, shared local memory, and gateway circuitry (¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 21); 
and synchronizing the broadcast of the data element utilizing gateway circuitry of the first core and the one or more other cores (¶ [0248] states “barrier synchronization mechanism 2130 enables fine-grained synchronization such that M producer threads (e.g., data generating threads) and N consumer threads (e.g., data consuming threads) may be implemented”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier”. Examiner’s Note: there are producer and consumer threads that are synchronized using the gateway); 
Although Valerio teaches the synchronization of threads using a gateway, Valerio does not explicitly teach the broadcast of a data between multiple cores and the synchronization of a broadcast between multiple cores.
initiating broadcast of a data element from a first core of a cluster of cores of a graphics processing unit (GPU) to one or more other cores of the cluster, each core including one or more processing resources, shared local memory, and gateway circuitry (¶ [0053] states “SM 204 generates a multicast request packet 224. More specifically, a thread in the CTA executing on SM 204 generates the multicast request packet 224. The SM 204 may be referred to as the “multicast source SM” or “multicast requesting SM” because the thread that generates the multicast request packet is on SM 204”. ¶ [0054] states “The thread that generates the multicast request packet 224, may be referred to as the “leader thread”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212.” ¶ [0059] states “An LRC multicast response packet 230 that comprises the requested data received from the L2s slice 220 and information regarding the multiple receivers for the requested data is generated”. ¶ [0060] states “the result data carried in packet 230 is duplicated to two packets 232 and 234 for receiving SMs 204 and 206 as identified in the list of receivers, respectively, at a separation point which is a point in the crossbar at which the common path from an input port to the receiver crossbar 210 separates to a first path to SM 204 and a second path to SM 206”. ¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. See ¶ [0053] – [0064] for further details. Examiner’s Note: the SM is the core); 
and synchronizing the broadcast of the data element utilizing gateway circuitry of the producer core and the one or more consumer cores (¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”); 
wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. ¶ [0084] states “the new load operation include the global memory address to read from (e.g. source data address), destination (receiver) CTAs/SMs, destination shared memory address, and the synchronization entity” and “The synchronization entity may be represented by a barrier ID to indicate completion”. Examiner’s Note: the barrier is used by the source SM and receiving SMs, so it is a multi-core barrier).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the multicast of data of and synchronization of SMs of Parle with the GPU architecture and gateway synchronization of Valerio. A person having ordinary skill in the art would have motivated to make this combination “to reduce the bandwidth and power required to move the same amount of data and better scale” (¶ [0047]). Specifically, it is to improve L2 bandwidth (¶ [0040] states “L2 cache to SM bandwidth (referred to also as “L2 bandwidth”) improvements”). One of ordinary skill in the art would recognize the efficiency improvement of fetching L2 data once and then broadcasting the result as opposed to fetching L2 data multiple times for the same data. 

With regard to claim 16, Valerio and Parle teach the method of claim 15. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: generating a first broadcast instruction from the first core associated with the multi-core barrier (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier”. Examiner’s Note: the producer signaling the barrier is the producer broadcast instruction); 
and generating a second broadcast instruction from each of the one or more other cores (¶ [0250] states “producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other … the consumer thread waits for the signal from producer”. Examiner’s Note: the consumer waiting on the barrier is the consumer broadcast instruction. The consumer threads are considered the “other threads”).
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: generating a first broadcast instruction from the first core associated with the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, so the barrier is a multi-core barrier).
and generating a second broadcast instruction from each of the one or more other cores (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. Examiner’s Note: there are multiple SMs, or cores, that have threads that can wait. The receiving SMs wait on the barrier by calling the wait instruction. The wait instruction is the second broadcast instruction).

With regard to claim 19, Valerio and Parle teach the method of claim 15. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The consumer thread closes the named barrier)
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon threads of the one or more other cores having each received the data element, the one or more other cores to: provide confirmation to the first core that receipt of the data element is complete (¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM”),
and close a barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used multiple SMs, or cores).

With regard to claim 20, Valerio and Parle teach the method of claim 19. Valerio additionally teaches and close a barrier ID for the multi-core barrier (¶ [0252] states “Once a named barrier is closed, gateway counters 2157 and EU 2110 notification registers for this named barrier are reset so that the next workgroup can use the barrier”. See FIG. 23. ¶ [0026] states “FIG. 23 illustrates one embodiment of pseudo code to implement a convolution kernel flow using named barriers”. ¶ [0266] states “the first named barrier is closed upon completion of execution of the first set of execution threads”. Examiner’s Note: the pseudo code in FIG. 23 includes both signal and wait instructions. This means that there are instructions for producers and consumers in FIG. 23. At the end of FIG. 23, the barriers are closed. The producer thread closes the named barrier).
Parle additionally teaches wherein synchronizing the broadcast of the data element further includes: upon the first core receiving the confirmation from the one or more other cores, the first core to: cease broadcast of the data element to the one or more other cores (¶ [0109] states “The multicast sender SM keeps track of all outstanding transactions in counters”. ¶ [0114] states “The sender SM keeps track of total outstanding requests, not per receiver SM”. ¶ [0061] states “Each of the one or more receivers of multicast result data sends an ack message to the requester of the multicast data”. ¶ [0108] states “an Ack is generated by the destination SM and sent to the source SM ID (the source SM ID information may be included in the received response packet) to signal completion of the operation at the destination SM”. Examiner’s Note: the sender SM keeps track of outstanding requests. When the sender SM receives an acknowledgement of a completed operation from the destination SM, the broadcast is complete and the transaction is no longer outstanding. When the broadcast is completed, it has ended),
and close the barrier ID for the multi-core barrier (¶ [0116] states “In the programming model for the barriers, the source SM issues a load for a defined number of bytes, and the receiving SMs each wait on a barrier for that number of bytes to be received. The load instruction may specify the barrier address (e.g. as a shared memory offset)”. Examiner’s Note: the barrier is used multiple SMs, or cores).

Claim(s) 3, 4, 11, 12, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valerio in view of Parle and further in view of Jiang et al. Pat. No. US 20210334127 A1 (hereafter Jiang).
With regard to claim 3, Valerio and Parle teach the apparatus of claim 2. Valerio and Parle do not explicitly teach the consumer broadcast message including a count of local threads and a total count of consumer threads to receive the data element.
However, in an analogous art, Jiang teaches wherein the consumer broadcast instruction for each consumer compute core of the one or more consumer cores further includes: a count of local consumer threads of the consumer core to receive the data element; and a total count of consumer threads to receive the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. ¶ [0134] states “In one embodiment, for each barrier identifier, a barrier counter for the identifier and reduction state data can be stored within the reduction state registers. In one embodiment the registers are configured to support up to an 8-bit barrier counter to support up to 256 threads in a thread group”. See TABLE-US-00003 which includes a field called “barrier count”. ¶ [0170] states “enable synchronization between the multiple threads via a merged write, barrier, and read operation”. Examiner’s Note: the barrier count corresponds to the number of local threads to receive data and/or to the total number of consumer threads to receive the element); 
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the barrier message payload including a barrier counter of Jiang with the synchronized multicast of data of Valerio and Parle. Additionally, the barrier message payload shows that multiple fields of data, including types of count, can be included in a barrier message. A person having ordinary skill in the art would have been motivated to make this combination for the purpose of reducing I/O and improving performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).

With regard to claim 4, Valerio and Parle teach the apparatus of claim 1. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: the one or more consumer cores providing notice that the one or more consumer cores are ready to receive the data element (¶ [0250] states “Subsequently, producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier, while the consumer thread waits for the signal from producer”. Examiner’s Note: when the consumer thread has sent the wait signal and is waiting, it means that the consumer thread is ready to receive data. In other words, the wait indicates the thread is at the barrier and the thread is synchronized with other threads at the barrier);
Parle additionally teaches and upon confirming readiness of the one or more consumer cores to receive the data element, the producer core to commence broadcast of the data element (¶ [0053] states “SM 204 generates a multicast request packet 224.” ¶ [0082] states “This request first reads global memory, and then it delivers the data that was read to multiple target CTAs (possibly including the same CTA that initiated/executed the request) as specified in the instruction. The data is read from L2 once, returned to the LRC, and then multicast to all the target CTAs simultaneously on the response crossbar”).
Valerio and Parle do not explicitly teach the confirming readiness of the one or more consumer cores.
However, in an analogous art, Jiang teaches and upon confirming readiness of the one or more consumer cores to receive the data element, the producer core to commence broadcast of the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. Examiner’s Note: checking if all threads have sent the barrier instruction is the confirming of readiness of the threads to receive data. Since all the threads have reached the barrier, when the barrier is released all the threads will execute synchronously. This means that that they can all receive data, so it means the broadcast can begin).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the check for all threads sending the barrier instruction with of Jiang the consumer cores providing notice that they are ready to receive data and broadcasting of data of Valerio and Parle. A person having ordinary skill in the art would have been motivated to make this combination so that the apparatus can “synchronize the set of barrier requester threads in the thread work group” (¶ [0169]). Synchronization is a necessary step before performing certain operations (¶ [0144] states “the operation is not performed until all threads are synchronized”). These operations are part of the read and write instructions that are merged into the barrier operation to realize improvements such as reduced I/O and improved performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).

With regard to claim 11, Valerio and Parle teach the one or more non-transitory computer-readable storage mediums of claim 10. Valerio and Parle do not explicitly teach the consumer broadcast message including a count of local threads and a total count of consumer threads to receive the data element.
However, in an analogous art, Jiang teaches wherein the consumer broadcast instruction for each consumer compute core of the one or more consumer cores further includes: a count of local consumer threads of the consumer core to receive the data element; and a total count of consumer threads to receive the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. ¶ [0134] states “In one embodiment, for each barrier identifier, a barrier counter for the identifier and reduction state data can be stored within the reduction state registers. In one embodiment the registers are configured to support up to an 8-bit barrier counter to support up to 256 threads in a thread group”. See TABLE-US-00003 which includes a field called “barrier count”. ¶ [0170] states “enable synchronization between the multiple threads via a merged write, barrier, and read operation.”. Examiner’s Note: the barrier count corresponds to the number of local threads to receive data and/or to the total number of consumer threads to receive the element); 
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the barrier message payload including a barrier counter of Jiang with the synchronized multicast of data of Valerio and Parle. Additionally, the barrier message payload shows that multiple fields of data, including types of count, can be included in a barrier message. A person having ordinary skill in the art would have been motivated to make this combination for the purpose of reducing I/O and improving performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).

With regard to claim 12, Valerio and Parle teach the one or more non-transitory computer-readable storage mediums of claim 9. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: the one or more consumer cores providing notice that the one or more consumer cores are ready to receive the data element (¶ [0250] states “Subsequently, producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier, while the consumer thread waits for the signal from producer”. Examiner’s Note: when the consumer thread has sent the wait signal and is waiting, it means that the consumer thread is ready to receive data. In other words, the wait indicates the thread is at the barrier and the thread is synchronized with other threads at the barrier);
Parle additionally teaches and upon confirming readiness of the one or more consumer cores to receive the data element, the producer core to commence broadcast of the data element (¶ [0053] states “SM 204 generates a multicast request packet 224”. ¶ [0082] states “This request first reads global memory, and then it delivers the data that was read to multiple target CTAs (possibly including the same CTA that initiated/executed the request) as specified in the instruction. The data is read from L2 once, returned to the LRC, and then multicast to all the target CTAs simultaneously on the response crossbar”).
Valerio and Parle do not explicitly teach the confirming readiness of the one or more consumer cores.
However, in an analogous art, Jiang teaches and upon confirming readiness of the one or more consumer cores to receive the data element, the producer core to commence broadcast of the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. Examiner’s Note: checking if all threads have sent the barrier instruction is the confirming of readiness of the threads to receive data. Since all the threads have reached the barrier, when the barrier is released all the threads will execute synchronously. This means that that they can all receive data, so it means the broadcast can begin).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the check for all threads sending the barrier instruction with of Jiang the consumer cores providing notice that they are ready to receive data and broadcasting of data of Valerio and Parle. A person having ordinary skill in the art would have been motivated to make this combination so that the apparatus can “synchronize the set of barrier requester threads in the thread work group” (¶ [0169]). Synchronization is a necessary step before performing certain operations (¶ [0144] states “the operation is not performed until all threads are synchronized”). These operations are part of the read and write instructions that are merged into the barrier operation to realize improvements such as reduced I/O and improved performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).


With regard to claim 17, Valerio and Parle teach the method of claim 16. Valerio and Parle do not explicitly teach the consumer broadcast message including a count of local threads and a total count of consumer threads to receive the data element. However, in an analogous art, Jiang teaches wherein the second broadcast instruction for each core of the one or more other cores further includes: a count of local threads of the core to receive the data element; and a total count of threads to receive the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. ¶ [0134] states “In one embodiment, for each barrier identifier, a barrier counter for the identifier and reduction state data can be stored within the reduction state registers. In one embodiment the registers are configured to support up to an 8-bit barrier counter to support up to 256 threads in a thread group”. See TABLE-US-00003 which includes a field called “barrier count”. ¶ [0170] states “enable synchronization between the multiple threads via a merged write, barrier, and read operation.”. Examiner’s Note: the barrier count corresponds to the number of local threads to receive data and/or to the total number of consumer threads to receive the element).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the barrier message payload including a barrier counter of Jiang with the synchronized multicast of data of Valerio and Parle. Additionally, the barrier message payload shows that multiple fields of data, including types of count, can be included in a barrier message. A person having ordinary skill in the art would have been motivated to make this combination for the purpose of reducing I/O and improving performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).

With regard to claim 18, Valerio and Parle teach the method of claim 15. Valerio additionally teaches wherein synchronizing the broadcast of the data element further includes: the one or more other cores providing notice that the one or more other cores are ready to receive the data element (¶ [0250] states “Subsequently, producer and consumer threads use the same named barrier to signal to (or wait for) wait for each other. In one embodiment, a producer thread first signals the availability of a resource using the named barrier, while the consumer thread waits for the signal from producer”. Examiner’s Note: when the consumer thread has sent the wait signal and is waiting, it means that the consumer thread is ready to receive data. In other words, the wait indicates the thread is at the barrier and the thread is synchronized with other threads at the barrier. The cores that are to receive data correspond to the consumer threads);
Parle additionally teaches and upon confirming readiness of the one or more other cores to receive the data element, the first core to commence broadcast of the data element (¶ [0053] states “SM 204 generates a multicast request packet 224”. ¶ [0082] states “This request first reads global memory, and then it delivers the data that was read to multiple target CTAs (possibly including the same CTA that initiated/executed the request) as specified in the instruction. The data is read from L2 once, returned to the LRC, and then multicast to all the target CTAs simultaneously on the response crossbar”. Examiner’s Note: the producer core is the first core).
Valerio and Parle do not explicitly teach the confirming readiness of the one or more consumer cores.
However, in an analogous art, Jiang teaches and upon confirming readiness of the one or more other cores to receive the data element, the first core to commence broadcast of the data element (¶ [0132] states “Once all threads have sent the barrier instruction, a write-back message is broadcast to all threads to indicate that the barrier operation is complete for all requesting threads”. Examiner’s Note: checking if all threads have sent the barrier instruction is the confirming of readiness of the threads to receive data. Since all the threads have reached the barrier, when the barrier is released all the threads will execute synchronously. This means that that they can all receive data, so it means the broadcast can begin. The consumer threads correspond to the “one or more other” that are to receive the data element).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the check for all threads sending the barrier instruction with of Jiang the consumer cores providing notice that they are ready to receive data and broadcasting of data of Valerio and Parle. A person having ordinary skill in the art would have been motivated to make this combination so that the apparatus can “synchronize the set of barrier requester threads in the thread work group” (¶ [0169]). Synchronization is a necessary step before performing certain operations (¶ [0144] states “the operation is not performed until all threads are synchronized”). These operations are part of the read and write instructions that are merged into the barrier operation to realize improvements such as reduced I/O and improved performance (¶ [0115] states “The write and read operation of the reduce phase can introduce a large amount of I/O into the processing operation, which can significantly reduce the performance of the reduce operation. To resolve this issue, embodiments described herein provide a system and method to eliminate the read and write operation in the reduce phase by merging the read and write operations into the barrier function”).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Valerio in view of Parle and further in view of Palmer et al. Pat. No. US 20230289215 A1 (hereafter Palmer).
With regard to claim 7, Valerio and Parle teach the apparatus of claim 1. Additionally, Valerio teaches wherein the gateway circuitry of each core of a cluster of cores is interconnected with the gateway circuitry of one or more neighboring cores in the cluster (¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”).
Valerio and Parle do not explicitly teach that the gateway circuitry is interconnected with other gateway circuitries.
However, in an analogous art, Palmer teaches wherein the gateway circuitry of each core of a cluster of cores is interconnected with the gateway circuitry of one or more neighboring cores in the cluster (¶ [0219] states “distributed shared memory DSMEM is made up of at least a portion of memory local to/within each SM, and hardware interconnections and other functionality that allow SMs to access each others' local memory”. ¶ [0266] states “the SM can look up the SM_id of that other SM that is running that CTA so it can communicate across the interconnect with that other SM”. See FIG. 16A, 19, and 5A. Examiner’s Note: in light of the relative terminology issue raised in the 112(b) claim rejections, the term “neighboring” is interpreted to mean any other core within the cluster. FIG. 16A and 19 show the interconnect between multiple SMs, or cores. FIG. 5A shows multiple SMs in a graphics processing cluster).
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the interconnections between SMs of Palmer with the gateway circuitry of Valerio resulting in a system where gateways between cores are also interconnected. A person having ordinary skill in the art would have been motivated to make this combination because “Direct SM-to-SM communication for lower latency data sharing and improved synchronization between producer and consumer threads in the CGA” (Palmer ¶ [0118]). The interconnects between SMs are what allows the direct SM-to-SM communication. By combining the gateways into every SM, the gateways between each SM will also be interconnected. Since gateways help facilitate synchronization (Valerio ¶ [0250]), it would be obvious for gateways to be interconnected to further improve synchronization. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20200402198 A1
teaches
SHARED LOCAL MEMORY READ MERGE AND MULTICAST RETURN
US 20230370304 A1
teaches
PROGRAMMABLE MULTICAST PROTOCOL FOR RING-TOPOLOGY BASED ARTIFICIAL INTELLIGENCE SYSTEMS
US 20230315655 A1
teaches
FAST DATA SYNCHRONIZATION IN PROCESSORS AND MEMORY
US 20230289189 A1
teaches
Distributed Shared Memory
US 20230289242 A1
teaches
HARDWARE ACCELERATED SYNCHRONIZATION WITH ASYNCHRONOUS TRANSACTION SUPPORT
US 20230229599 A1
teaches
MULTICAST AND REFLECTIVE MEMORY BEHAVIOR FOR MEMORY MODEL CONSISTENCY


Any inquiry concerning this communication or earlier communications from the examiner should be directed to PETER L YUAN whose telephone number is (571)272-5737. The examiner can normally be reached Mon-Fri 7:30am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bradley Teets can be reached at 571-272-3338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PETER LI YUAN/Examiner, Art Unit 2197                                                                                                                                                                                                        
/BRADLEY A TEETS/Supervisory Patent Examiner, Art Unit 2197
Read full office action
SYNCHRONIZATION FOR DATA MULTICAST IN COMPUTE CORE CLUSTERS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYNCHRONIZATION FOR DATA MULTICAST IN COMPUTE CORE CLUSTERS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email