Office Action Analysis: 18148997 — DATA MULTICAST IN COMPUTE CORE CLUSTERS

Office Action

§101 §103
DETAILED ACTION The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. The Office Action is in response to claims filed 12/30/2022 . Claims 1-20 are pending. Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention recites a judicial exception, an abstract idea, and it has not been integrated into practical application and the claims further do not recite significantly more than the judicial exception. Examiner has evaluated the claims under the framework provided in the 2019 Patent Eligibility Guidance published in the Federal Register 01/07/2019 and has provided such analysis below. Step 1: Claims 1-8 are directed to an apparatus and fall within the statutory class of machine. Claims 9-14 are directed to a computer readable storage medium and fall within the statutory class of articles of manufacture. Claims 15-20 are directed to a method and fall within the statutory class of processes. Therefore, “Are the claims to a process, machine, manufacture or composition of matter?” Yes. Step 2A Prong 1: Claims 1, 9, 15 : The limitation “determine whether any additional cores in the first cluster require the data element”, as drafted, is a limitation that under its broadest reasonable interpretation, covers performance of the limitation of the mind. The determining step is a mental process that requires observing the additional cores and then forming a judgement as to whether they require the data element or not. The recited action is understood to be performed by a processor, but is also able to be performed entirely in the mind. Therefore, Yes, claims 1, 9, and 15 recite a judicial exception. Step 2A Prong 2 will evaluate whether the claims are directed to a judicial exception. Step 2A Prong 2: Claims 1, 9, and 15 : The judicial exception is not integrated into a practical application. Claim 1 recites “one or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory” . Claim 9 also recites “one or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions”. These additional elements are recitations of generic computing components and functions merely being used as a tool to apply the abstract idea (MPEP § 2106.05(f)). Claims 1, 9, and 15 also recite “wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared memory, and broadcast circuitry” . This additional element is considered field of use/technological environment (MPEP § 2106.05(h)) because it specifies the computer hardware environment. Claim s 1, 9, and 15 additionally recites “wherein a first core in a first cluster of cores is to: request a data element” and “upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster”. These additional elements are recitations of insignificant extra solution activity of mere data gathering and transmission (MPEP § 2106.05(g)) . The recited additional elements in claims 1, 9, and 15 do not integrate the judicial exception into a practical application. Therefore, “Do the claims recite additional elements that integrate the judicial exception in a practical application?” No, these additional elements do not integrate the abstract idea into a practical application and they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. After having evaluated the inquiries set forth in Steps 2A Prong 1 and 2, it has been concluded that claims 1, 9, 15 not only recite a judicial exception but that the claims are directed to the judicial exception as the judicial exception has not been integrated into practical application. Step 2B: Claims 1, 9, 15 : The claims do not include additional elements, alone or in combination, that are sufficient to amount to significantly more than the judicial exception. As discussed above, the additional elements only amount to generic computing components being use d as a tool to apply the abstract idea , field of use/technological environment, and insignificant extra-solution activity of mere data gathering and transmission . When reevaluating the insignificant extra-solution activity for an inventive concept, the additional elements did not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. The data gathering /transmission step of requesting a data element and broadcasting data to cores involve transmitting and receiving data over a network. When reevaluating the other additional elements alone or in combination, the additional elements do not add an inventive concept. Therefore, “Do the claims recite additional elements that amount to significantly more than the judicial exception? No , these additional elements, alone or in combination, do not amount to significantly more than the judicial exception. Having concluded analysis with in the provided framework, claims 1, 9, 15 do not recite eligible subject matter under 35 U.S.C. § 101. With regard to claims 2, 10, and 16, it recites “wherein the first core is further to: direct the data element to one or more threads run by the one or more processing resources of the first cores”. This additional element is an insignificant extra-solution activity of mere data gathering/ transmission (MPEP § 2106.05(g)). It does not integrate the judicial exception into a practical application, so the claims fail Step 2A Prong 2. When reevaluating the additional element for an inventive concept that is significantly more, the claims do not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. Directing data to threads is considered a transfer of data, so the claims fail Step 2B. Therefore, claims 2, 10, and 16 do not recite patent eligible subject matter under 35 U.S.C. § 101. With regard to claims 3, 11, and 17, it recites “wherein the first core is further to: determine a routing for transmission of the data element to the one or more additional cores”. This limitation is considered to be a mental process because it involves a n observation and judgement. Therefore, the claims are directed to a mental process and fail Step 2A Prong 1. The claim does not include any additional elements that integrate the judicial exception into a practical application, so the claims fail Step 2A Prong 2. Additionally, the claims do not include any additional elements that add an inventive concept that is significantly more. Therefore, the claims fail Step 2B. Therefore, claims 3, 11, and 17 do not recite eligible patent eligible subject matter under 35 U.S.C. § 101. With regard to claims 4, 12, and 18, it recites “ a second core is to : upon determining that the data element is required by the second core ” and “ upon determining that the data element to be routed to another core . These limitations are considered a mental process because they involve a n observation and judgement. Therefore, the claims are directed to a mental process and fail Step 2A Prong 1. Additionally, the claims recite “ upon receiving the data element from the first core ” , “ direct the data element to one or more threads of the second core” , and “broadcast the data element from the broadcast circuitry of the second core to broadcast circuitry of a third core”. These additional elements are considered to be an insignificant extra-solution of mere data gatherin g/transmission (MPEP § 2106.05(g)). The additional element s do not integrate the judicial exception into a practical application, so the claims fail Step 2A Prong 2. When reevaluating the additional element(s) for an inventive concept that is significantly more, the claims do not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. The additional elements are simply the transfer of data from one area to another, so the claims fail Step 2B. Therefore, claims 4, 12, and 18 do not recite eligible patent eligible subject matter under 35 U.S.C. § 101. With regard to claims 5, 13, and 19, it recites “wherein the shared memory includes a shared local memory (SLM) portion and an L1 cache portion”. This additional element is considered field of use/technological environment (MPEP § 2106.05(h)) because it specif ies the hardware environment of the core. The claims also recite “ the data element is directed to either the SLM portion or the L1 cache portion ”. This additional element is an insignificant extra-solution activity of mere data gathering /transmission (MPEP § 2106.05(g)). These additional element s do not integrate the judicial exception into a practical application, so the claims fail Step 2A Prong 2. When reevaluating the insignificant extra-solution activity for an inventive concept that is significantly more, the claims do not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. Directing data to different parts of memory is a transfer of data, so the claims fail Step 2B. Therefore, claims 5, 13, and 19 do not recite eligible patent eligible subject matter under 35 U.S.C. § 101. With regard to claims 6, 14, and 20, it recites “wherein the data element is directed to the SLM portion” and “wherein the cores of the first cluster of cores provide synchronization for the broadcast of the data element”. These additional elements are insignificant extra-solution activities of mere data gathering /transmission (MPEP 2106.05(g)). The additional elements do not integrate the judicial exception into a practical application, the claim fails Step 2A Prong 2. When reevaluating the additional element(s) for an inventive concept that is significantly more, the claims do not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. Directing data to memory is a transfer of data. Synchronization involves a transfer of data between at least two areas. The claims fail Step 2B. Therefore, claims 6, 14, and 20 do not recite eligible patent eligible subject matter under 35 U.S.C. § 101. With regard to claim 7, it recites “wherein each core of the first core cluster includes gateway circuitry”. This additional element is considered field of use/technological environment (MPEP 2106.05(h)) because it further specifies the hardware environment of the core. Claim 7 also recites “the gateway circuitry of the cores providing synchronization for the broadcast of the data element”. This additional element is an insignificant extra-solution activity of mere data gathering /transmission (MPEP § 2106.05(g)). The additional elements do not integrate the judicial exception into a practical application, so the claim fails Step 2A Prong 2. When reevaluating the additional element(s) for an inventive concept that is significantly more, the claims do not add an inventive concept that is other than what is well understood, routine, and conventional in the field. MPEP § 2106.05(d)(II) lists that “Receiving or transmitting data over a network” is a well understood, routine, and conventional computer function. Synchronization involves a transfer of data between at least two areas. The claim fails Step 2B. Therefore, claim 7 do es not recite eligible patent eligible subject matter under 35 U.S.C. § 101. With regard to claim 8, it recites “wherein the first processor is a graphics processor”. This additional element is considered field of use/technological environment (MPEP 2106.05(h)) because it further specifies that the first processor is a graphics processor . This additional element does not integrate the judicial exception into a practical application, so the claim fails Step 2A Prong 2. The claim does not include any additional elements that when reevaluated amount to an inventive concept that is significantly more , so the claim fails Step 2B. Therefore, claim 8 does not recite eligible patent eligible subject matter under 35 U.S.C. § 101. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1 , 2, 4-10, 12-16, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valerio et al. Pat. No. US 20210263785 A1 (hereafter Valerio ) in view of Parle et al. Pat. No. US 20230289190 A1 (hereafter Parle ) and further in view of Ray et al. Pat. No. US 20200402198 A1 (hereafter Ray) . Regarding claim 1, Valerio teaches an apparatus comprising: o ne or more processors including at least a first processor, the first processor including one or more clusters of cores and a memory ( ¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0243] states “GPU 1914 is divided into slices, where each slice includes a plurality of slices”. See FIG. 3B and FIG. 20. Examiner’s Note: in FIG. 20, the slice 2000 is the cluster of cores. Sub-Slice 2005A-C are the cores. In FIG. 3B, memory 326A-D is within the GPU ); wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared memory, and broadcast circuitry (¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 21 . Examiner’s Note: the sub-slice is the core ); Valerio does not explicitly teach a first core to request data, determine if other cores need data, and broadcast data. However, in an analogous art, Parle teaches and wherein a first core in a first cluster of cores is to: request a data element, determine whether any additional cores in the first cluster require the data element ( ¶ [0053] states “SM 204 generates a multicast request packet 224”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212”. ¶ [0057] states “At the LRC 212, multicast-specific information that was included in the multicast request packet 224 is saved in a tracking structure 222, and an L2 request packet 226 is generated”. Examiner’s Note: the multicast request packet is a request for data from L2 cache ), and upon determining that one or more additional cores in the first cluster require the data element , broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster ( ¶ [0060] states “The LRC multicast response packet 230 travels through the response crossbar 210 as a single packet through the crossbar path that is common to all receivers designated in the packet 230. The result data in the single packet 230 is duplicated into two or more packets as determined to be necessary based upon the list of receivers at one or more points of separation. In the illustrated example, the result data carried in packet 230 is duplicated to two packets 232 and 234 for receiving SMs 204 and 206 as identified in the list of receivers” . Examiner’s Note: the response packet includes the data element being broadcasted to one or more cores ). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the first core requesting data and then broadcasting of Parle with the GPU core cluster environment of Valerio . A person having ordinary skill in the art would have motivated to make this combination “to reduce the bandwidth and power required to move the same amount of data and better scale” (¶ [0047]). Specifically, it is to improve L2 bandwidth (¶ [0040] states “L2 cache to SM bandwidth (referred to also as “L2 bandwidth”) improvements”). One of ordinary skill in the art would recognize the efficiency improvement of fetching L2 data once and then broadcasting the result as opposed to fetching L2 data multiple times for the same data. Valerio and Parle do not explicitly teach a broadcast circuitry and determination if another core needs a data element. However, in an analogous art, Ray teaches wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared memory, and broadcast circuitry (¶ [0202] states “a memory arbiter 1700 having multicast support for SLM data” and “The memory arbiter 1700 can represent either of the memory arbiters 1404A-1404B of FIG. 14 and provides a memory interface between the compute units 1402A-1402B and the SLM 1406”. ¶ [0188] states “The multiple groups of compute units 1402A-1402B can include but are not limited to a first group of compute units 1402A and a second group of compute units 1402A. The groups of compute units can be, for example, sub-slices of execution units, compute unit clusters, streaming multiprocessors,”. See FIG. 14 and 17. Examiner’s Note: in FIG. 14, there is one memory arbitrator per group of compute units. The group of compute units combined with the memory arbiter is interpreted to be a core or streaming multiprocessor . The memory arbiter is the broadcast circuitry ) and wherein a first core in a first cluster of cores is to: request a data element, determine whether any additional cores in the first cluster require the data element ( ¶ [0194] states “when a read request arrives at the SLM and the scoreboard is not full, a scoreboard entry (e.g., 1530, 1540, 1550) is allocated and the request is added to the scoreboard. When a new request is received the new request is compared with the existing scoreboard entries. If the address range of the new entry matches the same address range of an outstanding block read, the new read can be merged with the outstanding read”. Examiner’s Note: the core makes a memory request that specifies an address range. Other cores in the cluster also request data . The scoreboard keeps track if any request requests a memory address range that has been previously requested by a different core. When the scoreboard checks if a same memory address has been previously requested, this is the determination if any other additional cores require the same data element ), and upon determining that one or more additional cores in the first cluster require the data element, broadcast the data element to the one or more additional cores via interconnects between the broadcast circuitry of the cores of the first core cluster ( ¶ [0199] states “control logic can perform operation 1605 to determine if the first read request and the second read requests are to the same address block”. ¶ [0207] states “operation 1806, which performs a multicast transmission of read return data for the merged read to each compute unit and/or thread”. ¶ [0196] states “If the multiple memory arbiters share a return bus, a single read return can be broadcast over the shared bus and received by the multiple memory arbiters” ). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the scoreboard to determine if multiple cores request the same memory and request merging and broadcasting of Ray with the broadcasting process of Parle and the GPU cluster and core environment of Valerio . A person having ordinary skill in the art would have motivated to make this combination to “reduce the access cost and data transfer cost associated with the multiple requests. With no change in application level software, the optimizations described herein result in higher efficiency operation of the computation units, as well as reducing operational power requirements, resulting in higher performance at the same or lower power levels” (¶ [0025]). With regard to claim 2, Valerio , Parle, and Ray teach the apparatus of claim 1 . Additionally, Ray teaches wherein the first core is further to: direct the data element to one or more threads run by the one or more processing resources of the first core (¶ [0196] states “ t he SLM can send a single return message to the memory arbiter that includes a bit-mask that identifies all of the merged requests from a compute unit and/or threads executed by the compute unit. The memory arbiter can then transmit a read return for the merged read requests to the compute units and/or threads identified by the merge mask” . Examiner’s Note: the compute unit s are the processing resources. Threads execute on compute units ). With regard to claim 4, Valerio , Parle, and Ray teach the apparatus of claim 1 . Parle additionally teaches wherein, upon receiving the data element from the first core, a second core is to: upon determining that the data element is required by the second core, direct the data element to one or more threads of the second core ( ¶ [0053] states “SM 204 generates a multicast request packet 224” and “The SM 204 may be referred to as the “multicast source SM” or “multicast requesting SM” because the thread that generates the multicast request packet is on SM 204”. ¶ [0107] states “The response data (requested data) included in the multicast response packet is ultimately received at each of the receiving SMs”. Examiner’s Note: the multicast source SM requests data. The receiving SMs all receive the data ). Ray additionally teaches wherein, upon receiving the data element from the first core, a second core is to: upon determining that the data element is required by the second core, direct the data element to one or more threads of the second core ( ¶ [0194] states “When a new request is received the new request is compared with the existing scoreboard entries. If the address range of the new entry matches the same address range of an outstanding block read, the new read can be merged with the outstanding read”. ¶ [0207] states “operation 1806, which performs a multicast transmission of read return data for the merged read to each compute unit and/or t h read indicated to be associated with the merged read by the read return metadata”. ¶ [0196] states “the SLM can send a single return message to the memory arbiter that includes a bit-mask that identifies all of the merged requests from a compute unit and/or threads executed by the compute unit. The memory arbiter can then transmit a read return for the merged read requests to the compute units and/or threads identified by the merge mask”. Examiner’s Note: the second core receives data from the first core because the second core requested the data. The requesting of data by the second core is the determining that the second core requires the data. W hen each core’s memory arbiter receives data, it passes the requested data to the request ing threads at the core ); and upon determining that the data element to be routed to another core, broadcast the data element from the broadcast circuitry of the second core to broadcast circuitry of a third core (¶ [0188] states “The multiple groups of compute units 1402A-1402B can include but are not limited to a first group of compute units 1402A and a second group of compute units”. ¶ [0189] states “ matrix operations can be performed in a systolic manner, in which operations for elements of a matrix are distributed across the groups of compute units to perform operations such as fused matrix multiply and accumulate or systolic dot products. During systolic operations, a systolic broadcast 1403 of intermediate data can be performed between the compute units. In one embodiment, the intermediate data can be shared between compute units via the SLM”. ¶ [0190] states “each group of compute units can receive data from the SLM 1406 via a set of memory arbiters 1404A-1404B”. See FIG. 14. Examiner’s Note: the groups of compute units and their respective memory arbiter form a core. The environment supports a plurality of cores. The memory arbiter is the broadcast circuitry. The memory arbiter is between the groups of compute units and the SLM as shown in FIG. 14 . During systolic operations, the determination for broadcasting data to another core is determined by the systolic algorithm being performed . The data being broadcasted goes from the source memory arbiter to the SLM and then to the destination memory arbiter ) . With regard to claim 5, Valerio , Parle, and Ray teach the apparatus of claim 1 . Parle additionally teaches wherein the shared memory includes a shared local memory (SLM) portion and an L1 cache portion ( ¶ [0161] states “As shown in FIG. 12, the SM 1140 includes … a shared memory/L1 cache 1270”. See FIG. 12 ), and the data element is directed to either the SLM portion or the L1 cache portion ( ¶ [0096], [0099], and [0103] states ““Metadata” transported from source SM to destination SMs may include, for example: Data SMEM (shared memory) Offset, and implementation specific result data processing parameters”. ¶ [0104] states “The shared memory offset represents the offset, from the shared memory base address for the SM, at which the result data should be written”. ¶ [0159] states “Data from the L2 cache 1160 may be fetched and stored in each of the L1 caches” ) . With regard to claim 6, Valerio, Parle, and Ray teach the apparatus of claim 5 . Valerio additionally teaches and wherein the cores of the first cluster of cores provide synchronization for the broadcast of the data element ( ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier” . See FIG. 20 and 21. Examiner’s Note: the gateway s help facilitate barrier synchronization. The sub-slice is the core and the slice is the cluster ). Parle additionally teaches wherein the data element is directed to the SLM portion ( ¶ [0096] and [0099] states ““Metadata” transported from source SM to destination SMs may include, for example: Data SMEM (shared memory) Offset”. ¶ [0104] states “The shared memory offset represents the offset, from the shared memory base address for the SM, at which the result data should be written” ), and wherein the cores of the first cluster of cores provide synchronization for the broadcast of the data element ( ¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. ¶ [0096] , [0100] , and [0102] state “Therefore, all metadata necessary to handle the responses must be sent with the packet. “Metadata” transported from source SM to destination SMs may include, for example: Barrier address Offset , ACK phase ID (two possible phases, part of the MEMBAR protocol) ”. ¶ [0104] state “The barrier address offset can be used by the receiver to identify the barrier”. ¶ [0105] states “The Ack Phase ID field is used by the receiving SM to correctly indicate the phase of the MEMBAR synchronization protocol to which an ack corresponds” . Examiner’s Note: the cores in the cluster use barrier address offset and ack phase ID to synchronize the broadcast ). With regard to claim 7, Valerio, Parle, and Ray teach the apparatus of claim 6 . Valerio additionally teaches wherein each core of the first core cluster includes gateway circuitry, the gateway circuitry of the cores providing synchronization for the broadcast of the data element ( ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier”. See FIG. 20 and 21. Examiner’s Note: the gateway help s facilitate barrier synchronization. The sub-slice is the core and the slice is the cluster ). Parle additionally teaches wherein each core of the first core cluster includes gateway circuitry, the gateway circuitry of the cores providing synchronization for the broadcast of the data element ( ¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. ¶ [0096], [0100], and [0102] state “Therefore, all metadata necessary to handle the responses must be sent with the packet. “Metadata” transported from source SM to destination SMs may include, for example: Barrier address Offset, ACK phase ID (two possible phases, part of the MEMBAR protocol)”. ¶ [0104] state “The barrier address offset can be used by the receiver to identify the barrier”. ¶ [0105] states “The Ack Phase ID field is used by the receiving SM to correctly indicate the phase of the MEMBAR synchronization protocol to which an ack corresponds” . Examiner’s Note: the cores in the cluster use barrier address offset and ack phase ID to synchronize the broadcast ). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the synchronization technique of Parle with the gateway circuitry in each core of Valerio. Valerio ¶ [0252] states “a gateway counter associated with the named barrier is incremented when a thread transmits a signal to a named barrier” and Parle ¶ [0119] states “If the MEMBAR waits for all Acks corresponding to current phase to be received and the MEMBAR will not clear until such condition, the next phase is used to handle subsequent multicast operations”. The gateway of Valerio could implement the MEMBAR operation of Parle by using a gateway counter. A person having ordinary skill in the art would have been motivated to make this combination because “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction” (¶ [0062]) . One of ordinary skill in the art would recognize that detecting completion or errored transactions is beneficial because it can be used in determining whether to move on to the next instruction , to rebroadcast , or perform some other suitable action . With regard to claim 8, Valerio, Parle, and Ray teach the apparatus of claim 1 . Valerio additionally teaches wherein the first processor is a graphics processor ( ¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. See FIG. 19 ). With regard to claim 9, Valerio teaches one or more non-transitory computer-readable storage mediums having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising ( ¶ [0241] states “Embodiments may be provided, for example, as a computer program product which may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines such as a computer, network of computers, or other electronic devices, may result in the one or more machines carrying out operations in accordance with embodiments described herein” ): requesting a data element by a first core in a first cluster of cores of a graphics processor, each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry ( ¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0243] states “GPU 1914 is divided into slices, where each slice includes a plurality of slices”. ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 20 and 21. Examiner’s Note: the slice is the cluster, and the sub-slice is the core ) ; Valerio does not explicitly teach a first core to request data, determine if other cores need data, and broadcast data. However, in an analogous art, Parle teaches requesting a data element by a first core in a first cluster of cores of a graphics processor , each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry ( ¶ [0053] states “SM 204 generates a multicast request packet 224”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212”. ¶ [0057] states “At the LRC 212, multicast-specific information that was included in the multicast request packet 224 is saved in a tracking structure 222, and an L2 request packet 226 is generated”. Examiner’s Note: the multicast request packet is a request for data from L2 cache ); and upon determining that one or more additional cores in the first cluster require the data element, broadcasting the data element to the one or more additional cores via interconnects between broadcast circuitry of the cores of the first core cluster (¶ [0060] states “The LRC multicast response packet 230 travels through the response crossbar 210 as a single packet through the crossbar path that is common to all receivers designated in the packet 230. The result data in the single packet 230 is duplicated into two or more packets as determined to be necessary based upon the list of receivers at one or more points of separation. In the illustrated example, the result data carried in packet 230 is duplicated to two packets 232 and 234 for receiving SMs 204 and 206 as identified in the list of receivers”. Examiner’s Note: the response packet includes the data element being broadcasted to one or more cores ) . It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the first core requesting data and then broadcasting of Parle with the GPU core cluster environment of Valerio. A person having ordinary skill in the art would have motivated to make this combination “to reduce the bandwidth and power required to move the same amount of data and better scale” (¶ [0047]). Specifically, it is to improve L2 bandwidth (¶ [0040] states “L2 cache to SM bandwidth (referred to also as “L2 bandwidth”) improvements”). One of ordinary skill in the art would recognize the efficiency improvement of fetching L2 data once and then broadcasting the result as opposed to fetching L2 data multiple times for the same data. Valerio and Parle do not explicitly teach a broadcast circuitry and determination if another core needs a data element. However, in an analogous art, Ray teaches requesting a data element by a first core in a first cluster of cores of a graphics processor, each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry ( ¶ [0202] states “a memory arbiter 1700 having multicast support for SLM data” and “The memory arbiter 1700 can represent either of the memory arbiters 1404A-1404B of FIG. 14 and provides a memory interface between the compute units 1402A-1402B and the SLM 1406”. ¶ [0188] states “The multiple groups of compute units 1402A-1402B can include but are not limited to a first group of compute units 1402A and a second group of compute units 1402A. The groups of compute units can be, for example, sub-slices of execution units, compute unit clusters, streaming multiprocessors,”. See FIG. 14 and 17. Examiner’s Note: in FIG. 14, there is one memory arbitrator per group of compute units. The group of compute units combined with the memory arbiter is interpreted to be a core or streaming multiprocessor. The memory arbiter is the broadcast circuitry ); determining whether any additional cores in the first cluster require the data element ( ¶ [0194] states “when a read request arrives at the SLM and the scoreboard is not full, a scoreboard entry (e.g., 1530, 1540, 1550) is allocated and the request is added to the scoreboard. When a new request is received the new request is compared with the existing scoreboard entries. If the address range of the new entry matches the same address range of an outstanding block read, the new read can be merged with the outstanding read”. Examiner’s Note: the core makes a memory request that specifies an address range. Other cores in the cluster also request data. The scoreboard keeps track if any request requests a memory address range that has been previously requested by a different core. When the scoreboard checks if a same memory address has been previously requested, this is the determination if any other additional cores require the same data element ); and upon determining that one or more additional cores in the first cluster require the data element, broadcasting the data element to the one or more additional cores via interconnects between broadcast circuitry of the cores of the first core cluster ( ¶ [0199] states “control logic can perform operation 1605 to determine if the first read request and the second read requests are to the same address block”. ¶ [0207] states “operation 1806, which performs a multicast transmission of read return data for the merged read to each compute unit and/or thread”. ¶ [0196] states “If the multiple memory arbiters share a return bus, a single read return can be broadcast over the shared bus and received by the multiple memory arbiters” ). It would have been obvious to a person having ordinary skill in the art prior to the effective filing date to combine the scoreboard to determine if multiple cores request the same memory and request merging and broadcasting of Ray with the broadcasting process of Parle and the GPU cluster and core environment of Valerio. A person having ordinary skill in the art would have motivated to make this combination to “reduce the access cost and data transfer cost associated with the multiple requests. With no change in application level software, the optimizations described herein result in higher efficiency operation of the computation units, as well as reducing operational power requirements, resulting in higher performance at the same or lower power levels” (¶ [0025]). With regard to claim 10, Valerio, Parle, and Ray teach the one or more non-transitory computer-readable storage mediums of claim 9 . Ray additionally teaches wherein the executable computer program instructions further include instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: directing the data element to one or more threads run by the one or more processing resources of the first core ( ¶ [0196] states “the SLM can send a single return message to the memory arbiter that includes a bit-mask that identifies all of the merged requests from a compute unit and/or threads executed by the compute unit. The memory arbiter can then transmit a read return for the merged read requests to the compute units and/or threads identified by the merge mask”. Examiner’s Note: the compute unit s are the processing resources. Threads execute on compute units ). With regard to claim 12, Valerio, Parle, and Ray teach the one or more non-transitory computer-readable storage mediums of claim 9 . Parle additionally teaches wherein the executable computer program instructions further include instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: receiving the data element from the first core at a second core ( ¶ [0053] states “SM 204 generates a multicast request packet 224” and “The SM 204 may be referred to as the “multicast source SM” or “multicast requesting SM” because the thread that generates the multicast request packet is on SM 204”. ¶ [0107] states “The response data (requested data) included in the multicast response packet is ultimately received at each of the receiving SMs”. Examiner’s Note: the multicast source SM requests data. The receiving SMs all receive the data ); Ray additionally teaches upon determining that the data element is required by the second core, directing the data element to one or more threads of the second core (¶ [0194] states “When a new request is received the new request is compared with the existing scoreboard entries. If the address range of the new entry matches the same address range of an outstanding block read, the new read can be merged with the outstanding read”. ¶ [0207] states “operation 1806, which performs a multicast transmission of read return data for the merged read to each compute unit and/or thread indicated to be associated with the merged read by the read return metadata”. ¶ [0196] states “the SLM can send a single return message to the memory arbiter that includes a bit-mask that identifies all of the merged requests from a compute unit and/or threads executed by the compute unit. The memory arbiter can then transmit a read return for the merged read requests to the compute units and/or threads identified by the merge mask”. Examiner’s Note: the second core receives data from the first core because the second core requested the data. The requesting of data by the second core is the determining that the second core requires the data. When each core’s memory arbiter receives data, it passes the requested data to the requesting threads at the core ); and upon determining that the data element to be routed to another core, broadcasting the data element from the broadcast circuitry of the second core to broadcast circuitry of a third core ( ¶ [0188] states “The multiple groups of compute units 1402A-1402B can include but are not limited to a first group of compute units 1402A and a second group of compute units”. ¶ [0189] states “matrix operations can be performed in a systolic manner, in which operations for elements of a matrix are distributed across the groups of compute units to perform operations such as fused matrix multiply and accumulate or systolic dot products. During systolic operations, a systolic broadcast 1403 of intermediate data can be performed between the compute units. In one embodiment, the intermediate data can be shared between compute units via the SLM”. ¶ [0190] states “each group of compute units can receive data from the SLM 1406 via a set of memory arbiters 1404A-1404B”. See FIG. 14. Examiner’s Note: the groups of compute units and their respective memory arbiter form a core. The environment supports a plurality of cores. The memory arbiter is the broadcast circuitry. The memory arbiter is between the groups of compute units and the SLM as shown in FIG. 14. During systolic operations, the determination for broadcasting data to another core is determined by the systolic algorithm being performed. The data being broadcasted goes from the source memory arbiter to the SLM and then to the destination memory arbiter ). With regard to claim 13, Valerio, Parle, and Ray teach the one or more non-transitory computer-readable storage mediums of claim 9 . Parle additionally teaches wherein the shared memory includes a shared local memory (SLM) portion and an L1 cache portion (¶ [0161] states “As shown in FIG. 12, the SM 1140 includes … a shared memory/L1 cache 1270”. See FIG. 12), and the data element is directed to either the SLM portion or the L1 cache portion (¶ [0096], [0099], and [0103] states ““Metadata” transported from source SM to destination SMs may include, for example: Data SMEM (shared memory) Offset, and implementation specific result data processing parameters”. ¶ [0104] states “The shared memory offset represents the offset, from the shared memory base address for the SM, at which the result data should be written”. ¶ [0159] states “Data from the L2 cache 1160 may be fetched and stored in each of the L1 caches”). With regard to claim 14, Valerio, Parle, and Ray teach the one or more non-transitory computer-readable storage mediums of claim 13 . Valerio additionally teaches and wherein the executable computer program instructions further include instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: providing, by the cores of the first cluster of cores, synchronization for the broadcast of the data element (¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. ¶ [0250] states “barrier synchronization mechanism 2130 causes each thread of a named barrier to open and acquire (or assigned) a handle, which enables gateway 2150 to register physical thread identifiers (or IDs) as part of the named barrier”. See FIG. 20 and 21. Examiner’s Note: the gateways help facilitate barrier synchronization. The sub-slice is the core and the slice is the cluster ). Parle additionally teaches wherein the data element is directed to the SLM portion (¶ [0096] and [0099] states ““Metadata” transported from source SM to destination SMs may include, for example: Data SMEM (shared memory) Offset”. ¶ [0104] states “The shared memory offset represents the offset, from the shared memory base address for the SM, at which the result data should be written”) , and wherein the executable computer program instructions further include instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: providing, by the cores of the first cluster of cores, synchronization for the broadcast of the data element (¶ [0062] states “a synchronization technique may be utilized by the sender SM 204 and receiver SMs in order to detect completion of the transaction or an errored transaction”. ¶ [0096], [0100], and [0102] state “Therefore, all metadata necessary to handle the responses must be sent with the packet. “Metadata” transported from source SM to destination SMs may include, for example: Barrier address Offset, ACK phase ID (two possible phases, part of the MEMBAR protocol)”. ¶ [0104] state “The barrier address offset can be used by the receiver to identify the barrier”. ¶ [0105] states “The Ack Phase ID field is used by the receiving SM to correctly indicate the phase of the MEMBAR synchronization protocol to which an ack corresponds”. Examiner’s Note: the cores in the cluster use barrier address offset and ack phase ID to synchronize the broadcast ) . With regard to claim 15, Valerio teaches a method comprising (¶ [0258] state “Examples may include subject matter such as a method”): requesting a data element by a first core in a first cluster of cores of a graphics processor, each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry (¶ [0225] states “computing device 1900 may include any number and type of hardware and/or software components, such as (without limitation) GPU 1914”. ¶ [0243] states “GPU 1914 is divided into slices, where each slice includes a plurality of slices”. ¶ [0244] states “each sub-slice 2005 includes processing resources such as execution units (EUs) 2110, a shared local memory (SLM) 2020 and a gateway 2050”. See FIG. 20 and 21. Examiner’s Note: the slice is the cluster, and the sub-slice is the core ); Valerio does not explicitly teach a first core to request data, determine if other cores need data, and broadcast data. However, in an analogous art, Parle teaches requesting a data element by a first core in a first cluster of cores of a graphics processor , each core of the first cluster of cores including one or more processing resources, shared memory, and broadcast circuitry (¶ [0053] states “SM 204 generates a multicast request packet 224”. ¶ [0056] states “The multicast request packet 224 is transmitted on the request crossbar 208 to the L2 Request Coalescer (LRC) 212”. ¶ [0057] states “At the LRC 212, multicast-specific information that was included in the multicast request packet 224 is saved in a tracking structure 222, and an L2 re
Read full office action
DATA MULTICAST IN COMPUTE CORE CLUSTERS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

DATA MULTICAST IN COMPUTE CORE CLUSTERS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email