DETAILED ACTION
Claims 1-20 are pending in this application.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3, 8, 10, 15 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2018/0286005 A1 et al. to Koker et al. and further in view of U.S. Pub. No. 2015/0309845 A1 to Wilson et al.
As to claim 1, Martin teaches a processor comprising:
a plurality of compute units, where each of the compute units comprises circuitry configured to execute instructions (SEs 106/CU 202 paragraph 0078); and
dispatch (WD 102) the individual wavefronts of the first workgroup to separate multiple two or more compute units of the plurality of compute units (SEs 106/CU 202) (“...In an embodiment, WD 102 distributes the work to other components in a graphics pipeline for parallel processing. WD 102 receives patches from a driver that include instructions for rendering primitives on a display screen. The driver receives patches from a graphics application. Once the driver receives patches from the graphics application, it uses a communication interface, such as a communication bus, to transmit patches to a graphics pipeline that begins with WD 102. In an embodiment, WD 102 divides patches into multiple work groups that are processed in parallel using multiple SEs 106...In an embodiment, to transmit work groups to SEs 106, WD 102 passes work groups to IAs 104. In an embodiment, there may be multiple IAs 104 connected to WD 102. IAs 104 divide workgroups into primitive group (also referred to as "prim groups"). IA 104 then passes the prim groups to SEs 106. In an embodiment, each IA 104 is coupled to two SEs 106. IAs 104 may also retrieve data that is manipulated using instructions in the patches, and performs other functions that prepare patches for processing using SEs 106...In another embodiment, WD 102 may distribute prim groups directly to SEs 106. In this embodiment, the functionality of IA 104 may be included in WD 102 or in SE 106. In this case, WD 102 divides a draw call into multiple prim groups and passes a prim group to SE 106 for processing. This configuration allows WD 102 to scale the number of prim groups to the number of SEs 106 that are included in the graphics pipeline...In an embodiment, SEs 106 process prim groups. For example, SEs 106 use multiple compute units to manipulate the data in each prim group so that it is displayed as objects on a display screen...” paragraphs 0024-0027).
Martin is silent with reference to in response to determining that resources of a single compute unit are insufficient to execute the entire workgroup maintain state information associated with the workgroup to track progress of its individual wavefronts toward a synchronization point and
generate and convey a control signal to the two or more compute units responsive to all wavefronts of the workgroup having reached the synchronization point, wherein the control signal is generated based on the state information and enables continued execution of the wavefronts.
Koker teaches in response to determining that resources of a single compute unit are insufficient (saturate the available resources) to execute the entire workgroup maintain state information associated with the workgroup to track progress of its individual wavefronts (parallelism profile) (If the threads of the workgroup can saturate the available resources of one or more compute blocks 620A-620N, the thread dispatcher 710 can distribute the threads for parallel execution across one or more compute blocks) (“…For example and in one embodiment a first workgroup can arrive at the thread dispatcher 710. In various embodiments a workgroup can be any grouping of threads used in SIMT programming, including a thread group, thread block, or thread warp, and embodiments can implement compute block thread scheduling at various levels of granularity. In one embodiment the thread dispatcher 710 can distribute or concentrate threads across or within one or more compute blocks 620A-620N or across all compute blocks…After receiving the first workgroup, the thread dispatcher 710 can analyze the parallelism profile associated with the workgroup to determine how to schedule the threads across available compute blocks 620A-620N. If the threads of the workgroup can saturate the available resources of one or more compute blocks 620A-620N, the thread dispatcher 710 can distribute the threads for parallel execution across one or more compute blocks. However, if the parallelism profile indicated that the workgroup will be executed using a smaller number of threads, the threads of the workgroup can be concentrated to a single compute block (e.g., compute block 620A). In one embodiment, the parallelism profile can cause the thread dispatcher 710 to distribute the threads of a workgroup to a single processing cluster (e.g., processing cluster 614A of FIG. 6 within compute block 620A), or within a subset of compute units (e.g., graphics multiprocessor 234 of FIG. 2C). Any compute resources that will go unused during a workload cycle can be power gated by the arbitration and power management layer 716…” paragraphs 0133/0134).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin with the teaching of Koker because the teaching of Koker would improve the system of Martin by providing a technique (load balancing) of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient.
Wilson teaches maintain state information associated with the workgroup to track progress of its individual wavefronts (plurality of threads) toward a synchronization point (barrier synchronization) and
generate and convey a control signal to the two or more compute units (multiple cores) responsive to all wavefronts (plurality of threads) of the workgroup having reached the synchronization point (barrier synchronization), wherein the control signal is generated based on the state information and enables continued execution of the wavefronts (“…According to an embodiment of a further aspect there is provided a computer system with multiple cores arranged to implement a synchronization method wherein a group of cores is arranged to allow parallel execution of a plurality of threads using barrier synchronization, wherein each thread in the group waits for all the others at a barrier before progressing; the cores are arranged to allow execution of the threads until a first thread reaches the barrier; the core on which the first thread is executing is arranged to then allow the first thread to enter a polling state, in which it checks for a release condition indicating the end of the barrier; and further cores are arranged to allow execution of subsequent threads which reach the barrier subsequently; to allow the subsequent threads to be moved to the core on which the first thread is executing and then to be powered down as the number of moved threads increases; wherein the core on which the first thread is executing is arranged to allow the first thread to detect the release condition; and to allow the subsequent threads to move back to the further cores, and the further cores are operable to be powered up for use by the threads…For example, according to an embodiment of a still further aspect there is provided a synchronization method in a computer system with multiple cores, wherein a group of processes executes in parallel on a plurality of cores, the group of processes being synchronized using barrier synchronization in which each process in the group waits for all the others at a barrier before progressing; the group of processes executes until a first process reaches the barrier; the first process enters a polling state, repeatedly checking for a release condition indicating the end of the barrier; subsequent processes to reach the barrier are moved to a core on which the first process is executing; and other cores are powered down as the number of moved processes increases; and wherein when the first process detects the release condition, the powered down cores are powered up and are available for use by the processes…One thread will reach the barrier first (S110). As in the prior art barrier implementation this thread should increment a shared counter (which records how many of the threads have reached the barrier) from zero to one (S120) and then begin a busy-wait: repeatedly checking whether the value of the shared counter has reached N (S130). When the nth thread (running on core n) reaches the barrier (S140) it also increments the shared counter (from n−1 to n) in step S150 and checks once whether the value has reached N. If not, then rather than carrying out a busy-wait this thread is unpinned from the core on which it is running and moved in step S160 to the core that is running the first thread to reach the barrier, where it is de-scheduled in step S170 (i.e. enters a sleep state). There is now no work scheduled on core n, (in this one thread per core scenario) so this is powered down in order to save energy. Looking now at the last thread to reach the barrier in step S180, the value of the shared counter does equal N following the increment in step S190. n=N and all threads have reached the barrier (i.e. the barrier is ready to end). In this situation the thread reaching the barrier can continue with normal execution in step S200 without being unpinned from its core…” paragraphs 0041/0044/0068).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Koker with the teaching of Wilson because the teaching of Wilson would improve the system of Martin by providing a synchronization method with multiple cores, where a group of threads executes in parallel on a plurality of cores, and the group of threads being synchronized using barrier synchronization in which each thread in the group waits for all the others at a barrier point before progressing (Wilson paragraph 0019).
As to claim 3, Wilson teaches the processor as recited in claim 1, wherein the processor further comprises a scoreboard or other state-tracking circuitry (shared counter), and wherein the processor is further configured to:
allocate an entry in the scoreboard to track wavefronts of the first workgroup (shared counter);
track, in the entry, a number of wavefronts of the first workgroup which have reached a given barrier synchronization point (“…FIG. 4 is a flow chart showing the progress of various threads through a barrier as proposed in an invention embodiment. In the prior art (busy-wait), all threads would take the left-hand route through steps S100, S110, S120, S130 and S200. Each thread would increment a counter and then poll until the counter value reaches the number of threads…” paragraph 0065); and
initiate synchronization among two or more compute units of the plurality of compute units to allow wavefronts of the first workgroup to proceed when the number of wavefronts of the first workgroup which have reached the synchronization point is equal to a total number of wavefronts in the first workgroup (nth thread) (“…One thread will reach the barrier first (S110). As in the prior art barrier implementation this thread should increment a shared counter (which records how many of the threads have reached the barrier) from zero to one (S120) and then begin a busy-wait: repeatedly checking whether the value of the shared counter has reached N (S130). When the nth thread (running on core n) reaches the barrier (S140) it also increments the shared counter (from n−1 to n) in step S150 and checks once whether the value has reached N. If not, then rather than carrying out a busy-wait this thread is unpinned from the core on which it is running and moved in step S160 to the core that is running the first thread to reach the barrier, where it is de-scheduled in step S170 (i.e. enters a sleep state). There is now no work scheduled on core n, (in this one thread per core scenario) so this is powered down in order to save energy. Looking now at the last thread to reach the barrier in step S180, the value of the shared counter does equal N following the increment in step S190. n=N and all threads have reached the barrier (i.e. the barrier is ready to end). In this situation the thread reaching the barrier can continue with normal execution in step S200 without being unpinned from its core…” paragraphs 0041/0044/0068).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin and Koker with the teaching of Wilson because the teaching of Wilson would improve the system of Martin by providing a synchronization method with multiple cores, where a group of threads executes in parallel on a plurality of cores, and the group of threads being synchronized using barrier synchronization in which each thread in the group waits for all the others at a barrier point before progressing (Wilson paragraph 0019).
As to claim 8, see the rejection of claim 1 and 15, expect for a processor and a memory.
Martin teaches a processor and a memory (“...The embodiments are also directed to computer program products comprising software stored on any computer-usable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein or, as noted above, allows for the synthesis and/or manufacture of computing devices (e.g., ASICs, or processors) to perform embodiments described herein. Embodiments employ any computer-usable or -readable medium, known now or in the future. Examples of computer-usable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nano-technological storage devices, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.)...” paragraph 0091).
As to claims 10 and 17, see the rejection of claim 3 above.
Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2018/0286005 A1 et al. to Koker et al. and further in view of U.S. Pub. No. 20150309845 A1 to Wilson et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2018/0082470 A1 to Nijasure et al.
As to claim 2, Martin as modified by Koker and Wilson teaches the processor as recited in claim 1, however it is silent with reference to wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises dispatching at least two wavefronts of the first workgroup to at least two different compute units of the plurality of compute units, wherein the compute units are configured to execute the wavefronts concurrently, optionally with wavefronts of the one or more other workgroups.
Nijasure teaches wherein dividing the first workgroup into individual wavefronts for dispatch from the dispatch unit to separate compute units comprises dispatching at least two wavefronts of the first workgroup to at least two different compute units of the plurality of compute units, wherein the compute units are configured to execute the wavefronts concurrently, optionally with wavefronts of the one or more other workgroups (“...The basic unit of execution in shader engines 132 is a work-item. Each work-item represents a single instantiation of a shader program that is to be executed in parallel in a particular lane. Work-items can be executed simultaneously as a "wavefront" on a single SIMD unit 138. Multiple wavefronts may be included in a "work group," which includes a collection of work-items designated to execute the same program. A work group can be executed by executing each of the wavefronts that make up the work group. The wavefronts may be executed sequentially on a single SIMD unit 138 or partially or fully in parallel on different SIMD units 138. Wavefronts can be thought of as instances of parallel execution of a shader program, where each wavefront includes multiple work-items that execute simultaneously on a single SIMD unit 138 in line with the SIMD paradigm (e.g., one instruction control unit executing the same stream of instructions with multiple data)...A scheduler 136 is configured to perform operations related to scheduling various wavefronts on different shader engines 132 and SIMD units 138. Wavefront bookkeeping 204 inside scheduler 136 stores data for pending wavefronts, which are wavefronts that have launched and are either executing or "asleep" (e.g., waiting to execute or not currently executing for some other reason). In addition to identifiers identifying pending wavefronts, wavefront bookkeeping 204 also stores indications of resources used by each wavefront, including registers such as vector registers 206 and/or scalar registers 208, portions of a local data store memory 212 assigned to a wavefront, portions of a memory 210 not local to any particular shader engine 132, or other resources assigned to the wavefront...” paragraphs 0022/0023).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Koker and Wilson with the teaching of Nijasure because the teaching of Nijasure would improve the system of Martin, Koker and Wilson by providing a technique of simultaneously executing task on computing units to allow for optimal use of computing resources.
As to claims 9 and 16, see rejection of claim 2 above.
Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2018/0286005 A1 et al. to Koker et al. and further in view of U.S. Pub. No. 20150309845 A1 to Wilson et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pub. No. 2021/0373899 A1 to Vembu et al.
As to claim 4, Martin as modified by Koker and Wilson teaches the processor as recited in claim 1, wherein the processor is further configured to determine the process for dispatching the individual wavefronts of the first workgroup to compute units of the plurality of compute units.
Vembu teaches wherein the processor is further configured to determine the process (scheduling policy) for dispatching the individual wavefronts (threads/wavefronts) of the first workgroup to separate compute units of the plurality of compute units (plurality of compute units) (“…At processing block 750, scheduler 613 performs thread scheduling based on a scheduling policy that includes both barrier usage and usual multiprocessor load balancing… wherein the graphics processor includes: a plurality of workgroup processors, each workgroup processor including a plurality of compute units for execution of threads in a plurality of wavefronts, each wavefront including a plurality of threads, and a scheduler to schedule a plurality of wavefronts for execution by the plurality of workgroup processors according to a scheduling policy, the scheduling policy being based at least in part on the barrier usage data, wherein the scheduler is to prioritize scheduling of a set of wavefronts of the plurality of wavefronts to a same workgroup processor of the plurality of workgroup processors upon a determination that the barrier usage data indicates a high magnitude of barrier messages in the set of wavefronts…wherein the scheduling policy is further based on load balancing of wavefronts across the plurality of workgroup processors...” paragraph 0147/claims 21/23).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Koker and Wilson with the teaching of Vembu because the teaching of Vembu would improve the system of Martin, Koker and Wilson by providing a load balancing process of distributing a set of tasks over a set of resources to allow for optimal use of computing resources.
As to claims 11 and 18, see the rejection of claim 4 above.
Claims 5, 6, 12, 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2018/0286005 A1 et al. to Koker et al. and further in view of U.S. Pub. No. 20150309845 A1 to Wilson et al. as applied to claims 1, 8 and 15 above, and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al.
As to claim 5, Martin as modified by Koker and Wilson teaches the processor as recited in claim 1, however it is silent with reference the processor is further configured to allocate wavefronts of the first workgroup to the plurality of compute units based on load-ratings for each compute units and each resource calculated based on a plurality of monitoring performance counters.
Conte teaches the processor is further configured to allocate wavefronts of the first workgroup to the plurality of compute units based on load-ratings for each compute units and each resource calculated based on a plurality of monitoring performance counters (“...FIG. 6a is a schematic illustration of a system for performing methods for multi-core thread mapping in accordance with the present disclosure. As shown in FIG. 6a, a computer system 600 may include a processor 605 configured for performing an example of a method for mapping threads to execution to processor cores. In other examples, various operations or portions of various operations of the method may be performed outside of the processor 605. In operation 602, the method may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having stored thereon computer executable instructions for performing a procedure for mapping threads of execution to processor cores in a multi-core processing system. As shown in FIG. 6b, a computer accessible medium 600 may have stored thereon computer accessible instructions 605 configured for performing an example procedure for mapping threads to execution to processor cores. In operation 602, the procedure may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Koker and Wilson with the teaching of Conte because the teaching of Conte would improve the system of Martin, Koker and Wilson by providing a technique of mapping threads to execution to processor cores.
As to claim 6, Martin as modified by Koker and Wilson teaches the processor as recited in claim 5, however it is silent with reference to wherein the processor is further configured to select a first compute unit of the plurality of compute units as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource.
Conte teaches wherein the processor is further configured to select a first compute unit of the plurality of compute units as a candidate for dispatch responsive to determining the first compute unit has a lowest load-rating among the plurality of compute units for a first resource (“...FIG. 6a is a schematic illustration of a system for performing methods for multi-core thread mapping in accordance with the present disclosure. As shown in FIG. 6a, a computer system 600 may include a processor 605 configured for performing an example of a method for mapping threads to execution to processor cores. In other examples, various operations or portions of various operations of the method may be performed outside of the processor 605. In operation 602, the method may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the method may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the method may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...FIG. 6b is a schematic illustration of a computer accessible medium having stored thereon computer executable instructions for performing a procedure for mapping threads of execution to processor cores in a multi-core processing system. As shown in FIG. 6b, a computer accessible medium 600 may have stored thereon computer accessible instructions 605 configured for performing an example procedure for mapping threads to execution to processor cores. In operation 602, the procedure may include executing at least one software application program resulting in at least one thread of execution. In operation 603, the procedure may include collecting data relating to the performance of the multi-core processor using a performance counter. In operation 604, the procedure may include using a core controller to map the thread of execution to a processor core based at least in part on the data collected by the performance counter...” Col. 5 Ln. 57 – 67, Col. 6 Ln. 1 – 21).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Koker and Wilson with the teaching of Conte because the teaching of Conte would improve the system of Martin, Koker and Wilson by providing a technique of mapping threads to execution to processor cores.
As to claims 12 and 19, see the rejection of claim 5 above.
As to claims 13 and 20, see the rejection of claim 6 above.
Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pat. No. 2014/0152675 A1 to Martin et al. in view of U.S. Pub. No. 2018/0286005 A1 et al. to Koker et al. and further in view of U.S. Pub. No. 2015/0309845 A1 to Wilson et al. and further in view of U.S. Pat. No. 9,189,282 B2 issued to Conte et al. as applied to claim 5 above, and further in view of U.S. Pub. No. 2017/0031719 A1 to Clark et al.
As to claim 7, Martin as modified by Koker, Wilson and Conte teaches the processor as recited in claim 5, however it is silent with reference to wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth.
Clark teaches wherein the plurality of performance counters track two or more of vector arithmetic logic unit (VALU) execution bandwidth, scalar ALU (SALU) execution bandwidth, local data share (LDS) bandwidth, Load Store Bus bandwidth, Vector Register File (VRF) bandwidth, Scalar Register File (SRF) bandwidth, cache subsystem capacity, cache bandwidth, and translation lookaside buffer (TLB) bandwidth (Cache Bandwidth 450D, Cache Capacity 450E) (“...Performance counter values 450 may include a plurality of values associated with a given guest VM or a given vCPU of a guest VM, depending on the embodiment. A guest VM may include a plurality of performance counter values 450, and the guest VM may include a different set of performance counter values for a plurality of vCPUs in use by the guest VM. In various embodiments, the performance counter values 450 may include one or more of CPU time 450A, instructions retired 450B, floating point operations (FLOPs) 450C, cache bandwidth 450D, cache capacity 450E, memory bandwidth 450F, memory capacity 450G, I/O bandwidth 450H, and/or fixed function IP usage 450J...” paragraph 0052).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Martin, Koker, Wilson and Conte with the teaching of Clark because the teaching of Clark would improve the system of Martin, Koker, Wilson and Conte by providing a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems.
As to claim 14, see the rejection of claim 7 above.
Response to Arguments
Applicant's arguments filed 12/01/25 have been fully considered but they are not persuasive.
Applicants argued in substance that (1) the Koker prior art does not teaches or suggest synchronization of a split workgroup by tracking and maintaining workgroup state information and (2) the Wilson prior art does not teach or suggest barrier synchronization and workgroup state tracking.
The Examiner disagrees.
As to point (1), the Koker prior art was never applied in this rejection of the synchronization of a split workgroup by tracking and maintaining workgroup state information. The Koker prior art discloses work distribution unit in a scheduler configured to distribute threads or other work items (workgroup) to the compute units and/or compute clusters. The work distribution unit can distribute the threads as dispatched to individual/single compute units to gain the best hardware resource utilization via separate buses and access to dedicated resources. Additionally, compute units and/or compute clusters can be grouped to enable efficient hardware power utilization. In the distributing unit, a thread dispatcher can distribute or concentrate threads across or within one or more compute units. On receiving the thread/workgroup, the thread dispatcher analyzes the parallelism profile associated with the thread/workgroup to determine how to schedule the threads across available compute units. If the threads of the workgroup can saturate the available resources of one or more compute units, the thread dispatcher can distribute threads for parallel execution across one or more compute units. In essence, if the parallelism profile indicated that the workgroup will be executed using a smaller number of threads, the threads of the workgroup can be concentrated to a single compute unit, otherwise if the single compute unit (resource) is insufficient for the number of threads/workgroups, it is distributed across more than a single compute unit.
As to point (2), the Wilson prior art discloses a synchronization method in a computer system with multiple cores, where a group of threads executes in parallel on a plurality of cores. The group of threads being synchronized using barrier synchronization in which each thread in the group waits for all the others at a barrier before progressing. When first thread reaches the barrier, the thread increments a shared counter which records its state and then begins a busy-wait until the next thread reaches the barrier. The shared counter and recording are updated until the last or nth. thread reaches the barrier. The recording and updating of the shared counter is functionally equivalent the claimed monitoring/tracking and recording the state of threads by barrier synchronization.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES E ANYA whose telephone number is (571)272-3757. The examiner can normally be reached Mon-Fir. 9-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KEVIN YOUNG can be reached on 571-270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHARLES E ANYA/Primary Examiner, Art Unit 2194