Last updated: April 19, 2026
Application No. 18/540,663
THROTTLING KERNEL SCHEDULING TO MINIMIZE CACHE CONTENTION

Non-Final OA §103
Filed
Dec 14, 2023
Examiner
WU, BENJAMIN C
Art Unit
2195
Tech Center
2100 — Computer Architecture & Software
Assignee
Advanced Micro Devices, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +16.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 522 resolved cases, 2023–2026
Examiner Intelligence

WU, BENJAMIN C View full profile →
Grants 87% — above average
Career Allow Rate
456 granted / 522 resolved
+32.4% vs TC avg
Strong +16% interview lift
Without
With
+16.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
29 currently pending
Career history
551
Total Applications
across all art units
Statute-Specific Performance

§101
19.8%
-20.2% vs TC avg
§103
48.4%
+8.4% vs TC avg
§102
0.8%
-39.2% vs TC avg
§112
16.1%
-23.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 522 resolved cases
Office Action

§103
CTNF 18/540,663 CTNF 87996 DETAILED ACTION Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia 1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 2. Claims 1–20 are presented for examination in a non-provisional application filed on 12/14/2023 . Drawings 06-37 AIA 3. The drawings were received on 12/14/2023 (in the filings) . These drawings are acceptable . Examiner’s Remarks 4. Examiner refers to and explicitly cites particular pages, sections, figures, paragraphs or columns and lines in the references as applied to Applicant’s claims to the extent practicable to streamline prosecution. Although the cited portions of the references are representative of the best teachings in the art and are applied to meet the specific limitations of the claims, other uncited but related teachings of the references may be equally applicable as well. It is respectfully requested that, in preparing responses to the rejections, the Applicant fully considers not only the cited portions of the references, but also the references in their entirety, as potentially teaching, suggesting or rendering obvious all or one or more aspects of the claimed invention. Abbreviations 5. Where appropriate, the following abbreviations will be used when referencing Applicant’s submissions and specific teachings of the reference(s): i. figure / figures: Fig. / Figs. ii. column / columns: Col. / Cols. iii. page / pages: p. / pp. References Cited 6. (A) Ashbaugh et al ., US 2020/0293380 A1 (“Ashbaugh”). (B) Alexander et al. , US 2011/0161734 A1 (“Alexander”). (C) Kandula et al. , US 2012/0167101 A1 (“Kandula”). Notice re prior art available under both pre-AIA and AIA 07-06 AIA 15-10-15 7. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. Claim Rejections - 35 USC § 103 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. A. 07-21-aia AIA 8. Claim s 1–5, 8–12, and 15–19 are rejected under 35 U.S.C. 103 as being unpatentable over (A) Ashbaugh in view of (B) Alexander and (C) Kandula . See “References Cited” section, above, for full citations of references. 9. Regarding claim 1 , (A) Ashbaugh teaches/suggests the invention substantially as claimed, including: “ An integrated circuit comprising: a plurality of compute circuits; and ” (Fig. 17 and ¶ 231: processing platform incorporated within a system-on-a-chip (SoC) integrated circuit; Figs. 28–29 and ¶ 340: exemplary integrated circuits and associated graphics processors that may be fabricated using one or more IP cores); “ a scheduler comprising circuitry configured to create a plurality of scheduling groups including at least a first scheduling group … and a second scheduling group … each of the scheduling groups comprising kernels of a plurality of kernels ” (Fig. 15 and ¶ 222: scheduling of thread groups to processors . Certain sub-sets of thread assignments, such as sub-set 1510, are cached together . In some embodiments, in contrast with the conventional scheduling of thread groups illustrated in FIG. 14, thread groups are assigned utilizing cache locality. For example, in a particular instance thread groups 0 to 3 are assigned in a manner to follow the cache locality established for thread group assignments; ¶ 224: the bias is a hint regarding cache locality that may be utilized in thread group assignment, such as for KERNELS with regular access patterns; ¶ 52: a scheduler 210, which is configured to distribute commands or other work items to a processing cluster array); “ execute the first scheduling group on one or more of the plurality of compute circuits; execute the second scheduling group on available hardware resources of the plurality of compute circuits …” (¶ 223: a certain bias is provided for synchronous scheduling , the bias (using hint or assertion) being that a group of thread groups/warps are to EXECUTE TOGETHER on a single microprocessor process or cache domain to improve cache locality; ¶ 346: the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches; ¶ 220: thread groups 0-5 when being scheduling will each be scheduled to the first available processor). Ashbaugh do not teach “a first scheduling group that accesses a first data set and a second scheduling group that accesses a second data set different from the first data set.” (B) Alexander however teaches or suggests: “ a first scheduling group that accesses a first data set and a second scheduling group that accesses a second data set different from the first data set ” (¶ 32: to receive one or more kernels from a compiler and schedule the work (e.g., one or more kernels and/ or data sets) for dispatch to/by one or more of the multiple processor cores; ¶ 39: provide work divided into one or more work items 2040-2042, each associated with a kernel ( e.g., a kernel of kernels 2010-2014 ). The kernels 2010-2014 are forwarded to scheduler 1335. In one or more embodiments, scheduler 1350 includes a scheduler that performs the functions of: (1) scheduling (placing) work elements … (2) selectively allocating the work items to selected processor cores; Fig. 3 and ¶ 43: work items can be grouped with a respective work counter and a respective kernel that can be used to process the work items. As illustrated, work groups 3010-3013 can include respective work items 3040-3043 and respective WIR counters 3050-3053. As shown, work groups 3010 and 3011 can include kernel 2010, and work groups 3012 and 3013 can include kernel 2011; ¶ 20: (1) Work Item: a base element of a data set ( e.g., a byte, a string, an integer number, an floating point number, a pixel, an array, a data structure, etc.) ). It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of (B) Alexander with those of (A) Ashbaugh to implement and schedule multiple thread/kernel groups to process different data sets (work items). The motivation or advantage to do so is to provide for balanced and optimal division and distribution of work, i.e. data sets. Ashbaugh and Alexander do not teach “execute the second scheduling group … based at least in part on a dispatch rate not being satisfied; and delay execution of the second scheduling group on the available hardware resources of the plurality of compute circuits, based at least in part on the dispatch rate condition being satisfied.” (C) Kandula , in the context of Ashbaugh and Alexander’s teachings , however teaches or suggests implementing: “ execute the second scheduling group … based at least in part on a dispatch rate not being satisfied; and delay execution of the second scheduling group on the available hardware resources of the plurality of compute circuits, based at least in part on the dispatch rate condition being satisfied ” (¶ 118: tasks can be ordered so that tasks that will take longer to complete are scheduled before tasks that will take less time to complete. For example, the longest task can be scheduled first and the shortest task can be scheduled last; ¶ 120: proactive scheduler 131 can reduce job completion times by scheduling tasks in a phase in descending order of their data size …. Thus, scheduling tasks with the longest processing time first can approximate the optimal task scheduling). (see Ashbaugh , ¶ 74: instructions are cached in the instruction cache 252 and dispatched for execution by the instruction unit 254. The instruction unit 254 can dispatch instructions as thread groups (e.g., warps), with each thread of the thread group assigned to a different execution unit within GPGPU core 262; Alexander , ¶ 32: to receive one or more kernels from a compiler and schedule the work (e.g., one or more kernels and/ or data sets) for dispatch to/by one or more of the multiple processor cores). It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of (C) Kandula with those of (A) Ashbaugh and (B) Alexander to schedule and execute thread/kernel groups in the order of their expected (or remaining) processing time or time to complete, i.e. longest job first scheduling. The motivation or advantage to do so is to optimize the scheduling of thread/kernel groups and ensure earlier overall completions across scheduled groups. 10. Regarding claim 2 , Kandula teaches or suggests: “ wherein the dispatch rate condition being satisfied comprises a duration of execution of the first scheduling group is greater than a duration of execution of the second scheduling group ” (¶ 118: tasks can be ordered so that tasks that will take longer to complete are scheduled before tasks that will take less time to complete. For example, the longest task can be scheduled first and the shortest task can be scheduled last; ¶ 120: proactive scheduler 131 can reduce job completion times by scheduling tasks in a phase in descending order of their data size …. Thus, scheduling tasks with the longest processing time first can approximate the optimal task scheduling). 11. Regarding claim 3 , Ashbaugh, Alexander, and Kandula, in combination, teach or suggest: “ wherein the scheduler is further configured to schedule execution of a third scheduling group of the plurality of scheduling groups that accesses a third data set, based at least in part on a dispatch rate condition for the third scheduling group not being satisfied ” ( Ashbaugh , ¶ 223: a certain bias is provided for synchronous scheduling , the bias (using hint or assertion) being that a group of thread groups/warps are to EXECUTE TOGETHER on a single microprocessor process or cache domain to improve cache locality; ¶ 346: the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches; ¶ 220: thread groups 0-5 when being scheduling will each be scheduled to the first available processor; Alexander , ¶ 32: to receive one or more kernels from a compiler and schedule the work (e.g., one or more kernels and/ or data sets) for dispatch to/by one or more of the multiple processor cores; ¶ 39: provide work divided into one or more work items 2040-2042, each associated with a kernel ( e.g., a kernel of kernels 2010-2014 ). The kernels 2010-2014 are forwarded to scheduler 1335. In one or more embodiments, scheduler 1350 includes a scheduler that performs the functions of: (1) scheduling (placing) work elements … (2) selectively allocating the work items to selected processor cores; Fig. 3 and ¶ 43: work items can be grouped with a respective work counter and a respective kernel that can be used to process the work items. As illustrated, work groups 3010-3013 can include respective work items 3040-3043 and respective WIR counters 3050-3053. As shown, work groups 3010 and 3011 can include kernel 2010, and work groups 3012 and 3013 can include kernel 2011; ¶ 20: (1) Work Item: a base element of a data set ( e.g., a byte, a string, an integer number, an floating point number, a pixel, an array, a data structure, etc.); Kandula , ¶ 118 and ¶ 120, teaching longest job first scheduling method, as applied in rejecting claim 1 above). 12. Regarding claim 4 , Ashbaugh and Kandula, in combination, teach or suggest: “ wherein the scheduler is further configured to generate the first duration based on the first data set being stored in a cache ” ( Ashbaugh , Fig. 15 and ¶ 222: scheduling of thread groups to processors . Certain sub-sets of thread assignments, such as sub-set 1510, are cached together . In some embodiments, in contrast with the conventional scheduling of thread groups illustrated in FIG. 14, thread groups are assigned utilizing cache locality. For example, in a particular instance thread groups 0 to 3 are assigned in a manner to follow the cache locality established for thread group assignments; ¶ 224: the bias is a hint regarding cache locality that may be utilized in thread group assignment, such as for KERNELS with regular access patterns; ¶ 225: the scheduling is required to be executed on basis of similar cache domain; ¶ 346: one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches; ¶ 282: one or more data caches (e.g., 2212) are included to cache thread data during thread execution; Kandula , ¶ 118 and ¶ 120, teaching longest job first scheduling method, as applied in rejecting claim 1 above; ¶ 123: the time to finish computing output after the input data has been read, and can be estimated from the behavior of earlier tasks in the phase. Proactive scheduler 131 can use interpolation and/or extrapolation from execution times and data sizes of previous tasks). 13. Regarding claim 5 , Ashbaugh and Kandula, in combination, teach or suggest: “ wherein the scheduler is further configured to generate the first duration based on a difference between a completion time estimate of the first scheduling group with the first data set stored in the cache and an amount of time that has elapsed since the first scheduling group had begun execution ” ( Ashbaugh , Fig. 15 and ¶ 222: scheduling of thread groups to processors . Certain sub-sets of thread assignments, such as sub-set 1510, are cached together . In some embodiments, in contrast with the conventional scheduling of thread groups illustrated in FIG. 14, thread groups are assigned utilizing cache locality. For example, in a particular instance thread groups 0 to 3 are assigned in a manner to follow the cache locality established for thread group assignments; ¶ 224: the bias is a hint regarding cache locality that may be utilized in thread group assignment, such as for KERNELS with regular access patterns; ¶ 225: the scheduling is required to be executed on basis of similar cache domain; ¶ 346: one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches; ¶ 282: one or more data caches (e.g., 2212) are included to cache thread data during thread execution; Kandula , ¶ 118 and ¶ 120, teaching longest job first scheduling method, as applied in rejecting claim 1 above; ¶ 117: Block 803 can include estimating the AMOUNT OF TIME REMAINING for a task to complete . For example, proactive scheduler 131 can estimate the amount of time for the task to complete based on the amount of input data that is left to be read by the task , how fast the input data is being read by the task, the amount of data that is left to be output by the task, and/or how fast the data is being output by the task; ¶ 123: the time to finish computing output after the input data has been read, and can be estimated from the behavior of earlier tasks in the phase. Proactive scheduler 131 can use interpolation and/or extrapolation from execution times and data sizes of previous tasks). 14. Regarding claims 8–12 , they are the corresponding method claim reciting similar limitations of commensurate scope as the apparatus (integrated circuit) of claims 1–5 . Therefore, they are rejected on the same basis as claims 1–5 above. 15. Regarding claims 15–19 , they are the corresponding system claim reciting similar limitations of commensurate scope as the apparatus (integrated circuit) of claims 1–5 . Therefore, they are rejected on the same basis as claims 1–5 above, including the following rationale: Ashbaugh teaches or suggests “ a cache configured to store a copy of data stored in a memory ” ( Ashbaugh , Fig. 15 and ¶ 222: scheduling of thread groups to processors . Certain sub-sets of thread assignments, such as sub-set 1510, are cached together . In some embodiments, in contrast with the conventional scheduling of thread groups illustrated in FIG. 14, thread groups are assigned utilizing cache locality. For example, in a particular instance thread groups 0 to 3 are assigned in a manner to follow the cache locality established for thread group assignments; ¶ 224: the bias is a hint regarding cache locality that may be utilized in thread group assignment, such as for KERNELS with regular access patterns; ¶ 225: the scheduling is required to be executed on basis of similar cache domain; ¶ 346: one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches; ¶ 282: one or more data caches (e.g., 2212) are included to cache thread data during thread execution; “ a processing circuit comprising: a plurality of chiplets ; ” (¶ 47: the one or more parallel processor(s) 112, memory hub 105, processor(s) 102, and I/O hub 107 can be integrated into a system on chip (SoC) integrated circuit ; Fig. 13 and ¶ 213: FIG. 13 illustrates an exemplary inferencing system on a chip (SOC) 1300 suitable for performing inferencing using a trained model. The SOC 1300 can integrate processing components including a media processor 1302, a vision processor 1304, a GPGPU 1306 and a multi-core processor 1308 . The SOC 1300 can additionally include on-chip memory 1305 that can enable a shared on-chip data pool that is accessible by each of the processing components ; Fig. 17 and ¶ 231: processing platform incorporated within a system-on-a-chip (SoC) integrated circuit; Fig. 28 and ¶ 341: FIG. 28 is a block diagram illustrating an exemplary system on a chip integrated circuit 2800 that may be fabricated using one or more IP cores, according to an embodiment. Exemplary integrated circuit 2800 includes one or more application processor(s) 2805 (e.g., CPUs), at least one graphics processor 2810, and may additionally include an image processor 2815 and/or a video processor 2820; the Examiner notes that a “chiplet” is a small, modular integrated circuit that can be combined together to create a more complex system-on-chip (SoC) of a single chip). “ a scheduler comprising circuitry ” (Fig. 15 and ¶ 222: scheduling of thread groups to processors . Certain sub-sets of thread assignments, such as sub-set 1510, are cached together . In some embodiments, in contrast with the conventional scheduling of thread groups illustrated in FIG. 14, thread groups are assigned utilizing cache locality. For example, in a particular instance thread groups 0 to 3 are assigned in a manner to follow the cache locality established for thread group assignments; ¶ 224: the bias is a hint regarding cache locality that may be utilized in thread group assignment, such as for KERNELS with regular access patterns; ¶ 52: a scheduler 210, which is configured to distribute commands or other work items to a processing cluster array … the scheduler 210 is implemented via firmware logic executing on a microcontroller). Allowable Subject Matter 07-43 16. Claims 6–7, 13–14, and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if 1) rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is the Examiner’s statement of reasons for allowance: The prior art of record, when viewed individually or in combination, does not expressly teach nor render obvious the features of dependent claims 6, 13, and 20 when viewed as a whole, specific to the limitation(s) of: “wherein the scheduler is further configured to generate the second duration based on a corresponding data set accessed by the given scheduling group being NOT initially stored in the cache .” Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN C WU whose telephone number is (571)270-5906. The examiner can normally be reached Monday through Friday, 8:30 A.M. to 5:00 P.M.. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee J. Li can be reached on (571)272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /BENJAMIN C WU/Primary Examiner, Art Unit 2195 March 14, 2026 Application/Control Number: 18/540,663 Page 2 Art Unit: 2195 Application/Control Number: 18/540,663 Page 3 Art Unit: 2195 Application/Control Number: 18/540,663 Page 4 Art Unit: 2195 Application/Control Number: 18/540,663 Page 5 Art Unit: 2195 Application/Control Number: 18/540,663 Page 6 Art Unit: 2195 Application/Control Number: 18/540,663 Page 7 Art Unit: 2195 Application/Control Number: 18/540,663 Page 8 Art Unit: 2195 Application/Control Number: 18/540,663 Page 9 Art Unit: 2195 Application/Control Number: 18/540,663 Page 10 Art Unit: 2195 Application/Control Number: 18/540,663 Page 11 Art Unit: 2195 Application/Control Number: 18/540,663 Page 12 Art Unit: 2195 Application/Control Number: 18/540,663 Page 13 Art Unit: 2195
Read full office action
Prosecution Timeline

Dec 14, 2023
Application Filed
Mar 14, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/352,293
Patent 12602258
INSTANTIATING SOFTWARE DEFINED STORAGE NODES ON EDGE INFORMATION HANDLING SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/247,243
Patent 12585508
RECONSTRUCTING AND VERIFYING PROPRIETARY CLOUD BASED ON STATE TRANSITION
2y 5m to grant Granted Mar 24, 2026
17/817,109
Patent 12579006
SYSTEMS AND METHODS FOR UNIVERSAL AUTO-SCALING
2y 5m to grant Granted Mar 17, 2026
18/182,878
Patent 12572388
COMPUTING RESOURCE SCHEDULING BASEDON EXPECTED CYCLES
2y 5m to grant Granted Mar 10, 2026
17/587,663
Patent 12566646
Accessing Critical Resource in a Non-Uniform Memory Access (NUMA) System
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
87%
Grant Probability
99%
With Interview (+16.4%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 522 resolved cases by this examiner. Grant probability derived from career allow rate.