Last updated: April 19, 2026
Application No. 17/571,220
MULTI-PASS PERFORMANCE PROFILING

Non-Final OA §103
Filed
Jan 07, 2022
Examiner
ANYA, CHARLES E
Art Unit
2194
Tech Center
2100 — Computer Architecture & Software
Assignee
Nvidia Corporation
OA Round
5 (Non-Final)
Interview Optional

— +33.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 891 resolved cases, 2023–2026
Examiner Intelligence

ANYA, CHARLES E View full profile →
Grants 82% — above average
Career Allow Rate
727 granted / 891 resolved
+26.6% vs TC avg
Strong +34% interview lift
Without
With
+33.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
41 currently pending
Career history
932
Total Applications
across all art units
Statute-Specific Performance

§101
11.2%
-28.8% vs TC avg
§103
61.1%
+21.1% vs TC avg
§102
6.8%
-33.2% vs TC avg
§112
10.4%
-29.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 891 resolved cases
Office Action

§103
DETAILED ACTION
Claims 1-33 are pending in this application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



	
Claims 1, 9, 17 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2017/0212563 A1 to Farazmand et al.

As to claim 1, Rao teaches one or more processors comprising: circuitry to, in response to an application programming interface (API) call (At 1301 an application calls) (“…FIG. 13 illustrates a sequence of operations performed across various software layers including the scheduler kernel 1203 in accordance with one embodiment of the invention. At 1301 an application calls the user mode driver using an appropriate set of API calls (e.g., OpenCL API calls in one embodiment). At 1302, the user mode driver allocates GPU resources and submits the parent kernel to the kernel mode driver (KMD) of the OS. At 1303, the OS/KMD submits the parent kernel for execution by the GPU (e.g., placing instructions and data in the GPU command buffer)…” paragraph 0118):
determining that the first portion of a software program is dependent on data generated by a second portion of the software program (if child kernels are present, determined at 1306); and
causing the first portion and second portion of the software program to be performed a plurality of times (loops back) in response to the determination (then at 1307, the scheduler kernel select the first/next child kernel for execution) (“…At 1304, the parent kernel is executed on the GPU. As mentioned above, in one embodiment, the parent kernel sets up a scheduler kernel 1203 for scheduling execution of the child kernels (e.g., setting up state information in the GPU needed for the execution of the scheduler kernel 1203). At 1305, the schedule kernel is executed and, if child kernels are present, determined at 1306, then at 1307, the scheduler kernel select the first/next child kernel for execution. As mentioned above, the scheduler kernel 1203 may evaluate dependencies between the child kernels and other relevant information when scheduling the child kernels for execution. Once the scheduler kernel 1203 determines that the child kernel is ready for execution (e.g., because the data on which it depends is available), at 1308 it sets up the execution on the GPU. For example, the scheduler kernel 1203 may program the GPU state as needed so that the GPU can execute the child kernel. At 1309, once the state has been set up on the GPU, the child kernel is executed. The process then loops back through operations 1305-1309 to execute the next child kernel, potentially reordering the child kernels for execution based on detected conditions/dependencies. When no more child kernels are available, determined at 1306, the process terminates and returns results of the parent/host kernel execution to the host application at 1310…A set of exemplary program code implemented in one embodiment of the invention is illustrated in FIG. 14. Certain portions are highlighted as follows to identify pertinent aspects of this embodiment. In particular, at 1401, the parent kernel is called on the host after which everything happens in GPU domain. There is no need to return to the host until execution is complete. At 1402 conditional execution of the child kernels is initiated. For example, whether child kernels A1 or A2 is executed is known only when parent kernel is executed…” paragraphs 0119/0130).
	Rao is silent with reference to in response to an application programming interface (API) call cause one or more performance metrics of a software program to be obtained, and 
wherein the API causes, for each of the plurality of times, different subset of the one or more performance metrics to be obtained based, at least in part, on the performance of at least the first portion and the second portion.
Farazmand teaches in response to an application programming interface (API) call (issue commands) cause one or more performance metrics of a software program to be obtained ( “…This disclosure introduces techniques for using performance counters to measure the processing time for a compute kernel, or portion thereof, as part of a profiling phase. Based on the measured processing times and other information available at compile time or kernel launch time, a system may predict total execution time of the kernel. An example of such other information may be the number of workgroups in a kernel or number of kernels in a virtual frame, where a virtual frame is a virtual construct for converting compute workloads that are theoretically unbounded and non-periodic, into execution units with associated (e.g. implied) deadline or performance requirements…” paragraph 0022) and 
wherein the API causes (issue commands), for each of the plurality of times (a total number of execution cycles for the compute kernel), different subset of the one or more performance metrics to be obtained based, at least in part, on the performance of at least the first portion (compute kernel) and the second portion (total number of workgroups) (“… GPU 12 may predict the execution time of the compute kernel by estimating an average execution clock cycles per workgroup for the compute kernel and by estimating a total number of execution cycles for the compute kernel based on the average execution clock cycles per workgroup for the compute kernel and a total number of workgroups in the kernel…To predict the execution time of the compute kernel, the system may additionally estimate a total number of execution cycles for the compute kernel based on the average execution clock cycles per workgroup for the compute kernel and a total number of workgroups in the kernel…” paragraphs 0042//0094/0097).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao with the teaching of Farazmand because the teaching of Farazmand would improve the system of Rao by providing a set of special-purpose registers built into microprocessors to store the counts of hardware-related activities within computer systems. 

          As to claims 9, 17 and 25, see the rejection of claim 1 above.


Claims 1-3, 6, 9-11, 14, 17-19, 22, 25-27 and 29 rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al.

As to claim 1, Rao teaches one or more processors comprising: circuitry to, in response to an application programming interface (API) call (At 1301 an application calls) (“…FIG. 13 illustrates a sequence of operations performed across various software layers including the scheduler kernel 1203 in accordance with one embodiment of the invention. At 1301 an application calls the user mode driver using an appropriate set of API calls (e.g., OpenCL API calls in one embodiment). At 1302, the user mode driver allocates GPU resources and submits the parent kernel to the kernel mode driver (KMD) of the OS. At 1303, the OS/KMD submits the parent kernel for execution by the GPU (e.g., placing instructions and data in the GPU command buffer)…” paragraph 0118):
determining that the first portion of a software program is dependent on data generated by a second portion of the software program (if child kernels are present, determined at 1306); and
causing the first portion and second portion of the software program to be performed a plurality of times (loops back) in response to the determination (then at 1307, the scheduler kernel select the first/next child kernel for execution) (“…At 1304, the parent kernel is executed on the GPU. As mentioned above, in one embodiment, the parent kernel sets up a scheduler kernel 1203 for scheduling execution of the child kernels (e.g., setting up state information in the GPU needed for the execution of the scheduler kernel 1203). At 1305, the schedule kernel is executed and, if child kernels are present, determined at 1306, then at 1307, the scheduler kernel select the first/next child kernel for execution. As mentioned above, the scheduler kernel 1203 may evaluate dependencies between the child kernels and other relevant information when scheduling the child kernels for execution. Once the scheduler kernel 1203 determines that the child kernel is ready for execution (e.g., because the data on which it depends is available), at 1308 it sets up the execution on the GPU. For example, the scheduler kernel 1203 may program the GPU state as needed so that the GPU can execute the child kernel. At 1309, once the state has been set up on the GPU, the child kernel is executed. The process then loops back through operations 1305-1309 to execute the next child kernel, potentially reordering the child kernels for execution based on detected conditions/dependencies. When no more child kernels are available, determined at 1306, the process terminates and returns results of the parent/host kernel execution to the host application at 1310…A set of exemplary program code implemented in one embodiment of the invention is illustrated in FIG. 14. Certain portions are highlighted as follows to identify pertinent aspects of this embodiment. In particular, at 1401, the parent kernel is called on the host after which everything happens in GPU domain. There is no need to return to the host until execution is complete. At 1402 conditional execution of the child kernels is initiated. For example, whether child kernels A1 or A2 is executed is known only when parent kernel is executed…” paragraphs 0119/0130).
	Rao is silent with reference to in response to an application programming interface (API) call cause one or more performance metrics of a software program to be obtained, and 
wherein the API causes, for each of the plurality of times, different subset of the one or more performance metrics to be obtained based, at least in part, on the performance of at least the first portion and the second portion.
Gummaraju teaches in response to an application programming interface (API) call cause one or more performance metrics of a software program to be obtained (Performance Counters 211/Statistics may be collected using hardware and/or software performance counters) and wherein the API causes, for each of the plurality of times, different subset of the one or more performance metrics to be obtained based, at least in part, on the performance of at least the first portion and the second portion (compute kernels/operation 304, kernel profiles) (“…Work monitor 203 operates to monitor the compute kernels during runtime. Dynamic information such as executing time, consumed energy, compute unit utilization, bandwidth utilization, number of stall cycles due to memory access latencies, total number of stall cycles, and conditional branch divergence can be obtained by monitoring kernels during execution. The monitoring may be based on hardware, firmware, and/or software based performance counters 211. The number and types of performance counters 211 being monitored may be balanced against the overhead incurred in such monitoring so as not to unduly burden the operation of the computing system. The monitoring may be performed at configured periodic intervals or continuous sampling…At operation 304, kernel profiles are generated for compute kernels that are to be executed in the heterogeneous computing system. A profile may be created for each compute kernel to be executed. The profile may be created during installation of the application or library containing the kernel, on the first invocation of the kernel, or on each invocation of the kernel. Profiles may be stored in persistent storage for use across multiple runs of the application or re-generated on each invocation of the application or kernel. As described before, multiple instances of a compute kernel can be executed in the computing system in parallel on one or more processors. A kernel profile maybe generated for each compute kernel. The initial kernel profile may be generated based on static characteristics of the kernel code. The respective kernel profiles can be updated according to dynamic performance and/or energy consumption characteristics of the kernel during execution. The generation of kernel profiles is further described in relation to FIG. 4 below…At operation 502, compute kernel execution is monitored. The monitoring may be based upon collecting a configurable set of statistics. The statistics may be collected separately for each of the processors on which the compute kernels execute. Statistics may be aggregated for each processor type. Statistics may be collected using hardware and/or software performance counters. As described below, various metrics are available to measure performance and/or energy consumption. The set of metrics measured and monitored may be configurable…” 0040/0049/0063).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao with the teaching of Gummaraju because the teaching of Gummaraju would improve the system of Rao by providing a set of special-purpose registers built into microprocessors to store the counts of hardware-related activities within computer systems. 

As to claim 2, Rao teaches the one or more processors of claim 1, wherein each of the one or more portions comprises a kernel (figure 12) to be executed on a graphics processing unit (Graphics Processor Unit (GPU) 1212) (“…An overview of one embodiment of the invention is illustrated in FIG. 12, which shows an operating system (OS) 1210, a graphics driver 1211, and a graphics processor unit (GPU) 1212. In one embodiment, program code 1200 is executed by the OS 1210 in response to API calls generated by an application (not shown). In one embodiment, the program code 1200 comprises binary and state information generated by a compiler for one or more parent kernels 1204 and child kernels 1201. The driver 1211 may then parse the binary and state information and manage the dynamic execution of the parent and child kernels as described herein (without intervention by the OS/host processor)…In one embodiment, the parent kernel 1204 is executed on the GPU 1212 first and sets up a scheduler kernel 1203 for scheduling execution of the child kernels 1201 (as indicated at 1206 in FIG. 12). For example, the parent kernel 1204, in combination with the driver 1211, sets up the state information in the GPU needed for the execution of the scheduler kernel 1203. In one embodiment, the scheduler kernel 1203 (potentially implemented as a separate OpenCL kernel) is aware of the policies and mechanisms of the driver 1211 so that it can set up the GPU hardware state information correctly. The scheduler kernel 1203 then performs a variety of tasks described below to execute the child kernels 1201…” paragraphs 0113/0114).

As to claim 3, Gummaraju teaches the one or more processors of claim 1, wherein the plurality of times the one or more portions are to be performed is based, at least in part, on a number of available hardware counters (Performance Counters 211) (“…Work monitor 203 operates to monitor the compute kernels during runtime. Dynamic information such as executing time, consumed energy, compute unit utilization, bandwidth utilization, number of stall cycles due to memory access latencies, total number of stall cycles, and conditional branch divergence can be obtained by monitoring kernels during execution. The monitoring may be based on hardware, firmware, and/or software based performance counters 211. The number and types of performance counters 211 being monitored may be balanced against the overhead incurred in such monitoring so as not to unduly burden the operation of the computing system. The monitoring may be performed at configured periodic intervals or continuous sampling…At operation 304, kernel profiles are generated for compute kernels that are to be executed in the heterogeneous computing system. A profile may be created for each compute kernel to be executed. The profile may be created during installation of the application or library containing the kernel, on the first invocation of the kernel, or on each invocation of the kernel. Profles may be stored in persistent storage for use across multiple runs of the application or re-generated on each invocation of the application or kernel. As described before, multiple instances of a compute kernel can be executed in the computing system in parallel on one or more processors. A kernel profile maybe generated for each compute kernel. The initial kernel profile may be generated based on static characteristics of the kernel code. The respective kernel profiles can be updated according to dynamic performance and/or energy consumption characteristics of the kernel during execution. The generation of kernel profiles is further described in relation to FIG. 4 below…At operation 502, compute kernel execution is monitored. The monitoring may be based upon collecting a configurable set of statistics. The statistics may be collected separately for each of the processors on which the compute kernels execute. Statistics may be aggregated for each processor type. Statistics may be collected using hardware and/or software performance counters. As described below, various metrics are available to measure performance and/or energy consumption. The set of metrics measured and monitored may be configurable…” 0040/0049/0063).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao with the teaching of Gummaraju because the teaching of Gummaraju would improve the system of Rao by providing a set of special-purpose registers built into microprocessors to store the counts of hardware-related activities within computer systems. 

          As to claim 6, Rao teaches the one or more processors of claim 1, wherein the one or more portions comprise at least a first kernel having an interdependency with a second kernel (FIG. 12) (“…An overview of one embodiment of the invention is illustrated in FIG. 12, which shows an operating system (OS) 1210, a graphics driver 1211, and a graphics processor unit (GPU) 1212. In one embodiment, program code 1200 is executed by the OS 1210 in response to API calls generated by an application (not shown). In one embodiment, the program code 1200 comprises binary and state information generated by a compiler for one or more parent kernels 1204 and child kernels 1201. The driver 1211 may then parse the binary and state information and manage the dynamic execution of the parent and child kernels as described herein (without intervention by the OS/host processor)…In one embodiment, the parent kernel 1204 is executed on the GPU 1212 first and sets up a scheduler kernel 1203 for scheduling execution of the child kernels 1201 (as indicated at 1206 in FIG. 12). For example, the parent kernel 1204, in combination with the driver 1211, sets up the state information in the GPU needed for the execution of the scheduler kernel 1203. In one embodiment, the scheduler kernel 1203 (potentially implemented as a separate OpenCL kernel) is aware of the policies and mechanisms of the driver 1211 so that it can set up the GPU hardware state information correctly. The scheduler kernel 1203 then performs a variety of tasks described below to execute the child kernels 1201…” paragraphs 0113/0114).

          As to claims 9, 17 and 25, see the rejection of claim 1 above.

As to claims 10, 18 and 26, see the rejection of claim 2 above.

As to claims 11, 19 and 27, see the rejection of claim 3 above.

As to claims 14, 22, and 29, see the rejection of claim 6 above.

Claims 4, 12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al. as applied to claims 1, 9 and 17 above, and further in view of W.O. No. 2016/122503 A to Gay et al.

As to claim 4, Gummaraju teaches the one or more processors of claim 1, wherein the performance metrics are to be generated based, at least in part, on hardware counters (Performance Counters 211) (“…Work monitor 203 operates to monitor the compute kernels during runtime. Dynamic information such as executing time, consumed energy, compute unit utilization, bandwidth utilization, number of stall cycles due to memory access latencies, total number of stall cycles, and conditional branch divergence can be obtained by monitoring kernels during execution. The monitoring may be based on hardware, firmware, and/or software based performance counters 211. The number and types of performance counters 211 being monitored may be balanced against the overhead incurred in such monitoring so as not to unduly burden the operation of the computing system. The monitoring may be performed at configured periodic intervals or continuous sampling…At operation 304, kernel profiles are generated for compute kernels that are to be executed in the heterogeneous computing system. A profile may be created for each compute kernel to be executed. The profile may be created during installation of the application or library containing the kernel, on the first invocation of the kernel, or on each invocation of the kernel. Profles may be stored in persistent storage for use across multiple runs of the application or re-generated on each invocation of the application or kernel. As described before, multiple instances of a compute kernel can be executed in the computing system in parallel on one or more processors. A kernel profile maybe generated for each compute kernel. The initial kernel profile may be generated based on static characteristics of the kernel code. The respective kernel profiles can be updated according to dynamic performance and/or energy consumption characteristics of the kernel during execution. The generation of kernel profiles is further described in relation to FIG. 4 below…At operation 502, compute kernel execution is monitored. The monitoring may be based upon collecting a configurable set of statistics. The statistics may be collected separately for each of the processors on which the compute kernels execute. Statistics may be aggregated for each processor type. Statistics may be collected using hardware and/or software performance counters. As described below, various metrics are available to measure performance and/or energy consumption. The set of metrics measured and monitored may be configurable…” 0040/0049/0063).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Hsu with the teaching of Gummaraju because the teaching of Gummaraju would improve the system of Hsu by providing a set of special-purpose registers built into microprocessors to store the counts of hardware-related activities within computer systems. 
Gay teaches wherein a total number of available hardware counters, is less than required to generate the performance metrics in one performance of the one or more portions (“…The specification and figures describe a method of collecting hardware performance data…The method further includes, with the processor, executing an interpolation module to interpolate missed samples between a number of captured values of a first event. This method may have a number of advantages, including: (1 ) allowing users to sample many more hardware performance counter events to trace or profile an application in a single run; (2) providing collection of 3 to 4 times more events than otherwise possible; (3) making it possible with a limited number of hardware counters a processing device manufacturer may provide, to capture and study more than the minimum number of critical events in the processor captures; (4) allowing capture of all the events of interest to a user; (5) not having to deal with yielded misaligned or mismatched sample results between multiple runs; and (6) reducing the time required to capture hardware performance counter events that may be in practically time consuming, among other advantages….” paragraph 0129).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao and Gummaraju with the teaching of Gay because the teaching of Gay would improve the system of Rao and Gummaraju by providing a hardware performance data collection system for capturing and studying more than a minimum number of critical events (Gay paragraph 0129).

As to claims 12 and 20, see the rejection of claim 4 above.

Claims 5, 13, 21 and 28 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al. as applied to claims 1, 9, 17 and 25 above, and further in view of U.S. Pat. No. 6,253,338 B1 issued to Smolders et al.

As to claim 5, Rao as modified by Gummaraju teaches the one or more processors of claim 1, however it is silent with reference to wherein the one or more circuits are to restore contents of memory used to perform the two or more portions, prior to at least one of the plurality of times the one or more portions are performed.
Smolders teaches wherein the one or more circuits are to restore contents of memory used to perform the two or more portions (portions of workloads), prior to at least one of the plurality of times (Hardware Counters 74/75) the one or more portions are performed (“…Application Programming Interfaces (API) have also been built to collect counter information for portions of workloads…. The counter level tracing tool 31 then resets the hardware counters 74 and 75 to zero and restores the previous state (registers) information, shown respectively by steps 52 and 62. At that point the counter level tracing tool 31 has completed its operation at 64 with a return from the interrupt…” Col. 1, Ln. 33-35, Col. 5, Ln. 47-51).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao and Gummaraju with the teaching of Smolders because the teaching of Smolders would improve the system of Rao and Gummaraju by providing a counter tracing tool for restoring or resetting registers for reuse.

As to claims 13, 21 and 28, see the rejection of claim 5 above.

Claims 8, 16, 24, 31 and 32 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al. as applied to claims 1, 9, 17, and 25 above, and further in view of U.S. Pub. No. 2021/0117202 A1 to Levit-Gurevich et al.

As to claim 8, Rao as modified by Gummaraju teaches the one or more processors of claim 1, however it is silent with reference to wherein the API comprises code to identify a range of kernels to be re-executed, the range of kernels comprising the one or more portions.  
Levit-Gurevich teaches wherein the API comprises code to identify a range of kernels to be re-executed, the range of kernels comprising the one or more portions (Kernels 106/108/706/902/904) (“…In the illustrated example of FIG. 4, the trace emulator 430 emulates and/or otherwise replays the GLITs 112 of FIG. 1 to effectuate analysis of the operation of the GPU 110. For example, the trace emulator 430 may replay execution of the second kernel 108 by the GPU 110 based on data stored in the GLITs 112. In some examples, the trace emulator 430 may replay one or more executions of the second kernel 108 by respective ones of the threads 208 of FIG. 2 based on the data stored in the GLITs 112 that correspond to the respective ones of the threads 208…In this example, the CPU 704 implements an example GLIT replay application 714 to replay an execution of the kernel 706 by the GPU 702 based on the GLIT(s) 712 by simulating the execution of the kernel 706. In some examples, the GLIT replay application 714 may implement the application 120 of FIG. 1. For example, the GLIT replay application 714 can be a software application that instruments emulation routines (e.g., emulation instructions, emulation software routines, etc.) that correspond to a simulation of GPU routines (e.g., GPU instructions, GPU kernel routines, etc.) utilized to execute the kernel 706…” paragraphs 0076/0114/0122).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao and Gummaraju with the teaching of Levit-Gurevich because the teaching of Levit-Gurevich would improve the system of Rao and Gummaraju by providing a technique that allows the users to profile and analyze the kernels or shaders correctly.

As to claims 16, 24, 31 and 32 see the rejection of claim 8 above.

Claim 33 is rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. 2016/0093012 A1 to Rao et al. in view of U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al. as applied to claim 25 above, and further in view of U.S. Pub. No. 2018/0033114 A1 to Chen et al.

As to claim 33, Rao as modified by Gummaraju teaches the method of claim 25, however it is silent with reference to wherein at least some of the two or more portions are performed concurrently.
Chen teaches wherein at least some of the two or more portions are performed concurrently (concurrent execution of kernel codes that are programmed in more than one programming framework. The kernel codes programmed in different programming frameworks are referred herein as different types of kernel codes) (“…Embodiments of the invention support concurrent execution of kernel codes that are programmed in more than one programming framework. The kernel codes programmed in different programming frameworks are referred herein as different types of kernel codes. Correspondingly, APIs programmed in different programming frameworks are referred herein as different types of APIs. In one embodiment, the concurrent execution is carried out by a GPU. The GPU may receive commands from a driver module for executing a first kernel code of a first programming framework and a second kernel code of a second programming framework. The commands may include a first set of commands issued by a first API and a second set of commands issued by a second API. The GPU may concurrently decode the commands with two command decoders. The GPU may assign a first set of shader cores to execute the first kernel code, and assign a second set of shader cores to execute the second kernel code. The numbers of the shader cores in the first set and in the second set may be determined according to the weighs provided by a driver module. The GPU then concurrently executes the first kernel code with the first set of shader cores and the second kernel code with the second set of shader cores according to the decoded commands…” paragraphs 0014).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claim invention to modify the system of Rao and Gummaraju with the teaching of Chen because the teaching of Chen would improve the system of Rao and Gummaraju by providing the ability of a system to handle multiple tasks or processes at the same time, either through interleaving their execution on a single processor or by executing them truly simultaneously on multiple processors, thus allowing for optimal use of computing resources.

			Allowable Subject Matter
Claims 7, 15, 23 and 30 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
	


Reasons for Allowance 
     The following is an examiner’s statement of reasons for allowance:
The closest prior art of records, (U.S. Pub. No. 2016/0093012 A1 to Rao et al., U.S. Pat. No. 10,802,807 B1 to Hsu et al. and U.S. Pub. No. 2013/0160016 A1 to Gummaraju et al.), taken alone or in combination do not specifically disclose or suggest the claimed recitations (claims 7, 15, 23 and 30) of “…wherein the one or more circuits are to cause deferral of deallocation of memory used to execute the two or more portions until the two or more portions have been performed the plurality of times…”, when taken in the context of claims as a whole.  
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
	
	Response to Arguments
Applicant’s arguments with respect to claims 1, 9 and 17 have been considered but are moot because the new ground of rejection relied on additional references not applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
As for the Gummaraju prior art, the Examiner disagrees with Applicant’s argument that it does teach or suggest for each of the plurality of times, a different subset of the one or more performance metrics to be obtained.
The Gummaraju prior art discloses a method of executing compute kernels on one or more processing units. A Unified kernel scheduler operates to schedule compute kernels on one more types of processors available in heterogeneous computing system using a unified queue. A Work monitor monitors the compute kernels during runtime. Dynamic information such as executing time, consumed energy, compute unit utilization, bandwidth utilization, number of stall cycles due to memory access latencies, total number of stall cycles, and conditional branch divergence are statistically collected or obtained by monitoring compute kernels during execution. The monitoring may be based on hardware, firmware, and/or software based performance counters. The monitoring may be performed at configured periodic intervals or continuous sampling. The executing of compute kernels include the following in figure 4 (paragraph 0055):
“…At operation 404 execution-ready compute kernels are determined…The determination of the readiness for execution of compute kernels may also include considering whether all other application-level dependencies for that kernel have been satisfied…”.
In essence, the execution of compute kernels include executing dependent or graph of child kernels. Therefore, statistically collecting or  obtaining performance metrics (e.g., dynamic information such as executing time, consumed energy, compute unit utilization, bandwidth utilization, number of stall cycles due to memory access latencies, total number of stall cycles, and conditional branch divergence) includes dynamic information of the dependent or graph of child compute kernels (claimed subset of performance metrics).
	
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
U.S. Pub. No. 20200098082 A1 to Gutierrez et al. and directed to Kernels executing on a GPU to allocate memory on demand or free previously allocated memory resources that are no longer needed.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHARLES E ANYA whose telephone number is (571)272-3757. The examiner can normally be reached Mon-Fir. 9-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KEVIN YOUNG can be reached on 571-270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/CHARLES E ANYA/Primary Examiner, Art Unit 2194
Read full office action
Prosecution Timeline

Jan 07, 2022
Application Filed
Jun 29, 2024
Non-Final Rejection — §103
Sep 25, 2024
Interview Requested
Oct 09, 2024
Examiner Interview Summary
Oct 09, 2024
Applicant Interview (Telephonic)
Nov 05, 2024
Response Filed
Feb 05, 2025
Final Rejection — §103
Mar 12, 2025
Interview Requested
Mar 19, 2025
Applicant Interview (Telephonic)
Mar 19, 2025
Examiner Interview Summary
Jun 10, 2025
Request for Continued Examination
Jun 11, 2025
Response after Non-Final Action
Jul 12, 2025
Non-Final Rejection — §103
Jul 26, 2025
Interview Requested
Oct 02, 2025
Applicant Interview (Telephonic)
Oct 02, 2025
Examiner Interview Summary
Oct 15, 2025
Response Filed
Nov 09, 2025
Final Rejection — §103
Dec 11, 2025
Interview Requested
Dec 23, 2025
Applicant Interview (Telephonic)
Dec 23, 2025
Examiner Interview Summary
Mar 11, 2026
Request for Continued Examination
Mar 18, 2026
Response after Non-Final Action
Mar 20, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/116,751
Patent 12591471
KNOWLEDGE GRAPH REPRESENTATION OF CHANGES BETWEEN DIFFERENT VERSIONS OF APPLICATION PROGRAMMING INTERFACES
2y 5m to grant Granted Mar 31, 2026
18/303,681
Patent 12591455
PARAMETER-BASED ADAPTIVE SCHEDULING OF JOBS
2y 5m to grant Granted Mar 31, 2026
18/206,295
Patent 12585510
METHOD AND SYSTEM FOR AUTOMATED EVENT MANAGEMENT
2y 5m to grant Granted Mar 24, 2026
18/825,575
Patent 12579014
METHOD AND A SYSTEM FOR PROCESSING USER EVENTS
2y 5m to grant Granted Mar 17, 2026
18/052,993
Patent 12572393
CONTAINER CROSS-CLUSTER CAPACITY SCALING
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+33.5%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 891 resolved cases by this examiner. Grant probability derived from career allow rate.
MULTI-PASS PERFORMANCE PROFILING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email