DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending.
Response to Arguments
Regarding 35 U.S.C. 112:
Applicant’s amendments and arguments regarding the rejection of claims 1-20 under 35 U.S.C. 112(b) have been fully considered and are found to be persuasive. The rejections of claims 1-20 under 35 U.S.C. 112(b) are withdrawn. Applicant’s amendments introduce new issues under 35 U.S.C. 112(b) detailed below.
Regarding 35 U.S.C. 101:
Applicant’s amendments and arguments regarding the rejection of claims 1-20 under 35 U.S.C. 101 have been fully considered and are found to be persuasive. The rejections of claims 1-20 under 35 U.S.C. 101 are withdrawn as the amendment reciting processing of the frames integrates the judicial exception of allocation into a practical application.
Regarding: Prior Art Rejections:
Applicant’s amendments and arguments regarding the rejection of claims 1-20 under 35 U.S.C. 103 have been fully considered and are found to be not persuasive. The rejections of claims 1-20 under 35 U.S.C. 103 are maintained.
Applicant’s remarks recite:
Ibryam teaches increasing a quantity of threads to process received messages until performance metrics are met. Ibryam teaches adjusting a quantity of threads to process currently received messages. Ibryam does not consider usage of shared resources, but considers performance metrics. Ibryam does not teach determine usage of shared hardware resources by the GPU engines for one or more current frames of the application and dispatch an adjusted number of threads for the GPU engines to process the next frame.
Usage of shared hardware resources, under the broadest reasonable interpretation in light of the specification, includes prior performance of tasks run by the threads (i.e., usage of the threads to execute past tasks). Ibryam utilizes performance metrics as insight for how the thread pool is being utilized to perform work. If the system determines that performance metrics are slow, then the pool is too small. The system may respond by increasing the number of threads to encourage faster processing of tasks. Duluk Jr. teaches adjusting an amount of threads for the purposes of graphics processing. It would be obvious for one of ordinary skill in the art to utilize the thread pool feedback system from Ibryam in the frame processing system of Duluk Jr.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 3, 5, 7, 8, 9, 10, 11, 12, and 18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claims 3, 5, 7, 8, 9, 10, 11, 12, and 18 recite an inconsistency affecting clarity. The claims recite that the first circuitry is to determine usage of the shared hardware resources, however previously in claim 1, the second circuitry is the circuitry used to determine usage of the shared hardware resources. It is unclear which circuitry is determining usage of the shared hardware resources. Examiner interprets the first and second circuitry to both be able to determine usage of the shared hardware resources utilizing a shared form of resource monitoring or querying.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 11, 13, and 15-17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 (hereinafter “Duluk Jr.”) in view of Ibryam US 20210157649 A1 (hereinafter “Ibryam”).
Regarding claim 1, Duluk Jr. teaches the invention substantially as claimed including:
A graphics processing unit (GPU) (Fig 2 GPU 122) comprising:
first circuitry to receive commands for GPU engines of GPU to render and process frames of an application (Fig 2 Frontend 204; Col 4 Lines 51-53 Front end 204 receives state information (STATE), rendering commands (CMD), and geometry data (GDATA), e.g., from CPU 102 of FIG. 1; Col 5 Lines 28-36 Front end 204 directs the geometry data to data assembler 206. Data assembler 206 formats the geometry data and prepares it for delivery to a geometry module 218 in multithreaded core array 202. Geometry module 218 directs programmable processing engines (not explicitly shown) in multithreaded core array 202 to execute vertex and/or geometry shader programs on the vertex data, with the programs being selected in response to the state information provided by front end 204); and
second circuitry (Fig 2 ROP 214) to: determine usage of shared hardware resources by the GPU engines for one or more frames of the application (Fig 5B; Col 15 Lines 38-49 As shown in FIG. 5B, there are three groupings of execution units and corresponding memory allocations in frame buffer 226. The first group includes vertex shader unit 502 and a corresponding thread specific memory space of size 512 kilobytes, along with a stack space of size 32 kilobytes. The second group includes geometry shader unit 504 and a corresponding thread specific memory space of size 512 kilobytes, along with a stack space of size 32 kilobytes. The third group includes pixel shader unit 506 and a corresponding thread specific memory space of size 1024 kilobytes (shown as approximately 1 megabyte, or "1M"), along with a stack space of size 64 kilobytes; Col 7 Lines 25-32 at certain times, a given processing engine may operate as a vertex shader … as a geometry shader … as a pixel shader, receiving and executing pixel shader program instructions),
adjust a number of threads to dispatch to the shared hardware resources for a next frame for the GPU engines based on the determined usage of the shared hardware resources for the one or more frames (Col 14 Line 57- Col 15 Line 3 Thread count throttling may occur during operation of GPU 122. Continuing with the example discussed above, the application program may decide at a later point in time to re-define the thread-specific memory required per thread, to a value of 4096 kilobytes. This may occur, for example, if the application program is about to take on a routine that is more memory intensive. Here, the driver software re-calculates that given the total amount of thread-specific memory allocation of 2048 kilobytes and the newly defined per thread memory requirement of 4 kilobytes, there is only enough thread-specific memory in frame buffer 226 available for 512 threads. Thus, the driver software may send a command to GPU 122, to limit the total number of threads executed by multithreaded core array 202 to a reduced number of 512 threads), and
dispatch the adjusted number of threads for the GPU engines to process the next frame after the one or more frames (Thread count throttling may occur during operation of GPU 122 … if the application program is about to take on a routine that is more memory intensive, Col 14 57- Col 15 3).
While Duluk Jr. teaches adjusting thread counts for GPU execution units based on memory intensity of tasks, it does not explicitly teach adjusting the number of threads based on the determined usage of the shared hardware resources for the one or more frames.
However, Ibryam teaches adjusting the number of threads based on the determined usage of the shared hardware resources for the one or more frames ([0015] a consumer may process messages in parallel with ten threads and may do so in a certain amount of time. The consumer may measure various performance metrics during that time, or at the end of that time, and may increase the quantity of threads for processing received messages (e.g., to 15, to 20, to 30, to 60, etc.) until one or more of the measured performance metrics degrade to a preconfigured threshold (e.g., a certain latency is reached). Upon measuring a degradation to the preconfigured threshold, the consumer may decrease the quantity of threads, for example, to the quantity of threads used prior to the degradation inducing quantity. At some subsequent time, if the performance metrics do not further degrade, the consumer may then attempt to increase the quantity of threads again and measure the responsive performance metrics).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. with Ibryam because Ibryam’s teaching of optimizing the number of threads available for processing based on performance data gathered from previous usage of the shared hardware resources would have provided Duluk Jr’s system with the advantage and capability to dynamically adjust the number of threads to improve the performance of task processing (see Ibryam, [0012] Allocating the optimal quantity of threads, however, can be a challenge. If a consumer allocates too few threads, the consumer may process all of the messages, but may be underutilized and could process more messages if more threads were allocated. If a consumer allocates too many threads, the consumer may become oversaturated with messages and stall out because the consumer runs out of memory resources. A consumer may also experience network contention issues if it allocates too many threads).
Regarding claim 2, Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. further teaches wherein the second circuitry to adjust the number of threads to dispatch is to: increase or decrease a limit of hardware threads that one or more of the GPU engines can occupy for a frame (Col 13 Lines 62-64 the number of threads carried out by an execution unit in a GPU may be dynamically throttled; Col 17 lines 18-28 thread count throttling can be turned ON or OFF to dynamically change the number of threads executed by multithread core array 202 … Specifically, multithread core array 202 may support two different modes of operation for thread count throttling--an ON mode and an OFF mode. In the ON mode, multithread core array 202 does not use thread-specific memory in frame buffer 226, but multithread core array 202 is allowed to execute the maximum number of threads that its hardware is capable of supporting; Col 17 lines 31-34 In the OFF mode, multithread core array 202 does use thread-specific memory in frame buffer 226, but multiprocessor core array 202 is only allowed to execute a reduced number of threads; Examiner notes: turning the mode from OFF to ON increases the limit of threads to the maximum number allowed by hardware).
Regarding claim 11. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Ibryam further teaches wherein: the first circuitry is to determine the usage of the shared hardware resources after a predetermined number of frames of the application ([0021] the performance metrics tracker 172 may measure the performance metrics each time a predetermined quantity of messages are processed (e.g., 2, 5, 10, 20, 50, 100); [0038] the performance metrics tracker 172 may measure the latency of the consumer 150 … for a predetermined quantity of messages (e.g., 500 messages)).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of frames processing of Duluk Jr. with the determining of performance metrics of Ibryam resulting in a system that determines performance metrics of processing resources after a certain number of frames have been processed. This combination would have provided Duluk Jr’s system with the advantage and capability to have access to recently updated performance metrics detailing the usage of shared GPU resources for optimizing resource allocations in future processing (see Ibryam, [0013] Determining where to set the optimal fixed quantity is difficult, however, and typically requires gradually resetting and increasing the quantity of threads, and testing until an optimal quantity is determined. Additionally, the optimal quantity of threads for the consumer to allocate does not remain consistent. For instance, the optimal quantity of threads may be dependent on downstream systems that the consumer is accessing, which are also utilizing the thread pool. Additionally or alternatively, the optimal quantity may vary across the environments the consumer is in and may vary across time. Accordingly, a fixed quantity of threads may often lead to a consumer becoming oversaturated or being underutilized; also see performance metric tracking frequency in [0021]).
Regarding claim 13. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. further teaches wherein: the shared hardware resources include one or more of: processing unit, memory, a sampler, a texture unit, and a cache (Fig 5B Frame Buffer 520 Thread-specific Memory space).
Regarding claim 15, it is the system of claim 1. Therefore, it is rejected for the same reasons as claim 1.
Duluk Jr. further teaches a memory device coupled with the GPU (Fig 1 System Memory 104 – Memory Bridge 105 – Graphics Subsystem 112; Fig 2 System Memory 104 – GPU 122).
Regarding claim 16, Duluk Jr. and Ibryam teach the system of claim 15.
Duluk Jr. further teaches one or more of: a central processing unit (CPU) (Fig 1 CPU 102; Col 2 Lines 65-66 Computer system 100 includes a central processing unit (CPU) 102) and a display (Fig 1 Display 110; Col 3 Lines 8-9 Visual output is provided on a pixel based display device 110).
Regarding claim 17, it is the system of claim 2. Therefore, it is rejected for the same reasons as claim 2.
Regarding claims 19 and 20, they are the methods of claim 1 and 2 respectively. Therefore, they are rejected for the same reasons as claims 1 and 2 respectively.
Claims 3 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Green US 20070008324 A1 (hereinafter “Green”).
Regarding claim 3, Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. further teaches command buffers for the GPU engines for the one or more frames of the application (Col 3 Lines 46-48 CPU 102 writes a stream of commands for GPU 122 to a command buffer; Col 3 Lines 50-57 GPU 122 reads the command stream from the command buffer and executes commands asynchronously with operation of CPU 102. The commands may include conventional rendering commands for generating images as well as general-purpose computation commands that enable applications executing on CPU 102 to leverage the computational power of GPU 122 for data processing that may be unrelated to image generation; Examiner notes: the command buffer for GPU 122 can be split into multiple buffers accepting different types of commands for use of each specialized GPU engine).
Duluk Jr. and Ibryam do not explicitly teach wherein to determine the usage of the shared hardware resources, the first circuitry is to: determine a total execution time for command buffers for the GPU engines for the one or more frames of the application
However, Green teaches wherein to determine the usage of the shared hardware resources, the first circuitry is to: determine a total execution time for command buffers for the GPU engines for the one or more frames of the application ([0039] Each time a graphics command buffer is run, the time it took to execute is measured).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Green because Green’s teaching of measuring command buffer execution durations would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to monitor the performance of shared graphics resource allocations with the intention to optimize the allocations (see Green, [0031] the graphics proxy process 314 will be able to exercise greater control in the exemplary scenario where application A 316 attempts to take over all the entire graphic adapter resources 326 by allocating, say, a large number of graphics surfaces. In that situation, the graphics proxy process 314 can make sure that application B 318 is not starving for graphics adapter resources 326. This is of course just one typical scenario, and other scenarios are discussed below; [0035] there are at least two scenarios where the graphics adapter 310 and the graphics adapter resources 326 might have to be shared among partitions, such as VCP A 306 and VCP B 308. The first scenario is time sharing of the GPU (Graphics Processing Unit) that is interfaced by the graphics adapter 310).
Regarding claim 18, it is the system of claim 3. Therefore, it is rejected for the same reasons as claim 3.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in view of Green US 20070008324 A1 in further view of Basu et al. US 10592279 B2 (hereinafter “Basu”).
Regarding claim 4, Duluk Jr, Ibryam, and Green teach the graphics processing unit of claim 3.
Ibryam further teaches wherein to adjust the number of threads to dispatch, the second circuitry is to: increase a number of hardware threads that a GPU engine of the GPU engines can occupy for a frame if execution time for the GPU engine was greater than execution time for a second GPU engine in the one or more frames ([0015] The consumer may measure various performance metrics during that time, or at the end of that time, and may increase the quantity of threads for processing received messages (e.g., to 15, to 20, to 30, to 60, etc.) until one or more of the measured performance metrics degrade to a preconfigured threshold (e.g., a certain latency is reached); [0024] the predetermined threshold may be a previously measured performance metric, such as a measured performance metric prior to processing the first plurality of messages. For example, the measured performance metric may be the directly preceding measured performance metric prior to processing the first plurality of messages. In such instances, determining whether a respective performance metric meets its predetermined threshold includes comparing the measured respective performance metric after processing the first plurality of messages with a value of the performance metric that was measured prior to processing the first plurality of messages); and
decrease the number of hardware threads that the GPU engine can occupy for a frame if execution time for the GPU engine was less than the execution time for the second GPU engine in the one or more frames ([0036] the consumer 150 may decrease the allocation of threads upon the first time that the performance metrics tracker 172 of the consumer 150 measures a latency above its predetermined threshold).
Duluk Jr., Ibryam, and Green do not explicitly teach if execution time for the GPU engine was greater than execution time for a second GPU engine in the one or more frames and if execution time for the GPU engine was less than the execution time for the second GPU engine in the one or more frames
However, Basu teaches if execution time for the GPU engine was greater than execution time for a second GPU engine in the one or more frames and if execution time for the GPU engine was less than the execution time for the second GPU engine in the one or more frames (Claim 4 the controller is further configured to determine the task executing on the first processor as the lagging task by: when a predetermined sampling period is completed, comparing an execution time of the task executing on the first processor during the sampling period to execution times of one or more other tasks executing on the at least one of the first processor and the second processor during the sampling period; and identifying the task executing on the first processor as the lagging task when the execution time of the task executing on the first processor is greater than the execution times of the one or more other tasks).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr., Ibryam, and Green with Basu because Basu’s teaching of comparing execution times between two processors would have provided Duluk Jr., Ibryam, and Green’s system with the advantage and capability to determine resources with relatively slow execution times ultimately informing improved resource allocations for reducing wasted processor idle time (see Basu, Col 1 Line 58 – Col 2 Line 2 During processing of a program, when a task lags behind (e.g., takes longer to complete execution) other tasks due to various factors (e.g., complex branching behavior, an irregular memory access pattern, and/or an unpredictable amount of work), one or more of the other tasks can be caused to wait for the lagging task to complete execution, such as tasks which have completed other processing stages (e.g., fetch and decode stages) but cannot execute because they are dependent on data resulting from the execution of the lagging task. One or more of these lagging tasks can bottleneck the execution of a program and, therefore, delay the execution of the program).
Claims 5 is rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Rana et al. US 20230122295 A1 (hereinafter “Rana”).
Regarding claim 5. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. teaches two adjacent command buffers in each of the GPU engine's command queues (Col 3 Lines 46-57 In some embodiments, CPU 102 writes a stream of commands for GPU 122 to a command buffer, … GPU 122 reads the command stream from the command buffer and executes commands asynchronously with operation of CPU 102. The commands may include conventional rendering commands for generating images as well as general-purpose computation commands that enable applications executing on CPU 102 to leverage the computational power of GPU 122 for data processing that may be unrelated to image generation; Fig 4 Issue Logic 424; Col 12 Lines-11-13 issue logic 424 maintains a queue of fetched instructions for each in-flight SIMD group; Fig 6; Col 16 Lines 4-12 the 16 processing engines 402 are also multithreaded in nature. Thus, at a first instant in time, the 16 processing engines 402 may be operating on a SIMD group corresponding to an instruction from a particular program. At a second instant in time, the 16 processing engines 402 may switch context to operate on a different SIMD group corresponding to an instruction from another program. In this manner, the 16 processing engines 402 may switch context between programs, for example, up to 24 (G=24) programs; Examiner notes: the issue logic’s queues of fetched instructions contain adjacent instructions for each GPU processing engine to process the command buffer commands for frame processing).
While Ibryam teaches measuring processor utilization as a percentage, Duluk Jr. and Ibryam do not explicitly teach determining a time gap between two adjacent command buffers in each of the GPU engine's command queues at a synchronization point.
However, Rana teaches wherein the circuitry to determine the usage of shared hardware resources is to: determine a time gap between two adjacent command buffers in each of the GPU engine's command queues at a synchronization point (Fig 7; [0033] an idle duration, as used herein, refers to an amount of time a thread is not engaged in performing some action related to its assigned process, including times when the thread is completely idle (i.e., not doing any work) … Idle duration, as used herein, includes time measurements for things such as … [0036] Idle at a synchronization point).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the processor engine instruction queues of Duluk Jr. and Ibryam with the idle time measurement of Rana resulting in a system that is able to identify the idle time of processors between instruction executions at a synchronization point. Rana’s teaching of tracking thread idle times would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to understand which resources are being wasted or over/under allocated in efforts to optimize resource allocation for best performance (see Rana, [0020-0023] … In general, multithreading is most beneficial when done with a proper number of threads (i.e., considered as an optimal number of threads). Too many threads results in unnecessary overhead that exceeds performance gains (referred to as detrimental multithreading), whereas too few threads leave room for improved performance (referred to as sub-optimal multithreading). The point at which a system reaches peak performance can be referred to as the equilibrium point … ).
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in view of Rana et al. US 20230122295 A1 in further view of Glenny et al. US 20180276097 A1 (hereinafter “Glenny”).
Regarding claim 6, Duluk Jr., Ibryam, and Rana teach the graphics processing unit of claim 5.
Rana further teaches wherein to adjust the number of threads to dispatch the second circuitry is to: adjust the number of threads to dispatch to the shared hardware resources for one or more of the GPU engines if a difference between the time gaps exceeds a threshold (Fig 7 Differences between gaps between busy sections seen in Workers 1-4; [0028] If the mechanisms detect detrimental multithreading, the mechanisms project that a lower thread count will be more optimal (i.e., complete the process cycle more quickly) for a subsequent cycle of the process. On the other hand, if the mechanisms detect sub-optimal multithreading, the mechanisms project that a higher thread count will be more optimal (i.e., complete the process cycle more quickly) for a subsequent cycle of the process).
Duluk Jr., Ibryam, and Rana do not explicitly teach adjusting the number of threads to dispatch to the shared hardware resources for one or more of the GPU engines if a difference between the time gaps exceeds a threshold.
However, Glenny teaches adjusting the hardware resources if a difference between the time gaps exceeds a threshold ([0035] when a magnitude of a difference between the determined time and the expected time exceeds a threshold, the first processor 904 can transmit a reset signal to the second processor 904. The reset signal can cause the second processor to reset, reinitialize a program, reinitialize data areas, the like, and/or any combination of the foregoing. The reset signal can cause the second processor to disable, be held in an indefinite reset, the like, and/or any combination of the foregoing).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk. Jr, Ibryam, and Rana with Glenny because Glenny’s teaching of differencing a waiting time with an expected waiting time and comparing the difference with an acceptable threshold would have provided Duluk. Jr, Ibryam, and Rana’s system with the advantage and capability to optimize thread usage and cost efficiency by comparing the idle times for various processing engines and performing the corrective action of adjusting the thread counts to reduce resource waste/underutilization in the event of a resource being more idle than its peers prior to a synchronization point (see Glenny, [0024-0025] if the supervisor processor receives the signal, but the reception time of the signal is outside of the regular interval, then the supervisor processor can take corrective action on behalf of the monitored processor. Corrective action can include sending a reset signal to the monitored processor, causing the monitored processor to reset, reinitializing a program and/or data areas associated with the monitored processor, the like, and/or any combination of the foregoing. If a problem with the monitored processor persists, corrective action can include disabling the monitored processors, holding the monitored processor in a reset indefinitely, creating a notification for a user, the like, and/or any combination of the foregoing. In this way, the systems and methods according to example aspects of the present disclosure can have a number of beneficial effects and benefits. For instance, example aspects of the present disclosure can have a technical effect of monitoring processor performance without increasing the size, weight, or cost of computing hardware).
Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Yeung et al. US 20190258528 A1 (hereinafter “Yeung”).
Regarding claim 7. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. teaches command buffer (Col 3 Lines 46-57 In some embodiments, CPU 102 writes a stream of commands for GPU 122 to a command buffer, which may be in system memory 104, graphics memory 124, or another storage location accessible to both CPU 102 and GPU 122. GPU 122 reads the command stream from the command buffer and executes commands asynchronously with operation of CPU 102. The commands may include conventional rendering commands for generating images as well as general-purpose computation commands that enable applications executing on CPU 102 to leverage the computational power of GPU 122 for data processing that may be unrelated to image generation).
Duluk Jr. and Ibryam do not explicitly teach wherein to determine the usage of the shared hardware resources the first circuitry is to: store timestamps to indicate a beginning and an ending of execution of command buffers for the GPU engines for the one or more frames of the application.
However, Yeung teaches wherein to determine the usage of the shared hardware resources the first circuitry is to: store timestamps to indicate a beginning and an ending of execution of command buffers for the GPU engines for the one or more frames of the application ([0028] The GPU driver 222 submits the workloads to the GPU 224, monitoring and saving the workload processing start and completion times).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Yeung because Yeung’s teaching of utilizing a GPU driver to track and send out workload start and end times would have provided Duluk. Jr and Ibryam’s system with the advantage and capability to optimize the resource allocation based on the execution times of the workloads by providing the execution times of the command buffers in Duluk Jr. and Ibryam (see Yueng, [0003] Each of the one or more workloads are associated with completion deadline information and execution metadata representing execution guidance for the workload. The described technology further generates a processor performance adjustment for each of the one or more workloads using a performance model providing the processor performance adjustment based on the completion deadline information and the execution metadata for each of the one or more workloads).
Regarding claim 8. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. and Ibryam do not explicitly teach receiving, from a driver for the graphics processing unit, timestamps to indicate a beginning and an ending of execution for command buffers for the GPU engines for the one or more frames of the application.
However, Yeung teaches wherein to determine the usage of the shared hardware resources the first circuitry is to: receive, from a driver for the graphics processing unit, timestamps to indicate a beginning and an ending of execution for command buffers for the GPU engines for the one or more frames of the application ([0028] The GPU driver 222 submits the workloads to the GPU 224, monitoring and saving the workload processing start and completion times, returning these times to the operating system power management layer 220. The operating system power management layer 220 can use this information to process each workload with more performance (e.g., “high speed”) or better power efficiency).
Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Kalele et al. US 20210304350 A1 (hereinafter “Kalele”).
Regarding claim 9. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
While Duluk Jr. teaches sending a command to a GPU for limiting the number of threads it may utilize and Ibryam teaches adjusting the quantity of threads of a thread pool, Duluk Jr. and Ibryam do not explicitly teach storing a count to indicate a number of hardware threads occupied by the GPU engines for the one or more frames of the application.
However, Kalele teaches wherein to determine the usage of the shared hardware resources the first circuitry is to: store a count to indicate a number of hardware threads occupied by the GPU engines for the one or more frames of the application ([0026] the system 100 for tuning GPU parameters of a GPU kernel; [0027] the GPU parameters to be tuned include a thread per block parameter…).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Kalele because Kalele’s teaching of the tuning of specific GPU parameters for a GPU kernel including a number of threads available per block would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to keep track of various configuration settings that are able to optimize the performance of the GPU (see Kalele, [0003] Although the GPUs provide high performance, exploiting their complete performance potential is a challenging task, more particularly determining/tuning a set of GPU parameters that have a significant impact on performance of the GPU kernel is a challenging aspect).
Regarding claim 10. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Kalele further teaches wherein to determine the usage of the shared hardware resources the first circuitry is to: store a count to indicate a number of the shared hardware resources used by the GPU engines for the one or more frames of the application ([0027] the GPU parameters to be tuned include a … number of streams (n-streams) ... a set of streaming multiprocessors (SM) that are shared among data parallel threads; Examiner notes: the number of streams indicates a separate processing pipeline to some resource of the shared streaming multiprocessors).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Kalele because Kalele’s teaching of storing the GPU parameter: number of streams available to processing resources would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to maintain GPU parameters that may be optimized and tuned for improved utilization of the shared GPU resources (see Kalele, [0007] receiving a plurality of data regarding a GPU application, wherein the plurality of data regarding the GPU applications includes a plurality of GPU parameters to be tuned for optimal functioning of GPU kernels).
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Krishnamoorthy et al. US 10776753 B1 (hereinafter “Krishnamoorthy”).
Regarding claim 12. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. teaches determining the usage of the shared hardware resources (Fig 5B; Col 15 Lines 38-49 As shown in FIG. 5B, there are three groupings of execution units and corresponding memory allocations in frame buffer 226. The first group includes vertex shader unit 502 and a corresponding thread specific memory space of size 512 kilobytes, along with a stack space of size 32 kilobytes. The second group includes geometry shader unit 504 and a corresponding thread specific memory space of size 512 kilobytes, along with a stack space of size 32 kilobytes. The third group includes pixel shader unit 506 and a corresponding thread specific memory space of size 1024 kilobytes (shown as approximately 1 megabyte, or "1M"), along with a stack space of size 64 kilobytes; Col 7 Lines 25-32 at certain times, a given processing engine may operate as a vertex shader … as a geometry shader … as a pixel shader, receiving and executing pixel shader program instructions; Col 14 Lines 39-44 driver software executing on CPU 102 (shown in FIG. 2) may calculate the appropriate value that represents the reduced number of threads. The calculation may take into account (1) a defined total amount of thread-specific memory allocation available and (2) the amount of thread-specific memory required per thread; Col 16 Lines 27-34 The thread-specific memory space allocated in frame buffer 226 may be divided into blocks. One block is assigned to each thread. That is, each thread of execution is allotted a separate block of memory within the thread-specific memory space. Just as an example, each thread may use its dedicated block of memory as a "scratch" area to store intermediate results; Examiner notes: usage of the thread specific memory allocations is determined in order to change the thread limit and map blocks of memory to specific threads) and submissions of command buffers to one or more of the GPU engines (Col 3 Lines 46-57 CPU 102 writes a stream of commands for GPU 122 to a command buffer, which may be in system memory 104, graphics memory 124, or another storage location accessible to both CPU 102 and GPU 122. GPU 122 reads the command stream from the command buffer and executes commands asynchronously with operation of CPU 102. The commands may include conventional rendering commands for generating images as well as general-purpose computation commands that enable applications executing on CPU 102 to leverage the computational power of GPU 122 for data processing that may be unrelated to image generation).
Duluk Jr. and Ibryam do not explicitly teach wherein: the first circuitry is to determine the usage of the shared hardware resources after a predetermined number of submissions of command buffers to one or more of the GPU engines.
However, Krishnamoorthy teaches wherein: the first circuitry is to determine the usage of the shared hardware resources after a predetermined number of submissions of command buffers to one or more of the GPU engines (Col 10 Lines 27-36 the data pipeline manager 112 is configured to determine a number of incomplete tasks based on a task queue comprising all incomplete tasks, and to instantiate new task nodes based on task configuration data if the number of incomplete tasks exceeds a specified threshold value. The specified threshold may indicate that the number of available task nodes is insufficient to process the number of incomplete tasks within a desired period of time. The data pipeline manager 112 may associate the new task processes with a particular job and a task from the task queue).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Krishnamoorthy because Krishnamoorthy’s teaching of utilizing a threshold number of incomplete tasks in a queue to determine shared resource usage would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to adapt to an increasing load of requested tasks for the GPU engines by keeping current with the system utilization in order to best respond once the number of command buffers reaches a certain load (see Krishnamoorthy, Col 1 Line 64 – Col 2 Line 9 The inability to provide such fine-grained control may force inefficient resource allocation by the application service provider and prevents the application service provider from updating target data storage units, per customer, in near real-time. Lack of such fine-grained control also fails the application service provider from ensuring that data per customer is also consistent across source data storage unit and target data storage unit in an efficient manner. Thus, the inability to update target data storage units, on a per customer basis, while ensuring data consistency across source and target data storage units may significantly reduce the value of the service provided by the application service provider).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Duluk Jr. et al. US 8429656 B1 in view of Ibryam US 20210157649 A1 in further view of Kini et al. US 20140344528 A1 (hereinafter “Kini”).
Regarding claim 14. Duluk Jr. and Ibryam teach the graphics processing unit of claim 1.
Duluk Jr. further teaches wherein: the GPU engines include a render engine (Col 7 Lines 25-32 a given processing engine may operate as a vertex shader, receiving and executing vertex program instructions; at other times the same processing engine may operates as a geometry shader, receiving and executing geometry program instructions; and at still other times the same processing engine may operate as a pixel shader, receiving and executing pixel shader program instructions).
While Duluk Jr. teaches a GPU with clusters, cores, and processing engines that are able to perform general-purpose computing (Col 3 Lines 50-57 GPU 122 reads the command stream from the command buffer and executes commands asynchronously with operation of CPU 102. The commands may include conventional rendering commands for generating images as well as general-purpose computation commands that enable applications executing on CPU 102 to leverage the computational power of GPU 122 for data processing that may be unrelated to image generation), it does not explicitly teach a compute engine and a copy engine.
However, Kini teaches a GPU containing a compute engine and a copy engine (Fig 2 PPU 202-1 Compute Engine 220 Copy Engine 240; [0022] he parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU)).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined the teaching of Duluk Jr. and Ibryam with Kini because Kini’s teaching of a PPU containing a compute engine and copy engine would have provided Duluk Jr. and Ibryam’s system with the advantage and capability to have dedicated partitions of the GPU for the purposes of general computing and Memory I/O for improved separation of concerns during parallel processing (see Kini, [0006] Different types of parallel processing subsystem resources operate on different types of work components. For example, compute engines execute computational work components, and copy engines execute memory copies. Parallel processing subsystems are typically configured to receive work components via hardware channels, with each hardware channel dedicated to an appropriate type of work component).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to HARRISON LI whose telephone number is (703) 756-1469. The
examiner can normally be reached Monday-Friday 9:00am-5:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is
encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s
supervisor, Aimee Li can be reached on 571-272-4169. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.L./
Examiner, Art Unit 2195
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195