Detailed Action
1. This office action is in response to communication filed December 30, 2025. Claims 1-24 are currently pending and claims 1 and 23 are the independent claims.
Notice of Pre-AIA or AIA Status
2. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
3. This Final Office Action is in response to the applicant’s remarks and arguments filed on
December 30, 2025.
Claims 1-3, 6-7, 10, 13, 16, and 23 were amended. No claims have been cancelled. No claims are new. Claims 1-24 remain pending in the application. Claims 4-5, 8-9, 11-12, 14-15, 17-22, and 24 filed on August 12, 2022 are being considered on the merits along with amended claims 1-3, 6-7, 10, 13, 16, and 23.
Response to Arguments
4. Applicant's arguments filed December 30, 2025 have been fully considered but they are not persuasive.
On page 15 of the Remarks, the Applicant respectfully traverses the 102 rejections for at least the reason that Guim Bernat does not partition a plurality of execution units into a “first slice” and “one or more second slices.”
The Examiner understands the Applicant’s point of view on traversing the 102 rejection and has since updated the rejection in section 5 below. The rejection now stands as a 103 rejection via the combination of Guim Bernat and Vembu.
On pages 15-18 of the Remarks, the Applicant respectfully traverses the 102 rejections for both claims 1 and 13 with explanations for the reasoning behind their traversal based on both prior art citations and Specification citations for this invention.
The Examiner understands the Applicant’s point of view on traversing the 102 rejections for claims 1 and 13 and has since updated the rejections in section 5 below. The rejections now stand as 103 rejections via the combination of Guim Bernat and Vembu. Detailed explanation and motivations for combinations of prior art are in section 5 below in each claim rejection.
On pages 18-19 of the Remarks, the Applicant respectfully traverses the remaining claim rejections as “Guim Bernat fails to teach the foundational architecture of the independent claims.”
The Examiner respectfully disagrees with the Applicant. The remaining rejections remain rejected by the Examiner with detailed descriptions of prior art and motivations for combinations of prior art in sections 5 and 6 below. As such, the 103 rejections for claims 1-24 remain in this Final Office Action.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5. Claims 1, 4-7, 9, 11-14, and 17-24 are rejected under 35 U.S.C. 103 as being unpatentable over Guim Bernat et al. (U.S. Pub. No. 2018/0267878) – hereinafter “Guim Bernat” in view of Vembu et al. (U.S. Patent No. 10,922,085) – hereinafter “Vembu”.
Regarding independent claim 1, Guim Bernat discloses:
A monitoring apparatus, the monitoring apparatus comprising interface circuitry, machine-readable instructions, and processing circuitry to execute the machine-readable instructions to: (Title “System, Apparatus, and Method for Multi-Kernel Performance Monitoring in a Field Programmable Gate Array”)
obtain a first compute kernel to be monitored; (Fig. 1 Performance Monitor Circuit 126 and [0032] “In the embodiment shown in FIG. 1, performance monitoring circuit 126 includes a fixed plurality or sets of performance monitors 126.sub.1-126.sub.n. Each of these sets of performance monitors may be dynamically allocated to a given kernel.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the performance monitoring circuit is allocated to monitor a kernel.
obtain one or more second compute kernels; ([0031] “In addition, to provide a high level of dynamic programmability for performance monitoring within FPGA 120, an integrated monitoring logic 122 (also referred to herein as “monitoring logic”) is present. In embodiments, integrated monitoring logic 122, which may be implemented as hardware circuitry, software and/or firmware or combinations thereof, is configured to receive information from multiple kernels that may execute in parallel within FPGA 120. Based upon kernel programming, particular performance monitors within a performance monitoring circuit 126 may be dynamically allocated to particular kernels.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the group of kernels are executing in parallel within the FPGA and are not necessarily being monitored like the particular kernel selected to be monitored.
provide instructions, using the interface circuitry, to control circuitry of a computing device comprising a plurality of execution units, to instruct the control circuitry to … ([0020] “To this end, embodiments provide a set of interfaces to extend a processing node architecture (including at least a processor and an FPGA, which may be coupled via a coherent interconnect) to enable kernels to expose specific performance monitors to one or more applications in execution in the processing node.” and [0054] “The execution cluster(s) 560 includes a set of one or more execution units 562 and a set of one or more memory access units 564. The execution units 562 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multiple execution units can be split up to perform all functions for the compute kernels based on the interface interactions.
… and to instruct the control circuitry to provide counter-status information on a change of a status of at least one hardware counter associated with the first slice that is caused by the execution of the first compute kernel; and ([0026] “Still with reference to host processor 110, a performance monitoring logic 118 is present. This performance monitoring logic may include combinations of hardware circuitry, software and/or firmware. In some embodiments, performance monitoring logic 118 may include control circuitry and a set of performance monitors to monitor performance within host processor 110. These performance monitors may include one or more sets of counters. Such counters may include dedicated counters to count particular events within the processor, such as cache memory misses, instruction execution rates, cycle counters and so forth.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the counter associated with the performance monitoring logic for the kernel counts events such as cache memory misses, instruction execution rates, cycle counters, and more to monitor the status of the kernel’s execution.
determine profiling information on the execution of the first compute kernel by processing the counter-status information generated by the at least one hardware counter in response to execution of the first compute kernel on the first slice. ([0018] “Thus using an embodiment actual execution and interaction between a FPGA and processor can be monitored and analyzed. Still further, with programming by a kernel, kernel-specific events that are custom defined for the kernel can be monitored based on the kernel's registration of performance monitors and metadata for the performance monitors. Embodiments also enable monitored information to be observed and monitored from the processor side via interfaces described herein. In this way, as the processor is the agent that provides the control flow for execution (and therefore can take corrective action), suitable information can be obtained and analyzed to effect corrective action. As an example based on the monitored information, work could be redistributed between the processor and FPGA if it is determined that execution on the FPGA is bottlenecked, or vice versa.” and [0026] “Still with reference to host processor 110, a performance monitoring logic 118 is present. This performance monitoring logic may include combinations of hardware circuitry, software and/or firmware. In some embodiments, performance monitoring logic 118 may include control circuitry and a set of performance monitors to monitor performance within host processor 110. These performance monitors may include one or more sets of counters. Such counters may include dedicated counters to count particular events within the processor, such as cache memory misses, instruction execution rates, cycle counters and so forth.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the counter associated with the performance monitoring logic for the kernel counts events such as cache memory misses, instruction execution rates, cycle counters, and more and uses this monitored information to make execution determinations such that the FPGA is bottlenecked and needs to reallocate resources to the kernel.
Guim Bernat does not explicitly disclose:
execute the first compute kernel using a first slice of the plurality of execution units and to execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units …
However, Vembu discloses:
execute the first compute kernel using a first slice of the plurality of execution units and to execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units … (Col. 5, Lines 20-32 “The processing cluster array 212 can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N). Each cluster 214A-214N of the processing cluster array 212 can execute a large number of concurrent threads. The scheduler 210 can allocate work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and/or work distribution algorithms, which may vary depending on the workload arising for each type of program or computation. The scheduling can be handled dynamically by the scheduler 210, or can be assisted in part by compiler logic during compilation of program logic configured for execution by the processing cluster array 212.” and Col. 33, Lines 22-26 “In some embodiments, graphics processor 2000 includes scalable thread execution resources featuring modular cores 2080A-2080N (sometimes referred to as core slices), each having multiple sub-cores 2050A-550N, 2060A-2060N (sometimes referred to as core sub-slices).” and Col. 33, Lines 56-67 “In some embodiments, thread execution logic 2100 includes a shader processor 2102, a thread dispatcher 2104, instruction cache 2106, a scalable execution unit array including a plurality of execution units 2108A-2108N, a sampler 2110, a data cache 2112, and a data port 2114. In one embodiment the scalable execution unit array can dynamically scale by enabling or disabling one or more execution units (e.g., any of execution unit 2108A, 2108B, 2108C, 2108D, through 2108N-1 and 2108N) based on the computational requirements of a workload. In one embodiment the included components are interconnected via an interconnect fabric that links to each of the components.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multiple execution units can be split up to perform all functions for the compute kernels/clusters concurrently.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add execute the first compute kernel using a first slice of the plurality of execution units and to execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units as seen in Vembu's invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that the compute kernels are executed using slices of execution units to allow multiple compute kernels to execute concurrently.
Regarding claim 4, Guim Bernat discloses the monitoring apparatus according to claim 1, wherein the machine-readable instructions comprise instructions to provide instructions to the control circuitry of the computing device to configure at least one event to be counted by the at least one hardware counter. ([0040] “Thus as further illustrated in FIG. 3, control next passes to block 330, where a performance monitor is updated upon occurrence of a given event during execution of the kernel on the FPGA. For example, assume that a first performance monitor dynamically programmed for the kernel is to count a number of multiplications that occur. When a multiplication event occurs during kernel execution, the corresponding performance monitor may be updated, e.g., by the monitor circuit.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the performance monitor counts each event occurring, e.g., multiplication events.
Regarding claim 5, Guim Bernat discloses the monitoring apparatus according to claim 4, wherein the at least one event comprises at least one of a floating-point unit pipelining event, a systolic pipelining event, a math pipelining event, a data-type specific event, a floating-point data-type specific event, an integer data-type specific event, an instruction-specific event, an extended math instruction-specific event, ([0040] “Thus as further illustrated in FIG. 3, control next passes to block 330, where a performance monitor is updated upon occurrence of a given event during execution of the kernel on the FPGA. For example, assume that a first performance monitor dynamically programmed for the kernel is to count a number of multiplications that occur. When a multiplication event occurs during kernel execution, the corresponding performance monitor may be updated, e.g., by the monitor circuit.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the performance monitor counts each event occurring, e.g., multiplication events.
a jump instruction-specific event and a send instruction-specific event.
Regarding claim 6, Guim Bernat discloses the monitoring apparatus according to claim 1, wherein the machine-readable instructions comprise instructions to determine the profiling information on the execution of the first compute kernel with functionality information on hardware functionality being used by the execution of the first compute kernel. ([0032] ‘Note that as used herein, the terms “performance monitor” and “performance counter” are used synonymously to refer to a hardware circuit configured to monitor and store information associated with operation of a circuit. In different cases, these performance monitors or counters may be configured to store count information associated with particular events that occur, execution rates, cache misses, particular operations performed, among a wide variety of other performance monitoring information types.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the counter/monitor tracks predetermined events and stores count information associated with the events to determine information regarding the execution of the kernel.
Regarding claim 7, Guim Bernat discloses the monitoring apparatus according to claim 6, wherein the functionality information on the hardware functionality being used by the execution of the first compute kernel comprises at least one of information on a use of a floating point unit pipeline, information on a use of a systolic pipeline, information on a use of a math pipeline, information on a use of a data-type specific functionality, and information on a use of one or more predefined instructions executed by the execution units of the first slice. ([0030] “Nonetheless, in some embodiments FPGA 120 may include a host fabric interface (HFI) performance monitor logic 125 which may include one or more dedicated performance monitors that can monitor predetermined events within FPGA 120. Examples of these dedicated monitors include monitors to track the data traffic to the FPGA, memory bandwidth used by the FPGA, network traffic, power consumed by the FPGA, as such events can be predetermined since they do not depend on any specifics of a kernel that runs on the FPGA.” and [0032] “Note that as used herein, the terms “performance monitor” and “performance counter” are used synonymously to refer to a hardware circuit configured to monitor and store information associated with operation of a circuit. In different cases, these performance monitors or counters may be configured to store count information associated with particular events that occur, execution rates, cache misses, particular operations performed, among a wide variety of other performance monitoring information types.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the counter/monitor tracks predetermined events and stores count information associated with the events.
Regarding claim 9, Guim Bernat discloses the monitoring apparatus according to claim 1, wherein the machine-readable instructions comprise instructions to provide instructions to the control circuitry of the computing device to execute the first compute kernel and the one or more second compute kernels in a non-serialized manner. ([0058] “It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multithreaded execution of kernels allows time slicing kernels.
Regarding claim 11, Guim Bernat discloses the monitoring apparatus according to claim 1, wherein the one or more second compute kernels belong to the same computer program or to different computer programs. ([0057-0058] “Note that the core 590 may send kernels including kernel registration information to a FPGA to enable programming of dynamic performance monitors of the FPGA, as described herein. It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the kernels for execution are supported by a core that supports multithreading so the compute kernels coming from the same or different programs are supported by a core that executes them at the same time.
Regarding claim 12, Guim Bernat discloses the monitoring apparatus according to claim 1, wherein the first compute kernel and/or the one or more second compute kernels are related to one or more of compute operations, render operations and media operations. ([0054] “The execution cluster(s) 560 includes a set of one or more execution units 562 and a set of one or more memory access units 564. The execution units 562 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the kernels associated with execution units can perform various operations such as compute operations like shifts, addition, subtraction, multiplication, etc.
Regarding claim 13, Guim Bernat discloses:
A system comprising:
the monitoring apparatus according to claim 1 (see rejection of claim 1 above); and
a computing device comprising a plurality of execution units, machine-readable instructions and control circuitry to execute the machine-readable instructions to, based on instructions of the monitoring apparatus: (Fig. 5B Execution Engine Unit 550 and Execution Units 562 and [0054] “The execution cluster(s) 560 includes a set of one or more execution units 562 and a set of one or more memory access units 564. The execution units 562 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point).”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the Execution Engine Unit 550 contains Execution Units 562 that perform operations for the kernel.
provide information on a change of a status of at least one hardware counter associated with the first slice that is caused by the execution of the first compute kernel to the monitoring apparatus. ([0026] “Still with reference to host processor 110, a performance monitoring logic 118 is present. This performance monitoring logic may include combinations of hardware circuitry, software and/or firmware. In some embodiments, performance monitoring logic 118 may include control circuitry and a set of performance monitors to monitor performance within host processor 110. These performance monitors may include one or more sets of counters. Such counters may include dedicated counters to count particular events within the processor, such as cache memory misses, instruction execution rates, cycle counters and so forth.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the counter associated with the performance monitoring logic for the kernel counts events such as cache memory misses, instruction execution rates, cycle counters, and more to monitor the status of the kernel’s execution.
Guim Bernat does not explicitly disclose:
execute the first compute kernel using the first slice of the plurality of execution units, execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units …
However, Vembu discloses:
execute the first compute kernel using the first slice of the plurality of execution units, execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units … (Col. 5, Lines 20-32 “The processing cluster array 212 can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N). Each cluster 214A-214N of the processing cluster array 212 can execute a large number of concurrent threads. The scheduler 210 can allocate work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and/or work distribution algorithms, which may vary depending on the workload arising for each type of program or computation. The scheduling can be handled dynamically by the scheduler 210, or can be assisted in part by compiler logic during compilation of program logic configured for execution by the processing cluster array 212.” and Col. 33, Lines 22-26 “In some embodiments, graphics processor 2000 includes scalable thread execution resources featuring modular cores 2080A-2080N (sometimes referred to as core slices), each having multiple sub-cores 2050A-550N, 2060A-2060N (sometimes referred to as core sub-slices).” and Col. 33, Lines 56-67 “In some embodiments, thread execution logic 2100 includes a shader processor 2102, a thread dispatcher 2104, instruction cache 2106, a scalable execution unit array including a plurality of execution units 2108A-2108N, a sampler 2110, a data cache 2112, and a data port 2114. In one embodiment the scalable execution unit array can dynamically scale by enabling or disabling one or more execution units (e.g., any of execution unit 2108A, 2108B, 2108C, 2108D, through 2108N-1 and 2108N) based on the computational requirements of a workload. In one embodiment the included components are interconnected via an interconnect fabric that links to each of the components.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multiple execution units can be split up to perform all functions for the compute kernels/clusters concurrently.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add execute the first compute kernel using a first slice of the plurality of execution units and to execute the one or more second compute kernels concurrently with the first compute kernel using one or more second slices of the plurality of execution units as seen in Vembu's invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that the compute kernels are executed using slices of execution units to allow multiple compute kernels to execute concurrently.
Regarding claim 14, it is a system claim having the same limitations as cited in monitoring apparatus claim 4. Thus, claim 14 is also rejected under the same rationale as addressed in the rejection of claim 4 above.
Regarding claim 17, Guim Bernat discloses the system of claim 13, but does not explicitly disclose:
wherein the first slice and the one or more second slices are non-overlapping slices of the plurality of execution units.
However, Vembu discloses:
wherein the first slice and the one or more second slices are non-overlapping slices of the plurality of execution units. (Col. 14, Lines 47-54 “In one embodiment, a virtualized graphics execution environment is presented in which the resources of the graphics processing engines 431-432, N are shared with multiple applications or virtual machines (VMs). The resources may be subdivided into “slices” which are allocated to different VMs and/or applications based on the processing requirements and priorities associated with the VMs and/or applications.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the individual graphics processing engines have resources that are subdivided into slices for processing operations.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the first slice and the one or more second slices are non-overlapping slices of the plurality of execution units as seen in Vembu’s invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that resources are not attempted to be reused when already working on another task and do not have errors when selecting slices of execution units.
Regarding claim 18, Guim Bernat discloses the system of claim 13, but does not explicitly disclose:
wherein the first slice and/or the one or more second slices each comprise a fixed number of execution units.
However, Vembu discloses:
wherein the first slice and/or the one or more second slices each comprise a fixed number of execution units. (Col. 34, Lines 5-11 “In some embodiments, each execution unit (e.g. 2108A) is a stand-alone programmable general purpose computational unit that is capable of executing multiple simultaneous hardware threads while processing multiple data elements in parallel for each thread. In various embodiments, the array of execution units 2108A-2108N is scalable to include any number individual execution units.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, each execution unit is a stand-alone programmable computational unit and thus can execute by itself, so a slice can comprise just one execution unit.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the first slice and/or the one or more second slices each comprise a fixed number of execution units as seen in Vembu’s invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that easy selection of resource slices is ensured so that the user always has the same number of execution units to know how many slices it needs to execute its tasks.
Regarding claim 19, Guim Bernat discloses the system of claim 13, but does not explicitly disclose:
wherein the first slice and/or the one or more second slices each comprise a variable number of execution units, with the machine-readable constructions for the control circuitry comprising instructions to set the number of execution units being part of the respective slices, and to provide information on the execution units being part of the respective slices to the monitoring apparatus.
However, Vembu discloses:
wherein the first slice and/or the one or more second slices each comprise a variable number of execution units, with the machine-readable constructions for the control circuitry comprising instructions to set the number of execution units being part of the respective slices, and to provide information on the execution units being part of the respective slices to the monitoring apparatus. (Col. 32, Lines 29-35 “In some embodiments, graphics core array 1914 is scalable, such that the array includes a variable number of graphics cores, each having a variable number of execution units based on the target power and performance level of GPE 1910. In one embodiment the execution resources are dynamically scalable, such that execution resources may be enabled or disabled as needed.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the graphics core array 1914 contains a variable amount of graphics cores which each contain a variable number of execution units. The number of execution units is set to reach a threshold performance level.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the first slice and/or the one or more second slices each comprise a variable number of execution units, with the machine-readable constructions for the control circuitry comprising instructions to set the number of execution units being part of the respective slices, and to provide information on the execution units being part of the respective slices to the monitoring apparatus as seen in Vembu’s invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that a more individualized resource slice preparation exists such that the slice can be set with the ideal amount of resources to handle the specific operations of this kernel execution rather than having unused resources wasted if the execution unit is preset to be above the necessary amounts.
Regarding claim 20, it is a system claim having the same limitations as cited in monitoring apparatus claim 8. Thus, claim 20 is also rejected under the same rationale as addressed in the rejection of claim 8 above.
Regarding claim 21, it is a system claim having the same limitations as cited in monitoring apparatus claim 9. Thus, claim 21 is also rejected under the same rationale as addressed in the rejection of claim 9 above.
Regarding claim 22, Guim Bernat discloses the system according to claim 13, wherein the computing device is a graphics processing unit. (Fig. 1 and [0023] “With reference to FIG. 1, processing node 100 includes a host processor 110 which in an embodiment may be implemented as a central processing unit (CPU) such as a multicore processor. In a particular example, host processor 110 may take the form of an Intel® XEON® or an Intel® Core™ processor. In other cases, host processor 110 may be an accelerated processing unit (APU) or a graphics processing unit (GPU).”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the processing node 100 includes a host processor 110 that can be a graphics processing unit (GPU).
Regarding claim 23, it is a monitoring method claim having the same limitations as cited in monitoring apparatus claim 1. Thus, claim 23 is also rejected under the same rationale as addressed in the rejection of claim 1 above.
Regarding claim 24, Guim Bernat discloses:
A non-transitory machine-readable storage medium including program code, when executed, to cause a machine to perform the method of claim 23. ([0045] “As illustrated, method 400 begins by sending a kernel (including kernel registration information, as discussed above) to an FPGA (block 410). In some cases, an application that executes on a processor may generate one or more kernels to be sent to program the FPGA for executing, e.g., specialized operations, to offload these operations to the FPGA.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the kernels for execution come from an application offloading program code containing operations to the FPGA.
6. Claims 2-3, 8, 10, and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Guim Bernat et al. (U.S. Pub. No. 2018/0267878) – hereinafter “Guim Bernat”, in view of Vembu et al. (U.S. Patent No. 10,922,085) – hereinafter “Vembu”, and further in view of Sethia et al. (U.S. Patent No. 2016/0103691) – hereinafter “Sethia”.
Regarding claim 2, Guim Bernat discloses the monitoring apparatus of claim 1, but does not explicitly disclose:
wherein the information on the change of the status of at least one hardware counter comprises information on the change of the status of at least one hardware counter per execution unit of the first slice.
However, Sethia discloses:
wherein the information on the change of the status of at least one hardware counter comprises information on the change of the status of at least one hardware counter per execution unit of the first slice. ([0045] “As new GPU architectures support different kernels on each SM, Equalizer runs on individual SMs to make decisions tailored for each kernel. It monitors the state of threads with four hardware counters that measure the number of active warps (groups of threads), warps waiting for data from memory, warps ready to execute arithmetic pipeline and warps ready to issue to memory pipeline over an execution window.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the individual streaming multiprocessors have four hardware counters that monitor each thread for state changes.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the information on the change of the status of at least one hardware counter comprises information on the change of the status of at least one hardware counter per execution unit of the first slice as seen in Sethia's invention into Guim Bernat's invention because these modifications allow applying a known technique to a known device ready for improvement to yield predictable results such that each individual execution unit can track the change of the status of the hardware counters such that all the execution units working on one kernel have data synchronization.
Regarding claim 3, Guim Bernat discloses the monitoring apparatus of claim 2, but does not explicitly disclose:
wherein the machine-readable instructions comprise instructions to aggregate the information on the change of the status of the at least one hardware counter per execution unit of the first slice, and to determine the information on the execution of the first compute kernel based on the aggregate.
However, Sethia discloses:
wherein the machine-readable instructions comprise instructions to aggregate the information on the change of the status of the at least one hardware counter per execution unit of the first slice, and to determine the information on the execution of the first compute kernel based on the aggregate. ([0045] “As new GPU architectures support different kernels on each SM, Equalizer runs on individual SMs to make decisions tailored for each kernel. It monitors the state of threads with four hardware counters that measure the number of active warps (groups of threads), warps waiting for data from memory, warps ready to execute arithmetic pipeline and warps ready to issue to memory pipeline over an execution window.” and [0046] “Firstly, it decides to increase, maintain, or decrease the number of concurrent threads on the SM. Secondly, it also takes a vote among different SMs to determine the overall resource requirement of the kernel based on the above counters. After determining the resource requirements of a kernel, Equalizer can work in either energy efficient or high performance modes.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the Equalizer aggregates the state information from the hardware counters of the streaming multiprocessor to understand the execution status of the kernel.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the machine-readable instructions comprise instructions to aggregate the information on the change of the status of the at least one hardware counter per execution unit of the first slice, and to determine the information on the execution of the first compute kernel based on the aggregate as seen in Sethia's invention into Guim Bernat's invention because these modifications allow the “obvious to try” solution with a reasonable expectation of success such that aggregation of all the hardware counters are combined for data synchronization to ensure proper handling of kernel execution.
Regarding claim 8, Guim Bernat discloses the monitoring apparatus according to claim 1, but does not explicitly disclose:
wherein the machine-readable instructions comprise instructions to provide instructions to the control circuitry of the computing device to execute the first compute kernel and the one or more second compute kernels such, that the at least one hardware counter associated with the first slice is unaffected by the concurrent execution of the one or more second compute kernels.
However, Sethia discloses:
wherein the machine-readable instructions comprise instructions to provide instructions to the control circuitry of the computing device to execute the first compute kernel and the one or more second compute kernels such, that the at least one hardware counter associated with the first slice is unaffected by the concurrent execution of the one or more second compute kernels. ([0045] “To address the limitations mentioned above and exploit the significant opportunities provided by modulating these three parameters, the present disclosure proposes Equalizer, a comprehensive dynamic system which coordinates these three architectural parameters. Based on the resource requirements of the kernel at runtime, it tunes these parameters to exploit any imbalance in resource requirements. As new GPU architectures support different kernels on each SM, Equalizer runs on individual SMs to make decisions tailored for each kernel. It monitors the state of threads with four hardware counters that measure the number of active warps (groups of threads), warps waiting for data from memory, warps ready to execute arithmetic pipeline and warps ready to issue to memory pipeline over an execution window. At the end of a window, Equalizer performs two actions to tune the hardware.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the execution of multiple different kernels on each SM is monitored by Equalizer to ensure that the concurrent execution of kernels does not affect the hardware counters.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the machine-readable instructions comprise instructions to provide instructions to the control circuitry of the computing device to execute the first compute kernel and the one or more second compute kernels such, that the at least one hardware counter associated with the first slice is unaffected by the concurrent execution of the one or more second compute kernels as seen in Sethia's invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that the concurrent execution of kernels does not cause errors that the user does not intend.
Regarding claim 10, Guim Bernat discloses the monitoring apparatus of claim 1, but does not explicitly disclose:
wherein the machine-readable instructions comprise instructions to obtain information on execution units being part of the respective slices from the computing device, and to determine the information on the execution of the first compute kernel further based on the information on the execution units being part of the respective slices.
However, Sethia discloses:
wherein the machine-readable instructions comprise instructions to obtain information on execution units being part of the respective slices from the computing device, and to determine the information on the execution of the first compute kernel further based on the information on the execution units being part of the respective slices. ([0045] “As new GPU architectures support different kernels on each SM, Equalizer runs on individual SMs to make decisions tailored for each kernel. It monitors the state of threads with four hardware counters that measure the number of active warps (groups of threads), warps waiting for data from memory, warps ready to execute arithmetic pipeline and warps ready to issue to memory pipeline over an execution window.” and [0046] “Firstly, it decides to increase, maintain, or decrease the number of concurrent threads on the SM. Secondly, it also takes a vote among different SMs to determine the overall resource requirement of the kernel based on the above counters. After determining the resource requirements of a kernel, Equalizer can work in either energy efficient or high performance modes.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the state information from the hardware counters of the streaming multiprocessor are analyzed to understand the execution status of the kernel and make any changes if necessary.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the machine-readable instructions comprise instructions to obtain information on execution units being part of the respective slices from the computing device, and to determine the information on the execution of the first compute kernel further based on the information on the execution units being part of the respective slices as seen in Sethia's invention into Guim Bernat's invention because these modifications allow the “obvious to try” solution with a reasonable expectation of success such that the aggregation of all the execution units are combined to ensure data synchronization for proper handling of kernel execution across different resource slices.
Regarding claim 15, Guim Bernat discloses the system of claim 13, but does not explicitly disclose:
wherein the at least one hardware counter associated with the first slice comprises at least one hardware counter per execution unit of the first slice.
However, Sethia discloses:
wherein the at least one hardware counter associated with the first slice comprises at least one hardware counter per execution unit of the first slice. ([0045] “As new GPU architectures support different kernels on each SM, Equalizer runs on individual SMs to make decisions tailored for each kernel. It monitors the state of threads with four hardware counters that measure the number of active warps (groups of threads), warps waiting for data from memory, warps ready to execute arithmetic pipeline and warps ready to issue to memory pipeline over an execution window.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the individual streaming multiprocessors have four hardware counters to monitor each thread.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the at least one hardware counter associated with the first slice comprises at least one hardware counter per execution unit of the first slice as seen in Sethia's invention into Guim Bernat's invention because these modifications allow the use of a known technique to improve similar devices in the same way such that each execution unit separately tracks its status with an individual hardware counter to ensure each execution unit’s proper execution.
Regarding claim 16, it is a system claim having the same limitations as cited in monitoring apparatus claim 2. Thus, claim 16 is also rejected under the same rationale as addressed in the rejection of claim 2 above.
Conclusion
7. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Prior art such as Bourd et al. (U.S. Pub. No. 2013/0194286) and Haberstro (U.S. Pub. No. 2020/0379864) are inventions having to do with GPU performance management. Bourd et al. contains important details about kernel/shader execution with a PMU tracking counter information. Haberstro contains important details about kernel/shader execution with hardware counters tracking status of performance of the GPU over time. Also, Hartog et al. (U.S. Patent No. 8,963,933) discloses execution of compute kernels and shaders with tracking counters.
Examiner has cited particular columns/paragraphs/sections and line numbers in the references applied and not relied upon to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings of the art and are applied to specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant in preparing responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
When responding to the Office action, applicant is advised to clearly point out the patentable novelty the claims present in view of the state of the art disclosed by the reference(s) cited or the objections made. A showing of how the amendments avoid such references or objections must also be present. See 37 C.F.R. 1.111(c).
When responding to this Office action, applicant is advised to provide the line and page numbers in the application and/or reference(s) cited to assist in locating the appropriate paragraphs.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL B TRAINOR whose telephone number is (571)272-3710. The examiner can normally be reached Monday-Friday 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached at (571) 272-4215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/D.T./Examiner, Art Unit 2198
/PIERRE VITAL/Supervisory Patent Examiner, Art Unit 2198