Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to the amendment filed on 02/25/2026. By the amendment, Claim 2 is amended. Claims 1-20 are pending.
Priority
Applicant’s claims for priority from foreign applications no. EP22188051.1, EP22188053.7, EP22386054.5 filed 08/01/2022, GB2214192.3 filed 09/28/2022, and provisional application no. 63394053 filed 08/01/2022 are acknowledged.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Surti et al. (US 20210258592 A1) in view of Yang et al. (US 20190324759 A1), hereinafter referred to as Surti and Yang, respectively.
Regarding Claim 1, Surti discloses A processor comprising: a command processing unit ([0035] The GPU may be communicatively coupled to the host processor/cores over a bus […] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. . Please note that the dedicated circuitry of the GPU for command processing correspond to Applicant’s processor comprising a command processing unit.) to:
receive, from a host processor, a sequence of commands to be executed ([0035] host/processor cores […] the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. Please note that the GPU receiving sequences of commands that are allocated as work from the processor cores corresponds to Applicant’s receiving a sequence of commands to be executed from a host processor, as the host/processor cores correspond to the host processor.);
and generate based on the sequence of commands a plurality of tasks ([0035] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. Please note the dedicated logic of the GPU for processing the commands corresponds to Applicant’s generating a plurality of tasks based on the sequence of commands, as it is known in the art that commands or instructions can be processed by generating respective tasks.);
and a plurality of compute units ([0046] When the host interface 206 receives a command buffer via the I/O unit 204, the host interface 206 can direct work operations to perform those commands to a front end 208. In one embodiment the front end 208 couples with a scheduler 210, which is configured to distribute commands or other work items to a processing cluster array 212. Please note the processing cluster array corresponds to Applicant’s plurality of compute units.),
wherein at least one of the plurality of compute units comprises: a first processing module for executing tasks of a first task type generated by the command processing unit ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a first particular cluster 214A of the processing cluster array 212 for a first particular type of computation corresponds to Applicant’s first processing module for executing tasks of a first task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
a second processing module for executing tasks of a second task type, different from the first task type, generated by the command processing unit ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a second particular cluster 214B of the processing cluster array 212 for a second particular type of computation corresponds to Applicant’s second processing module for executing tasks of a second task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
a local cache shared by at least the first processing module and the second processing module ([0053] In some embodiments, a local instance of the parallel processor memory 222 may be excluded in favor of a unified memory design that utilizes system memory in conjunction with local cache memory. Please note that the local cache used by the parallel processors as part of memory corresponds to Applicant’s local cache shared by the first and second processing module, as they are both part of the parallel processing system and therefore would both utilize this local cache.);
Surti does not explicitly disclose wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.
However, Yang discloses wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks ([0066] using a common instruction unit configured to issue instructions to a set of processing engines within each one of the processing clusters. Please note that using a common instruction unit to issue instructions to a set of processing engines within each processing cluster corresponds to Applicant’s command processing unit issuing the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks. This is because the common instruction unit corresponds to Applicant’s command processing unit that issues the plurality of tasks to each processing engine, corresponding to at least one of the plurality of compute units, which processes it.).
Surti and Yang are both considered to be analogous to the claimed invention because they are in the same field of computer command pipeline processing. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Surti to incorporate the teachings of Yang to modify the command processing system generating tasks and compute units with first and second processing modules for executing tasks of respective types generated by the command processing unit with a shared local cache to have the command processing unit issue the plurality of tasks to the compute units to be processed, allowing for greater flexibility of execution and improved efficiency via parallel processing, as described in Yang.
Regarding Claim 2, Surti-Yang as described in Claim 1, Surti further discloses wherein the command processing unit is to issue tasks of the first task type to the first processing module of a given compute unit of the plurality of compute units and to issue tasks of the second task type to the second processing module of the given compute unit ([0047] The scheduler 210 can allocate work to the clusters 214A-214N of the processing cluster array 212 using various scheduling and/or work distribution algorithms, which may vary depending on the workload arising for each type of program or computation. […] In one embodiment, different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs. Please note that the scheduler 210 allocating work to a first particular cluster 214A of the processing cluster array 212 for a first particular type of computation and to a second particular cluster 214A of the processing cluster array 212 for a second particular type of computation corresponds to Applicant’s command processing unit issuing tasks of the first and second task types to respective processing modules of a given compute unit, i.e., clusters 214A and 214B of cluster array 212.).
Regarding Claim 3, Surti-Yang as described in Claim 1, Surti further discloses wherein the first task type is a task for undertaking at least a portion of a graphics processing operation forming one of a set of pre-defined graphics processing operations which collectively enable the implementation of a graphics processing pipeline ([0121] The various parallel processing systems can implement the graphics processing pipeline 500 via one or more instances of the parallel processing unit (e.g., parallel processing unit 202 of FIG. 2A) as described herein. […] The graphics processing pipeline 500 may also be implemented using dedicated processing units for one or more functions. Please note that the parallel processing systems implementing the graphics processing pipeline 500, and having being configured to perform parallel graphics processing operations and being implemented using processing units for its functions corresponds to Applicant’s first task type being a task for undertaking a portion of a graphics operation forming one of a set of pre-defined graphics processing operations which collectively enable the implementation of a graphics processing pipeline, as [00121] describes a number of pre-defined graphics processing operations which collectively enable the implementation of the pipeline, and may be carried out by being defined as a graphics processing task.),
Yang further discloses and wherein the second task type is a task for undertaking at least a portion of a neural processing operation ([0152] The computing architecture provided by embodiments described herein can be configured to perform the types of parallel processing that is particularly suited for training and deploying neural networks for machine learning. Please note that the parallel processing performing training and deploying of neural networks corresponds to Applicant’s second task type being a task for undertaking a portion of a neural processing operation, as the same parallel processing system could be utilized to carry out neural processing operations based on the type of the commands.).
Regarding Claim 4, Surti-Yang as described in Claim 3, Surti further discloses wherein the graphics processing operation comprises at least one of: a graphics compute shader task; a vertex shader task; a fragment shader task; a tessellation task; and a geometry shader task ([0049] Additionally, the processing cluster array 212 can be configured to execute graphics processing related shader programs such as, but not limited to vertex shaders, tessellation shaders, geometry shaders. Please note that the processing cluster array being configured to execute graphics processing related shader programs such as vertex shaders, tessellation shaders, and geometry shaders corresponds to Applicant’s graphics processing operation comprising a vertex shader task, a tessellation task, and a geometry shader task. As Applicant states “one or more of” the operations, this is interpreted as fulfilling the requirement.).
Regarding Claim 5, Surti-Yang as described in Claim 1, Surti further discloses wherein each compute unit is a shader core in a graphics processing unit ([0251] Graphics processor 2810 includes one or more shader cores 2815A-2815N. Please note that graphics processor 2810 including shader cores 2815A-2815N corresponds to Applicant’s each compute unit being a shader core in a graphics processing unit.).
Regarding Claim 6, Surti-Yang as described in Claim 1, Surti further discloses wherein the first processing module is a graphics processing module ([0121] FIG. 5 illustrates a graphics processing pipeline 500, according to an embodiment. In one embodiment a graphics processor can implement the illustrated graphics processing pipeline 500. The graphics processor can be included within the parallel processing subsystems as described herein, such as the parallel processor 200 of FIG. 2A. Please note that the graphics processor included in the parallel processing subsystems corresponds to Applicant’s first processing module being a graphics processing module.)
Yang further discloses and wherein the second processing module is a neural processing module ([0152] The computing architecture provided by embodiments described herein can be configured to perform the types of parallel processing that is particularly suited for training and deploying neural networks for machine learning. Please note that the parallel processing system for training and deploying neural networks corresponds to Applicant’s second processing module being a neural processing module.).
Regarding Claim 7, Surti-Yang as described in Claim 1, Surti further discloses wherein the command processing unit further comprises at least one dependency tracker to track dependencies between commands in the sequence of commands ([0199] While waiting for data from memory or one of the shared functions, dependency logic within the execution units 2008A-2008N causes a waiting thread to sleep until the requested data has been returned. Please note that the dependency logic within the execution units waiting for data from shared functions corresponds to Applicant’s command processing unit comprising a dependency tracker to track dependencies between commands in the sequence of commands, as it causes the thread to sleep while waiting, and is therefore aware of the data dependencies between the operations of the commands.);
and wherein the command processing unit is to use the at least one dependency tracker to wait for completion of processing of a given task of a first command in the sequence of commands before issuing an associated task of a second command in the sequence of commands for processing, where the associated task is dependent on the given task ([0199] While waiting for data from memory or one of the shared functions, dependency logic within the execution units 2008A-2008N causes a waiting thread to sleep until the requested data has been returned. While the waiting thread is sleeping, hardware resources may be devoted to processing other threads. For example, during a delay associated with a vertex shader operation, an execution unit can perform operations for a pixel shader, fragment shader, or another type of shader program, including a different vertex shader. Please note that the dependency logic within the execution units causing a waiting thread to sleep until the requested data has been returned corresponds to Applicant’s command processing unit using the dependency tracker to wait for completion of processing of a given task of a first command in the sequence of commands before issuing an associated task of a second command in the sequence of commands for processing, where the associated task is dependent on the given task, as the dependency logic causes the execution unit to wait for completion of the shared function, i.e., completion of processing of a given task of the first command, before proceeding to the next operation, corresponding to issuing the associated task of the second command in the sequence for processing. As the logic is based on the dependency of the second operation on the first, this corresponds to Applicant’s associated task being dependent on the given task.).
Regarding Claim 8, Surti-Yang as described in Claim 7, Surti further discloses wherein an output of the given task is stored in the local cache ([0231] the graphics processor also uses one or more return buffers to store output data and to perform cross thread communication. Please note that the return buffer storing output data of the graphics processor corresponds to Applicant’s output of the given task being stored in the local cache, since it allows for cross thread communication, indicating it is a local cache shared by the threads of task processing.).
Regarding Claim 9, Surti-Yang as described in Claim 7, Surti further discloses wherein each command in the sequence of commands has metadata, wherein the metadata comprises indications of at least a number of tasks in the command, and task types associated with each of the tasks ([0105] In one embodiment, each WD 484 is specific to a particular graphics acceleration module 446 and/or graphics processing engine 431-432, N.; [0109] The WD is formatted specifically for the graphics acceleration module 446 and can be in the form of a graphics acceleration module 446 command, an effective address pointer to a user-defined structure, an effective address pointer to a queue of commands, or any other data structure to describe the work to be done by the graphics acceleration module 446. Please note that the WD 484 specific to a particular graphics processing engine that is an address pointer to a queue of commands to describe work to be done corresponds to Applicant’s each command in the sequence of commands having metadata comprising indications of a number of tasks in the command and task types associated with each of the tasks, as the WD 484 provides the metadata for the work, i.e., number of tasks and types, to be done by a particular processing engine.).
Regarding Claim 10, Surti-Yang as described in Claim 9, Surti further discloses wherein the command processing unit allocates each command in the sequence of commands, a command identifier, and the dependency tracker tracks dependencies between commands in the sequence of commands based on the command identifier ([0199] While waiting for data from memory or one of the shared functions, dependency logic within the execution units 2008A-2008N causes a waiting thread to sleep until the requested data has been returned. While the waiting thread is sleeping, hardware resources may be devoted to processing other threads. For example, during a delay associated with a vertex shader operation, an execution unit can perform operations for a pixel shader, fragment shader, or another type of shader program, including a different vertex shader. Please note that since Applicant states in [0049] of the Specification that “The command identifier may be used to indicate the order in which the commands of the command stream 120 are to be processed”, the dependency logic being aware that a particular operation must be completed before another subsequent, dependent operation is completed at a later time corresponds to the command identifier, as there is a mechanism for the system to be aware of the order in which the commands are to be processed. Therefore, this corresponds to the dependency tracker, i.e., dependency logic, tracking dependencies between commands in the sequence of commands based on the command identifier.).
Regarding Claim 11, Surti-Yang as described in Claim 10, Surti further discloses wherein when the given task of the first command is dependent on the associated task of the second command, the command processing unit allocates the given task and the associated task a same task identifier ([0227] the commands may be issued as batch of commands in a command sequence, such that the graphics processor will process the sequence of commands in at least partially concurrence. Please note that a batch of commands in a command sequence corresponds to Applicant’s given task of the first command being dependent on the associated task of the second command, as they are provided as a batch in a command sequence, meaning they are to be processed together and in order. Additionally, since they are processed together by the graphics processor, this corresponds to the command processing unit allocating the given task and associated task a same task identifier.).
Regarding Claim 12, Surti-Yang as described in Claim 11, Surti further discloses wherein tasks of each of the commands that have been allocated the same task identifier are executed on the same compute unit of the plurality of compute units ([0227] the commands may be issued as batch of commands in a command sequence, such that the graphics processor will process the sequence of commands in at least partially concurrence. Please note that a batch of commands in a command sequence being processed by the graphics processor in concurrence corresponds to Applicant’s tasks of each of the commands that have been allocated the same task identifier being executed on the same compute unit, since they are processed together by the graphics processor when they are in the same batch, i.e., allocated the same task identifier.).
Regarding Claim 13, Surti-Yang as described in Claim 10, Surti further discloses wherein a task allocated a first task identifier is executed on a first compute unit of the plurality of compute units and a task allocated a second, different, task identifier is executed on a second compute unit of the plurality of compute units ([0186] In one embodiment, the ring buffer can additionally include batch command buffers storing batches of multiple commands. The commands for the 3D pipeline 1712 can also include references to data stored in memory, such as but not limited to vertex and geometry data for the 3D pipeline 1712 and/or image data and memory objects for the media pipeline 1716. The 3D pipeline 1712 and media pipeline 1716 process the commands and data by performing operations via logic within the respective pipelines. Please note that batches of multiple commands that are processed by respective pipelines correspond to tasks allocated first and second task identifiers being executed on respective compute units of the plurality of compute units, as the commands in a particular batch, i.e., having a particular task identifier, are processed by respective pipelines, corresponding to respective compute units.).
Regarding Claim 14, Surti-Yang as described in Claim 11, Surti further discloses wherein a task allocated a first task identifier, and of the first type, is executed on the first processing module of a given compute unit of the plurality of compute units, and a task allocated a second, different, task identifier, and of the second task type, is executed on the second processing module of the given compute unit of the plurality of compute units ([0186] In one embodiment, the ring buffer can additionally include batch command buffers storing batches of multiple commands. The commands for the 3D pipeline 1712 can also include references to data stored in memory, such as but not limited to vertex and geometry data for the 3D pipeline 1712 and/or image data and memory objects for the media pipeline 1716. The 3D pipeline 1712 and media pipeline 1716 process the commands and data by performing operations via logic within the respective pipelines. Please note that batches of multiple commands that are processed by respective pipelines correspond to tasks allocated first and second task identifiers and of respective task types being executed on respective compute units of the plurality of compute units, as the commands in a particular batch, i.e., having a particular task identifier and of a particular type to be processed by a specific pipeline, are processed by respective pipelines, corresponding to respective compute units. For example, the first type could indicate processing by the 3D pipeline 1712, and the second, different type could indicate processing by the media pipeline 1716.).
Regarding Claim 15, Surti-Yang as described in Claim 1, Surti further discloses wherein each of the plurality of compute units further comprise at least one queue of tasks, wherein the queue tasks comprise at least a part of the sequence of commands ([0099] the process elements 483 are stored in response to GPU invocations 481 from applications 480 executed on the processor 407. […] A work descriptor (WD) 484 contained in the process element 483 can be a single job requested by an application or may contain a pointer to a queue of jobs.; [0105] each WD 484 is specific to a particular graphics acceleration module 446 and/or graphics processing engine 431-432, N. It contains all the information a graphics processing engine 431-432, N requires to do its work or it can be a pointer to a memory location where the application has set up a command queue of work to be completed. Please note that the WD 484 in the process element 483 containing a pointer to a command queue of work to be completed corresponds to Applicant’s each of the plurality of compute units comprising a queue of tasks comprising a part of the sequence of commands.).
Regarding Claim 16, Surti-Yang as described in Claim 15, Surti further discloses wherein a given queue is associated with at least one task type ([0105] each WD 484 is specific to a particular graphics acceleration module 446 and/or graphics processing engine 431-432, N. It contains all the information a graphics processing engine 431-432, N requires to do its work or it can be a pointer to a memory location where the application has set up a command queue of work to be completed. Please note that the WD 484 pointing to the command queue of work being specific to a particular graphics processing engine corresponds to Applicant’s given queue being associated with at least one task type, as since it is specific to a particular graphics processing engine, this corresponds to being associated with a task type that is processed by that engine.).
Regarding Claim 17, Surti discloses A method of allocating tasks associated with commands in a sequence of commands ([0035] The GPU may be communicatively coupled to the host processor/cores over a bus […] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. Please note that the dedicated circuitry of the GPU for command processing correspond to Applicant’s method of allocating tasks associated with commands in a sequence of commands.) comprising:
receiving at a command processing unit, from a host processor, the sequence of commands to be executed ([0035] host/processor cores […] the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. Please note that the GPU receiving sequences of commands that are allocated as work from the processor cores corresponds to Applicant’s receiving a sequence of commands to be executed from a host processor, as the host/processor cores correspond to the host processor.);
generating, at the command processing unit, based on the received sequence of commands a plurality of tasks ([0035] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. Please note the dedicated logic of the GPU for processing the commands corresponds to Applicant’s generating a plurality of tasks based on the sequence of commands, as it is known in the art that commands or instructions can be processed by generating respective tasks.);
and issuing, by the command processing unit, each task to a compute unit of a plurality of compute units for execution ([0046] When the host interface 206 receives a command buffer via the I/O unit 204, the host interface 206 can direct work operations to perform those commands to a front end 208. In one embodiment the front end 208 couples with a scheduler 210, which is configured to distribute commands or other work items to a processing cluster array 212. Please note the processing cluster array corresponds to Applicant’s plurality of compute units.),
each compute unit comprising :a first processing module for executing tasks of a first task type ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a first particular cluster 214A of the processing cluster array 212 for a first particular type of computation corresponds to Applicant’s first processing module for executing tasks of a first task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
a second processing module for executing tasks of a second task type ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a second particular cluster 214B of the processing cluster array 212 for a second particular type of computation corresponds to Applicant’s second processing module for executing tasks of a second task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
and a local cache shared by at least the first processing module and the second processing module ([0053] In some embodiments, a local instance of the parallel processor memory 222 may be excluded in favor of a unified memory design that utilizes system memory in conjunction with local cache memory. Please note that the local cache used by the parallel processors as part of memory corresponds to Applicant’s local cache shared by the first and second processing module, as they are both part of the parallel processing system and therefore would both utilize this local cache.);
Surti does not explicitly disclose wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.
However, Yang discloses wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks ([0066] using a common instruction unit configured to issue instructions to a set of processing engines within each one of the processing clusters. Please note that using a common instruction unit to issue instructions to a set of processing engines within each processing cluster corresponds to Applicant’s command processing unit issuing the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks. This is because the common instruction unit corresponds to Applicant’s command processing unit that issues the plurality of tasks to each processing engine, corresponding to at least one of the plurality of compute units, which processes it.).
Surti and Yang are both considered to be analogous to the claimed invention because they are in the same field of computer command pipeline processing. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Surti to incorporate the teachings of Yang to modify the command processing system generating tasks and compute units with first and second processing modules for executing tasks of respective types generated by the command processing unit with a shared local cache to have the command processing unit issue the plurality of tasks to the compute units to be processed, allowing for greater flexibility of execution and improved efficiency via parallel processing, as described in Yang.
Regarding Claim 18, Surti-Yang as described in Claim 17, Surti further discloses wherein the command processing unit waits for completion of processing of the tasks associated with the first command before issuing the tasks associated with the second command to the given compute unit, when the task associated with the second command is dependent on the task associated with the first command ([0199] While waiting for data from memory or one of the shared functions, dependency logic within the execution units 2008A-2008N causes a waiting thread to sleep until the requested data has been returned. While the waiting thread is sleeping, hardware resources may be devoted to processing other threads. For example, during a delay associated with a vertex shader operation, an execution unit can perform operations for a pixel shader, fragment shader, or another type of shader program, including a different vertex shader. Please note that the dependency logic within the execution units causing a waiting thread to sleep until the requested data has been returned corresponds to Applicant’s command processing unit waiting for completion of processing of tasks associated with the first command before issuing the tasks associated with the second command to the given compute unit, where the associated task is dependent on the given task, as the dependency logic causes the execution unit to wait for completion of the shared function, i.e., completion of processing of given tasks of the first command, before proceeding to the next operation, corresponding to issuing the associated tasks of the second command in the sequence for processing. As the logic is based on the dependency of the second operation on the first, this corresponds to Applicant’s task being associated with the second command being dependent on the task associated with the first command.).
Regarding Claim 19, Surti-Yang as described in Claim 17, Surti further discloses wherein each command has associated metadata comprising indications of at least a number of tasks in the given command, and task types associated with each of the plurality of tasks ([0105] In one embodiment, each WD 484 is specific to a particular graphics acceleration module 446 and/or graphics processing engine 431-432, N.; [0109] The WD is formatted specifically for the graphics acceleration module 446 and can be in the form of a graphics acceleration module 446 command, an effective address pointer to a user-defined structure, an effective address pointer to a queue of commands, or any other data structure to describe the work to be done by the graphics acceleration module 446. Please note that the WD 484 specific to a particular graphics processing engine that is an address pointer to a queue of commands to describe work to be done corresponds to Applicant’s each command having metadata comprising indications of a number of tasks in the given command and task types associated with each of the tasks, as the WD 484 provides the metadata for the work, i.e., number of tasks and types, to be done by a particular processing engine.).
Regarding Claim 20, Surti discloses A non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon which, when executed by at least one processor ([0255] One embodiments provide for a data processing system comprising a non-transitory machine-readable medium to store instructions for execution by one or more processors of the data processing system. Please note that the non-transitory machine-readable medium storing instructions for execution by the processors corresponds to Applicant’s non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon to be executed by at least one processor.) are arranged to allocate tasks associated with commands in a sequence of commands ([0035] The GPU may be communicatively coupled to the host processor/cores over a bus […] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. Please note that the dedicated circuitry of the GPU for command processing correspond to Applicant’s allocating tasks associated with commands in a sequence of commands.) wherein the instructions, when executed cause the at least one processor to:
receive at a command processing unit, from a host processor, the sequence of commands to be executed ([0035] host/processor cores […] the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. Please note that the GPU receiving sequences of commands that are allocated as work from the processor cores corresponds to Applicant’s receiving a sequence of commands to be executed from a host processor, as the host/processor cores correspond to the host processor.);
generate, at the command processing unit, based on the received sequence of commands a plurality of tasks ([0035] The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions. Please note the dedicated logic of the GPU for processing the commands corresponds to Applicant’s generating a plurality of tasks based on the sequence of commands, as it is known in the art that commands or instructions can be processed by generating respective tasks.);
and issue, by the command processing unit, each task to a compute unit of a plurality of compute units for execution ([0046] When the host interface 206 receives a command buffer via the I/O unit 204, the host interface 206 can direct work operations to perform those commands to a front end 208. In one embodiment the front end 208 couples with a scheduler 210, which is configured to distribute commands or other work items to a processing cluster array 212. Please note the processing cluster array corresponds to Applicant’s plurality of compute units.),
each compute unit comprising: a first processing module for executing tasks of a first task type ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a first particular cluster 214A of the processing cluster array 212 for a first particular type of computation corresponds to Applicant’s first processing module for executing tasks of a first task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
a second processing module for executing tasks of a second task type ([0047] different clusters 214A-214N of the processing cluster array 212 can be allocated for processing different types of programs or for performing different types of computations. Please note that a second particular cluster 214B of the processing cluster array 212 for a second particular type of computation corresponds to Applicant’s second processing module for executing tasks of a second task type generated by the command processing unit, since, as previously stated, the computations performed by the cluster are based on the received commands.);
and a local cache shared by at least the first processing module and the second processing module ([0053] In some embodiments, a local instance of the parallel processor memory 222 may be excluded in favor of a unified memory design that utilizes system memory in conjunction with local cache memory. Please note that the local cache used by the parallel processors as part of memory corresponds to Applicant’s local cache shared by the first and second processing module, as they are both part of the parallel processing system and therefore would both utilize this local cache.);
Surti does not explicitly disclose wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks.
However, Yang discloses wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks ([0066] using a common instruction unit configured to issue instructions to a set of processing engines within each one of the processing clusters. Please note that using a common instruction unit to issue instructions to a set of processing engines within each processing cluster corresponds to Applicant’s command processing unit issuing the plurality of tasks to at least one of the plurality of compute units, and wherein at least one of the plurality of compute units is to process at least one of the plurality of tasks. This is because the common instruction unit corresponds to Applicant’s command processing unit that issues the plurality of tasks to each processing engine, corresponding to at least one of the plurality of compute units, which processes it.).
Surti and Yang are both considered to be analogous to the claimed invention because they are in the same field of computer command pipeline processing. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Surti to incorporate the teachings of Yang to modify the command processing system generating tasks and compute units with first and second processing modules for executing tasks of respective types generated by the command processing unit with a shared local cache to have the command processing unit issue the plurality of tasks to the compute units to be processed, allowing for greater flexibility of execution and improved efficiency via parallel processing, as described in Yang.
Response to Arguments
Applicant's arguments filed 02/25/2026 have been fully considered but they are not persuasive.
Applicant’s arguments are summarized as the following:
Claim 2 has been sufficiently amended to overcome the rejection under 35 U.S.C. 112 (b).
Surti-Yang fails to disclose a compute unit comprising 1. A first processing module 2. A second processing module and 3. A shared local cache, as disclosed by Claim 1. The rejection relies on a system-level mapping that is inconsistent with the definition, as it maps “plurality of units” to Surti’s “processing cluster array 212”, and subsequently maps the “first processing module” to a first cluster 214A and the “second processing module” to a second, separate cluster 214B within that array. This mapping fails to teach the “shared local cache” limitation because 1. In Surti, 214A and 214B are separate hardware blocks, not modules within a single unit but rather distinct units within an array 2. Surti discloses that each cluster has its own local L1 cache, and no single local cache shared by both 214A and 214B. In the invention, the different processing units are co-located within the same compute unit so they can exchange data via the same low-latency cache, but Surti teaches away due to separating tasks across different clusters with independent local caches, and Yang does not remedy the deficiency.
Since independent Claims 17 and 20 contain limitations similar to that of independent Claim 1, they should also have their rejections under 35 U.S.C. 103 withdrawn.
Since the dependent Claims depend on allowable independent Claims, they are also allowable and should have their rejections under 35 U.S.C. 103 withdrawn.
Regarding A, Applicant’s arguments with respect to Claim 2 have been fully considered and are persuasive. The Claim has been sufficiently amended to overcome the rejection under 35 U.S.C. 112 (b). Therefore, the rejection under 35 U.S.C. 112 (b) for the claim is withdrawn.
Regarding B, the examiner respectfully disagrees. Firstly, as seen in figure 2, the processing cluster array 212, comprising the clusters 214A and 214B, are part of one parallel processor 200. Therefore, the clusters are indeed modules within a single unit, i.e., within the parallel processor 200, and are thus co-located within the same compute unit. Additionally, as cited in [0053] of Surti, the local instances of the parallel processor memory 222 are excluded in favor of unified memory design that utilizes local cache memory. Therefore, contrary to Applicant’s argument that each cluster has its own local L1 cache and no single cache shared by them, as [0052] discloses that the parallel processor memory 222 can be accessed via the memory crossbar 216 which can receive memory requests from the processing cluster array 212, the clusters can thus access the local cache associated with the unified memory, corresponding to the shared local cache.
Therefore, the recited features can be found in the cited combination of references, and independent Claim 1 remains rejected under 35 U.S.C. 103 for the reasons stated above, and the combinations cited would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the application. The rejections under 35 U.S.C. 103 are maintained/
Regarding C, the examiner respectfully disagrees. Independent Claim 1 remains rejected for the reasons stated above, and the combinations cited would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the application. Therefore, contrary to Applicant’s arguments, because independent Claims 17 and 20 contain similar limitations to unpatentable Claim 1 and do not add limitations that overcome the rejection, they likewise remain rejected, and the application is not in condition for allowance.
Regarding D, the examiner respectfully disagrees. Independent claims 1, 17, and 20 remain rejected for the reasons stated above, and the combinations cited would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the application. Therefore, contrary to Applicant’s arguments, because the dependent claims depend on unpatentable claims and do not add limitations that overcome the rejection, they likewise remain rejected, and the application is not in condition for allowance.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Gurfinkel et al. (US 20210149734 A1) discloses a graphics pipeline, a job/command queue, shader cores, and shared local memory for sub-cores (see [0089, 0102, 0108, 0113, 0174).
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARAZ T AKBARI whose telephone number is (571)272-4166. The examiner can normally be reached Monday-Thursday 9:30am-7:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, April Blair can be reached at (571)270-1014. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/FARAZ T AKBARI/Examiner, Art Unit 2196
/APRIL Y BLAIR/Supervisory Patent Examiner, Art Unit 2196