Last updated: April 19, 2026
Application No. 17/967,768
CHAINED ACCELERATOR OPERATIONS WITH STORAGE FOR INTERMEDIATE RESULTS

Non-Final OA §103
Filed
Oct 17, 2022
Examiner
WANG, JIN CHENG
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Intel Corporation
OA Round
1 (Non-Final)
Interview Optional

— +10.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 832 resolved cases, 2023–2026
Examiner Intelligence

WANG, JIN CHENG View full profile →
Grants 59% of resolved cases
Career Allow Rate
492 granted / 832 resolved
-2.9% vs TC avg
Moderate +10% lift
Without
With
+10.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
40 currently pending
Career history
872
Total Applications
across all art units
Statute-Specific Performance

§101
11.8%
-28.2% vs TC avg
§103
62.7%
+22.7% vs TC avg
§102
7.6%
-32.4% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 832 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 13-19, 23 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Madugula et al. US-PGPUB No. 2024/0069736 (hereinafter Madugula) in view of Mody et al. US-PGPUB No. 2020/0210351 (hereinafter Mody); Hung et al. US-PGPUB No. 2023/0049442 (hereinafter Hung). 
Re Claim 1: 
Madugula at least suggests an apparatus comprising:
a first accelerator having support for a chained accelerator operation, the first accelerator to be controlled as part of the chained accelerator operation to access an input data from a source memory location in system memory, process the input data, generate first intermediate data, and store the first intermediate data to a storage (Madugula teaches at FIG. 4 and Paragraph 0055 that the first accelerator processing subsystem 112(0) includes a GPC 208(0), PP memory 204(0), an MMU 320(0) and GPC 208(0) includes SMs 310(0) and a direct memory access controller 410(0). 
Madugula teaches at Paragraph 0033 that PPU 202 comprises a graphics processing unit (GPU) that may be configured to implement a graphics rendering pipeline to perform various operations related to generating pixel data based on graphics data supplied by CPU 102 and/or system memory 104 and at Paragraph 0040 that crossbar unit 210 has a connection to I/O unit 205, in addition to a connection to PP memory 204 via memory interface 214, thereby enabling the processing cores within the different GPCs 208 to communicate with system memory 104 or other memory not local to PPU 202. 
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to system memory 104 and/or PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100.
Madugula teaches at Paragraph 0056 that the SMs 310(0) perform various processing operations by means of a set of functional execution units and employ the load-store units to store data to the local PP memory 204(0) and to the remote PP memory 204(1). 
Madugula teaches at Paragraph 0055 that the second accelerator processing subsystem 112(1) includes a GPC 208(1), PP memory 204(1), an MMU 320(1) and GPC 208(1) includes SMs 310(1) and a direct memory access controller 410(1)); and
a second accelerator having support for the chained accelerator operation, the second accelerator to be controlled as part of the chained accelerator operation to receive the first intermediate data from the storage, without the first intermediate data having been sent to the system memory, process the first intermediate data, and generate additional data (
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100). 
Madugula does not explicitly teach a chained accelerator operation, and would implicitly teaches a chained accelerator operation in view of Mody. 
Mody teaches an apparatus comprising:
a first accelerator having support for a chained accelerator operation, the first accelerator to be controlled as part of the chained accelerator operation to access an input data from a source memory location in system memory, process the input data, generate first intermediate data, and store the first intermediate data to a storage (
Mody teaches at Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIG. 3-4 and Paragraph [0042] that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 include circuitry to process a video stream in real-time (e.g., to process a line of video at a time). Data may be transferred between the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 in units of a pixel.
Mody teaches at FIGS. 3-4 and Paragraph 0043 that he load/store engine 412 retrieves image data from the shared memory 214 for processing by the stream processing accelerator 400, and the load/store engine 412 transfers data processed by the stream processing accelerator 400 to the shared memory 214 for storage. 
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406); and
a second accelerator having support for the chained accelerator operation, the second accelerator to be controlled as part of the chained accelerator operation to receive the first intermediate data from the storage, without the first intermediate data having been sent to the system memory, process the first intermediate data, and generate additional data (
Mody teaches at Paragraph 0014 that the real-time processing subsystem may include an output first-in-first-out (FIFO) memory to buffer output of the chain of accelerators for transfer to external memory by a dedicated direct memory access (DMA) controller. 
Mody teaches at FIGS. 3-4 and Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406. 
Mody teaches at Paragraph [0028] In the configuration of FIG. 3A, the stream accelerator 204 receives and processes the video stream 308 (e.g., processes one line of video at a time), and transfers the results 310 of processing to the shared memory 214. More specifically, the stream accelerator 204 writes the results 310 of processing to a circular buffer 302 formed in the shared memory 214. The circular buffer 302 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth (i.e., the number of units of storage) of the circular buffer 302, and all circular buffers formed in the shared memory 214, is variable via software configuration to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the stream accelerator 206 by the GPP 106 may set the depth of the circular buffer 302. The stream accelerator 206 retrieves the processed image data 312 from the circular buffer 302 for further processing.
Mody teaches at Paragraph [0029] The unit of storage applied in a circular buffer may vary with the data source. The stream accelerator 204 processes the video stream 308 received from the camera 102. As each line of video data is received, the stream accelerator 204 processes the line, and transfers the processed line of video data to the circular buffer 302. Thus, the unit of storage of the circular buffer is a line with respect to the stream accelerator 204. Other sources may write to a circular buffer using a different unit of data. For example, the memory-to-memory accelerator 208 may process data in units of two-dimensional blocks and write to a circular buffer in units of two-dimensional blocks.
Mody teaches at Paragraph [0030] The stream accelerator 206 processes the image data 312 retrieved from the circular buffer 302, and transfers the results 314 of processing to the shared memory 214. More specifically, the stream accelerator 206 writes the results 314 of processing to a circular buffer 304 formed in the shared memory 214. The memory-to-memory accelerator 208 retrieves the processed image data 316 from the circular buffer 304 for further processing. The circular buffer 304 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth of the circular buffer 304 is software configurable to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the memory-to-memory accelerator 208 by the GPP 106 may set the depth of the circular buffer 304. Because the memory-to-memory accelerator 208 may process image data in blocks that include data from multiple lines of an image, the circular buffer 304 may be sized to store multiple lines of image data. The depth of the circular buffer 304 may also be a function of the processing performed by the memory-to-memory accelerator 208.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mody’s chained accelerator operation and Madugula’s graphics processing system including a series of interconnected accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 

Hung teaches an apparatus comprising:
a first accelerator having support for a chained accelerator operation, the first accelerator to be controlled as part of the chained accelerator operation to access an input data from a source memory location in system memory, process the input data, generate first intermediate data, and store the first intermediate data to a storage (
Hung teaches at Paragraph 0219 that the accelerator(s) may read from the shared memory for configuration parameters and/or input data structures, and may write to shared memory the output result data structures. 
Hung teaches at Paragraph 0227 that an application programmer may program the processor(s) 802 and the accelerator(s) 804 with knowledge of what each is capable of, so that the application program may be split into parts—some parts for the processor(s) 802 and some parts for the accelerator(s) 804. The processing may thus be executed in parallel, in embodiments, between the processor(s) 802 and the accelerator(s) 804 to decrease runtime and increase efficiency. A configuration message—shared via the accelerator interface and/or via shared memory 806—may be generated by the processor(s) 802 and used to indicate to the accelerator(s) 804 where the data to process starts in shared memory 806, how much data to process, and where to write the results back to in the shared memory 806. The processor(s) 802 may generate an input buffer in the shared memory 806 at the specified location that includes the data for the accelerator(s) 804 to operate on. Once the configuration message is transmitted and the input buffer(s) are stored in shared memory 806, the accelerator(s) 804 may receive a trigger signal from the processor(s) 802 via the event interface (e.g., a programming interface), and the accelerator(s) 804 may being processing the data. Once the accelerator(s) 804 is triggered, the processor(s) 802 may then perform other work or enter a low power state, and once the accelerator(s) 804 is finished processing, the accelerator(s) 804 may indicate the same to the processor(s) 802 and may wait for additional work.
Hung teaches at Paragraph 0228 that the accelerator(s) 804, once aware of the particular mode or function, may configure the registers to properly read the data from memory 806, process, and write the data back to memory 806.
Hung teaches at Paragraph [0234] The method 850, at block B806, includes reading data from an input buffer in the memory based at least in part on an indication of a starting location of the input buffer included in the configuration information. For example, the configuration information may include an indication of where in memory 806 an input buffer is stored, and the accelerator(s) 804 may read the data from the input buffer into the registers.
Hung teaches at Paragraph [0235] The method 850, at block B808, includes processing the data from the input buffer to compute output data. For example, the accelerator(s) 804 may process the data from the input buffers to generate or compute outputs.
Hung teaches at Paragraph [0236] The method 850, at block B810, includes writing the output data to the memory at a location determined based at least in part on the configuration information. For example, the accelerator(s) 804 may write the results of the computations out to memory 806, and may indicate to the processor(s) 802 that the processing is complete. The processor 802 may then use the output data to perform one or more second processing tasks of the processing pipeline.
); and
a second accelerator having support for the chained accelerator operation, the second accelerator to be controlled as part of the chained accelerator operation to receive the first intermediate data from the storage, without the first intermediate data having been sent to the system memory, process the first intermediate data, and generate additional data (Hung teaches at Paragraph 0219 that the accelerator(s) may read from the shared memory for configuration parameters and/or input data structures, and may write to shared memory the output result data structures. 
Hung teaches at Paragraph 0227 that an application programmer may program the processor(s) 802 and the accelerator(s) 804 with knowledge of what each is capable of, so that the application program may be split into parts—some parts for the processor(s) 802 and some parts for the accelerator(s) 804. The processing may thus be executed in parallel, in embodiments, between the processor(s) 802 and the accelerator(s) 804 to decrease runtime and increase efficiency. A configuration message—shared via the accelerator interface and/or via shared memory 806—may be generated by the processor(s) 802 and used to indicate to the accelerator(s) 804 where the data to process starts in shared memory 806, how much data to process, and where to write the results back to in the shared memory 806. The processor(s) 802 may generate an input buffer in the shared memory 806 at the specified location that includes the data for the accelerator(s) 804 to operate on. Once the configuration message is transmitted and the input buffer(s) are stored in shared memory 806, the accelerator(s) 804 may receive a trigger signal from the processor(s) 802 via the event interface (e.g., a programming interface), and the accelerator(s) 804 may being processing the data. Once the accelerator(s) 804 is triggered, the processor(s) 802 may then perform other work or enter a low power state, and once the accelerator(s) 804 is finished processing, the accelerator(s) 804 may indicate the same to the processor(s) 802 and may wait for additional work.
Hung teaches at Paragraph 0228 that the accelerator(s) 804, once aware of the particular mode or function, may configure the registers to properly read the data from memory 806, process, and write the data back to memory 806.
Hung teaches at Paragraph [0234] The method 850, at block B806, includes reading data from an input buffer in the memory based at least in part on an indication of a starting location of the input buffer included in the configuration information. For example, the configuration information may include an indication of where in memory 806 an input buffer is stored, and the accelerator(s) 804 may read the data from the input buffer into the registers.
Hung teaches at Paragraph [0235] The method 850, at block B808, includes processing the data from the input buffer to compute output data. For example, the accelerator(s) 804 may process the data from the input buffers to generate or compute outputs.
Hung teaches at Paragraph [0236] The method 850, at block B810, includes writing the output data to the memory at a location determined based at least in part on the configuration information. For example, the accelerator(s) 804 may write the results of the computations out to memory 806, and may indicate to the processor(s) 802 that the processing is complete. The processor 802 may then use the output data to perform one or more second processing tasks of the processing pipeline). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have combined the features of Hung and Madugula’s graphics processing system including accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 
Re Claim 2: 
The claim 2 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the storage corresponds to, and is dedicated for use by, at least a portion of one of the first and second accelerators. 
Madugula teaches the claim limitation that the storage corresponds to, and is dedicated for use by, at least a portion of one of the first and second accelerators (Madugula teaches at FIG. 4 and Paragraph 0055 that the first accelerator processing subsystem 112(0) includes a GPC 208(0), PP memory 204(0), an MMU 320(0) and GPC 208(0) includes SMs 310(0) and a direct memory access controller 410(0). 
Madugula teaches at Paragraph 0055 that the second accelerator processing subsystem 112(1) includes a GPC 208(1), PP memory 204(1), an MMU 320(1) and GPC 208(1) includes SMs 310(1) and a direct memory access controller 410(1). 
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100). 
Mody further teaches the claim limitation that the storage corresponds to, and is dedicated for use by, at least a portion of one of the first and second accelerators (
Mody teaches at Paragraph 0014 that the real-time processing subsystem may include an output first-in-first-out (FIFO) memory to buffer output of the chain of accelerators for transfer to external memory by a dedicated direct memory access (DMA) controller. 
Mody teaches at FIGS. 3-4 and Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406. 
Mody teaches at Paragraph [0028] In the configuration of FIG. 3A, the stream accelerator 204 receives and processes the video stream 308 (e.g., processes one line of video at a time), and transfers the results 310 of processing to the shared memory 214. More specifically, the stream accelerator 204 writes the results 310 of processing to a circular buffer 302 formed in the shared memory 214. The circular buffer 302 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth (i.e., the number of units of storage) of the circular buffer 302, and all circular buffers formed in the shared memory 214, is variable via software configuration to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the stream accelerator 206 by the GPP 106 may set the depth of the circular buffer 302. The stream accelerator 206 retrieves the processed image data 312 from the circular buffer 302 for further processing.
Mody teaches at Paragraph [0029] The unit of storage applied in a circular buffer may vary with the data source. The stream accelerator 204 processes the video stream 308 received from the camera 102. As each line of video data is received, the stream accelerator 204 processes the line, and transfers the processed line of video data to the circular buffer 302. Thus, the unit of storage of the circular buffer is a line with respect to the stream accelerator 204. Other sources may write to a circular buffer using a different unit of data. For example, the memory-to-memory accelerator 208 may process data in units of two-dimensional blocks and write to a circular buffer in units of two-dimensional blocks.
Mody teaches at Paragraph [0030] The stream accelerator 206 processes the image data 312 retrieved from the circular buffer 302, and transfers the results 314 of processing to the shared memory 214. More specifically, the stream accelerator 206 writes the results 314 of processing to a circular buffer 304 formed in the shared memory 214. The memory-to-memory accelerator 208 retrieves the processed image data 316 from the circular buffer 304 for further processing. The circular buffer 304 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth of the circular buffer 304 is software configurable to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the memory-to-memory accelerator 208 by the GPP 106 may set the depth of the circular buffer 304. Because the memory-to-memory accelerator 208 may process image data in blocks that include data from multiple lines of an image, the circular buffer 304 may be sized to store multiple lines of image data. The depth of the circular buffer 304 may also be a function of the processing performed by the memory-to-memory accelerator 208.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mody’s chained accelerator operation and Madugula’s graphics processing system including a series of interconnected accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 
Re Claim 3: 
The claim 3 encompasses the same scope of invention as that of the claim 2 except additional claim limitation that the storage corresponds to, and is dedicated for use by, all of said one of the first and second accelerators.
Madugula teaches the claim limitation that the storage corresponds to, and is dedicated for use by, all of said one of the first and second accelerators (Madugula teaches at FIG. 4 and Paragraph 0055 that the first accelerator processing subsystem 112(0) includes a GPC 208(0), PP memory 204(0), an MMU 320(0) and GPC 208(0) includes SMs 310(0) and a direct memory access controller 410(0). 
Madugula teaches at Paragraph 0055 that the second accelerator processing subsystem 112(1) includes a GPC 208(1), PP memory 204(1), an MMU 320(1) and GPC 208(1) includes SMs 310(1) and a direct memory access controller 410(1). 
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100). 
Mody further teaches the claim limitation that the storage corresponds to, and is dedicated for use by, all of said one of the first and second accelerators (
Mody teaches at Paragraph 0014 that the real-time processing subsystem may include an output first-in-first-out (FIFO) memory to buffer output of the chain of accelerators for transfer to external memory by a dedicated direct memory access (DMA) controller. 
Mody teaches at FIGS. 3-4 and Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406. 
Mody teaches at Paragraph [0028] In the configuration of FIG. 3A, the stream accelerator 204 receives and processes the video stream 308 (e.g., processes one line of video at a time), and transfers the results 310 of processing to the shared memory 214. More specifically, the stream accelerator 204 writes the results 310 of processing to a circular buffer 302 formed in the shared memory 214. The circular buffer 302 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth (i.e., the number of units of storage) of the circular buffer 302, and all circular buffers formed in the shared memory 214, is variable via software configuration to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the stream accelerator 206 by the GPP 106 may set the depth of the circular buffer 302. The stream accelerator 206 retrieves the processed image data 312 from the circular buffer 302 for further processing.
Mody teaches at Paragraph [0029] The unit of storage applied in a circular buffer may vary with the data source. The stream accelerator 204 processes the video stream 308 received from the camera 102. As each line of video data is received, the stream accelerator 204 processes the line, and transfers the processed line of video data to the circular buffer 302. Thus, the unit of storage of the circular buffer is a line with respect to the stream accelerator 204. Other sources may write to a circular buffer using a different unit of data. For example, the memory-to-memory accelerator 208 may process data in units of two-dimensional blocks and write to a circular buffer in units of two-dimensional blocks.
Mody teaches at Paragraph [0030] The stream accelerator 206 processes the image data 312 retrieved from the circular buffer 302, and transfers the results 314 of processing to the shared memory 214. More specifically, the stream accelerator 206 writes the results 314 of processing to a circular buffer 304 formed in the shared memory 214. The memory-to-memory accelerator 208 retrieves the processed image data 316 from the circular buffer 304 for further processing. The circular buffer 304 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth of the circular buffer 304 is software configurable to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the memory-to-memory accelerator 208 by the GPP 106 may set the depth of the circular buffer 304. Because the memory-to-memory accelerator 208 may process image data in blocks that include data from multiple lines of an image, the circular buffer 304 may be sized to store multiple lines of image data. The depth of the circular buffer 304 may also be a function of the processing performed by the memory-to-memory accelerator 208.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mody’s chained accelerator operation and Madugula’s graphics processing system including a series of interconnected accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 
Re Claim 4: 
The claim 4 encompasses the same scope of invention as that of the claim 2 except additional claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of accelerator resources of said one of the first and second accelerators.
Madugula teaches the claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of accelerator resources of said one of the first and second accelerators (
Madugula teaches at Paragraph [0080] As shown, the SM 310(0) generates three store operations 540, 542, and 544 and transmits the three store operations 540, 542, and 544 to the remote high-speed hub 420(1). The remote high-speed hub 420(1) forwards the store operations 540, 542, and 544 to the remote PP memory 204(1). The SM 310(0) generates a memory synchronization operation 546 and transmits the memory synchronization operation 546 to the remote PP memory 204(1) via the remote high-speed hub 420(1). When the remote PP memory 204(1) determines that the three store operations 540, 542, and 544 have completed, and in response to the memory synchronization operation 546, the PP memory 204(1) generates an acknowledgement (Ack) 550. The remote PP memory 204(1) transmits the acknowledgement 550 to the SM 310(0) indicating that the three store operations 540, 542, and 544 have completed. In response, the SM 310(0) generates an atomic operation 580 to set a flag stored in the remote PP memory 204(1).
Madugula teaches at Paragraph [0081] Meanwhile, the remote SM 310(1) polls the flag to determine whether the three store operations 540, 542, and 544 have completed. The remote SM 310(1) generates a poll 560 and transmits the poll 560 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 560, the three store operations 540, 542, and 544 have not yet completed. Therefore, the remote PP memory 204(1) returns a status 562 indicating that the data is not yet available for loading. The remote SM 310(1) generates another poll 564 and transmits the poll 564 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 564, the three store operations 540, 542, and 544 have completed and the remote PP memory 204(1) has received the memory synchronization operation 546. The remote PP memory 204(1) has transmitted an acknowledgement 550 to the SM 310(0). However, the remote PP memory 204(1) has not yet received the atomic operation 580 that updates the flag. Therefore, the remote PP memory 204(1) returns a status 566 indicating that the data is not yet available for loading. The remote SM 310(1) generates another poll 568 and transmits the poll 568 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 568, the PP memory 204(1) has received and processed the atomic operation 580 that updates the flag. Therefore, the remote PP memory 204(1) returns a status 570 indicating that the data is available for loading. In response, the remote SM 310(1) performs three data load operations 590, 592, and 594 to load the data stored by the three store operations 540, 542, and 544, respectively.

Madugula teaches at FIG. 4 and Paragraph 0055 that the first accelerator processing subsystem 112(0) includes a GPC 208(0), PP memory 204(0), an MMU 320(0) and GPC 208(0) includes SMs 310(0) and a direct memory access controller 410(0). 
Madugula teaches at Paragraph 0055 that the second accelerator processing subsystem 112(1) includes a GPC 208(1), PP memory 204(1), an MMU 320(1) and GPC 208(1) includes SMs 310(1) and a direct memory access controller 410(1). 
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100). 
Mody further teaches the claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of accelerator resources of said one of the first and second accelerators (
Mody teaches at FIG. 3A that the circular buffer 302 is only dedicated to the stream accelerators 204 and 206 and the circular buffer 304 is only dedicated to the stream accelerators 206 and 208. 
Mody teaches at Paragraph 0014 that the real-time processing subsystem may include an output first-in-first-out (FIFO) memory to buffer output of the chain of accelerators for transfer to external memory by a dedicated direct memory access (DMA) controller. 
Mody teaches at FIGS. 3-4 and Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406. 
Mody teaches at Paragraph [0028] In the configuration of FIG. 3A, the stream accelerator 204 receives and processes the video stream 308 (e.g., processes one line of video at a time), and transfers the results 310 of processing to the shared memory 214. More specifically, the stream accelerator 204 writes the results 310 of processing to a circular buffer 302 formed in the shared memory 214. The circular buffer 302 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth (i.e., the number of units of storage) of the circular buffer 302, and all circular buffers formed in the shared memory 214, is variable via software configuration to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the stream accelerator 206 by the GPP 106 may set the depth of the circular buffer 302. The stream accelerator 206 retrieves the processed image data 312 from the circular buffer 302 for further processing.
Mody teaches at Paragraph [0029] The unit of storage applied in a circular buffer may vary with the data source. The stream accelerator 204 processes the video stream 308 received from the camera 102. As each line of video data is received, the stream accelerator 204 processes the line, and transfers the processed line of video data to the circular buffer 302. Thus, the unit of storage of the circular buffer is a line with respect to the stream accelerator 204. Other sources may write to a circular buffer using a different unit of data. For example, the memory-to-memory accelerator 208 may process data in units of two-dimensional blocks and write to a circular buffer in units of two-dimensional blocks.
Mody teaches at Paragraph [0030] The stream accelerator 206 processes the image data 312 retrieved from the circular buffer 302, and transfers the results 314 of processing to the shared memory 214. More specifically, the stream accelerator 206 writes the results 314 of processing to a circular buffer 304 formed in the shared memory 214. The memory-to-memory accelerator 208 retrieves the processed image data 316 from the circular buffer 304 for further processing. The circular buffer 304 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth of the circular buffer 304 is software configurable to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the memory-to-memory accelerator 208 by the GPP 106 may set the depth of the circular buffer 304. Because the memory-to-memory accelerator 208 may process image data in blocks that include data from multiple lines of an image, the circular buffer 304 may be sized to store multiple lines of image data. The depth of the circular buffer 304 may also be a function of the processing performed by the memory-to-memory accelerator 208.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mody’s chained accelerator operation and Madugula’s graphics processing system including a series of interconnected accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 
Re Claim 5: 
The claim 5 encompasses the same scope of invention as that of the claim 2 except additional claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of virtual accelerator resources of said one of the first and second accelerators. 
Madugula teaches the claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of virtual accelerator resources of said one of the first and second accelerators (
Madugula teaches at Paragraph [0080] As shown, the SM 310(0) generates three store operations 540, 542, and 544 and transmits the three store operations 540, 542, and 544 to the remote high-speed hub 420(1). The remote high-speed hub 420(1) forwards the store operations 540, 542, and 544 to the remote PP memory 204(1). The SM 310(0) generates a memory synchronization operation 546 and transmits the memory synchronization operation 546 to the remote PP memory 204(1) via the remote high-speed hub 420(1). When the remote PP memory 204(1) determines that the three store operations 540, 542, and 544 have completed, and in response to the memory synchronization operation 546, the PP memory 204(1) generates an acknowledgement (Ack) 550. The remote PP memory 204(1) transmits the acknowledgement 550 to the SM 310(0) indicating that the three store operations 540, 542, and 544 have completed. In response, the SM 310(0) generates an atomic operation 580 to set a flag stored in the remote PP memory 204(1).
Madugula teaches at Paragraph [0081] Meanwhile, the remote SM 310(1) polls the flag to determine whether the three store operations 540, 542, and 544 have completed. The remote SM 310(1) generates a poll 560 and transmits the poll 560 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 560, the three store operations 540, 542, and 544 have not yet completed. Therefore, the remote PP memory 204(1) returns a status 562 indicating that the data is not yet available for loading. The remote SM 310(1) generates another poll 564 and transmits the poll 564 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 564, the three store operations 540, 542, and 544 have completed and the remote PP memory 204(1) has received the memory synchronization operation 546. The remote PP memory 204(1) has transmitted an acknowledgement 550 to the SM 310(0). However, the remote PP memory 204(1) has not yet received the atomic operation 580 that updates the flag. Therefore, the remote PP memory 204(1) returns a status 566 indicating that the data is not yet available for loading. The remote SM 310(1) generates another poll 568 and transmits the poll 568 to the remote PP memory 204(1). At the time that the remote PP memory 204(1) receives the poll 568, the PP memory 204(1) has received and processed the atomic operation 580 that updates the flag. Therefore, the remote PP memory 204(1) returns a status 570 indicating that the data is available for loading. In response, the remote SM 310(1) performs three data load operations 590, 592, and 594 to load the data stored by the three store operations 540, 542, and 544, respectively.

Madugula teaches at FIG. 4 and Paragraph 0055 that the first accelerator processing subsystem 112(0) includes a GPC 208(0), PP memory 204(0), an MMU 320(0) and GPC 208(0) includes SMs 310(0) and a direct memory access controller 410(0). 
Madugula teaches at Paragraph 0055 that the second accelerator processing subsystem 112(1) includes a GPC 208(1), PP memory 204(1), an MMU 320(1) and GPC 208(1) includes SMs 310(1) and a direct memory access controller 410(1). 
Madugula teaches at Paragraph 0041 that PPU 202 is configured to transfer data from system memory 104 and/or PP memory 204 to one or more on-chip memory units, process the data, and write result data back to PP memory 204. The result data may then be accessed by other system components, including CPU 102, another PPU 202 within accelerator processing subsystem 112, or another accelerator processing subsystem 112 within computer system 100). 
Mody further teaches the claim limitation that the storage corresponds to, and is dedicated for use by, only a subset of virtual accelerator resources of said one of the first and second accelerators (
Mody teaches at FIG. 3A that the circular buffer 302 is only dedicated to the stream accelerators 204 and 206 and the circular buffer 304 is only dedicated to the stream accelerators 206 and 208. 
Mody teaches at Paragraph [0052] The shared memory access circuitry 604 also includes virtual line conversion circuitry 610 that implements a “virtual line mode.” The virtual line mode partitions a data stream into “lines” of any length to provide data to processing circuitry in units that promote efficient processing. For example, the shared memory access circuitry 604 may access the shared memory 214 to retrieve an entire row of pixel data from an image, or access the shared memory 214 to retrieve a specified number of bytes of pixel or other data (e.g., multiple lines) as best promotes efficient processing by processing circuitry coupled to the load/store engine 600. Similarly, output of processing circuitry that includes an arbitrary number of data units (bits, bytes, etc.) may be broken into lines by the virtual line conversion circuitry 610, where each line includes a predetermined number of data units, and written to the shared memory 214 as lines to optimize memory transfer efficiency. 
Mody teaches at Paragraph 0014 that the real-time processing subsystem may include an output first-in-first-out (FIFO) memory to buffer output of the chain of accelerators for transfer to external memory by a dedicated direct memory access (DMA) controller. 
Mody teaches at FIGS. 3-4 and Paragraph 0041 that the processing circuitry 402, the processing circuitry 406, and the processing circuitry 410 are arranged in series, such that the processing circuitry 406 processes the output of the processing circuitry 402 and the processing circuitry 410 processes the output of the processing circuitry 406.
Mody teaches at FIGS. 3-4 and Paragraph 0044 that some implementations of the stream processing accelerator 400 may apply the load/store engine 412 to implement transfer of data between two instances of the processing circuitry. For example, the load/store engine 412 may transfer output of the processing circuitry 402 to a circular buffer formed in the shared memory 214, and transfer data from the circular buffer as input to the processing circuitry 406. 
Mody teaches at Paragraph [0028] In the configuration of FIG. 3A, the stream accelerator 204 receives and processes the video stream 308 (e.g., processes one line of video at a time), and transfers the results 310 of processing to the shared memory 214. More specifically, the stream accelerator 204 writes the results 310 of processing to a circular buffer 302 formed in the shared memory 214. The circular buffer 302 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth (i.e., the number of units of storage) of the circular buffer 302, and all circular buffers formed in the shared memory 214, is variable via software configuration to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the stream accelerator 206 by the GPP 106 may set the depth of the circular buffer 302. The stream accelerator 206 retrieves the processed image data 312 from the circular buffer 302 for further processing.
Mody teaches at Paragraph [0029] The unit of storage applied in a circular buffer may vary with the data source. The stream accelerator 204 processes the video stream 308 received from the camera 102. As each line of video data is received, the stream accelerator 204 processes the line, and transfers the processed line of video data to the circular buffer 302. Thus, the unit of storage of the circular buffer is a line with respect to the stream accelerator 204. Other sources may write to a circular buffer using a different unit of data. For example, the memory-to-memory accelerator 208 may process data in units of two-dimensional blocks and write to a circular buffer in units of two-dimensional blocks.
Mody teaches at Paragraph [0030] The stream accelerator 206 processes the image data 312 retrieved from the circular buffer 302, and transfers the results 314 of processing to the shared memory 214. More specifically, the stream accelerator 206 writes the results 314 of processing to a circular buffer 304 formed in the shared memory 214. The memory-to-memory accelerator 208 retrieves the processed image data 316 from the circular buffer 304 for further processing. The circular buffer 304 may be implemented using one or more of the banks 216-222 of the shared memory 214. The depth of the circular buffer 304 is software configurable to accommodate the size and format of data transferred and transfer latency between hardware accelerators. For example, configuration information provided to the stream accelerator 204 and the memory-to-memory accelerator 208 by the GPP 106 may set the depth of the circular buffer 304. Because the memory-to-memory accelerator 208 may process image data in blocks that include data from multiple lines of an image, the circular buffer 304 may be sized to store multiple lines of image data. The depth of the circular buffer 304 may also be a function of the processing performed by the memory-to-memory accelerator 208.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mody’s chained accelerator operation and Madugula’s graphics processing system including a series of interconnected accelerators to read/load or write/store computational results to a storage. One of the ordinary skill in the art would have been motivated to have provided a memory storage dedicated for the accelerators. 
Re Claim 6: 
The claim 6 encompasses the same scope of invention as that of the claim 2 except additional claim limitation that a first connection fr
Read full office action
Prosecution Timeline

Oct 17, 2022
Application Filed
Dec 06, 2022
Response after Non-Final Action
Oct 31, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/270,926
Patent 12594883
DISPLAY DEVICE FOR DISPLAYING PATHS OF A VEHICLE
2y 5m to grant Granted Apr 07, 2026
16/703,494
Patent 12597086
Tile Region Protection in a Graphics Processing System
2y 5m to grant Granted Apr 07, 2026
18/291,702
Patent 12592012
METHOD, APPARATUS, ELECTRONIC DEVICE AND READABLE MEDIUM FOR COLLAGE MAKING
2y 5m to grant Granted Mar 31, 2026
17/655,739
Patent 12586270
GENERATING AND MODIFYING DIGITAL IMAGES USING A JOINT FEATURE STYLE LATENT SPACE OF A GENERATIVE NEURAL NETWORK
2y 5m to grant Granted Mar 24, 2026
17/888,216
Patent 12579709
IMAGE SPECIAL EFFECT PROCESSING METHOD AND APPARATUS
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
59%
Grant Probability
69%
With Interview (+10.3%)
3y 7m
Median Time to Grant
Low
PTA Risk
Based on 832 resolved cases by this examiner. Grant probability derived from career allow rate.