Last updated: April 19, 2026
Application No. 18/528,333
COMPUTATION OFFLOAD REQUESTS WITH DENIAL RESPONSE

Non-Final OA §102§103
Filed
Dec 04, 2023
Examiner
MILLER, DANIEL E
Art Unit
2194
Tech Center
2100 — Computer Architecture & Software
Assignee
Nvidia Corporation
OA Round
1 (Non-Final)
This examiner grants 41% of cases after interview

— +36.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 54 resolved cases, 2023–2026
Examiner Intelligence

MILLER, DANIEL E View full profile →
Grants 41% of resolved cases
Career Allow Rate
22 granted / 54 resolved
-14.3% vs TC avg
Strong +37% interview lift
Without
With
+36.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
10 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
22.3%
-17.7% vs TC avg
§103
38.7%
-1.3% vs TC avg
§102
15.7%
-24.3% vs TC avg
§112
19.6%
-20.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 54 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  


Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description:
In paragraphs [0051]-[0066], GPC 450 is mentioned multiple times, but not included in FIG. 4.
  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities: 
In paragraph [0038] at line 3, the term “any source data (src) 240” should be re-written to include ““source location (src) 240 and any source data 255 needed ...”.
In paragraph [0038] lines 6-7, “source data 240” should read “source data 255”. 
Appropriate correction is required.

Claim Objections
Claim 20 is objected to because of the following informalities:
In Claim 20 line 1, “The system of claim 10,” should read “The system of claim 14,”.
Appropriate correction is required.


Claim Interpretation
In claim 1 lines 2-3, the term “offload request” is used. In claim 1 lines 11-12, the term “denial message” is used. The following excerpts from the specification are not intended to limit the scope of the claim. However, these excerpts are useful in interpreting the broadest reasonable interpretation of “offload” and “denial message” in light of the specification.
The first excerpt is:
However, when a processing tile receives multiple offload requests, the processing tile may become a processing bottleneck, resulting in processing congestion. Multiple offload requests directed to the same processing tile may cause traffic congestion on inter-tile connections. The ability to deny a computation offload request enables computation offloading with reduced congestion. 
(Specification [0019] lines 3-7). 
	The second excerpt is:
As shown in Figure 1A, the processing tile 115-5 performs the offloaded computations associated with the offload requests 130 and 140, returning Acks 135 and 145, respectively. However, the processing tile 115-5 denies the offload request 120 and responds to the processing tile 115-8 with a denial 125 (NAck).
(Specification [0027] lines 1-4).
	The third and last excerpt is:
In an embodiment, the computation offload packet 230 functions as an active message that triggers execution of the computation.
(Specification [0038] lines 5-6).
	Note, these excerpts are useful for determining BRI of all of the claims.

	In claim 21 lines 8-13 and the dependent claims 22-23, that incorporate the language of claim 21 by reference, two if statements are used. Applicant is reminded of how “contingent limitations” are interpreted: 
The broadest reasonable interpretation of a method (or process) claim having contingent limitations requires only those steps that must be performed and does not include steps that are not required to be performed because the condition(s) precedent are not met. For example, assume a method claim requires step A if a first condition happens and step B if a second condition happens. If the claimed invention may be practiced without either the first or second condition happening, then neither step A or B is required by the broadest reasonable interpretation of the claim. If the claimed invention requires the first condition to occur, then the broadest reasonable interpretation of the claim requires step A. If the claimed invention requires both the first and second conditions to occur, then the broadest reasonable interpretation of the claim requires both steps A and B.
(See MPEP 2111.04(II)).

Claim Rejections - 35 USC § 102

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –


(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



Claim(s) 1, 3-7, 14, and 16-18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam)
	With respect to claim 1, Putnam teaches A method, comprising (see section 4.1, [page 3 col 2 paragraph 6]-[page 4 col 2 paragraph 2]; the method is two runs of the execution pipeline shown [page 4 col 1 paragraph 2], where "generating" starts at "4. Execute" stage in the first pass through the pipeline, "transmitting" is the "5. Output" stage, and then evaluating congestion criteria... and transmitting a denial message... is the "1. Input" stage of the second pass through the pipeline): generating, by a first processing tile in a topology including processing tiles, an offload request for executing a computation (4. Execute: An instruction executes. Its result goes to the  output queue, [page 4 col 1 paragraph 2 bullet 4]; where these are the five pipe stages of a PE, [page 4 col 1 paragraph 2 line 1]; and a PE (processing element) is a tile part of a large topology of processing elements as shown in FIG. 2 with a specific example in FIG. 1, [page 4]), wherein each processing tile includes one or more processing units that are each directly coupled to a local memory of a total memory (Each PE contains a functional unit, specialized memories to hold operands, and logic to control instruction execution and communication, [page 3 col 1 paragraph 4 lines 1-3]; a spec of the “total memory” is provided in Table 1, [page  4]) and the processing tile is indirectly coupled to the local memories included within each of the remaining processing tiles (see the right panel of FIG. 1, showing data from one processing element can be passed to another processing element "indirectly", [page 4]; Table 1 gives details of the microarchitecture such as the capacity for instructions stored in each PE, the input queue capacity, and the output queue capacity, and the last column of Table 1 describes the "network latency", which shows how many cycles it takes a PE to "offload" from one PE to another PE, [page 4]); transmitting the offload request from the first processing tile to a second processing tile (5. OUTPUT: Instruction outputs are sent via the output bus to their consumer instructions, ... a remote PE, [page 4 col 1 paragraph 2 bullet 5]), wherein data needed for the computation is stored in the local memory within the second processing tile (an example is shown in the right most panel of FIG. 1, the result of the Shift Left #2 instruction is sitting in the PE in the last column on the right in the second row of PEs, and this PE is waiting for the result of the Mul #2 operation from the left neighboring PE, [page 4]; the actual location in local memory is the matching table; for more details see cycle 0 and cycle 1 in FIG. 4, showing A[1] sitting in the matching table, while A[0] queues up, [page 5]); evaluating congestion criteria by the second processing tile (INPUT: Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; the congestion criteria is qualitatively described as "too many", but further details are given later, "INPUT will accept inputs from up to four of these busses each cycle. If more than four arrive during one cycle, an arbiter selects among them, [page 5 col 1 paragraph 1 lines 1-3]); and in response to determining that the second processing tile is congested, transmitting a denial message from the second processing tile to the first processing tile (INPUT: Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; rejecting messages is transmitting a denial message; further details are given later, described with respect to the first processing tile, "It is possible, however, that the consumer cannot handle the value that cycle and will reject it. ACK/NACK signals require four cycles for the round trip", [page 7 col 2 paragraph 3 lines 3-5]; here the "round trip" means the result is output from the first PE to the second PE, and an ACK/NACK signal is given response, from the second PE back to the first PE, where "reject" would specifically mean a NACK signal is given in response.).

With respect to claim 3, Putnam teaches all of the limitations of claim 1, as noted above. Putnam further teaches wherein the offload request includes source data needed for the computation (called an “operand message” because the request includes the “operand” meaning the input needed for the operation/computation, [page 4 col 1 paragraph 2 bullet 1]; more detail about the additional tags provided at [page 5 col 1 paragraph 2]).

With respect to claim 4, Putnam teaches all of the limitations of claim 1, as noted above. Putnam further teaches each local memory comprises a memory stack that is aligned with the one or more processing units in either a vertical or horizontal direction (FIG. 9 shows an architecture with local memory horizontally aligned with one or more PEs, [page 9]).

With respect to claim 5, Putnam teaches all of the limitations of claim 1, as noted above. Putnam further teaches wherein the congestion criteria include a measure of at least one of a processing workload, a processing resource availability, and an offload request buffer capacity (INPUT: Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; “too many” corresponds with buffer capacity; processing workload and processing resource availability are also discussed with respect to the matching table, which are called “matching table misses”, [page 5 col 2 paragraph 2]).

With respect to claim 6, Putnam teaches all of the limitations of claim 1, as noted above. Putnam further teaches retransmitting the offload request from the first processing tile to the second processing tile (Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; see specifically “retry”).

With respect to claim 7, Putnam teaches all of the limitations of claim 6, as noted above. Putnam further teaches evaluating the congestion criteria by the second processing tile; and in response to determining that the second processing tile is not congested, executing the computation by the second processing tile (Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; see specifically “retry”; “retry” means the entire pipeline is repeated until the Ack signal is received, see “After several retries, operands resident in the matching table are evicted to memory, and the newly empty line is allocated to the new operand”, [page 5 col 2 paragraph 2 lines 10-12]).

With respect to claim 14, Putnam teaches A system, comprising a plurality of processing tiles in a topology (see processing elements in a topology in Fig. 2, [page 4]), wherein each processing tile includes one or more processing units that are each directly coupled to a local memory of a total memory (Each PE contains a functional unit, specialized memories to hold operands, and logic to control instruction execution and communication, [page 3 col 1 paragraph 4 lines 1-3]) and the processing tile is indirectly coupled to the local memories that are included within each of the remaining processing tiles, and the plurality of processing tiles are configured to: (see the right panel of FIG. 1, showing data from one processing element can be passed to another processing element "indirectly", [page 4]; Table 1 gives details of the microarchitecture such as the capacity for instructions stored in each PE, the input queue capacity, and the output queue capacity, and the last column of Table 1 describes the "network latency", which shows how many cycles it takes a PE to "offload" from one PE to another PE, [page 4]).
	Regarding the rest of claim 14, incorporating the rejection of claim 1, claim 14 is rejected for a substantially similar rationale.

With respect to claim 16, incorporating the rejection of claim 3 and claim 14, claim 16 is rejected for a substantially similar rationale.

With respect to claim 17, incorporating the rejection of claim 4 and claim 14, claim 17 is rejected for a substantially similar rationale.

With respect to claim 18, incorporating the rejection of claim 5 and claim 14, claim 18 is rejected for a substantially similar rationale.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2, 9, and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of US 11,010,297 B2 (Hansson)
	With respect to claim 2, Putnam teaches all of the claim 1, as noted above. Putnam does not teach wherein the denial message includes at least a portion of the data needed for the computation.
	However, Hansson teaches wherein the denial message includes at least a portion of the data needed for the computation (In one embodiment, the process of FIG. 9 can be altered so that the data required to perform an operation may be propagated upstream with the NAck signal issued by a final
level memory unit, until a memory unit is reached where the operation can be performed. In particular, if at step 912 it is determined that the memory unit is the last level of the memory system, then when the NAck signal is sent at step 914, that last level memory unit may also propagate the data required for the operation back to the upstream memory unit along with the NAck signal, [col 11 ln 15-26]).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Hansson because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for sending data with the NAck signal. Hansson teaches why this is process is beneficial:
The upstream memory unit will be waiting at step 920 for the Ack/NAck signal, and when the NAck signal is received, it will not only update its downstream capabilities register at step 922 to identify that there are no downstream memory units capable of performing the operation, but will also evaluate whether it is capable of performing the operation. If it can perform the operation, then it will perform the operation locally using the data that has been returned from the last level memory unit, and thereafter will propagate an Ack signal upstream.
(Hansson [col 11 ln 26-35]).
... which will result at some point in the operation being performed by the first upstream memory that is  capable of performing that operation.
(Hansson [col 11 ln 40-42]).
	The NAck signal indicates that the downstream memory units are not capable of executing the instruction, so this algorithm ensures that the first upstream memory unit that is capable of performing the operation has the data to perform the operation. A person having skill in the art would have a reasonable expectation of having operations execute as quickly as possible in the system and method of Putnam by modifying Putnam with the data transfer along with the NAck signal of Hansson. Therefore, it would have been obvious to combine Putnam with Hansson to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 9, Putnam teaches all of the limitations of claim 1, as noted above. Putnam does not teach wherein in response to receiving the denial message, the first processing tile executes the computation.
However, Hansson teaches wherein in response to receiving the denial message, the first processing tile executes the computation The upstream memory unit will be waiting at step 920 for the Ack/NAck signal, and when the NAck signal is received, it will not only update its downstream capabilities register at step 922 to identify that there are no downstream memory units capable of performing the operation, but will also evaluate whether it is capable of performing the operation. If it can perform the operation, then it will perform the operation locally using the data that has been returned from the last level memory unit, and thereafter will propagate an Ack signal upstream, [col 11 ln 26-35]).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Hansson because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for sending data with the NAck signal. Hansson teaches why this is process is beneficial:
The upstream memory unit will be waiting at step 920 for the Ack/NAck signal, and when the NAck signal is received, it will not only update its downstream capabilities register at step 922 to identify that there are no downstream memory units capable of performing the operation, but will also evaluate whether it is capable of performing the operation. If it can perform the operation, then it will perform the operation locally using the data that has been returned from the last level memory unit, and thereafter will propagate an Ack signal upstream.
(Hansson [col 11 ln 26-35]).
... which will result at some point in the operation being performed by the first upstream memory that is  capable of performing that operation.
(Hansson [col 11 ln 40-42]).
	The NAck signal indicates that the downstream memory units are not capable of executing the instruction, so this algorithm ensures that the first upstream memory unit that is capable of performing the operation has the data to perform the operation. A person having skill in the art would have a reasonable expectation of having operations execute as quickly as possible in the system and method of Putnam by modifying Putnam with the data transfer along with the NAck signal of Hansson. Therefore, it would have been obvious to combine Putnam with Hansson to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 15, incorporating the rejection of claim 2 and claim 14, claim 15 is rejected for a substantially similar rationale.

	
Claim(s) 8 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of US 2014/0301241 A1  (Kumar)
With respect to claim 8, Putnam teaches all of the limitations of claim 1, as noted above. Putnam does not teach determining that a distance between the first processing tile and the second processing tile exceeds an offload threshold. 
However, Kumar teaches determining that a distance between the first processing tile and the second processing tile exceeds an offload threshold (referred to as latency constraints/requirements in the art, see [0044] lines 1-27; for background see: “Shortest path routing may minimize the latency as Such routing reduces the number of hops from the source to the destination. For this reason, the shortest path may also be the lowest power path for communication between the two components, [0008] lines 8-12; for a description of how distance/latency is calculated in “hops” see FIG. 5(a); and for teaching the threshold: “For a route to be eligible the number of already existing router hops along the route in the layer and the number of new routers that must be instantiated when this flow will be  mapped must be smaller than the latency requirement of the flow.”, [0044] lines 19-24).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Kumar because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam already discloses a system and method that teaches all of the claimed features as well as the requirements for an instruction mapping algorithm: “instruction mapping algorithm (which maps dynamically as the program executes) are to place dependent instructions near each other to minimize producer-consumer latency, and to spread independent instructions out in order to utilize resources and exploit parallelism”, (Putnam [page 3 col 1 paragraph 4 lines 10-14]). Putnam focuses on the microarchitecture rather than the routing algorithm. Kumar goes into detail on latency/routing issues, (Kumar [0008]), and further explains:
Different topologies may provide different latency characteristics and number of hops between various components providing routes between components that have different latencies. Thus, traffic flows may be mapped to the topologies based on their latency requirements.
(Kumar [0039] lines 12-17).
A person having skill in the art would have a reasonable expectation of successfully routing instructions “near each other to minimize producer-consumer latency” in the system and method Putnam by modifying Putnam with the latency requirements of Kumar. Therefore, it would have been obvious to combine Putnam with Kumar to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 19, incorporating the rejection of claim 8 and claim 14, claim 19 is rejected for a substantially similar rationale.

Claim(s) 10 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of “Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization” (2018-Biedert)
With respect to claim 10, Putnam teaches all of claim 1, as noted above. Putnam does not teach wherein at least one of the steps of generating, transmitting, or evaluating is performed on a server or in a data center to generate an image, and the image is streamed to a user device.
However, Biedert teaches wherein at least one of the steps of generating, transmitting, or evaluating is performed on a server or in a data center to generate an image, and the image is streamed to a user device (in FIG. 1, see the vertical columns showing the buffer and encode stages in the HPC system, the lateral rows show the parallel processing pipeline; the following passage describes how the image is processed in parallel, “For instance, a server application could split the framebuffer into two tiles and use two concurrent pipelines to better utilize the available hardware encoding units. A similar approach can be employed at client-side, either to decode and display full frames to distinct monitors, or composite partial tiles of a single display, [page 35 col 2 paragraph 2 lines 18-23]).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Biedert because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for a particular application (image processing) using server-client infrastructure for the application. Biedert teaches hardware-accelerated multi-tile streaming which is a parallel processing technique similar to WaveScalar applied to this application (i.e., see “Visualization” in the Title of Biedert) using the server-client infrastructure (i.e., “Realtime Remote” in the Title of Biedert). Biedert goes on to explain why such a use case is important:
Being able to drive high-resolution displays directly from a remote supercomputer opens up novel use cases. In particular, it enables cheaper infrastructure at the client’s side, as all the heavy lifting is done on the server side.
(Biedert [page 33 col 1 paragraph 3 lines 1-4]).
 A person having skill in the art would have a reasonable expectation of successfully applying the hardware accelerated microarchitecture in the system and method of Putnam to the Application area of Remote Visualization as in Biedert by modifying Putnam with the client-server infrastructure of the image visualization application of Biedert. Therefore, it would have been obvious to combine Putnam  with Biedert to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 20, incorporating the rejection of claim 10 and claim 14, claim 20 is rejected for a substantially similar rationale.

	Claim(s) 11-12, 21, and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of “A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications” (2019-Liu)
With respect to claim 11, Putnam teaches all of claim 1, as noted above. Putnam does not teach wherein at least one of the steps of generating, transmitting, or evaluating is performed within a cloud computing environment.
However, Liu teaches wherein at least one of the steps of generating, transmitting, or evaluating is performed within a cloud computing environment (table 2 gives a list of architectures classified as CRGAs, where row 7 is wavescalar, [page 13]; section 2.4 gives a list of applications particularly suited for CRGAs including signal and image processing, [page 14 paragraph 3]; and deep learning with DNNs, [page 14 paragraph 4]; Section 4.4 describes the “memory wall” problem as it applies to cloud servers, [page 28 paragraph 3 line 7]; and how the hardware solutions include “3D-stacked Dram, [page 28 paragraph 5 line 3] or integrating computation components in memory, [page 28 paragraph 6 line 10]; finally, FIG. 11 provides “killer applications” meaning applications that are particular suitable for CRGAs, particularly, “A  comparison with Figure 11(a) indicates that CGRAs well match the scenarios on the top right, such as datacenters, the mobile end, cloud servers and robotics).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Liu because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for applying the microarchitecture on a cloud server. Liu teaches that certain features like energy efficiency and flexibility of programming of these microarchitectures make them uniquely suited for certain applications like Data Centers and cloud servers (see Fig. 11, Liu [page 31]). A person having skill in the art would have a reasonable expectation of successfully applying the microarchitecture of Putnam to the applications disclosed in Liu by modifying Putnam with the potential applications of Liu. Therefore, it would have been obvious to combine Putnam with Liu to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 12, Putnam teaches all of claim 1, as noted above. Putnam does not teach wherein the computation is executed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle.
However, Liu teaches wherein the computation is executed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle (table 2 gives a list of architectures classified as CRGAs, where row 7 is wavescalar, [page 13]; section 2.4 gives a list of applications particularly suited for CRGAs including signal and image processing, [page 14 paragraph 3]; and deep learning with DNNs, [page 14 paragraph 4]; specifically, “CGRAs are capable of high-throughput computation and on-chip communication, making them a superior implementation for DNNs, [page 14 paragraph 4 lines 4-5] where DNNs refer to learning/training a specific type of neural network with many layers; Section 4.4 describes the “memory wall” problem as it applies to cloud servers, [page 28 paragraph 3 line 7]; and how the hardware solutions include “3D-stacked Dram, [page 28 paragraph 5 line 3] or integrating computation components in memory, [page 28 paragraph 6 line 10]; finally, FIG. 11 provides “killer applications” meaning applications that are particular suitable for CRGAs, particularly, “A  comparison with Figure 11(a) indicates that CGRAs well match the scenarios on the top right, such as datacenters, the mobile end, cloud servers and robotics).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Liu because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for applying the microarchitecture on a  robotics. Liu teaches that certain features like energy efficiency and flexibility of programming of these microarchitectures make them uniquely suited for certain applications like robotics (see Fig. 11, Liu [page 31]). A person having skill in the art would have a reasonable expectation of successfully applying the microarchitecture of Putnam to the applications disclosed in Liu by modifying Putnam with the potential applications of Liu. Therefore, it would have been obvious to combine Putnam with Liu to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 21, Putnam teaches A method of operating a computer system, ... , the method comprising (see section 4.1, [page 3 col 2 paragraph 6]-[page 4 col 2 paragraph 2]; the method is two runs of the execution pipeline shown [page 4 col 1 paragraph 2], where "generating" starts at "4. Execute" stage in the first pass through the pipeline, "transmitting" is the "5. Output" stage, and then evaluating congestion criteria... and transmitting a denial message... is the "1. Input" stage of the second pass through the pipeline): a first processing unit of the plurality generating an offload request for retrieving data from a second processing unit (4. Execute: An instruction executes. Its result goes to the  output queue, [page 4 col 1 paragraph 2 bullet 4]; where these are the five pipe stages of a PE, [page 4 col 1 paragraph 2 line 1]; and a PE (processing element) is a tile part of a large topology of processing elements as shown in FIG. 2, [page 4]); transmitting the offload request from the first processing unit to the second processing unit (5. OUTPUT: Instruction outputs are sent via the output bus to their consumer instructions, ... a remote PE, [page 4 col 1 paragraph 2 bullet 5]); if the second processing unit or an offload engine determines that the second processing unit is congested, the offload engine denying the offload request and returning the requested data to the first processing unit (note offload engine is the logic performing the “too many arrive” determination, “INPUT: Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; rejecting messages is transmitting a denial message; further details are given later, described with respect to the first processing tile, "It is possible, however, that the consumer cannot handle the value that cycle and will reject it. ACK/NACK signals require four cycles for the round trip", [page 7 col 2 paragraph 3 lines 3-5]; here the "round trip" means the result is output from the first PE to the second PE, and a ACK/NACK signal is given response, from the second PE back to the first PE, where "reject" would specifically mean a NACK signal is given in response); and if neither the second processing unit nor the offload engine determines that the second processing unit is congested, the second processing unit returning the requested data to the first processing unit (note offload engine is the logic performing the “too many arrive” determination, “INPUT: Operand messages arrive at the PE either from another PE or from itself. The PE may reject messages if too many arrive in one cycle; the senders will then retry on a later cycle, [page 4 col 1 paragraph 2 bullet 1]; rejecting messages is transmitting a denial message; further details are given later, described with respect to the first processing tile, "It is possible, however, that the consumer cannot handle the value that cycle and will reject it. ACK/NACK signals require four cycles for the round trip", [page 7 col 2 paragraph 3 lines 3-5]; here the "round trip" means the result is output from the first PE to the second PE, and a ACK/NACK signal is given response, from the second PE back to the first PE, where "accept" would specifically mean an ACK signal is given in response; the ACK signal is the requested data).
Putnam does not teach the computer system comprising a plurality of DRAM-based parallel processing units tiled in two dimensions onto which local memories comprising a memory system are stacked in a third dimension.
However, Liu teaches the computer system comprising a plurality of DRAM-based parallel processing units tiled in two dimensions onto which local memories comprising a memory system are stacked in a third dimension (3D-stacked DRAM, [page 28 paragraph 5], where 3D refers to the 3 dimensions of stacking).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Liu because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system and method that teaches all of the claimed features except for applying the microarchitecture to 3D stacked DRAM. Section 4.4 of Liu describes the “memory wall” problem as it applies to cloud servers, [page 28 paragraph 3 line 7]; and how the hardware solutions include “3D-stacked Dram, [page 28 paragraph 5 line 3] or integrating computation components in memory, [page 28 paragraph 6 line 10]. A person having skill in the art would have a reasonable expectation of successfully solving the memory wall problem by applying the microarchitecture of Putnam to the applications and 3D stacked Dram disclosed in Liu by modifying Putnam with the potential physical architecture of Liu. Therefore, it would have been obvious to combine Putnam with Liu to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

With respect to claim 23, incorporating the rejection of claim 3 and claim 21, claim 23 is rejected for a substantially similar rationale.

Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of “GPU Virtualization and Scheduling Methods: A Comprehensive Survey” (2017-Hong)
With respect to claim 13, Putnam teaches all of claim 1, as noted above. Putnam does not teach wherein at least one of the steps of generating, transmitting, or evaluating is performed on a virtual machine comprising a portion of a graphics processing unit.
However, Hong teaches wherein at least one of the steps of generating, transmitting, or evaluating (GPU scheduling methods, [Abstract] line 9]; see specifically section 7.3 Load Balancing, [page 25 paragraph 3]-[page 26 paragraph 3]; load balancing is a method of generating a task, transmitting the task, and evaluating congestion, for specific examples see specifically performance history method, [page 25 paragraph 3 lines 10-14]; dataflow programming method, [page 25 paragraph 4 lines 17-21]; and the work-stealing algorithm, [page 26 paragraph 3]) is performed on a virtual machine comprising a portion of a graphics processing unit (see FIG. 3 where the light blue “para & full virtualization stack” represent the virtual machine and the GPU is the GPU, [page 14]; see also FIG. 4 with a slightly different configuration, [page 17]; note the scheduling methods taught by section 7 are an overview of GPU scheduling, but are introduced with the intent to address challenges in GPU virtualization, [page 19 paragraph 5]-[page 20 paragraph 1]).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam with Hong because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam discloses a system that teaches all of the claimed features except for GPU virtualization. Hong  teaches the same basic microarchitecture but applied to a GPU, (see FIG. 1 of Hong, [page 3]) and notes the benefits of why this architecture is being adopted by cloud providers:
Cloud computing platforms can leverage heterogeneous compute nodes to reduce the total cost of ownership and achieve higher performance and energy efficiency [Crago et al. 2011; Schadt et al. 2011]. A cloud with heterogeneous compute nodes would allow users to deploy computationally intensive applications without the need to acquire and maintain large-scale clusters. In addition to this benefit, heterogeneous computing can offer better performance within the same power budget compared to systems based on homogeneous processors, as computational tasks can be placed on either conventional processors or GPUs depending on the degree of parallelism. These combined benefits have been motivating cloud service providers to equip their offerings with GPUs and heterogeneous programming environments...
(Hong [page 2 paragraph 2]).
	A person having skill in the art would have a reasonable expectation of getting the benefit of the GPU virtualization in the system and method of Putnam by modifying Putnam with the heterogeneous compute described by Hong. Therefore, it would have been obvious to combine Putnam with Hong to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.

Claim(s) 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over “The Microarchitecture of a Pipelined WaveScalar Processor: An RTL-based Study” (2005-Putnam) in view of in view of “A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications” (2019-Liu) in further view of US 2014/0301241 A1  (Kumar)
With respect to claim 22, Putnam in view of Liu teaches all of the limitations of claim 21, as noted above. Putnam teaches wherein, before transmitting the offload request, the first processing unit determines (part of the output stage is preparing the message, see the example, [page 9 col 1 paragraph 4 bullet labeled “cycle 0”: “Cycle 0: PE0 sends D0 to PE1 and PE2. The OUTPUT stage at PE0 prepares the message and broadcasts it, asserting the PE1 and PE2 receive lines.”; Preparing in this case likely refers to setting all of the fields for the message including setting the destination operand number, see [page 5 footnote 1]).
Putnam and Liu do not teach determines that a distance between the first processing unit and the second processing unit exceeds an offload threshold.
However, Kumar teaches wherein, before transmitting the offload request, the first processing unit determines that a distance between the first processing unit and the second processing unit exceeds an offload threshold (referred to as latency constraints/requirements in the art, see [0044] lines 1-27; for background see: “Shortest path routing may minimize the latency as Such routing reduces the number of hops from the source to the destination. For this reason, the shortest path may also be the lowest power path for communication between the two components, [0008] lines 8-12; for a description of how distance/latency is calculated in “hops” see FIG. 5(a); and for teaching the threshold: “For a route to be eligible the number of already existing router hops along the route in the layer and the number of new routers that must be instantiated when this flow will be mapped must be smaller than the latency requirement of the flow.”, [0044] lines 19-24).
It would have been obvious to one skilled in the art before the effective filing date to combine Putnam in view of Liu with Kumar because a teaching, suggestion, or motivation in the prior art would have led one skilled in the art to combine prior art teaching to arrive at the claimed invention. Putnam in view of Liu already discloses a system and method that teaches all of the claimed features as well as the requirements for an instruction mapping algorithm: “instruction mapping algorithm (which maps dynamically as the program executes) are to place dependent instructions near each other to minimize producer-consumer latency, and to spread independent instructions out in order to utilize resources and exploit parallelism”, (Putnam [page 3 col 1 paragraph 4 lines 10-14]). Putnam focuses on the microarchitecture rather than the routing algorithm. Kumar goes into detail on latency/routing issues, (Kumar [0008]), and further explains:
Different topologies may provide different latency characteristics and number of hops between various components providing routes between components that have different latencies. Thus, traffic flows may be mapped to the topologies based on their latency requirements.
(Kumar [0039] lines 12-17).
A person having skill in the art would have a reasonable expectation of successfully routing instructions “near each other to minimize producer-consumer latency” in the system and method Putnam in view of Liu by modifying Putnam in view of Liu with the latency requirements of Kumar. Therefore, it would have been obvious to combine Putnam in view of Liu with Kumar to a person having ordinary skill in the art, and this claim is rejected under 35 U.S.C. 103.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20160085551 A1 (Greathouse) - A compute unit configured to execute multiple threads in parallel is presented. The compute unit includes one or more single instruction multiple data (SIMD) units and a fetch and decode logic. The SIMD units have differing numbers of arithmetic logic units (ALUs), such that each SIMD unit can execute a different number of threads. The fetch and decode logic is in communication with each of the SIMD units, and is configured to assign the threads to the SIMD units for execution based on such differing numbers of ALUs, [Abstract].
“Instruction Scheduling for a Tiled Dataflow Architecture” (2006-Mercaldi) - This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule, [Abstract]. 
“Area-Performance Trade-offs in Tiled Dataflow Architectures” (2006-Swanson) - Tiled architectures, such as RAW, SmartMemories, TRIPS, and WaveScalar, promise to address several issues facing conventional processors, including complexity, wire-delay, and performance. The basic premise of these architectures is that larger, higher-performance implementations can be constructed by replicating the basic tile across the chip, [Abstract].
“Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors” (2007-Putstawamy) - 3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of Thermal Herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink, [Abstract]; With a TSM to apply to WaveScalar, [page 11 col 1 paragraph 1].
“WaveScalar” (2003-Swanson) - Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge. Ever increasing wire-delay relative to switching speed and the exponential cost of circuit complexity make simply scaling up existing processor designs futile. In this paper, we present an alternative to superscalar design, WaveScalar. WaveScalar is a dataflow instruction set architecture and execution model designed for scalable, low-complexity/high-performance processors. WaveScalar is unique among dataflow architectures in efficiently providing traditional memory semantics. At last, a dataflow machine can run “real-world” programs, written in any language, without sacrificing parallelism, [Abstract].
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL MILLER whose telephone number is (408) 918-7548. The examiner can normally be reached on Monday-Friday from 11am to 5pm (PT).
	If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kevin Young, can be reached at telephone number (571) 270-3180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
	Information regarding the status of an application may be obtained from Patent Center and the Private Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from Patent Center or Private PAIR. Status information for unpublished applications is available through Patent Center and Private PAIR to authorized users only. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 
	Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.

/D.M./Examiner, Art Unit 2187
/KEVIN L YOUNG/Supervisory Patent Examiner, Art Unit 2194
Read full office action
Prosecution Timeline

Dec 04, 2023
Application Filed
Mar 21, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/691,096
Patent 12421143
COOPERATIVE OPTIMAL CONTROL METHOD AND SYSTEM FOR WASTEWATER TREATMENT PROCESS
2y 5m to grant Granted Sep 23, 2025
17/307,474
Patent 12406113
COMPUTER-AIDED ENGINEERING TOOLKIT FOR SIMULATED TESTING OF PRESSURE-CONTROLLING COMPONENT DESIGNS
2y 5m to grant Granted Sep 02, 2025
16/278,767
Patent 12204835
STORAGE MEDIUM WHICH STORES INSTRUCTIONS FOR A SIMULATION METHOD IN A SEMICONDUCTOR DESIGN PROCESS, SEMICONDUCTOR DESIGN SYSTEM THAT PERFORMS THE SIMULATION METHOD IN THE SEMICONDUCTOR DESIGN PROCESS, AND SIMULATION METHOD IN THE SEMICONDUCTOR DESIGN PROCESS
2y 5m to grant Granted Jan 21, 2025
18/056,857
Patent 12154663
METHOD OF IDENTIFYING PROPERTIES OF MOLECULES UNDER OPEN BOUNDARY CONDITIONS
2y 5m to grant Granted Nov 26, 2024
16/274,403
Patent 12118279
Lattice Boltzmann Based Solver for High Speed Flows
2y 5m to grant Granted Oct 15, 2024
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
41%
Grant Probability
78%
With Interview (+36.9%)
3y 8m
Median Time to Grant
Low
PTA Risk
Based on 54 resolved cases by this examiner. Grant probability derived from career allow rate.