Last updated: April 19, 2026
Application No. 17/883,010
NEURAL NETWORK COMPUTING DEVICE FOR DETERMINING NEURAL NETWORK COMPUTATION SCHEDULE AND CONTROL METHOD THEREOF

Final Rejection §103§112
Filed
Aug 08, 2022
Examiner
NGUYEN, AN-AN NGOC
Art Unit
2195
Tech Center
2100 — Computer Architecture & Software
Assignee
Industry-Academic Cooperation Foundation Yonsei University
OA Round
2 (Final)
Interview Optional

— +50.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 6 resolved cases, 2023–2026
Examiner Intelligence

NGUYEN, AN-AN NGOC View full profile →
Grants 83% — above average
Career Allow Rate
5 granted / 6 resolved
+28.3% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.6%
-19.4% vs TC avg
§103
57.9%
+17.9% vs TC avg
§102
11.2%
-28.8% vs TC avg
§112
10.3%
-29.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
1.	Claims 1, 6, 8, 10, 12, 17, and 19 are currently amended.
2.	Claim 11 is cancelled.
3.	Claims 1-10 and 12-20 are pending.
4.	Claims 1-10 and 12-20 are rejected.

Response to Arguments
5.	Regarding Objections to the Title of the Invention:
Applicant’s amendments and arguments with respect to the objections to the title of the invention have been fully considered and are persuasive. The objections to the title of the invention have been withdrawn.

6.	Regarding 35 U.S.C. 112b Rejections:
Applicant’s amendments and arguments to claims 1 and 12 have been considered and are not persuasive. The rejections under 35 U.S.C. 112b are maintained. Additionally, applicant’s arguments are rejected under a new ground of rejection necessitated by the amendment. The full rejection can be found in the 35 U.S.C. 112b Rejections section below.

7.	Regarding 35 U.S.C. 101 Rejections:
Applicant’s amendments and arguments with respect to the 35 U.S.C. 101 rejections have been
fully considered and are persuasive. The rejections under 35 U.S.C. 101 have been withdrawn.

8.	Regarding Prior Art Rejections:
Applicant’s amendments and arguments to claims 1 and 12 have been considered and are not persuasive. The rejections under 35 U.S.C. 103 are maintained. Additionally, applicant’s arguments are rejected under a new ground of rejection necessitated by the amendment.

	Applicant argues in remarks:
	Applicant respectfully submits that claim 1 is patentable because the proposed Dave- Bokam combination does not disclose or suggest to "perform first scheduling for a first movement  of the neural network computation data between a first memory hierarchy and a second memory  hierarchy having a higher level than the first memory hierarchy, the first scheduling being based  on a size of data tiles storing the neural network computation data in the first memory hierarchy  and the second memory hierarchy and a combination of components allocated to the data tiles", as  claimed.
Dave is generally directed to discovering the most efficient way "to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method)". See Dave, Abstract. For example, Dave discloses a configuration in which the schedulable loop order for each loop can be limited according to the characteristics of operand data being processed. See id. at FIG. 6 and § 4.2. In particular, Dave discloses that while "tiling factors for L1, L2, and L3 loops determine the size of the data accessed from RF, SPM, and DRAM, the orderings of these loops determine the data reuse and scheduling of the data movement." Id. at p. 70:8, § 4.2.1. Dave further discloses that in a loop nest, data "operands (tensors) are often invariant of specific loops and can be reused", and thus, "for a given loop-nest, it is possible to create a list of all those loop-orderings (schedules) that feature unique reuse of operands, and the optimizer needs to target just those orderings." See id. at pp. 70:8- 70:10, § 4.2.1. For example, Dave discloses, referring to FIG. 4, that "Iteration counts (tiling factors) and orderings for L3 loops determine data communicated to (reused in) SPM" and that "Iterations of L2 loops affect SPM accesses and the cost of data communication to RF via NOC." See id. at FIG. 4. According to Dave, during each loop iteration, "the data corresponding to each operand can be accessed from lower memory (e.g., SPM) and brought to the current memory level." Id. at p. 70:10, § 4.2.1. 
The scheduling, according to Dave, is not performed in a bottom-up manner between different levels of memory hierarchy such that the scheduling of a lower-level memory affects the scheduling of a higher-level memory. That is, at best, Dave discloses that tiling factors for L1, L2, and L3 loops determine the size of the data accessed and that the orderings of these loops determine the data reuse and scheduling of the data movement. In other words, when determining a loop order corresponding to a tiling factor at a certain hierarchy, Dave merely discloses a method of leaving only loop orders having unique data reuse patterns determined by the type of reused data, and excluding loop orders having substantially the same reuse pattern. 
To the extent the Office is attempting to equate the L1, L2, and L3 loops of Dave to the claimed "first scheduling", claim 1 recites to "perform first scheduling for a first movement of the neural network computation data between a first memory hierarchy and a second memory hierarchy having a higher level than the first memory hierarchy, the first scheduling being based on a size of data tiles storing the neural network computation data in the first memory hierarchy and the second memory hierarchy and a combination of components allocated to the data tiles." Dave does not disclose that its L1, L2, and L3 loops are based on any such factors. That is, claim 1 recites a "bottom-up" scheduling, in which a second scheduling is performed based on a result of a first scheduling. However, Dave does not consider any such scheduling scheme. 
For example, if all possible schedule candidates are evaluated to calculate computational costs in order to find a low-cost schedule, according to Dave, the computation time for scheduling would increase significantly due to the excessive number of calculations. However, the present invention performs bottom-up scheduling of neural network computation data between different memory hierarchies, thereby enabling an optimal schedule to be found more efficiently. Such an effect cannot be expected from Dave. 
Therefore, Dave's purported disclosure of L1, L2, and L3 loops does not disclose the claimed "first scheduling". Consequently, Dave does not disclose or suggest at least the cited features recited in claim 1. 
The Office Action does not allege that Bokam discloses the above-identified features of claim 1 nor does Bokam make up for at least these deficiencies of Dave. Thus, even assuming for the sake of argument only that the proposed Dave-Bokam combination would have been proper, which Applicant does not concede, the proposed combination does not disclose or suggest all features recited in claim 1. 
For at least these reasons, Applicant respectfully submits that claim 1 is patentable and requests withdrawal of the rejection.

With the newly amended claims, the overall scope of the claim does not read the same way it did before. Therefore, new art and combination thereof was introduced to better suit the new scope of the claims. Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data:

Col. 11, lines 39-47, After the read and write processes illustrated in FIGS. 6 and 7, information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein. The updated data access information or other characteristic information of the data and/or any portion of the data may, for example, be stored as an entry in a group element of the thin device table 112 (for example, the entry 166f of the group element 166' as shown in FIG. 5).

Col. 14, lines 28-52,  FIG. 10 is a schematic illustration of a fine grained tiered storage system 600 according to an embodiment of the system described herein. A storage device 630 is shown including a thin device 620, like the thin devices 71-74 discussed elsewhere herein, that may be coupled to multiple physical storage devices across multiple storage tiers. As discussed elsewhere herein, the storage tiers may be associated with data devices, like the data devices 61-67 discussed herein, so that, for example, there is one data device for each storage tier, one data device for multiple storage tiers, any portion of a data device for any portion of the pools of storage shown for the storage tiers, and/or any combinations thereof. For example, in an embodiment, a top tier storage pool 610 (e.g., tier 0) may include flash/solid state disk (SSD) drives that are relatively fast and expensive. Other storage pools 611-613 (e.g., tiers 1-3) may include disk drives of decreasing speeds or other configurations (i.e., 15 k rpm, 10 k rpm, 7.5 k rpm redundant array of independent disk (RAID) storage). The lowest tier of storage pool 614 (e.g., tier 4) may include, for example, tape storage, largest capacity disk drives (such as massive array of idle disks (MAID) storage). As illustrated, the last storage tier 614 may include storage devices external to the storage device 630 that may be suitable for long term storage of data that is infrequently accessed.

Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory. 
 
Burke, however, does not specifically teach of a neural network environment.
However, Venkataramani teaches of neural network operations that partition the entire memory state across tiles that hold all of the features and errors of the network ([0085]). This is similar to the information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, as discussed in Burke. Moreover, the tiles are coupled with a memory hierarchy, similarly to Burke. An operation is performed in order to output data, and the data can be stored in the memory. 
Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation and the data size is based on data tiles. 
	Additionally, claims 2-10 and 13-20 depend from and further limit amended claims 1 and 12 and are therefore also rejected under 35 U.S.C 103.
The full rejection can be found in the 35 U.S.C. 103 rejection section below.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

The claim(s), as amended, contain the following language that is unclear:
As per amended claim 1, line 1 recites “generating output data”. However, lines 4-5 recite “a memory including a plurality of memory hierarchies, and configured to store neural network computation data including the input data, the weight data, and the output data”. In the preamble, it states that output data is being generated; however, in the following lines it says that memory includes output data. Is the output data already there? Is it being generated and then being placed into the memory? Is the output data the identified “neural network computation schedule for performing the convolution operation,” mentioned in subsequent lines? For examination purposes, examiner interprets the limitation as the output data generated after the operation is completed and stored in the memory. Once stored in the memory, it can be used for subsequent scheduling and neural network operations afterwards.
As per amended claim 1, lines 6-10 recite “an operator configured to: receive the input data and the weight data, perform the neural network operation by performing a multiply-and-accumulate (MAC) operation on the input data and the weight data to generate the output data, and provide the output data for storing in the memory.” Additionally, the final lines of the amended claim recite “control the operator to perform the neural network operation based on the neural network computation schedule.” However, was the neural network operation already done? Is the neural network operation completed once before the movement of data between memory hierarchies? Is an operation completed once before the movement of data between memory hierarchies, and after the movement of data is completed? For examination purposes, examiner interprets the limitation as the neural network operation is completed after the movement of data between memory hierarchies has been completed.
As per amended claim 1, lines 14-16 recite “first scheduling being based on a size of data tiles storing the neural network computation data in the first memory hierarchy and the second memory hierarchy and a combination of components allocated to the data tiles.” It is unclear whether the scheduling is done based on the data tile size of both the first and second memory hierarchy and a combination of components or something else. Is it a size of data tiles storing the neural network computation data in the first memory hierarchy, the second memory hierarchy overall, and a combination of components allocated to the data tiles? For examination purposes, examiner interprets the limitation as the first scheduling is being based on a size of a data tile storing the neural network computation data in the first and second memory hierarchies, in addition to a combination of components allocated to the data tiles.
Claim 2 is dependent on claim 1 and fails to cure the deficiencies set forth above for claim 1. Therefore, it is rejected under the same rationale above.
Claims 3-10 are dependent on claim 2 and fail to cure the deficiencies set forth above for claim 1. Therefore, they are rejected under the same rationale above.
Regarding claim 12, it is a method claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above.
Claim 13 is dependent on claim 12 and fails to cure the deficiencies set forth above for claim 1. Therefore, it is rejected under the same rationale above.
Claims 14-19 are dependent on claim 13 and fail to cure the deficiencies set forth above for claim 1. Therefore, they are rejected under the same rationale above.
Regarding claim 20, it is a computer program used to execute the method of claim 12. Therefore, it is rejected under the same rationale above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

9.	Claims 1, 4, 8-10, 12, 15, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Burke et al. US 8838887 B1 in view of Venkataramani et al. US 20190303743 A1. 

10.	With regard to claim 1, Burke teaches:

	A neural network computing device for generating output data by performing a neural network operation of input data and weight data (Fig. 6; Fig. 7; Col. 10, lines 10 – Col. 11, lines 52; Examiner’s Note: A read and write operation is performed. Once finished, information concerning access of the data is updated and stored. This data is output data.), the neural network computing device comprising: 

a memory including a plurality of memory hierarchies, and configured to store neural network computation data including the input data, the weight data, and the output data (Fig. 10; Col. 11, lines 39-43 After the read and write processes illustrated in FIGS. 6 and 7, information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein; Col. 14, lines 28-51, FIG. 10 is a schematic illustration of a fine grained tiered storage system 600 according to an embodiment of the system described herein. A storage device 630 is shown including a thin device 620, like the thin devices 71-74 discussed elsewhere herein, that may be coupled to multiple physical storage devices across multiple storage tiers. As discussed elsewhere herein, the storage tiers may be associated with data devices, like the data devices 61-67 discussed herein, so that, for example, there is one data device for each storage tier, one data device for multiple storage tiers, any portion of a data device for any portion of the pools of storage shown for the storage tiers, and/or any combinations thereof. For example, in an embodiment, a top tier storage pool 610 (e.g., tier 0) may include flash/solid state disk (SSD) drives that are relatively fast and expensive. Other storage pools 611-613 (e.g., tiers 1-3) may include disk drives of decreasing speeds or other configurations (i.e., 15 k rpm, 10 k rpm, 7.5 k rpm redundant array of independent disk (RAID) storage). The lowest tier of storage pool 614 (e.g., tier 4) may include, for example, tape storage, largest capacity disk drives (such as massive array of idle disks (MAID) storage). As illustrated, the last storage tier 614 may include storage devices external to the storage device 630 that may be suitable for long term storage of data that is infrequently accessed; Examiners’ Note: There is a tier storage system, which is analogous to a memory including a plurality of memory hierarchies. Data regarding access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein, after read and write operations. Read and write operations indicate input and output data. Moreover, access frequency and/or other statistics can be weight data.); 

an operator configured to: 

receive the input data and the weight data, perform the neural network operation by performing a multiply-and-accumulate (MAC) operation on the input data and the weight data to generate the output data, and provide the output data for storing in the memory (Fig. 6; Fig. 7; Col. 10, lines 10 – Col. 11, lines 52; Examiner’s Note: A read and write operation is performed by one of the host adapters, which is analogous with an operator. Once finished, information concerning access of the data is updated and stored. This data is output data.); and

a processor configured to:

perform first scheduling for a first movement of the neural network computation data between a first memory hierarchy and a second memory hierarchy having a higher level than the first memory hierarchy, the first scheduling being based on a size of data tiles storing the neural network computation data in the first memory hierarchy and the second memory hierarchy and a combination of components allocated to the data tiles (Fig. 10; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Col. 21, lines 54-64; ; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. The first scheduling can be to an intermediate storage level based on a score.), 

after the first scheduling is performed, perform second scheduling for a second movement of the neural network computation data between the second memory hierarchy and a third memory hierarchy having a higher level than the second memory hierarchy, the second scheduling being based on the first scheduling (Fig. 10; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Col. 21, lines 34-36, Data portions having a score of I or higher are promoted to the highest level of storage; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. The second scheduling can be to the highest storage level based on a score.),

identify a neural network computation schedule for the performing of the neural network operation, based on a result of the first scheduling and a result of the second scheduling (Col. 15, lines 66 – Col. 16, lines 2, Predictive policies may be used to recognize that data blocks that will be needed before they are actually needed and promote the data blocks accordingly (for example, nightly batch jobs, etc.); Col. 16, lines 26-28, For example, blocks may be promoted in the background immediately before batch runs (e.g., billing runs etc.); Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. An example is given where data is promoted, moved up in tier, when they are needed at a certain time (i.e. nightly batch jobs or billing runs). Since data is promoted at this specific time, this is analogous with a schedule. The data that is promoted can be used for the operation.), and

control the operator to perform the neural network operation based on the neural network computation schedule (Fig. 6; Fig. 7; Col. 10, lines 10 – Col. 11, lines 52; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. A read and write operation is performed by one of the host adapters, which is analogous with an operator. Once finished, information concerning access of the data is updated and stored. This data is output data.).

Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. However, Burke fails to teach specifically that this is for a neural network, and that the operation is a multiply-and-accumulate (MAC) operation. 

However, in analogous art, Venkataramani teaches:

[0067] FIG. 7 illustrates a compute intensive tile 700 (e.g., circuit) according to embodiments of the disclosure. The depicted compute intensive tile 700 in FIG. 7 comprises a 2D systolic array of processing elements 701 (2D-PE array). Each 2D-PE of the array may include a vector of fused multiply and add (accumulate) (FMA) units. A 1D array of accumulators 702 is located along the right border of the 2D-PE array in FIG. 7. Three sets of memory (e.g., streaming memory (SM)) elements (704, 706, 708) are placed along the left, top, and bottom borders in FIG. 7, e.g., to feed data operands to the 2D-PEs. Compute intensive tile 700 may also contain an auxiliary memory 710, e.g., to hold temporary outputs. Other components that may be utilized in the compute intensive tile 700 include an instruction memory 718, an instruction decode and control unit 712, a scalar register file 714, and an in-order scalar processing element 716, e.g., to execute control operations such as loop counters, pointer arithmetic, branches, etc. A compute intensive tile may be optimized to carry out batch convolution (e.g., one input, many kernels) and/or matrix multiplication operations. For example, batch 2D-convolution may be realized as follows: the rows of the convolution input are fed along the rows of the 2D-PE array 701 and the kernel rows fed along the columns. Each 2D-PE may compute a function (e.g., a dot product) of an input row with a kernel row. Convolution output may be produced by diagonally accumulating the outputs (e.g., dot products) in the 1D accumulator array 702. In one embodiment, for some convolution outputs, not all rowwise outputs (e.g., dot products) are produced in the same iteration of the 2D array. In such cases for example, the partial convolution outputs may be stored in the auxiliary memory 710, e.g., and fetched back by the 1D accumulator array 702 when the remaining outputs (e.g., dot products) are available. In an embodiment where the 2D-PEs have multiple execution lanes, an equivalent number of kernels may be simultaneously fed, e.g., enabling multiple convolutions, sharing the same input, to be evaluated in parallel. Note that although the terms rows and columns are utilized, other arrangements are possible, for example, a chip may be rotated relative to the perspective in the Figures.

[0085] One key aspect of the mapping process is that (e.g., at compile time), the entire network state (e.g., features, errors, weights, and weight gradients) is partitioned and distributed across the memory intensive tiles in the chip. In one embodiment, each feature and error in the network is assigned a home memory intensive tile. Enough memory capacity may be provisioned, e.g., cumulatively across all memory intensive tiles, to hold all the features and errors of the network. In one embodiment, a neural (e.g., deep) network includes a few million neurons and utilizes 10s' of MB of memory capacity, e.g., which may be provisioned on-chip. In an embodiment when the features and errors do not fit on a single chip, the neural network may be split at the node-level and multiple chips utilized to realize the neural network. An embodiment of this is discussed in the context of the node architecture described in Section II-C. In one embodiment, e.g., depending on the memory capacity available, weights and weight gradients of selected layers are stored on-chip, e.g., in the memory intensive tiles where the corresponding features reside. Weights and gradients of the other layers may be stored in external memory, for example, and are prefetched into the memory intensive tiles (e.g., the (FIFO queue) of a memory intensive tile). The compute intensive tiles may produce and consume the neural network state stored in memory intensive tiles, e.g., those tiles directly connected to them. The SFUs present within the memory intensive tiles may also operate on the neural network state stored in them. In one embodiment, by partitioning the network state and associated computations spatially across multiple processing tiles of a chip, data movement is localized and/or interconnect bandwidth is minimized. This, e.g., coupled with a simplified memory hierarchy, may significantly add to the energy efficiency of a processor (e.g., chip).

[0087] FIG. 11 illustrates the computations realized on a chip 1100 with compute intensive tiles and memory intensive tiles for the forward propagation of a convolutional layer according to embodiments of the disclosure. FIG. 11 illustrates how computations of a given layer are realized on a set of chip columns allocated to it by using the FP step of a convolutional layer as an example. Input features to the layer may be distributed evenly across the memory intensive tiles. In one embodiment, the output features produced are stored in the next set of columns, e.g., so that the next layer can be computed. The output features may be computed in batches of size equivalent to the lanes in the processing elements (e.g., 2D-Pes) of the compute intensive tile. In certain embodiments to compute each output features batch: (i) compute intensive tiles (e.g., compute intensive tile 1104) fetch an input feature from the left memory intensive tile (e.g., memory intensive tile 1102), (ii) compute intensive tile convolves the input feature to produce partial output features which are stored in the right memory intensive tile, (iii) Steps (i) and (ii) are repeated for all input features in the left memory intensive tile (e.g., memory intensive tile 1106); the right memory intensive tiles accumulate when the partial output features are stored, (iv) to produce the final weighted sum, the accumulated partial outputs in each right memory intensive tile are to be accumulated together. This may be achieved by identifying the “home” row of the output feature batch, e.g., and first accumulating the features vertically into the home row and then horizontally into the last column memory intensive tile, and (v) after this, the last-column home-row memory intensive tile may compute the activation function (e.g., and sampling if desired) before passing the output features to its home memory intensive tile. This process may be repeated until all output features are computed. If the layer weights are to be brought in from external memory, the compute intensive tile(s) may issue prefetch requests at the start of the previous output feature batch iteration.

[0156] The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache units 2506, and external memory (not shown) coupled to the set of integrated memory controller units 2514. The set of shared cache units 2506 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 2512 interconnects the integrated graphics logic 2508, the set of shared cache units 2506, and the system agent unit 2510/integrated memory controller unit(s) 2514, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 2506 and cores 2502-A-N.

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke with the teachings of Venkataramani where this is done on a neural network device, and the neural network operation is a multiply-and-accumulate (MAC) operation. As described above, Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. Similarly, Venkataramani teaches of neural network operations that partition the entire memory state across tiles that hold all of the features and errors of the network ([0085]). This is similar to the information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, as discussed in Burke. Moreover, the tiles are coupled with a memory hierarchy, similarly to Burke. An operation is performed in order to output data, and the data can be stored in the memory. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation and the data size is based on data tiles. 

11.	With regard to claim 4, Burke and Venkataramani teach:

wherein the processor: 

performs the first scheduling based on the first strategy to minimize movement costs of the neural network computation data which performs the first movement (Col. 13, lines 41-45, For example, data that is frequently used may be moved to a relatively fast storage device whereas data that has not been used over a certain period of time may be moved to a relatively slow storage device according to the data processing as discussed elsewhere herein; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. Data that is frequently used is scheduled/moved to a fast storage device.); and 

performs the second scheduling based on the first strategy to minimize movement costs of the neural network computation data which performs the second movement (Col. 13, lines 41-45, For example, data that is frequently used may be moved to a relatively fast storage device whereas data that has not been used over a certain period of time may be moved to a relatively slow storage device according to the data processing as discussed elsewhere herein; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. Data that is infrequently used is scheduled/moved to a slower storage device.), 

wherein the movement costs include at least one of energy and time required for transmission and reception of the neural network computation data (Col. 12, lines 18-27, The system described herein allows for the remapping of physical data based on policy criteria or other statistics. For example, the policy may be based on the last time data was used and/or accessed. Alternatively, the policy may be based on anticipated use of data over specific times and/or dates. For example, data that is expected to be used at a particular time may be stored on (or relocated to) relatively fast tier and then moved to relatively slow tier when it is expected that the data will not be used again for a lengthy period of time; Col 15, lines 51-55, Other possible criteria include the time of day, the size of the incoming write operation (e.g. very large sequential writes vs. smaller random writes), file name, file type, host OS type, data type, access patterns, inter-dependent accesses to other data, etc.).

Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. However, Burke fails to teach specifically that this is for a neural network.

However, in analogous art, Venkataramani teaches the neural network, see at least [0067]; [0085]; [0087]; [0156].

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke with the teachings of Venkataramani wherein the processor: performs the first scheduling based on the first strategy to minimize movement costs of the computation data which performs the first movement; and performs the second scheduling based on the first strategy to minimize movement costs of the computation data which performs the second movement, wherein the movement costs include at least one of energy and time required for transmission and reception of the computation data is done on a neural network. As described above, Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. Similarly, Venkataramani teaches of neural network operations that partition the entire memory state across tiles that hold all of the features and errors of the network ([0085]). This is similar to the information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, as discussed in Burke. Moreover, the tiles are coupled with a memory hierarchy, similarly to Burke. An operation is performed in order to output data, and the data can be stored in the memory. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device. Additionally, this is done using a strategy to minimize movement costs.

12.	With regard to claim 8, Burke and Venkataramani further teach:

wherein a plurality of neural network parameters of the neural network computation data include a plurality of input data parameters related to factors of the input data, a plurality of weight data parameters related to factors of the weight data, and a plurality of output data parameters related to factors of the output data (Fig. 10; Col. 11, lines 39-43 After the read and write processes illustrated in FIGS. 6 and 7, information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein; Col. 14, lines 28-51, FIG. 10 is a schematic illustration of a fine grained tiered storage system 600 according to an embodiment of the system described herein. A storage device 630 is shown including a thin device 620, like the thin devices 71-74 discussed elsewhere herein, that may be coupled to multiple physical storage devices across multiple storage tiers. As discussed elsewhere herein, the storage tiers may be associated with data devices, like the data devices 61-67 discussed herein, so that, for example, there is one data device for each storage tier, one data device for multiple storage tiers, any portion of a data device for any portion of the pools of storage shown for the storage tiers, and/or any combinations thereof. For example, in an embodiment, a top tier storage pool 610 (e.g., tier 0) may include flash/solid state disk (SSD) drives that are relatively fast and expensive. Other storage pools 611-613 (e.g., tiers 1-3) may include disk drives of decreasing speeds or other configurations (i.e., 15 k rpm, 10 k rpm, 7.5 k rpm redundant array of independent disk (RAID) storage). The lowest tier of storage pool 614 (e.g., tier 4) may include, for example, tape storage, largest capacity disk drives (such as massive array of idle disks (MAID) storage). As illustrated, the last storage tier 614 may include storage devices external to the storage device 630 that may be suitable for long term storage of data that is infrequently accessed; Examiners’ Note: Data regarding access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein, after read and write operations. Read and write operations indicate input and output data. Moreover, access frequency and/or other statistics can be weight data. The parameters surrounding the data relate to how often they are accessed. These parameters decided which storage tier the data is moved to.), and 

the dataflow is one of an input stationary (IS) dataflow where data related to the plurality of input data parameters is reused, a weight stationary (WS) dataflow where data related to the plurality of weight data parameters is reused, and an output stationary (OS) dataflow where data related to the plurality of output data parameters is reused (Col. 11, lines 39-47, After the read and write processes illustrated in FIGS. 6 and 7, information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, may be updated and stored by the system described herein. The updated data access information or other characteristic information of the data and/or any portion of the data may, for example, be stored as an entry in a group element of the thin device table 112 (for example, the entry 166f of the group element 166' as shown in FIG. 5); Col. 12, lines 16-27, A policy may be configured by an administrator on a system-wide level or may be specific to a particular user on a specific logical device. The system described herein allows for the remapping of physical data based on policy criteria or other statistics. For example, the policy may be based on the last time data was used and/or accessed. Alternatively, the policy may be based on anticipated use of data over specific times and/or dates. For example, data that is expected to be used at a particular time may be stored on (or relocated to) relatively fast tier and then moved to relatively slow tier when it is expected that the data will not be used again for a lengthy period of time; Examiner’s note: The flow of data is determined by how often the data is accessed, which is analogous with reuse.).

Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. However, Burke fails to teach specifically that this is for a neural network.

However, in analogous art, Venkataramani teaches the neural network, see at least [0067]; [0085]; [0087]; [0156].

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke with the teachings of Venkataramani where this is done on a neural network device. As described above, Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. Similarly, Venkataramani teaches of neural network operations that partition the entire memory state across tiles that hold all of the features and errors of the network ([0085]). This is similar to the information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, as discussed in Burke. Moreover, the tiles are coupled with a memory hierarchy, similarly to Burke. An operation is performed in order to output data, and the data can be stored in the memory. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device. Additionally, it would be beneficial that the dataflow is based on a data parameter of reuse so that data that is more often accessed is placed in higher storage tiers.

13.	With regard to claim 9, Burke teaches:

wherein the plurality of schedule candidates satisfy constraint conditions, which are determined based on at least one of a storage capacity of the plurality of memory hierarchies, and a number of components including the plurality of memory hierarchies (Fig. 10; Col. 14, lines 28-48, FIG. 10 is a schematic illustration of a fine grained tiered storage system 600 according to an embodiment of the system described herein. A storage device 630 is shown including a thin device 620, like the thin devices 71-74 discussed elsewhere herein, that may be coupled to multiple physical storage devices across multiple storage tiers. As discussed elsewhere herein, the storage tiers may be associated with data devices, like the data devices 61-67 discussed herein, so that, for example, there is one data device for each storage tier, one data device for multiple storage tiers, any portion of a data device for any portion of the pools of storage shown for the storage tiers, and/or any combinations thereof. For example, in an embodiment, a top tier storage pool 610 (e.g., tier 0) may include flash/solid state disk (SSD) drives that are relatively fast and expensive. Other storage pools 611-613 (e.g., tiers 1-3) may include disk drives of decreasing speeds or other configurations (i.e., 15 k rpm, 10 k rpm, 7.5 k rpm redundant array of independent disk (RAID) storage). The lowest tier of storage pool 614 (e.g., tier 4) may include, for example, tape storage, largest capacity disk drives (such as massive array of idle disks (MAID) storage); Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Col. 21, lines 44-53, Note that the test steps 1058, 1059 effectively set the threshold according to either the policy for a particular tier (e.g., fill up to fifty percent of capacity) or according to a value for I that will provide that data not promoted to a higher tier will still be serviced properly. For example, if a policy provides that the highest tier may be filled with up to fifty percent of its capacity, but most of the data of a particular storage group is rarely accessed (has relatively low scores), then the test at the step 1059 prevents data that is rarely accessed from being promoted to the highest tier; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. Each tier is a scheduling candidate that has storage capacity constraints.).

14.	With regard to claim 10, Burke and Venkataramani teach:

wherein the processor: 

identifies computation costs required for performing the neural network operation for each of the plurality of schedule candidates (Col. 12, lines 42-45, In an embodiment herein, data may be moved between physical disk drives (or other physical storage) having different characteristics, such as speed, cost, reliability, availability, security and/or other characteristics; Col. 17, lines 66 – Col. 18, lines 3, At the step 838, scheduling for the movement of the data may include relocating data in the particular requested tier, e.g. "faster" storage tier, to a lower tier, e.g. "slower" storage tier, to make memory available for the data temporarily stored in the global memory; Examiner’s Note: Scheduling is defined as relocating data in the particular requested tier. Each tier has particular cost associated with it.); and 

identifies a schedule candidate, which has smallest computation costs among the plurality of schedule candidates, as the neural network computation schedule, wherein the computation costs include at least one of energy and time required for performing the neural network operation (Col. 13, lines 20-37, The policy information provides the specific criteria used for data storage and management. After the step 504, processing proceeds to a step 506 where the policy is applied to the stored data. The policy may include criteria used for managing stored data such as criteria concerning frequency of use of data and/or criteria with respect to specific users and/or other criteria, such as file name, file type, file path, requesting application, expected time to re-use of the data, temporary storage only, life expectancy of the data, data type (e.g., compressed, encrypted, de-duped) and/or protection requirements of the data (e.g., store on an encrypted tier). The policy may be applied to identify data for lifecycle management according to characteristics of entire data volumes or any portions thereof. The policy may also consider the access history, effective performance or other characteristics about the data that might be utilized to optimize the performance, cost, availability or retention requirements of the data.). 

Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. However, Burke fails to teach specifically that this is for a neural network, and that the operation is a multiply-and-accumulate (MAC) operation. 

However, in analogous art, Venkataramani teaches the neural network, see at least [0067]; [0085]; [0087]; [0156].

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke with the teachings of Venkataramani where this is done on a neural network device. As described above, Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. Similarly, Venkataramani teaches of neural network operations that partition the entire memory state across tiles that hold all of the features and errors of the network ([0085]). This is similar to the information concerning access of the data, such as access frequency, time of last access or use and/or other characteristics and statistics, as discussed in Burke. Moreover, the tiles are coupled with a memory hierarchy, similarly to Burke. An operation is performed in order to output data, and the data can be stored in the memory. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device. Additionally, it would be beneficial to identify a scheduling candidate with the smallest computation cost in order to efficiently execute the neural network operation.

15.	Regarding claim 12, it is rejected under the same reasoning as claim 1 above. Therefore, it is rejected under the same rationale. 

16.	Regarding claim 15, it is rejected under the same reasoning as claim 4 above. Therefore, it is rejected under the same rationale. 

17.	Regarding claim 19, it is rejected under the same reasoning as claim 10 above. Therefore, it is rejected under the same rationale. 

18.	With regard to claim 20, Burke and Venkataramani teach:

A computer program stored in a non-transitory computer-readable recording medium to execute the method of controlling a neural network computing device of claim 12 (Col. 30, lines 46-51, Further, computer software, stored in a computer-readable medium (non-transitory computer-readable medium), may be provided according to the system described herein including executable code for carrying out any of the steps and processes described herein.).

Burke teaches of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data. However, Burke fails to teach specifically that this is for a neural network. However, in analogous art, Venkataramani teaches that this can be done on a neural network device. The explanation can be found in the mapping for claim 1. 

19.	Claims 2-3, 5-7, 13-14, and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Burke et al. US 8838887 B1 and Venkataramani et al. US 20190303743 A1, as applied in claim 1, in further view of Dave et al. in dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators.

Dave et al. in dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators was cited in IDS filed on November 4, 2024.

20.	With regard to claim 2, Burke and Venkataramani teach the neural network computing device of claim 1 and Dave et al. in dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators teaches:

wherein the processor performs the first scheduling based on at least one of a plurality of strategies, performs the second scheduling based on at least one of the plurality of strategies, identifies a plurality of schedule candidates based on the result of the first scheduling and the result of the second scheduling, and identifies one of the plurality of schedule candidates as the neural network computation schedule (4.2.3, However, to allow re-compiling applications by users and rapid design space explorations, the optimizer should be able to generate a highly efficient solution promptly. So, dMazeRunner embeds a pruning heuristic that achieves close-to-optimal solutions in second(s) through the following strategies: OPT 1, OPT 2, OPT 3, OPT 4, and OPT5; 4.2.3, OPT 4) For example, in Table 2, only schedules #8 and #15 maximize the reuse of weights and ofmap respectively. Thus, schedules #2–#7 and #9–#14 are discarded; Examiner’s Note: Each different OPT is a different strategy. An example of OPT 4 is given where it identifies the schedules that maximize the reuse of weights and ofmap respectively and discards the others, identifying the two ideal computation schedules.), and 

the plurality of strategies include a first strategy for enhancing a utilization rate of a low level memory hierarchy among the plurality of memory hierarchies (4.2.3, OPT 2) Discard execution methods requiring several memory accesses of noncontiguous data: Some IVs of loops correspond to a minor dimension of tensors (fy and fx for W[m][c][fy][fx]). For such IVs, when tiling factors of L3 loops (i.e., Fy_DRAM) are greater than 1, it requires many DMA invocations with small burst-sizes. Thus, it results in higher DMA cycles and may introduce the miss penalty for SPM management. So, dMazeRunner discards such execution methods which are susceptible to higher execution time; Examiner’s Note: This strategy discards methods that require several memory accesses of noncontiguous data. This would mean that there would need to be multiple individual requests to access the data. Doing so impacts low-level memory hierarchy by increasing the execution time. By discarding methods that require several memory accesses, low-level memory hierarchy utilization is enhanced.), a second strategy for enhancing a utilization rate of a high level memory hierarchy among the plurality of memory hierarchies (4.2.3, OPT1) Targeting execution methods featuring high resource utilization: dMazeRunner explores only those tiling factors that highly utilize (e.g., 60%) RFs, SPM, and PEs. High utilization improves data reuse and reduces DRAM accesses. Note that very high utilization does not guarantee an optimal solution, as it may not effectively interleave computation and communication cycles; Examiner’s Note: This strategy targets execution methods that have high resource utilization. High level memory hierarchies are faster and are able to store recently accessed data for quicker retrieval. This is better for data reuse and reduces DRAM accesses, which enhance a utilization rate of high-level memory hierarchy.), a third strategy for keeping a balance between the utilization rate of the high level memory hierarchy and the utilization rate of the low level memory hierarchy (4.2.3, OPT3) Discard execution methods that require inter-PE communication: Often a read + write (r+w) operand (O) is an invariant of few IVs (c, fy, and fx). If loops corresponding to these IVs execute spatially, it requires inter-PE communication (for reduction), which may introduce stall cycles and often costs higher energy. Therefore, to avoid inter-PE communication, dMazeRunner decides not to execute such loops in space. This strategy discards several dataflow mechanisms (e.g., weight-stationary, row-stationary); Examiner’s Note: This strategy discards execution methods that require inter-PE communication and several dataflow mechanisms like weight-stationary and row-stationary. These dataflow mechanisms are fixed and might not be optimal all the time. To find a balance between low and high level hierarchies, it might be better to adapt and find the best way to utilize both low and high level memory hierarchies. By discarding execution methods that require inter-PE communication and several dataflow mechanisms like weight-stationary and row-stationary, the system can find a balance of low and high level hierarchies and choose the best memory level(s) for optimal use.), and a fourth strategy for preventing repeated transmission and reception of the neural network computation data (4.2.3, OPT 4) Targeting execution methods that maximize the reuse of operands: Although dMazeRunner determines all loop-orderings featuring unique reuse factors, space can be pruned to few orderings that maximize the data reuse. For example, in Table 2, only schedules #8 and #15 maximize the reuse of weights and ofmap respectively. Thus, schedules #2–#7 and #9–#14 are discarded; Examiner’s Note: This strategy maximizes data reuse, which indicates that it is reusing current data instead of transmitting or receiving additional data. This is analogous with preventing repeated transmission and reception of the neural network computation data.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke and Venkataramani with the teachings of Dave wherein the processor performs the first scheduling based on at least one of a plurality of strategies, performs the second scheduling based on at least one of the plurality of strategies, identifies a plurality of schedule candidates based on the result of the first scheduling and the result of the second scheduling, and identifies one of the plurality of schedule candidates as the neural network computation schedule, and the plurality of strategies include a first strategy for enhancing a utilization rate of a low level memory hierarchy among the plurality of memory hierarchies, a second strategy for enhancing a utilization rate of a high level memory hierarchy among the plurality of memory hierarchies, a third strategy for keeping a balance between the utilization rate of the high level memory hierarchy and the utilization rate of the low level memory hierarchy, and a fourth strategy for preventing repeated transmission and reception of the neural network computation data. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation. Similarly, Dave teaches of memory hierarchies used for neural network operations. Dave further teaches of different strategies for scheduling data movement between memory hierarchies. By having multiple strategies for scheduling, the most optimal schedule can be used in order to best execute scheduling and neural network operations.

21.	With regard to claim 3, Dave further teaches:

wherein the processor performs the first scheduling and the second scheduling for each of a plurality of dataflow combinations which are generated based on a dataflow of the first movement and a dataflow of the second movement (4.3.3, Accurate modeling of continuous data reuse through several RF+SPM passes: Depending on the ordering of the loops, some operand gets reused continuously throughout all RF passes of an SPM pass and through several such SPM passes. For example, for an ordering of L2 loops with IVs {n_L2, m_L2, oy_L2, ox_l2, c_L2, fy_L2, fx_L2} (outermost to innermost) with TCs <1,1,1,1,4,3,3>, total RF passes in an SPM pass are 4×3×3 = 36. In each RF pass, operands I and W are communicated from SPM to RFs via NoC while O is reused in RFs. Now, for an ordering of L3 loops with IVs {oy_L3, ox_l3, fy_L3, fx_L3, n_L3, m_L3, c_L3} with TCs <1,1,1,1,2,32,16>, O gets reused in consecutive 16 SPM passes. Thus, write-back of O occurs just once after every 16 SPM passes; each SPM pass consists of 36 RF passes. We refer to such reuse of data at consecutive memory levels as a continuous reuse and accurately model it for various operands; Examiner’s Note: This is an example where based on the dataflow of L2 (dataflow of first movement) and the dataflow of L3 (dataflow of second movement), a scheduling is determined where write-back of O occurs just once after every 16 SPM passes; each SPM pass consists of 36 RF passes.), and 

the dataflow is determined based on a parameter about which data is reused, among a plurality of neural network parameters of the neural network computation data (4.2.3, OPT 4) Targeting execution methods that maximize the reuse of operands: Although dMazeRunner determines all loop-orderings featuring unique reuse factors, space can be pruned to few orderings that maximize the data reuse. For example, in Table 2, only schedules #8 and #15 maximize the reuse of weights and ofmap respectively. Thus, schedules #2–#7 and #9–#14 are discarded; Examiner’s Note: In OPT 4, dataflow is being determined based on the reuse of operands. Execution methods that maximize the reuse of operands. There are also other parameters of the neural network computation data, as can be seen in the other OPTs.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke and Venkataramani with the teachings of Dave wherein the processor performs the first scheduling and the second scheduling for each of a plurality of dataflow combinations which are generated based on a dataflow of the first movement and a dataflow of the second movement, and the dataflow is determined based on a parameter about which data is reused, among a plurality of neural network parameters of the neural network computation data. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation. Similarly, Dave teaches of memory hierarchies used for neural network operations. Dave further teaches of different strategies, such as performing the first scheduling and the second scheduling for each of a plurality of dataflow combinations which are generated based on a dataflow of the first movement and a dataflow of the second movement, and the dataflow is determined based on a parameter about which data is reused, among a plurality of neural network parameters of the neural network computation data. By doing so, data movement is minimized and overall energy consumption is reduced.

22.	With regard to claim 5, Dave further teaches:

wherein the processor: 

performs the first scheduling based on the second strategy to maximize a data tile size of the neural network computation data which performs the first movement (4.1 Execution on the dataflow accelerators takes place by means of executing the loop iterations onto the PE array both spatially and temporally. To determine spatial execution onto PEs and the data accessed from RFs, SPM, and DRAM, we explicitly tile each loop of the loop-nest at these four levels; 4.1, The seven loops at levels L1, L2, and L3 execute temporally on each PE and are configured to specify the accesses to RF, SPM, and DRAM. Here, tiling factors (e.g., N_SPM=2) impact the size of the data accessed from L1/L2/L3 memory (Section 4.3.1 provides the exact calculation), and ordering of the loop determines the schedule of data movement i.e., data reuse/eviction. In the proposed representation, since each loop of the input nest is explicitly modeled for spatial execution and for accessing data from L1/L2/L3 memory, it allows capturing the vast space of execution methods; Examiner’s Note: “To determine spatial execution onto PEs and the data accessed from RFs, SPM, and DRAM, we explicitly tile each loop of the loop-nest at these four levels” indicates that data tiling is maximized since it is done for each loop of the loop-nest at the four stated levels. As expressed earlier, each loop L1/L2/L3 represents scheduling of data movement between memory hierarchies. The movement of the data through these loops depicts a scheduling of a first movement of data between a first memory hierarchy (L3) to a second memory hierarchy having a higher level than the first (L2).); and 

performs the second scheduling based on the second strategy to maximize a data tile size of the neural network computation data which performs the second movement (4.1 Execution on the dataflow accelerators takes place by means of executing the loop iterations onto the PE array both spatially and temporally. To determine spatial execution onto PEs and the data accessed from RFs, SPM, and DRAM, we explicitly tile each loop of the loop-nest at these four levels; 4.1, The seven loops at levels L1, L2, and L3 execute temporally on each PE and are configured to specify the accesses to RF, SPM, and DRAM. Here, tiling factors (e.g., N_SPM=2) impact the size of the data accessed from L1/L2/L3 memory (Section 4.3.1 provides the exact calculation), and ordering of the loop determines the schedule of data movement i.e., data reuse/eviction. In the proposed representation, since each loop of the input nest is explicitly modeled for spatial execution and for accessing data from L1/L2/L3 memory, it allows capturing the vast space of execution methods; Examiner’s Note: “To determine spatial execution onto PEs and the data accessed from RFs, SPM, and DRAM, we explicitly tile each loop of the loop-nest at these four levels” indicates that data tiling is maximized since it is done for each loop of the loop-nest at the four stated levels. As expressed earlier, each loop L1/L2/L3 represents scheduling of data movement between memory hierarchies. The movement of the data through these loops depicts a scheduling of a first movement of data between a first memory hierarchy (L3) to a second memory hierarchy having a higher level than the first (L2). The tiling and scheduling is done one each L1/L2/L3 loop; therefore, it is possible that a second scheduling is performed after the first one.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke and Venkataramani with the teachings of Dave wherein the processor: performs the first scheduling based on the second strategy to maximize a data tile size of the neural network computation data which performs the first movement; and performs the second scheduling based on the second strategy to maximize a data tile size of the neural network computation data which performs the second movement. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation. Similarly, Dave teaches of memory hierarchies used for neural network operations. Dave further teaches of different strategies, such as maximizing a data tile size. By maximizing a data tile size, data locality and hardware utilization are improved.

23.	With regard to claim 6, Dave further teaches:

wherein the processor:

performs the first scheduling based on the third strategy to maximize a number of mapping candidates for the second movement (4.2.1, Therefore, for a given loop-nest, it is possible to create a list of all those loop-orderings (schedules) that feature unique reuse of operands, and the optimizer needs to target just those orderings; Examiner’s Note: A list of all loop orderings that feature unique reuse of operands is created, which maximizes a number of mapping candidates.); and

performs the second scheduling based on the third strategy to maximize the number of mapping candidates for a third movement of the neural network computation data between the third memory hierarchy and a fourth memory hierarchy having a higher level than the third memory hierarchy (4.2.1, Figure 6 depicts a 4-deep loop-nest along with information about each operand being invariant of certain loops. To explain the impact of orderings, in this example, we assume that the current memory level (e.g., RF) can accommodate 3 data elements. Thus, during each loop iteration (total 192), the data corresponding to each operand can be accessed from lower memory (e.g., SPM) and brought to the current memory level; Examiner’s Note: There is a four deep loop-nest. This means there is at least a first scheduling, second scheduling, and third movement. Each loop is able to access data from a lower memory and move it to the current memory. This is analogous with moving data between the third and fourth memory hierarchies, wherein the fourth having a higher level than the third memory hierarchy, and likewise for the other levels. Therefore, the number of valid mappings is maximized since a list of all loop orders that have unique reuse of operands is created, and scheduling is done to move data between lower levels to higher levels based on this list.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke and Venkataramani with the teachings of Dave wherein the processor: performs the first scheduling based on the third strategy to maximize a number of mapping candidates for the second movement; and performs the second scheduling based on the third strategy to maximize the number of mapping candidates for a third movement of the neural network computation data between the third memory hierarchy and a fourth memory hierarchy having a higher level than the third memory hierarchy. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation. Similarly, Dave teaches of memory hierarchies used for neural network operations. Dave further teaches of different strategies, such as maximizing a number of mapping candidates. By maximizing a number of mapping candidates, there are more choices in order to find the ideal candidate for scheduling movement of data in order to perform the neural network operation.

24.	With regard to claim 7, Dave further teaches:

wherein the processor: 

performs the first scheduling based on the fourth strategy to prevent the repeated transmission and reception of the neural network computation data in the second movement (4.2.3, OPT 4) Targeting execution methods that maximize the reuse of operands: Although dMazeRunner determines all loop-orderings featuring unique reuse factors, space can be pruned to few orderings that maximize the data reuse. For example, in Table 2, only schedules #8 and #15 maximize the reuse of weights and ofmap respectively. Thus, schedules #2–#7 and #9–#14 are discarded; Examiner’s Note: This strategy maximizes data reuse, which indicates that it is reusing current data instead of transmitting or receiving additional data. This is analogous with preventing repeated transmission and reception of the neural network computation data. The example given shows a first scheduling that is done based on discarding schedules that do not maximize reuse of weights and ofmap. Schedule #8 is the first scheduling.); and 

performs the second scheduling based on the fourth strategy to prevent the repeated transmission and reception of the neural network computation data in the third a third movement of the neural network computation data between the third memory hierarchy and a fourth memory hierarchy having a higher level than the third memory hierarchy (4.2.3, OPT 4) Targeting execution methods that maximize the reuse of operands: Although dMazeRunner determines all loop-orderings featuring unique reuse factors, space can be pruned to few orderings that maximize the data reuse. For example, in Table 2, only schedules #8 and #15 maximize the reuse of weights and ofmap respectively. Thus, schedules #2–#7 and #9–#14 are discarded; Examiner’s Note: This strategy maximizes data reuse, which indicates that it is reusing current data instead of transmitting or receiving additional data. This is analogous with preventing repeated transmission and reception of the neural network computation data. The example given shows a first scheduling that is done based on discarding schedules that do not maximize reuse of weights and ofmap. Scheduling #15 is the second scheduling.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Burke and Venkataramani with the teachings of Dave wherein the processor: performs the first scheduling based on the third strategy to maximize a number of mapping candidates for the second movement; and performs the second scheduling based on the third strategy to maximize the number of mapping candidates for a third movement of the neural network computation data between the third memory hierarchy and a fourth memory hierarchy having a higher level than the third memory hierarchy. Together, Burke and Venkataramani teach of scheduling the movement of data between memory hierarchies based on a data size, identifying a computation schedule for the operation, and using the schedule to perform an operation that uses the input and weight data to generate output data and Venkataramani shows that it can be done on a neural network device wherein the operation is a MAC operation. Similarly, Dave teaches of memory hierarchies used for neural network operations. Dave further teaches of different strategies, such as preventing repeated transmission and reception of neural network computation data. By doing so, there is lower bandwidth consumption and reduced latency.

25.	Regarding claim 13, it is rejected under the same reasoning as claim 2 above. Therefore, it is rejected under the same rationale. 

26.	Regarding claim 14, it is rejected under the same reasoning as claim 3 above. Therefore, it is rejected under the same rationale. 

27.	Regarding claim 16, it is rejected under the same reasoning as claim 5 above. Therefore, it is rejected under the same rationale. 

28.	Regarding claim 17, it is rejected under the same reasoning as claim 6 above. Therefore, it is rejected under the same rationale. 

29.	Regarding claim 18, it is rejected under the same reasoning as claim 7 above. Therefore, it is rejected under the same rationale. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AN-AN N NGUYEN whose telephone number is (571)272-6147. The examiner can normally be reached Monday-Friday 8:00-5:00 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, AIMEE LI can be reached at (571) 272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AN-AN NGOC NGUYEN/Examiner, Art Unit 2195                                                                                                                                                                                                        
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195
Read full office action
Prosecution Timeline

Aug 08, 2022
Application Filed
Jul 16, 2025
Non-Final Rejection — §103, §112
Oct 28, 2025
Response Filed
Jan 12, 2026
Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/971,391
Patent 12561130
MAINTENANCE MODE IN HCI ENVIRONMENT
2y 5m to grant Granted Feb 24, 2026
17/839,943
Patent 12511156
CREDIT-BASED SCHEDULING USING LOAD PREDICTION
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+50.0%)
3y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.