Last updated: May 04, 2026
Application No. 17/620,308
METHOD AND DEVICE FOR PROCESSING CONVOLUTION OPERATION OF NEURAL NETWORK PROCESSOR

Final Rejection §103
Filed
Dec 17, 2021
Priority
Jun 18, 2019 — RE 10-2019-0072062 +1 more
Examiner
KAPOOR, DEVAN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Furiosaai Co.
OA Round
3 (Final)
Interview Optional

— +16.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 11% grant rate with +16.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

KAPOOR, DEVAN View full profile →
Grants only 11% of cases
Career Allowance Rate
1 granted / 9 resolved
-43.9% vs TC avg
Strong +17% interview lift
Without
With
+16.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
37.4%
-2.6% vs TC avg
§103
45.2%
+5.2% vs TC avg
§102
10.5%
-29.5% vs TC avg
§112
5.7%
-34.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§103
DETAILED ACTION
This action is responsive to the application filed on 11/25/2025. Claims 21,24-25, 27-31, 34-36, and 38-48 are pending and have been examined. This action is Final. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C.
120, 121, 365(c), or 386(c) is acknowledged.

Response to Arguments 
Applicant's arguments filed 11/25/2025 have been fully considered but they are not persuasive, aside from
otherwise indicated. Please see the following responses below:
Argument 1: The applicant argues that the prior art rejection under 35 U.S.C. 103 should be withdrawn because the applied references, whether considered individually or in combination, fail to disclose or suggest the amended limitations of the independent claims and therefore do not render the claims obvious. For claims 21, 27 to 31, and 38 to 40, the applicant contends that Chen and Du do not teach a reader or convolution feeder architecture in which, under control of a convolution sequencer, the feeder sequentially reads data groups each having more pieces of data than the unit data throughput defined by UnitSize number of MAC, stores those data groups in an input data queue, and transmits a queued data group to a shift buffer for partial per cycle consumption, nor do they teach a processor that further accumulates output data in K or more accumulators having a width equal to UnitSize number of MAC. The applicant further argues that Dally does not cure these deficiencies and has been mischaracterized in the rejection because Dally relies on an input stationary PTIS sparse dataflow in which the same input activations are reused across output channels rather than reading storing or transmitting data groups larger than the unit data throughput as claimed. The applicant additionally asserts that Ben Cheikh which addresses tiling and halo techniques to improve data locality and compute intensity likewise fails to disclose or suggest the claimed oversized data group queuing shift buffer transmission or the required accumulation behavior and does not provide a rationale to modify Chen Du or Dally to arrive at the claimed invention. For claims 24, 25, and 34 to 36, the applicant argues that the additional reference Sun similarly fails to supply the missing limitations or a motivation to combine with the other references. For new claims 41 to 48, the applicant asserts patentability at least by dependency from amended claims 21 and 31 and further contends that the cited references still fail to disclose or suggest the additional limitations recited therein.
Examiner Response to Argument 1: The examiner has considered the argument set forth above, however in light of the amendments and new claims, the rejection is maintained for claims 21, 27 to 31, and 38 to 40 because Chen, Du, Dally, and Ben-Cheikh, as applied, collectively teach or render obvious the claimed convolution processing architecture including the amended reader and feeder and accumulation features. Chen teaches executing convolution operations using K by K filters on input feature maps to generate output feature maps and further teaches organizing partial sum accumulation based on filter geometry, including vertically accumulating rows of partial sums and determining processing set height based on filter height, which is directed to accumulation and storage behavior tied to filter height. Du teaches reading input data from memory and buffering a portion of the input data for later output and generating remap data for convolution, and further teaches generating multiple sets of remap data and executing multiple convolution operations, which is directed to sequentially staging groups of input data for reuse and performing convolution multiple times on such staged data based on a unit processing throughput. Dally teaches a sequencer controlled architecture that controls memory reads and stages vectors in FIFO structures and sequences through multiple activation vectors for reuse, which is directed to queue and buffer based staging and successive transmission of vectors corresponding to sequential parts of a staged group, and Dally further teaches an array of processing elements each including multiplier and accumulation logic, vectors distributed into an FxI multiplier array, and accumulation of multiplication results to produce output activations, which is directed to accumulating output data using multiple accumulators with an accumulation width matched to the per cycle unit throughput of the parallel multiplier array consistent with the amended accumulation requirement; Dally also teaches decoded address based selection of a jth accumulator unit and count based control of processing iterations, which is directed to the dependent accumulation count and accumulator indexing features. Ben-Cheikh teaches halo augmented tile sizing where a tile interior width is padded by a radius corresponding to half the kernel width to avoid extra global memory exchange, which is directed to defining staged data groups larger than a unit processing window by floor K over 2 on each side consistent with the amended group size expression and supports the rationale to size staged data to support reuse without redundant fetches. For claims 24, 25, and 34 to 36, the rejection is likewise maintained because Sun, as additionally applied, teaches further convolution processing and control features consistent with the dependent limitations, and the applied combination continues to render the claims obvious for the same reasons the independent claims remain unpatentable. For new claims 41 to 48, the rejection is maintained because these claims depend from amended claims 21 and 31 and are not patentable for at least that reason, and further because the additional limitations are taught by the applied references, including Dally’s count controlled processing and accumulator selection logic and Chen’s filter height based organization of accumulation, such that the new dependent features represent predictable variations of known accumulator control and indexing techniques in convolution accelerators. Accordingly, for at least the reasons above, the rejection is maintained.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 

Claim(s) 21, 27-31, and 38-48  are rejected under 35 U.S.C. 103 as being unpatentable over NPL reference “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks” by Chen et. al (referred herein as Chen) in view of US10162799B2, by Du et. al (referred herein as Du) in view of US 10891538B2, by Dally et. al (referred herein as Dally) further in view of NPL reference “Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications”, by Ben-Cheikh et. al. (referred herein as Ben-Cheikh). 

Regarding claim 21, Chen teaches:
A device for processing convolution operations, comprising: a processor that: executes, in a neural network, a convolution operation on input data in a form of width×height×input channel and on a filter in a form of K×K×input channel or K×K to correspond toa form of the input data, K being an integer greater than or equal to one, and generates output data in a form of width×height×output channel; and ([Chen, page 3, sec 3] “A CONV layer applies filters on the input fmaps (ifmaps) to extract embedded visual characteristics and generate the output fmaps (ofmaps). The dimensions of both filters and fmaps are4D: each filter or fmap is a 3D structure consisting of multiple 2D planes, i.e., channels, and a batch of 3D ifmaps is processed by a group of 3D filters in a CONV layer.”, wherein the examiner interprets “The dimensions of both filters and fmaps are 4D: each filter or fmap is a 3D structure consisting of multiple 2D planes, i.e., channels, and a batch of 3D ifmaps is processed by a group of 3D filters in a CONV layer” to be the same as “data in a form of width×height×input channel” because “input channel” is a “2D plane” hence width×height×2D is 4D similar to the fmaps and output fmaps (ofmaps).)
Chen does not teach a reader that: sequentially reads, from a memory storing the input data, a data group having more pieces of data than unit data throughput of an operator, and provides the data group to the operator to reuse at least one piece of data constituting the data group in the convolution operation, wherein the processor further executes, by using one or more operators identical to the operator, the convolution operation on the data constituting the data group and on the filter multiple times based on the unit data throughput.
Du teaches a reader that: sequentially reads, from a memory storing the input data, a data group having more pieces of data than unit data throughput of an operator, and provides the data group to the operator to reuse at least one piece of data constituting the data group in the convolution operation, ([Du, col 2, lines 52-67] “To achieve the above objectives, the present invention discloses a buffer device, which is coupled to a memory and includes input lines, an input buffer unit and a remapping unit. The input lines are coupled to the memory and configured to be inputted with data from the memory in a current clock. The input buffer unit is coupled to the input lines and configured to buffer a part of the inputted data and output the part of the inputted data in a later clock. The remapping unit is coupled to the input lines and the input buffer unit, and configured to generate remap data for a convolution operation according to the data on the input lines and an output of the input buffer unit in the current clock.”, wherein the examiner interprets “input buffer unit is coupled to the input lines and configured to buffer a part of the inputted data” to be the same as “sequentially reads, from memory storing the input data” where data is buffered as a group of data. The examiner further interprets “and configured to generate remap data for a convolution operation” to be the same as “constituting the data group in the convolution operation” as they are both performing convolution on the buffered data which is group data.)
wherein the processor further executes, by using one or more operators identical to the operator, the convolution operation on the data constituting the data group and on the filter multiple times based on the unit data throughput ([Du, col 7, lines 42-47] “For example , W data are inputted from the memory 1 to the input lines 21 in the current clock , and the remapping unit 23 generates W sets of remap data , which are inputted to the convolution operation module 3 . Then, the convolution operation module 3 executes W convolution operations according to the W sets of remap data .”, wherein the examiner interprets “executes W convolution operations” to be the same as “convolution operation on the data constituting the data group and on the filter multiple times” because W is an integer and indicates the number of convolution operations which is clearly multiple times.)
Chen and Du does not teach the reader comprises: a convolution feeder, and a convolution sequencer comprising an input data queue and a shift buffer, and the convolution feeder: sequentially reads data groups each having more pieces of data than the unit data throughput from the memory under control of the convolution sequencer, stores the data groups in the input data queue, and transmits one of the data groups stored in the input data queue to the shift buffer, and the convolution sequencer, transmits a data array having a data amount that is the same as the unit data throughput from the shift buffer to the processor, and transmits another data array having a data amount that is the same as the unit data throughput but different from the data array from the shift buffer to the processor, and the data array and the other data array correspond to a sequential part of the data constituting the one of the data groups and have same data part and different data parts as and from each other, wherein: an amount of data in the data array is the same as UnitSize(#MAC) that is the unit data throughput, and an amount in each of the data groups is defined by formula {floor(K/2)+UnitSize(#MAC)+floor(K/2)} or more obtained by adding floor(K/2) that is a maximum integer value of K/2, to the UnitSize(#MAC) that is the unit data throughput, twice, where K is a constant determined based on the form of the filter K x K x input channel or K x K and is an integer greater than or equal to one.
Dally teaches:
the reader comprises: a convolution feeder, and a convolution sequencer comprising an input data queue and a shift buffer, and the convolution feeder: sequentially reads data groups each having more pieces of data than the unit data throughput from the memory under control of the convolution sequencer, stores the data groups in the input data queue, and transmits one of the data groups stored in the input data queue to the shift buffer, and the convolution sequencer, transmits a data array having a data amount that is the same as the unit data throughput from the shift buffer to the processor, and transmits another data array having a data amount that is the same as the unit data throughput but different from the data array from the shift buffer to the processor, and the data array and the other data array correspond to a sequential part of the data constituting the one of the data groups and have same data part and different data parts as and from each other; ([Dally, [0046-0048]] “The layer sequencer 215 controls the reading of the memory to obtain the compact input activations and compact weights…In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector.”, and [Dally, [0081 - 0086]] “First, a vector F of compressed weights and a vector I of compressed input activations are fetched from the weight buffer 305 and the input activations buffer 310, respectively. The vectors are distributed into the FxI multiplier array 325 that computes a form of the cartesian product of the vectors…In one embodiment, the weight buffer 305 is a first-in first-out FIFO buffer (WFIFO).” “In one embodiment, the weight buffer 305 is a FIFO buffer that includes a tail pointer, a channel pointer, and a head pointer. The layer sequencer 215 controls the ‘input’ side of the weight buffer 305, pushing weight vectors into the weight buffer 305.”, and [Dally, [0050]] “Additionally, the input activation vectors may be reused within each PE 210 in an input stationary fashion against a number of weight vectors to reduce data accesses.”, wherein the examiner interprets the layer sequencer’s control of memory reads and vector broadcasts together with pointer-managed FIFO staging (tail/channel/head pointers and pointer sequencing) to be the same as a convolution sequencer comprising an input data queue and a shift buffer because they are both directed to orchestrating sequential delivery of buffered data groups from memory into queues/buffers for reuse during convolution, the “first-in first-out FIFO buffer (WFIFO)” and the layer sequencer “pushing weight vectors” to be the same as storing data groups in an input data queue because they are both directed to enqueuing groups larger than a per-cycle throughput for later staged consumption. The examiner further interprets the fetching of “a vector F … and a vector I …” to be the same as transmitting a data array having a data amount that is the same as the unit data throughput because they are both directed to delivering fixed-width vectors to the processor each cycle. Finally, the examiner further interprets the sequencer “sequences through multiple activation vectors” with “input activation vectors …reused…in an input stationary fashion” to be the same as transmitting another, different data array corresponding to a sequential part of the same data group that shares some data with the prior array because they are both directed to issuing successive, partially overlapping arrays (same data part and different data parts) for convolutional reuse.)
the processor further accumulates the output data in K or more accumulators with a width equal to the UnitSize(#MAC). ([Dally, [0049] “The accelerator includes an array of processing elements 210, each including a multiplier and accumulation logic.” and [Dally, [0082]] “The vectors are distributed into the FxI multiplier array 325 that computes a form of the Cartesian product of the vectors.” and [Dally, [0083] “Each processing element accumulates the results of the multiplications to produce output activations.”, wherein the examiner interprets “an array of processing elements 210, each including a multiplier and accumulation logic” to be the same as “K or more accumulators” because each processing element includes its own accumulation logic for accumulating convolution results and the array provides multiple accumulation structures operating in parallel. The examiner further interprets the “FxI multiplier array” distributing vectors and producing parallel multiplication results as establishing a fixed per cycle unit throughput corresponding to UnitSize(#MAC), and interprets the accumulation logic that “accumulates the results of the multiplications” as having a width corresponding to that same fixed per cycle throughput, because the accumulation logic accumulates the parallel multiplication outputs produced by the multiplier array. Thus, Dally teaches accumulating output data using multiple accumulators with an accumulation width corresponding to the unit data throughput, as recited.)
Chen, Du, and Dally do not teach wherein: an amount of data in the data array is the same as UnitSize(#MAC) that is the unit data throughput, and an amount in each of the data groups is defined by formula {floor(K/2)+UnitSize(#MAC)+floor(K/2)} or more obtained by adding floor(K/2) that is a maximum integer value of K/2, to the UnitSize(#MAC) that is the unit data throughput, twice, where K is a constant determined based on the form of the filter K x K x input channel or K x K and is an integer greater than or equal to one.
Ben-Cheikh teaches wherein: an amount of data in the data array is the same as UnitSize(#MAC) that is the unit data throughput, and an amount in each of the data groups is defined by formula {floor(K/2)+UnitSize(#MAC)+floor(K/2)} or more obtained by adding floor(K/2) that is a maximum integer value of K/2, to the UnitSize(#MAC) that is the unit data throughput, twice, where K is a constant determined based on the form of the filter K x K x input channel or K x K and is an integer greater than or equal to one. ([Ben-Cheikh, p. 91] “For convolution with use of shared memory, each tile is augmented with an extra halo : (2× radius) … TL′ = TLy × (TLx + 2× radius) (5.23)” AND [Ben-Cheikh, p. 97] “TL = (Ty × Ny) × (Tx × Nx) (5.26)” AND [Ben-Cheikh, p. 92] “W = 2∗radius +1”, wherein the examiner interprets TLx (or equivalently Tx × Nx as the tile’s interior width) to be the same as UnitSize(#MAC) that is the unit data throughput, and “radius” to be the same as ⌊K/2⌋ because W = 2·radius + 1 defines K, to be the same as the claim’s {⌊K/2⌋ + UnitSize(#MAC) + ⌊K/2⌋} group-size, because they are both defining a data group whose width equals the unit-throughput window padded by a halo of ⌊K/2⌋ elements on each side for a K×K (or K×K×input-channel) filter.).
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to reader/feeder and data-staging mechanisms for convolution accelerators that fetch windowed data groups larger than per-cycle throughput and reuse overlapping data using sequencers, queues/buffers, and halo-based group sizing based on kernel width. 
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the convolution device disclosed by Chen to include the buffer device disclosed by Du. One would be motivated to do so to efficiently stage and pipeline input data for reuse across sequential convolution steps, as suggested by Du (Du, [col. 2, lines 52–67] “The input buffer unit is coupled to the input lines and configured to buffer a part of the inputted data and output the part of the inputted data in a later clock.”).
It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the convolution device disclosed by Chen to include the layer sequencer disclosed by Dally. One would be motivated to do so to effectively control memory reads/feeding into per-PE (processing element) buffers, as suggested by Dally (Dally, [0046-0048] “the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector.”).
It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the convolution device disclosed by Chen to include the “extra halo” padding disclosed by Ben-Cheikh. One would be motivated to do so to efficiently size each staged data group to cover edge computations without redundant external fetches, as suggested by Ben-Cheikh ([Ben-Cheikh, p. 91] “For convolution with use of shared memory, each tile is augmented with an extra halo : (2× radius) …TL′ = TLy × (TLx + 2× radius .. to avoid any extra data exchange with global memory”). Claim 31 is analogous to claim 21, and thus faces the same rejection set forth above. 

Regarding claim 27, Chen, Du, Dally, and Ben-Cheikh, teaches The device of claim 21, (see rejection of claim 21). 
Du further teaches wherein the other data array is of an area shifted based on a preset standard from the data array in the data group of the shift buffer. [Du, col 7, lines 14-17] “The moving distance of the sliding window S is a stride. The size of the stride is smaller than the size of the sliding window or the convolution size.” ,wherein the examiner interprets “the sliding window” to be the same as “other data array” (i.e. the convolution filter) and “The moving distance of the sliding window S is a stride” to be the same as “area shifted on a preset standard” since both are conducting convolution of one area to another.
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to shifting data arrays for convolution based on a preset standard.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “the sliding window S is a stride” disclosed by Du. One would be motivated to do so to efficiently determine and increase convolution precision, as suggested by Du ([Du, col 7, lines 14-17] “The size of the stride is smaller than the size of the sliding window or the convolution size.”).

Regarding claim 28, Chen, Du, Dally, and Ben-Cheikh, teaches The device of claim 21, (see rejection of claim 21). 
Du further teaches wherein a number of data arrays transmitted from the shift buffer to the processor for the one of the data groups by the convolution sequencer is K, and as the convolution operation on the filter is executed K times for the data array transmitted from the shift buffer by the operator, a number of times of using data of the one of the data groups is K2 times. ([Du, col 2, lines 15-17] “In one embodiment, W data are inputted from the memory to the input lines in the current clock, and the remapping unit generates W sets of the remap data for W convolution operations.” And [Du, col 3, lines 7-10] “In one embodiment, the latest K data of the W data, which are still not inputted in the previous convolution operation, are kept in the buffer for the next convolution operation.”, wherein the examiner interprets the fixed number (K) of data elements is buffered and then reused for the following convolution operation, to be the same as “number of times of using data of the one of the data groups is K2 times” when the convolution operation is executed K times.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to buffering and reusing a fixed number of data elements for multiple convolution operations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “remapping unit generates W sets of the remap data for W convolution operations.” disclosed by Du. One would be motivated to do so to efficiently increase throughput, as suggested by Du ([Du, col 2, lines 15-17] “W data are inputted from the memory to the input lines in the current clock, and the remapping unit generates W sets of the remap data for W convolution operations.”)

Regarding claim 29, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 21, (see rejection of claim 21).
Du further teaches wherein result data calculated by the processor is transformed into a preset form and stored in the memory. ([Du, col 2, lines 10-15] “The remapping unit is coupled to the input lines and the input buffer unit, and configured to generate remap data for a convolution operation according to the data on the input lines and an output of the input buffer unit in the current clock.”, wherein the examiner interprets “…generate a remap data for a convolution operation…output of the input buffer unit” to be the same as taking result data computed by the processor and transforms it into a preset format which is then stored into memory.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to transforming result data into a preset form and storing it in the memory.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the process of how to “generate remap data for a convolution operation” disclosed by Du. One would be motivated to do so to effectively streamline data handling, as suggested by Du ([Du, col 2, lines 10-15] “generate remap data for a convolution operation according to the data on the input lines”).
 
Regarding claim 30, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 21, (see rejection of claim 21).
Du further teaches wherein the reader further comprises: a fetch buffer from which data stored in the memory is taken, a fetch sequencer that takes data from the memory to the fetch buffer, and a fetch network that transmits the taken data to the convolution feeder. ([Du, col 2, lines 1-15] “To achieve the above objectives, the present invention discloses a buffer device, which is coupled to a memory and includes input lines, an input buffer unit and a remapping unit. The input lines are coupled to the memory and configured to be inputted with data from the memory in a current clock. The input buffer unit is coupled to the input lines and configured to buffer a part of the inputted data and output the part of the inputted data in a later clock. The remapping unit is coupled to the input lines and the input buffer unit, and configured to generate remap data for a convolution operation according to the data on the input lines and an output of the input buffer unit in the current clock.”, wherein the examiner interprets the “input buffer unit” which is the fetch buffer which gets data stored in the memory and the “remapping unit” that effectively sequence and routes the buffered data to be the same as “fetch sequencer and fetch network” toward a convolution operation which is the same as “convolution feeder”.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to fetching data from memory and transmitting it to the convolution feeder.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “buffer device, which is coupled to a memory and includes input lines, an input buffer unit and a remapping unit.” disclosed by Du. One would be motivated to do so to effectively remap the data and convolve the data, as suggested by Du (Du, [col 2, lines 1-15] “generate remap data for a convolution operation according to the data on the input lines and an output of the input buffer unit in the current clock”).

Regarding claim 38, Chen, Du, Dally, and Ben-Cheikh teaches The method of claim 31, (see rejection of claim 31).
Du further teaches wherein the other data array is of an area shifted based on a preset standard from the data array in the data group of the shift buffer. ([Du, col 7, lines 14-17] “The moving distance of the sliding window S is a stride. The size of the stride is smaller than the size of the sliding window or the convolution size.”, wherein the examiner interprets “the sliding window” to be the same as “other data array” (i.e. the convolution filter) and “The moving distance of the sliding window S is a stride” to be the same as “area shifted on a preset standard” since both are conducting convolution of one area to another.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to shifting data arrays for convolution based on a preset standard.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “the sliding window S is a stride” disclosed by Du. One would be motivated to do so to efficiently determine and increase convolution precision, as suggested by Du ([Du, col 7, lines 14-17] “The size of the stride is smaller than the size of the sliding window or the convolution size.”)
 
Regarding claim 39, Chen, Du, Dally, and Ben-Cheikh teaches The method of claim 31, (see rejection of claim 31). 
Du further teaches wherein a number of data arrays transmitted from the shift buffer to the processor for the one of the data groups is K, and as the convolution operation on the filter is executed K times for the data array transmitted from the shift buffer by the operator, a number of times of using data of the one of the data groups is K2 times. ([Du, col 2, lines 15-17] “In one embodiment, W data are inputted from the memory to the input lines in the current clock, and the remapping unit generates W sets of the remap data for W convolution operations.” And [Du, col 3, lines 7-10] “In one embodiment, the latest K data of the W data, which are still not inputted in the previous convolution operation, are kept in the buffer for the next convolution operation.”, wherein the examiner interprets the fixed number (K) of data elements is buffered and then reused for the following convolution operation, to be the same as “number of times of using data of the one of the data groups is K2 times” when the convolution operation is executed K times.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to buffering and reusing a fixed number of data elements for multiple convolution operations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “remapping unit generates W sets of the remap data for W convolution operations.” disclosed by Du. One would be motivated to do so to efficiently increase throughput, as suggested by Du ([Du, col 2, lines 15-17] “W data are inputted from the memory to the input lines in the current clock, and the remapping unit generates W sets of the remap data for W convolution operations.”)
 
Regarding claim 40, Chen, Du, Dally, Ben-Cheikh teaches The method of claim 31, (see rejection of claim 31).
Du further teaches further comprising transforming calculated result data into a preset form and storing the data in the memory. ([Du, col 2, lines 10-15] “The remapping unit is coupled to the input lines and the input buffer unit, and configured to generate remap data for a convolution operation according to the data on the input lines and an output of the input buffer unit in the current clock.”, wherein the examiner interprets “the remapping unit …generate a remap data for a convolution operation … output of the input buffer unit” to be the same as “a commit unit” that takes result data computed by the processor and transforms it into a preset format which is then stored into memory.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are all directed to transforming result data into a preset form and storing it in the memory.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the process of how to “generate remap data for a convolution operation” disclosed by Du. One would be motivated to do so to effectively streamline data handling, as suggested by Du ([Du, col 2, lines 10-15] “generate remap data for a convolution operation according to the data on the input lines”).

Regarding claim 41, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 21, (see rejection of claim 21).
	Dally further teaches wherein the processor accumulates as much as set by an accumulation count register.  ([Dally, page 31, col 18, lines 10-16] “At the start of the channel IACnt and WCnt are initialized to the number of I-wide or F-wide entries for the channel. IACnt and WCnt are decremented after each vector is consumed and checked for zero to determine the end of the channel. In one embodiment, to avoid losing a processing cycle reading IACnt and WCnt for a channel, the counts are kept in a pair of separate small RAMS one for weight counts and one for IA counts”, wherein the examiner interprets “IACnt and WCnt are initialized” and “IACnt and WCnt are decremented” and “the counts are kept in a pair of separate small RAMS” to be the same as “an accumulation count register” because they are both directed to storing a count value in hardware storage and using that stored count to control how many processing iterations occur before completion, which corresponds to accumulating output data as much as set by the stored count value.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to hardware architectures for performing convolution operations using parallel processing elements and accumulation of convolution results under hardware control.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 as disclosed by Chen, Du, Dally, and Ben-Cheikh to include the accumulation count control disclosed by Dally. One would be motivated to do so to efficiently control the number of accumulation operations performed for convolution output generation and to manage completion of accumulation for a given channel or filter window, as suggested by Dally ([[Dally, page 31, col 18, lines 10-16] “IACnt and WCnt are initialized to the number of I-wide or F-wide entries for the channel … and checked for zero to determine the end of the channel.”).

Regarding claim 42, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 41, (see rejection of claim 41).
	Dally further teaches wherein the processor accumulates K×K times or K×K×input channel times. ([Dally, page 26, col 7, lines 35-37, 52-54] “the filter output for each of the C channels are accumulated together element - wise into a single output activation plane….[eq] for r = 1 to R” and “for s = 1 to S…Each point in the seven - dimensional space formed from the temporal variables represents a single multiply - accumulate operation.”, wherein the examiner interprets “for r = 1 to R” and “for s = 1 to S” to be the same as “accumulates K×K times” because they are both directed to performing accumulation across all kernel positions of a 2D convolution window, and the claim’s K×K corresponds to iterating across the kernel height and kernel width which Dally expresses as R and S loop bounds. The examiner further interprets “the filter output for each of the C channels are accumulated together element - wise into a single output activation plane” to be the same as “accumulates K×K×input channel times” because they are both directed to accumulating contributions across both the kernel positions and across the input channels, and Dally expressly describes accumulation across C channels together with the kernel loops such that accumulation is performed across the kernel positions for each input channel and across the input channels.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to convolution processing architectures that perform accumulation of partial results across kernel dimensions and input channels to generate output activations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 41 as disclosed by Chen, Du, Dally, and Ben-Cheikh, to include accumulation K×K times or K×K×input channel times as disclosed by Dally. One would be motivated to do so to correctly compute convolution output values by accumulating contributions from each kernel position and, when applicable, from each input channel, which is a fundamental requirement of convolution operations and ensures accurate generation of output activation planes. Such motivation is suggested by Dally, which describes iterating over kernel dimensions and accumulating filter outputs across channels to form a single output activation ([Dally, page 26, col 7, lines 35-37, 52-54] “the filter output for each of the C channels are accumulated together element-wise into a single output activation plane”).

Regarding claim 43, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 41, (see rejection of claim 41).
	Dally further teaches wherein the processor further comprises an accumulator indexer specifying an index to be fed from the processor to the operator; ([Dally, page 31, col 18, lines 10-16] “The output of the FIFO 362 consists of a product pᵢ and an address aᵢ. Product pᵢ from input i is connected to the ith input of the multiplexer 366 at the input to each accumulator unit 368. The low bits of address aᵢ are decoded by the decoder 364 to a one hot request vector rᵢⱼ. Across all inputs, if rᵢⱼ is true, input i is making a request for the jth accumulator unit 368.”, wherein the examiner interprets “address aᵢ” and “the low bits of address aᵢ” that are “decoded by the decoder 364” to generate a “request for the jth accumulator unit 368” to be the same as “an accumulator indexer specifying an index to be fed from the processor to the operator” because they are both directed to generating and providing an index value that selects a specific accumulator destination for a multiplication result within the convolution processing pipeline.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to convolution processing architectures that accumulate convolution results using multiple accumulators and include control logic for selecting a target accumulator during accumulation.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 41 as disclosed by Chen, Du, Dally, and Ben-Cheikh to include the comprise an accumulator indexer specifying an index to be fed from the processor to the operator, as disclosed by Dally. One would be motivated to do so to efficiently route each generated multiplication result to a selected accumulator when multiple accumulators are used for accumulation, thereby ensuring correct accumulation of partial sums and orderly accumulation across parallel accumulation structures. Such motivation is suggested by Dally ([Dally, page 31, col 18, lines 10-16] “The low bits of address aᵢ are decoded by the decoder 364 to a one hot request vector rᵢⱼ … making a request for the jth accumulator unit 368.”).

Regarding claim 44, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 43, (see rejection of claim 43).
	Chen further teaches wherein the accumulator indexer stores the output data in the accumulator corresponding to a height value of the filter among the K or more accumulators. ([Chen, page 6, sec 3] “each filter row and ifmap row are horizontally and diagonally reused, respectively, and each row of psums is vertically accumulated. The height and width of a logical PE set are determined by the filter height (R) and ofmap height (E), respectively.”, wherein the examiner interprets “each row of psums is vertically accumulated” to be the same as “stores the output data in the accumulator corresponding to a height value of the filter” because they are both directed to accumulating and storing partial sums in a manner organized by filter row height, and the examiner interprets “the height of a logical PE set are determined by the filter height (R)” to be the same as “height value of the filter among the K or more accumulators” because they are both directed to using a filter height parameter to determine a corresponding accumulation grouping for partial sums.)

Regarding claim 45, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 31, (see rejection of claim 31).
	Dally further teaches further comprising accumulating as much as set by an accumulation count register. ([Dally, page 30, col 15, lines 42-49] “At the start of the channel IACnt and WCnt are initialized to the number of I-wide or F-wide entries for the channel. IACnt and WCnt are decremented after each vector is consumed and checked for zero to determine the end of the channel. In one embodiment, to avoid losing a processing cycle reading IACnt and WCnt for a channel, the counts are kept in a pair of separate small RAMS one for weight counts and one for IA counts”, wherein the examiner interprets “IACnt and WCnt are initialized” and “IACnt and WCnt are decremented” and “the counts are kept in a pair of separate small RAMS” to be the same as “an accumulation count register” because they are both directed to storing a count value in hardware storage and using that stored count to control how many processing iterations occur before completion, which corresponds to accumulating output data as much as set by the stored count value.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to convolution processing architectures that perform accumulation of convolution results under hardware control using stored count values that govern iteration and completion of processing.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 as disclosed by Chen, Du, Dally, and Ben-Cheikh to further comprise accumulating as much as set by an accumulation count register, as disclosed by Dally. One would be motivated to do so to efficiently control the amount of accumulation performed for a channel or processing segment and to determine completion of accumulation using a stored count value, as suggested by Dally ([Dally, page 30, col 15, lines 42-49] “IACnt and WCnt are decremented after each vector is consumed and checked for zero to determine the end of the channel.”).

Regarding claim 46, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 45, (see rejection of claim 45).
	Dally further teaches further comprising accumulating K×K times or K×K×input channel times; ([Dally, page 26, col 7, lines 52-53 and Table 1] “for r = 1 to R” and “for s = 1 to S” and “Each point in the seven-dimensional space formed from the temporal variables represents a single multiply-accumulate operation.”, wherein the examiner interprets “for r = 1 to R” and “for s = 1 to S” to be the same as “accumulating K×K times” because they are both directed to iterating over the kernel height and kernel width positions of a convolution filter and performing a multiply-accumulate at each kernel position. The examiner further interprets “for c = 1 to C” together with the multiply-accumulate operation described by “Each point in the seven-dimensional space formed from the temporal variables represents a single multiply-accumulate operation” to be the same as “accumulating K×K×input channel times” because they are both directed to accumulating contributions across kernel positions and across input channels during convolution processing.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to convolution processing methods that accumulate partial sums across kernel dimensions and input channels to generate output activations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 45 as disclosed by Chen, Du, Dally, and Ben-Cheikh to further comprise accumulating K×K times or K×K×input channel times, as disclosed by Dally. One would be motivated to do so to correctly compute convolution outputs by accumulating contributions across each kernel position and across each input channel, as suggested by Dally (Dally, page 27, “Each point in the seven-dimensional space formed from the temporal variables represents a single multiply-accumulate operation.”).

Regarding claim 47, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 31, (see rejection of claim 31).
	Dally further teaches wherein the accumulating further comprises specifying an index to be fed from the processor to the operator. ([Dally, page 31, col 18, lines 10-16] “The output of the FIFO 362 consists of a product p[i] and an address a[i]. Product p[i] from input i is connected to the ith input of the multiplexer 366 at the input to each accumulator unit 368. The low bits of address a[i] are decoded by the decoder 364 to a one-hot request vector r[i][j]. Across all inputs, if r[i][j] is true, it implies that input i is making a request for the jth accumulator unit 368.”, wherein the examiner interprets “address a[i]” and “the low bits of address a[i]” that are decoded into a “request for the jth accumulator unit 368” to be the same as “specifying an index to be fed from the processor to the operator” because they are both directed to generating and providing an index value that selects a specific accumulator destination for a multiplication result within the accumulation pipeline.)
Chen, Du, Dally, Ben-Cheikh, and the instant application are analogous art because they are both directed to convolution processing methods that accumulate results using multiple accumulators and include selection logic that specifies which accumulator receives each partial sum.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 as disclosed by Chen, Du, Dally, and Ben-Cheikh such that the accumulating further comprises specifying an index to be fed from the processor to the operator, as disclosed by Dally. One would be motivated to do so to efficiently route each generated product to a selected accumulator during accumulation when multiple accumulators are used, thereby ensuring correct partial sum accumulation and orderly accumulation across parallel accumulation structures, as suggested by Dally ([Dally, page 31, col 18, lines 10-16] “The low bits of address a[i] are decoded by the decoder 364 to a one-hot request vector r[i][j] … making a request for the jth accumulator unit 368.”).

Regarding claim 48, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 47, (see rejection of claim 47).
	Chen further teaches wherein the accumulating comprises storing the output data in the accumulator corresponding to a height value of the filter among the K or more accumulator. ([Chen, page 6, sec 3] “. Fig. 6 shows a logical PE set, where each filter row and ifmap row are horizontally and diagonally reused, respectively, and each row of psums is vertically accumulated. The height and width of a logical PE set are determined by the filter height (R) and ofmap height (E), respectively.”, wherein the examiner interprets “each row of psums is vertically accumulated” to be the same as “storing the output data in the accumulator corresponding to a height value of the filter” because they are both directed to accumulating and storing partial sums in a manner organized by filter row height. The examiner further interprets “The height … of a logical PE set are determined by the filter height (R)” to be the same as “a height value of the filter among the K or more accumulators” because they are both directed to using a filter height parameter to determine an accumulation grouping corresponding to filter height.)


Claims 24-25, 34-36 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Du in view of Dally in view of Ben-Cheikh further in view of US 11487989 B2, “Data reuse method based on convolutional neural network accelerator” by Sun et. al. (referred herein as Sun). 

Regarding claim 24, Chen, Du, Dally, Ben-Cheikh teaches The device of claim 21, (see rejection of claim 21). 
	Chen and Du does not teach wherein the processor executes the convolution operation on the data array transmitted from the shift buffer and on the filter by using the operator to reuse at least one piece of data constituting the one of the data groups.
Sun teaches wherein the processor executes the convolution operation on the data array transmitted from the shift buffer and on the filter by using the operator to reuse at least one piece of data constituting the one of the data groups. ([Sun, col 2, lines 1-5] “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse and sends the received tile block data to the calculation processing unit. PE.”, wherein the examiner interprets “sequentially returns the tile block data to the input activation weight buffer unit”  to be the same as transmitted from the shift buffer and “the tile block data to implement data reuse” to be the same as “reuse one piece of data” because both are reusing some piece of the data.)
Chen, Du, Dally, Ben-Cheikh, Sun, and the instant application are analogous art because they are all directed to executing a convolution operation with data reuse from a shift buffer.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 as disclosed by Chen, Du, Dally, and Ben-Cheikh to include the process by which “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse” disclosed by Sun. One would be motivated to do so to efficiently process and transmit data as suggested by Sun (Sun, [Sun, col 2, lines 1-5] “sends the received tile block data to the calculation processing unit. PE”).
 
Regarding claim 25, Chen, Du, Dally, and Ben-Cheikh teaches The device of claim 21, (see rejection of claim 21). 
	Du further teaches sequentially reads, from the memory, data groups that have more pieces of data than the unit data throughput and are different from the data groups stored in the input data queue, ([Du, col 2, lines 24-30] “In one embodiment, each set of the remap data includes M remap data, and the convolution operation is an M×M convolution operation. In one embodiment, the remapping unit retrieves M data from the output of the input buffer unit and the input lines as a set of the remap data, and the data of the output of the input buffer unit and the input lines in sequence are retrieved by M times every J strides.”, wherein the examiner interprets “MxM convolution operation … sequence are retrieved by M times every J strides” and “input buffer” to be the same as a “sequentially reads, from the memory, data groups” and “data groups stored” as the both have to do with sequential data processing and the storing of data in a buffer.)
Chen further teaches stores the data groups in the input data queue when a control completion notification is issued for the data groups stored in the input data queue, and controls the different data groups. [Chen, page 2, sec 2] “The PE includes an ALU datapath, which is capable of doing multiply-and-accumulate (MAC) and addition, a register file (RF) as a local scratchpad, and a PE FIFO (pFIFO) used to control the traffic going in and out of the ALU.”, wherein the examiner interprets “stored in pFIFO” and “control the traffic going in and out of the ALU” to be the same as data groups stored and the controls are in place for the different data because Chen shows that data are buffered (i.e. stored in a queue) and the data flow is controlled for processing by the ALU.)
Chen and Du not teach wherein the convolution sequencer: sequentially transmits data groups stored in the input data queue to the shift buffer, transmits the data array of each of the data groups stored in the shift buffer to the processor to reuse at least any one piece of the data constituting the data groups stored in the input data queue in the convolution operation,.
Sun teaches wherein the convolution sequencer: sequentially transmits data groups stored in the input data queue to the shift buffer, transmits the data array of each of the data groups stored in the shift buffer to the processor to reuse at least any one piece of the data constituting the data groups stored in the input data queue in the convolution operation, ([Sun, col 2, lines 1-5] “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse and sends the received tile block data to the calculation processing unit. PE.”, wherein the examiner interprets “memory module sequentially returns the tile block data” and “buffer unit” to be the same as “sequencer: sequentially transmits data groups stored in the input data queue” and “shift buffer” since both are directed to moving or shifting data in sequence and using a data buffer.)
Chen, Du, Dally, Ben-Cheikh, Sun, and the instant application are analogous art because they are all directed to storing data groups in the input data queue and controlling those groups.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the device of claim 21 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “memory module sequentially returns the tile block data” disclosed by Sun. One would be motivated to do so to effectively increase throughput and facilitate passing of data, as suggested by Sun (Sun, [col 2, lines 1-5] “implement data reuse and sends the received tile block data to the calculation processing unit. PE”).

Regarding claim 34, Chen, Du, Dally, and Ben-Cheikh teaches The method of claim 31, (see rejection of claim 31). 
	Chen, Du, Dally, and Ben-Cheikh do not teach further comprising: executing the convolution operation on the data array transmitted from the shift buffer and on the filter by using the operator to reuse at least one piece of data constituting the one of the data groups.
Sun teaches further comprising: executing the convolution operation on the data array transmitted from the shift buffer and on the filter by using the operator to reuse at least one piece of data constituting the one of the data groups. ([Sun, col 2, lines 1-5] “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse and sends the received tile block data to the calculation processing unit. PE.”, wherein the examiner interprets “sequentially returns the tile block data to the input activation weight buffer unit”  to be the same as transmitted from the shift buffer and “the tile block data to implement data reuse” to be the same as “reuse one piece of data” because both are reusing some piece of the data.)
Chen, Du, Dally, Ben-Cheikh, Sun, and the instant application are analogous art because they are all directed to executing a convolution operation with data reuse from a shift buffer.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the process by which “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse” disclosed by Sun. One would be motivated to do so to efficiently process and transmit data as suggested by Sun (Sun, [Sun, col 2, lines 1-5] “sends the received tile block data to the calculation processing unit. PE”).
 
Regarding claim 35, Chen, Du, Dally, and Ben-Cheikh teaches The method of claim 31, (see rejection of claim 31).
	Chen, Du, Dally, and Ben-Cheikh do not teach further comprising: sequentially transmitting data groups stored in the input data queue to the shift buffer; transmitting the data array of each of the data groups stored in the shift buffer to a processor; and reusing at least any one piece of the data constituting the data groups stored in the input data queue in the convolution operation.
Sun teaches further comprising: sequentially transmitting data groups stored in the input data queue to the shift buffer; transmitting the data array of each of the data groups stored in the shift buffer to a processor; and reusing at least any one piece of the data constituting the data groups stored in the input data queue in the convolution operation. ([Sun, col 2, lines 1-5] “The memory module sequentially returns the tile block data to the input activation weight buffer unit, and the input activation weight buffer unit saves the received tile block data to implement data reuse and sends the received tile block data to the calculation processing unit. PE.”, wherein the examiner interprets “sequentially returns the tile block data to the input activation weight buffer unit” to be the same as “transmitting data groups stored in the input data queue to the shift buffer” and “the tile block data to implement data reuse” to be the same as “reusing at least any one piece of the data constituting the data groups stored in the input data queue” because both are reusing some piece of the data for the convolution operation.)
Chen, Du, Dally, Ben-Cheikh, Sun, and the instant application are analogous art because they are all directed to sequential transmission and reuse of input data for convolution operations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 31 disclosed by Chen, Du, Dally, and Ben-Cheikh to include the “implement data reuse” disclosed by Sun. One would be motivated to do so to effectively increase throughput as data is moved forward for processing, as suggested by Sun ([Sun, col 2, lines 1-5] “implement data reuse and sends the received tile block data to the calculation processing unit.”)
 
Regarding claim 36, Chen, Du, Dally, Ben-Cheikh, and Sun teaches The method of claim 35, (see rejection of claim 35). 
Chen further teaches further comprising: when a control completion notification is issued for the data groups stored in the input data queue, sequentially reading data groups that have more pieces of data than the unit data throughput and are different from the data groups stored in the input data queue, from the memory, and storing the data groups in the input data queue; and controlling the different data groups. ([Chen, page 2, sec 2] “The PE includes an ALU datapath, which is capable of doing multiply-and-accumulate (MAC) and addition, a register file (RF) as a local scratchpad, and a PE FIFO (pFIFO) used to control the traffic going in and out of the ALU.”, wherein the examiner interprets “stored in pFIFO” and “control the traffic going in and out of the ALU” to be the same as data groups stored and the controls are in place for the different data because Chen shows that data are buffered (i.e. stored in a queue) and the data flow is controlled for processing by the ALU.)
Chen, Du, Dally, Ben-Cheikh, Sun, and the instant application are analogous art because they are all directed to storing data groups in the input data queue and controlling those groups.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify method of claim 35 disclosed by Chen, Du, Dally, Ben-Cheikh, and Sun to include the “The memory module sequentially returns the tile block data” disclosed by Sun. One would be motivated to do so to effectively increase throughput and facilitate passing of data, as suggested by Sun (Sun, [col 2, lines 1-5] “implement data reuse and sends the received tile block data to the calculation processing unit. PE”). 


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DEVAN KAPOOR/Examiner, Art Unit 2126                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Dec 17, 2021
Application Filed
Mar 07, 2025
Non-Final Rejection — §103
Apr 30, 2025
Applicant Interview (Telephonic)
Apr 30, 2025
Examiner Interview Summary
Jun 13, 2025
Response Filed
Aug 26, 2025
Non-Final Rejection — §103
Nov 25, 2025
Response Filed
Feb 11, 2026
Final Rejection — §103 (current)
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

4-5
Expected OA Rounds
11%
Grant Probability
28%
With Interview (+16.7%)
4y 5m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allowance rate.