Last updated: April 19, 2026
Application No. 17/538,138
Indexing Operations In Neural Network Processor

Final Rejection §103
Filed
Nov 30, 2021
Examiner
GODO, MORIAM MOSUNMOLA
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Apple Inc.
OA Round
2 (Final)
Interview Optional

— +33.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 68 resolved cases, 2023–2026
Examiner Intelligence

GODO, MORIAM MOSUNMOLA View full profile →
Grants 44% of resolved cases
Career Allow Rate
30 granted / 68 resolved
-10.9% vs TC avg
Strong +33% interview lift
Without
With
+33.4%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
47 currently pending
Career history
115
Total Applications
across all art units
Statute-Specific Performance

§101
16.1%
-23.9% vs TC avg
§103
56.7%
+16.7% vs TC avg
§102
12.7%
-27.3% vs TC avg
§112
12.9%
-27.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 68 resolved cases
Office Action

§103
DETAILED ACTION
1.	This office action is in response to the Application No. 17538138 filed on 11/07/2025. Claims 1-20 are presented for examination and are currently pending.

Response to Arguments
2.	Applicant’s arguments has been considered but are moot in light of the newly added reference Cohen in view of Fishel in view of Barnard.
	On page 9 of the remarks, the Applicant argued that “In addition, independent claims 12 and 19 recite similar features as claim 1 and are allowable for at least the same reasons as for claim 1. Further, dependent claims 2-11, 13-18, and 20, are allowable for being dependent from an allowable base claim in addition to their own allowable features. Applicant respectfully requests that the rejections under 35 U.S.C. §§ 102 and 103 of claims 1-20 be reconsidered and withdrawn, and that these claims be passed to allowance”.
	Claims 1, 12 and 19 are not allowable because newly added reference Cohen in view of Fishel in view of Barnard has now been applied.
	Furthermore, dependent claims 2-11, 13-18 and 20 which are depend directly or indirectly from claims 1, 12 and 19 are not allowable for the same reasons argued above regarding the independent claims.
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3.	Claims 1-3, 12, 13 and 19 are rejected under 35 U.S.C 103 as being unpatentable over Cohen et al. (US11714653 filed 02/15/2021) in view of Fishel et al. (US20190340488) and further in view of Barnard et al. (US20180101763)

Regarding claim 1, Cohen teaches a neural processor circuit (FIG. 1 is block diagram that schematically illustrates elements of a computer system with fine-grained pipelining, in accordance with an embodiment of the invention, (col. 3, ln 41-44)), comprising: 
a plurality of neural engine circuits, at least one of the neural engine circuits configured to perform a convolution operation (In some embodiments, at least some of the processors, for example producer processors 22, comprise multiplier/accumulator processing elements 34, which are configured to carry out matrix and tensor operations, such as large-scale tensor convolutions (col. 5, ln 15-19). The Examiner notes producer processors 22 in Fig. 1 are neural engine circuits which perform convolution operation) on input data (data input, Fig. 1) to generate output data (a first processing stage in which multiple producer processors 22 compute and output data, col. 2, ln 5-6); and 
a data processor circuit (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations) directly coupled to the at least one neural engine circuit (producer processors 22, Fig. 1), 
the data processor circuit (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations) comprising: 
a buffer memory configured to store an index tensor (In applications that involve matrix and tensor operations, the buffers are multi-dimensional (col. 4, ln 48-49); As noted earlier, scheduler 30 can initiate a given consumer work unit as soon as the range of input data that is mapped to the index of the work unit is ready in buffer 24 (col. 6, ln 18-20). The Examiner notes input data mapped to the index in the buffer 24 is index tensor) and 
the output data as a source tensor (a first processing stage in which multiple producer processors 22 compute and output data to respective locations in a buffer 24 in a memory 26 (col. 2, ln 5-6). The Examiner notes the output data in buffer 24 is the source tensor), and 
an indexing circuit (consumers processors 28, Fig. 1) coupled to the buffer memory (In similar fashion, consumer processors 28 receive command inputs in respective command buffers 40 (col. 5, ln 34-36)), 
the indexing circuit (consumer processors 28, Fig. 1) configured to fetch a portion of the source tensor from the buffer memory (output data from buffer 24, Fig. 1) by referencing the index tensor representing indexing information into the portion of the source tensor (These command inputs drive respective processing elements 42 to read a certain range of input data from buffer 24, apply the corresponding work unit to the input data, output the resulting output data (col. 5, ln 36-40). The Examiner notes processing elements are within consumer processors 28, input data is index tensor, and output data is the source sensor), 
Cohen does not explicitly teach wherein the indexing circuit comprises a rasterizer and an index tensor fetching circuit coupled to the rasterizer, wherein the rasterizer is configured to generate one or more index values for referencing the index tensor stored in the buffer memory, and wherein the index tensor fetching circuit is configured to generate one or more address values for referencing the index tensor in the buffer memory based on the one or more index values.
Fishel teaches wherein the indexing circuit comprises a rasterizer (A rasterizer is a circuit in various components of neural processor circuit 218 that keeps track of the segment of the input/output data (e.g., group, work unit, input channel, output channel) and instructs the components of neural processor circuit for proper handling of the segment of the input data [0071]. The Examiner notes neural processor circuit 218 is the indexing circuit) and an index tensor fetching circuit (Neural task manager 310 manages the overall operation of neural processor circuit 218 … In one or more embodiments, the neural task manager 310 sends rasterizer information to the components of the neural processor circuit 218 to enable each of the components to track, retrieve or process appropriate portions of the input data [0047].The Examiner notes neural task manager 310 is the index fetching circuit and the input data is the index tensor) coupled to the rasterizer (neural task manager 310 programs rasterizers 714, 718, 720, 722 [0074], Fig. 7), 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Cohen to incorporate the teachings of Fishel for the benefit of neural processor circuit 218 is a configurable circuit that performs these operations in a fast and power-efficient manner (Fishel [0037])
	Modified Cohen does not explicitly teach wherein the rasterizer is configured to generate one or more index values for referencing the index tensor stored in the buffer memory , and wherein the index tensor fetching circuit is configured to generate one or more address values for referencing the index tensor in the buffer memory based on the one or more index values.
Barnard teaches wherein the rasterizer is configured to generate one or more index values for referencing the index tensor (wherein the input data values are is received in a rasterised order in which the coordinates of the input data values are sequentially incremented first by plane index p (see claim 14)) stored in the buffer memory (and storing the received input data values at the determined addresses in the buffer, abstract), and 
wherein the index tensor fetching circuit is configured to generate one or more address values (Specifically, the addressing scheme is used to generate MEMADDR and BANKSEL values. Each input data value is then stored in the input data buffer at a location based on the values calculated for MEMADDR and BANKSEL [0064]; The memory locations are identifiable by a Bank number, from 0 to 7 and an address position (MEMADDR) in respective banks. For example, the top left memory location is identified by MEMADDR=0 and BANKSEL=0 and the bottom right value is identified by MEMADDR=7 and BANKSEL=7; Subsequently, values for MEMADDR and BANKSEL are determined for each input data value [0068]) for referencing the index tensor in the buffer memory based on the one or more index values (wherein an address of an input data value in the buffer is determined based upon a plane index p for that input data value (see claim 6)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Barnard for the benefit of data throughput from the input data buffer to the convolution engines is both fast and consistent (Barnard [0105]).

Regarding claim 2, Modified Cohen teaches the neural processor circuit of claim 1, Cohen teaches the wherein the indexing circuit (consumer processors 28, Fig. 1) is further configured to fetch elements of the source tensor along a dimension of the source tensor (… one or more consumer processors 28 read the data from buffer 24 and apply a computational task to the data read from the buffer (col. 5, ln 2-4)) using a corresponding value in the index tensor (In the present example, it is assumed that consumer processors 28 execute a program stage called “sum,” which sums the values in each row of buffer 24 to produce a single scalar value per input row (col. 6, ln 32-35)).

Regarding claim 3, Modified Cohen teaches the neural processor circuit of claim 2, Cohen teaches wherein the data processor circuit (consumer processors 28 comprising processing elements 42, Fig. 1) is further configured to broadcast the fetched elements of the source tensor (consumer processors 28 … output the resulting output data, and report completion back to scheduler 30 (col. 5, ln 35-40)) to the plurality of neural engine circuits (A scheduler 30 reads program instructions from a program memory 32 and distributes corresponding command inputs to producer processors 22 (col. 5, ln 24-26), Fig. 1).

Regarding claim 12, Cohen teaches a method of operating a neural processor circuit (FIG. 1 is block diagram that schematically illustrates elements of a computer system 20 with fine-grained pipelining, in accordance with an embodiment of the invention, col., 3, ln 41-44), comprising: 
operating at least one of a plurality of neural engine circuits in the neural processor circuit to perform a convolution operation (In some embodiments, at least some of the processors, for example producer processors 22, comprise multiplier/accumulator processing elements 34, which are configured to carry out matrix and tensor operations, such as large-scale tensor convolutions (col. 5, ln 15-19). The Examiner notes producer processors 22 in Fig. 1 are neural engine circuits which perform convolution operation) on input data (data input, Fig. 1) 
to generate output data (a first processing stage in which multiple producer processors 22 compute and output data, col. 2, ln 5-6); 
storing an index tensor (In applications that involve matrix and tensor operations, the buffers are multi-dimensional (col. 4, ln 48-49); As noted earlier, scheduler 30 can initiate a given consumer work unit as soon as the range of input data that is mapped to the index of the work unit is ready in buffer 24 (col. 6, ln 18-20). The Examiner notes input data mapped to the index in the buffer 24 is index tensor) and 
the output data as a source tensor in a buffer memory (a first processing stage in which multiple producer processors 22 compute and output data to respective locations in a buffer 24 in a memory 26 (col. 2, ln 5-6). The Examiner notes the output data in buffer 24 is the source tensor) 
of a data processor circuit (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations) 
directly coupled to the at least one neural engine circuit (producer processors 22, Fig. 1); 
and fetching (output data from buffer 24, Fig. 1), by the indexing circuit of the data processor circuit coupled to the buffer memory, a portion of the source tensor from the buffer memory (consumer processors 28, Fig. 1) by referencing the index tensor representing indexing information into the portion of the source tensor (These command inputs drive respective processing elements 42 to read a certain range of input data from buffer 24, apply the corresponding work unit to the input data, output the resulting output data (col. 5, ln 36-40). The Examiner notes processing elements are within consumer processors 28, input data is index tensor, and output data is the source sensor).
Cohen does not explicitly teach generating, by a rasterizer of an indexing circuit, one or more index values for referencing the index tensor stored in the buffer memory; generating, by the index tensor fetching circuit, one or more address values for referencing the index tensor in the buffer memory based on the one or more index values;
Fishel teaches a rasterizer of an indexing circuit (A rasterizer is a circuit in various components of neural processor circuit 218 that keeps track of the segment of the input/output data (e.g., group, work unit, input channel, output channel) and instructs the components of neural processor circuit for proper handling of the segment of the input data [0071]. The Examiner notes neural processor circuit 218 is the indexing circuit), 
the index tensor fetching circuit (Neural task manager 310 manages the overall operation of neural processor circuit 218 … In one or more embodiments, the neural task manager 310 sends rasterizer information to the components of the neural processor circuit 218 to enable each of the components to track, retrieve or process appropriate portions of the input data [0047].The Examiner notes neural task manager 310 is the index fetching circuit and the input data is the index tensor)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Cohen to incorporate the teachings of Fishel for the benefit of neural processor circuit 218 is a configurable circuit that performs these operations in a fast and power-efficient manner (Fishel [0037])
Modified Cohen does not explicitly teach generating, by a rasterizer, one or more index values for referencing the index tensor stored in the buffer memory; generating, one or more address values for referencing the index tensor in the buffer memory based on the one or more index values;
Barnard teaches generating, by a rasterizer, one or more index values for referencing the index tensor (wherein the input data values are is received in a rasterised order in which the coordinates of the input data values are sequentially incremented first by plane index p (see claim 14))
stored in the buffer memory (and storing the received input data values at the determined addresses in the buffer, abstract); 
generating, one or more address values (Specifically, the addressing scheme is used to generate MEMADDR and BANKSEL values. Each input data value is then stored in the input data buffer at a location based on the values calculated for MEMADDR and BANKSEL [0064]; The memory locations are identifiable by a Bank number, from 0 to 7 and an address position (MEMADDR) in respective banks. For example, the top left memory location is identified by MEMADDR=0 and BANKSEL=0 and the bottom right value is identified by MEMADDR=7 and BANKSEL=7; Subsequently, values for MEMADDR and BANKSEL are determined for each input data value [0068]) for referencing the index tensor in the buffer memory based on the one or more index values (wherein an address of an input data value in the buffer is determined based upon a plane index p for that input data value (see claim 6));
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Barnard for the benefit of data throughput from the input data buffer to the convolution engines is both fast and consistent (Barnard [0105])

Regarding claim 13, Modified Cohen teaches the method of claim 12, Cohen teaches further comprising: fetching elements of the source tensor along a dimension of the source tensor (… one or more consumer processors 28 read the data from buffer 24 and apply a computational task to the data read from the buffer (col. 5, ln 2-4)) using a corresponding value in the index tensor (In the present example, it is assumed that consumer processors 28 execute a program stage called “sum,” which sums the values in each row of buffer 24 to produce a single scalar value per input row (col. 6, ln 32-35)); and 
broadcasting the fetched elements of the source tensor (consumer processors 28 … output the resulting output data, and report completion back to scheduler 30 (col. 5, ln 35-40)) to the plurality of neural engine circuits (A scheduler 30 reads program instructions from a program memory 32 and distributes corresponding command inputs to producer processors 22 (col. 5, ln 24-26), Fig. 1).

Regarding claim 19, Cohen teaches an electronic device, comprising: a system memory storing input data (There is also provided, in accordance with an embodiment of the invention, computing apparatus, including a memory, ); and 
a neural processor circuit coupled to the system memory (Producer processors 22 coupled to program memory 32, Fig. 1), 
the neural processor circuit including: a data processor circuit (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations) 
configured to receive the input data from the system memory (buffer 24 receive input data from program memory 32, Fig. 1), 
a plurality of neural engine circuits (In some embodiments, at least some of the processors, for example producer processors 22, comprise multiplier/accumulator processing elements 34 (col. 5, ln 15-19). The Examiner notes producer processors 22 in Fig. 1 are neural engine circuits) coupled to the data processor circuit (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1)
 at least one of the neural engine circuits directly coupled to the data processor circuit and configured to perform a convolution operation (In some embodiments, at least some of the processors, for example producer processors 22, comprise multiplier/accumulator processing elements 34, which are configured to carry out matrix and tensor operations, such as large-scale tensor convolutions (col. 5, ln 15-19). The Examiner notes producer processors 22 in Fig. 1 are neural engine circuits which perform convolution operation) on at least a portion of the input data (data input, Fig. 1) from the data processor circuit to generate output data, 
the data processor circuit comprising (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations):
 a buffer memory configured to store an index tensor (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations) and 
the output data as a source tensor (buffer 24 is created in memory 26, which is shared by the producing and consuming processing stages (col. 7, 54-55), Fig. 1. The Examiner notes memory 26 involves processing operations), and 
an indexing circuit (consumers processors 28, Fig. 1) coupled to the buffer memory (In similar fashion, consumer processors 28 receive command inputs in respective command buffers 40 (col. 5, ln 34-36)), 
the indexing circuit (consumers processors 28, Fig. 1) configured to fetch a portion of the source tensor from the buffer memory (output data from buffer 24, Fig. 1) by referencing the index tensor representing indexing information into the portion of the source tensor (These command inputs drive respective processing elements 42 to read a certain range of input data from buffer 24, apply the corresponding work unit to the input data, output the resulting output data (col. 5, ln 36-40). The Examiner notes processing elements are within consumer processors 28, input data is index tensor, and output data is the source sensor),
Cohen does not explicitly teach the indexing circuit comprises a rasterizer and an index tensor fetching circuit coupled to the rasterizer, wherein the rasterizer is configured to generate one or more index values for referencing the index tensor stored in the buffer memory, and wherein the index tensor fetching circuit is configured to generate one or more address values for referencing the index tensor in the buffer memory based on the one or more index values.
Fishel teaches the indexing circuit comprises a rasterizer (A rasterizer is a circuit in various components of neural processor circuit 218 that keeps track of the segment of the input/output data (e.g., group, work unit, input channel, output channel) and instructs the components of neural processor circuit for proper handling of the segment of the input data [0071]. The Examiner notes neural processor circuit 218 is the indexing circuit) and
 an index tensor fetching circuit (Neural task manager 310 manages the overall operation of neural processor circuit 218 … In one or more embodiments, the neural task manager 310 sends rasterizer information to the components of the neural processor circuit 218 to enable each of the components to track, retrieve or process appropriate portions of the input data [0047].The Examiner notes neural task manager 310 is the index fetching circuit and the input data is the index tensor) coupled to the rasterizer (neural task manager 310 programs rasterizers 714, 718, 720, 722 [0074], Fig. 7), 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Cohen to incorporate the teachings of Fishel for the benefit of neural processor circuit 218 is a configurable circuit that performs these operations in a fast and power-efficient manner (Fishel [0037])
Modified Cohen does not explicitly teach wherein the rasterizer is configured to generate one or more index values for referencing the index tensor stored in the buffer memory, and wherein the index tensor fetching circuit is configured to generate one or more address values for referencing the index tensor in the buffer memory based on the one or more index values.
Barnard teaches wherein the rasterizer is configured to generate one or more index values for referencing the index tensor (wherein the input data values are is received in a rasterised order in which the coordinates of the input data values are sequentially incremented first by plane index p (see claim 14)) stored in the buffer memory (and storing the received input data values at the determined addresses in the buffer, abstract), and 
wherein the index tensor fetching circuit is configured to generate one or more address values (Specifically, the addressing scheme is used to generate MEMADDR and BANKSEL values. Each input data value is then stored in the input data buffer at a location based on the values calculated for MEMADDR and BANKSEL [0064]; The memory locations are identifiable by a Bank number, from 0 to 7 and an address position (MEMADDR) in respective banks. For example, the top left memory location is identified by MEMADDR=0 and BANKSEL=0 and the bottom right value is identified by MEMADDR=7 and BANKSEL=7; Subsequently, values for MEMADDR and BANKSEL are determined for each input data value [0068]) for referencing the index tensor in the buffer memory based on the one or more index values (wherein an address of an input data value in the buffer is determined based upon a plane index p for that input data value (see claim 6)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Barnard for the benefit of data throughput from the input data buffer to the convolution engines is both fast and consistent (Barnard [0105])


4.	Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Cohen et al. (US11714653 filed 02/15/2021) in view of Fishel et al. (US20190340488) in view of Barnard et al. (US20180101763) and further in view of Frumkin et al. (US20200342632 filed 04/29/2019)

Regarding claim 4, Modified Cohen teaches the neural processor circuit of claim 2, Modified Cohen does not explicitly teach wherein the data processor circuit further comprises a formatting circuit coupled to the indexing circuit, the formatting circuit configured to: transpose the fetched elements of the source tensor to generate a transposed version of the source tensor.
Frumkin teaches wherein the data processor circuit (FIG. 1B shows a block diagram of an example processing system 100 [0056]) 
further comprises a formatting circuit (diagonal storage format 104 [0053]) 
coupled to the indexing circuit (index data indicating matrix locations of the non-zero values in the stream; and executing a plurality of threads to determine, based on the index data, matrix and/or transposed matrix coordinates of the non-zero values [0017]), 
the formatting circuit (diagonal storage format 104 [0053]) configured to: 
transpose the fetched elements of the source tensor to generate a transposed version of the source tensor (original sparse matrix array 106 a and the transposed version 106 c of the sparse matrix array [0053], Fig. 1B).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Frumkin for the benefit of reducing system storage and/or communication resources, and increase computation speeds (Frumkin [0063]). 

Regarding claim 14, Modified Cohen teaches the method of claim 12, Cohen teaches further comprising: fetching elements of the source tensor along a dimension of the source tensor (… one or more consumer processors 28 read the data from buffer 24 and apply a computational task to the data read from the buffer (col. 5, ln 2-4)) using a corresponding value in the index tensor (In the present example, it is assumed that consumer processors 28 execute a program stage called “sum,” which sums the values in each row of buffer 24 to produce a single scalar value per input row (col. 6, ln 32-35)); and 
Modified Cohen does not explicitly teach transposing the fetched elements of the source tensor to generate a transposed version of the source tensor.
Frumkin teaches transposing the fetched elements of the source tensor to generate a transposed version of the source tensor (original sparse matrix array 106 a and the transposed version 106 c of the sparse matrix array [0053], Fig. 1B).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Frumkin for the benefit of reducing system storage and/or communication resources, and increase computation speeds (Frumkin [0063])

5.	Claims 5-9, 15-17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cohen et al. (US11714653 filed 02/15/2021) in view of Fishel et al. (US20190340488) in view of Barnard et al. (US20180101763) and further in view of Albericio et al. ("Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing." ACM SIGARCH Computer Architecture News 44.3 (2016): 1-13)

Regarding claim 5, Modified Cohen teaches the neural processor circuit of claim 1, Modified Cohen does not explicitly teach wherein the indexing circuit is further configured to fetch a slice of the source tensor along a dimension of the source tensor starting from an offset value obtained from the index tensor, the slice being of a size that fits into the buffer memory.
Albericio teaches wherein the indexing circuit (CNV allows direct indexing at a finer granularity sacrificing any memory footprint savings, pg. 7, right col., fist full para.) is further configured to fetch a slice of the source tensor along a dimension of the source (Each cycle, one neuron per slice is fetched resulting into a group of 16 neurons one per lane thus keeping all lanes busy, pg. 8, left col., second para.) starting from an offset value obtained from the index tensor (For example, let e(x,y,z) be the (neuron,offset) pair stored at location (x,y,z) of an input array in ZFNAf. In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched, pg. 8, left col., second para.), the slice being of a size that fits into the buffer memory (CNV divides the window evenly into 16 slices, one per neuron lane, pg. 8, left col., second para.; CNV uses 16- element bricks. Bottom: NBin store format, Fig. 7, pg. 8, left col., The Examiner notes NBin implies Input Buffer Transfer (NBin) and cnvlutin (CNV) is a value-based approach to hardware acceleration).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 6, Modified Cohen teaches the neural processor circuit of claim 5, Albericio teaches the data processor circuit (Processing Order in CNV, pg. 8, left col., first para.) 
is further configured to broadcast the fetched slice of the source tensor to the plurality of neural engine circuits (In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched and broadcast to all units and processed by neuron lanes 0 through 15, respectively, pg. 8, left col., second para.
The same motivation to combine dependent claim 5 applies here.

Regarding claim 7, Modified Cohen teaches the neural processor circuit of claim 1, Modified Cohen does not explicitly teach wherein the data processor circuit further comprises a formatting circuit coupled to the indexing circuit, the formatting circuit configured to: transpose the fetched slice of the source tensor to generate a transposed version of the source tensor.
Albericio teaches wherein the data processor circuit (Processing Order in CNV, pg. 8, left col., first para.)  
further comprises a formatting circuit coupled to the indexing circuit (Fig. 7: Top: ZFNAf (Zero-Free Neuron Array Format) for 4-element bricks, Fig. 7, pg. 8, left col.,), 
the formatting circuit configured to: transpose the fetched slice of the source tensor to generate a transposed version of the source tensor (This proves to be equivalent to transposing the SB store order per subunit. Since the synapses are known in advance this rearrangement can be done statically in software (pg. 8, right col., first para.); The synapses are stored in the SBs in the order shown in the figure, so that the units can fetch the appropriate synapses in parallel (pg. 6, right col., first para.)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 8, Modified Cohen teaches the neural processor circuit of claim 1, Modified Cohen does not explicitly teach further comprising a planar engine circuit directly coupled to the data processor circuit, the planar engine circuit configured to: generate the index tensor as a result of a reduction operation applied on at least a portion of the input data; and store the generated index tensor into the buffer memory.
Albericio teaches further comprising a planar engine circuit directly coupled to the data processor circuit, the planar engine circuit (The output now is a 2×2×2 array, with each filter producing one of the two planes or layers of the output, pg. 3, Fig. 2) configured to: 
generate the index tensor as a result of a reduction operation applied on at least a portion of the input data (Each synapse s(x,y,z) is multiplied by the corresponding input neuron n(x, y,z), e.g., n(0,0,0)×s(0,0,0), and n(0,1,0)×s(0,1,0), for a total of 2×2×2 or eight products. The eight products are reduced into a single output neuron using addition, pg. 3, Fig. 2); and 
store the generated index tensor into the buffer memory (An adder tree per filter lane reduces two products into a partial sum that accumulates into an Output Neuron Buffer (NBout) lane per filter (pg. 4, left col., second to the last para.)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 9, Modified Cohen teaches the neural processor circuit of claim 1, Modified Cohen does not explicitly teach wherein the data processor circuit further comprises a formatting circuit coupled to the indexing circuit, the formatting circuit configured to: receive the fetched portion of the source tensor from the indexing circuit; and perform formatting and aligning of the fetched portion of the source tensor to generate an aligned version of the source tensor for the at least one neural engine circuit.
Albericio teaches wherein the data processor circuit further comprises a formatting circuit coupled to the indexing circuit (The Zero-Free Neuron Array Format: Figure 7 shows the Zero-Free Neuron Array format (ZFNAf) that enables CNV to avoid computations with zero-valued neurons, pg.7, left col., last para.), 
the formatting circuit configured to: receive the fetched portion of the source tensor from the indexing circuit (Specifically, ZFNAf encodes neurons as (value,offset) pairs in groups called bricks, pg. 7, right col., first full para.); and 
perform formatting and aligning of the fetched portion of the source tensor to generate an aligned version of the source tensor for the at least one neural engine circuit (Each brick corresponds to a fetch block of the DaDianNao design, that is an aligned, continuous along the input features dimension i group of 16 neurons, i.e., they all have the same x and y coordinates, pg. 7, right col., first full para.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 15, Modified Cohen teaches the method of claim 12, Modified Cohen does not explicitly teach further comprising: fetching a slice of the source tensor along a dimension of the source tensor starting from an offset value obtained from the index tensor, the slice being of a size that fits into the buffer memory; and broadcasting the fetched slice of the source tensor to the plurality of neural engine circuits.
Albericio teaches further comprising: fetching a slice of the source tensor along a dimension of the source tensor (Each cycle, one neuron per slice is fetched resulting into a group of 16 neurons one per lane thus keeping all lanes busy, pg. 8, left col., second para.) starting from an offset value obtained from the index tensor (For example, let e(x,y,z) be the (neuron,offset) pair stored at location (x,y,z) of an input array in ZFNAf. In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched, pg. 8, left col., second para.), 
the slice being of a size that fits into the buffer memory (CNV divides the window evenly into 16 slices, one per neuron lane, pg. 8, left col., second para.; CNV uses 16- element bricks. Bottom: NBin store format, Fig. 7, pg. 8, left col., The Examiner notes NBin implies Input Buffer Transfer (NBin) and cnvlutin (CNV) is a value-based approach to hardware acceleration); and 
broadcasting the fetched slice of the source tensor to the plurality of neural engine circuits (In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched and broadcast to all units and processed by neuron lanes 0 through 15, respectively, pg. 8, left col., second para.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 16, Modified Cohen teaches the method of claim 12, Modified Cohen does not explicitly teach further comprising: fetching a slice of the source tensor along a dimension of the source tensor starting from an offset value obtained from the index tensor, the slice being of a size that fits into the buffer memory; and transposing the fetched slice of the source tensor to generate a transposed version of the source tensor.
Albericio teaches further comprising: fetching a slice of the source tensor along a dimension of the source tensor (Each cycle, one neuron per slice is fetched resulting into a group of 16 neurons one per lane thus keeping all lanes busy, pg. 8, left col., second para.) starting from an offset value obtained from the index tensor (For example, let e(x,y,z) be the (neuron,offset) pair stored at location (x,y,z) of an input array in ZFNAf. In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched, pg. 8, left col., second para.), 
the slice being of a size that fits into the buffer memory (CNV divides the window evenly into 16 slices, one per neuron lane, pg. 8, left col., second para.; CNV uses 16- element bricks. Bottom: NBin store format, Fig. 7, pg. 8, left col., The Examiner notes NBin implies Input Buffer Transfer (NBin) and cnvlutin (CNV) is a value-based approach to hardware acceleration); and 
transposing the fetched slice of the source tensor to generate a transposed version of the source tensor (This proves to be equivalent to transposing the SB store order per subunit. Since the synapses are known in advance this rearrangement can be done statically in software (pg. 8, right col., first para.); The synapses are stored in the SBs in the order shown in the figure, so that the units can fetch the appropriate synapses in parallel (pg. 6, right col., first para.)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 17, Modified Cohen teaches the method of claim 12, Modified Cohen does not explicitly teach further comprising: performing formatting and aligning of the fetched portion of the source tensor to generate an aligned version of the source tensor.
Albericio teaches further comprising: performing formatting and aligning of the fetched portion of the source tensor to generate an aligned version of the source tensor (Each brick corresponds to a fetch block of the DaDianNao design, that is an aligned, continuous along the input features dimension i group of 16 neurons, i.e., they all have the same x and y coordinates, pg. 7, right col., first full para.).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

Regarding claim 20, Modified Cohen teaches the electronic device of claim 19, Modified Cohen  does not explicitly teach wherein: the indexing circuit is further configured to fetch a slice of the source tensor along a dimension of the source tensor starting from an offset value obtained from the index tensor; and the data processor circuit further comprising a formatting circuit coupled to the indexing circuit, the formatting circuit configured to transpose the fetched slice of the source tensor to generate a transposed version of the source tensor.
Albericio teaches wherein: the indexing circuit (CNV allows direct indexing at a finer granularity sacrificing any memory footprint savings, pg. 7, right col., fist full para.) is further configured to fetch a slice of the source tensor along a dimension of the source tensor (Each cycle, one neuron per slice is fetched resulting into a group of 16 neurons one per lane thus keeping all lanes busy, pg. 8, left col., second para.) starting from an offset value obtained from the index tensor (For example, let e(x,y,z) be the (neuron,offset) pair stored at location (x,y,z) of an input array in ZFNAf. In cycle 0, the encoded neurons at position e(0,0,0), e(0,0,16), ..., e(0,0,240) will be fetched, pg. 8, left col., second para.); and 
the data processor circuit (Processing Order in CNV, pg. 8, left col., first para.) 
further comprising a formatting circuit coupled to the indexing circuit (Fig. 7: Top: ZFNAf (Zero-Free Neuron Array Format) for 4-element bricks, Fig. 7, pg. 8, left col.,),
 the formatting circuit configured to transpose the fetched slice of the source tensor to generate a transposed version of the source tensor (This proves to be equivalent to transposing the SB store order per subunit. Since the synapses are known in advance this rearrangement can be done statically in software (pg. 8, right col., first para.); The synapses are stored in the SBs in the order shown in the figure, so that the units can fetch the appropriate synapses in parallel (pg. 6, right col., first para.)).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Albericio for the benefit of skipping over the ineffectual computations, enables further performance and energy efficiency improvements and average performance improvements increase to 1.52× without any loss in accuracy (Albericio, abstract)

6.	Claims 10, 11 and 18 rejected under 35 U.S.C. 103 as being unpatentable over Cohen et al. (US11714653 filed 02/15/2021) in view of Fishel et al. (US20190340488) in view of Barnard et al. (US20180101763) in view of Albericio et al. ("Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing." ACM SIGARCH Computer Architecture News 44.3 (2016): 1-13) and further in view of Nagy et al. (US20210241082 PCT filed 04/09/2019)

Regarding claim 10, Modified Cohen teaches the neural processor circuit of claim 9, Modified Cohen does not explicitly teach further comprising a planar engine circuit directly coupled to the data processor circuit, the planar engine circuit configured to: receive the aligned version of the source tensor; perform a planar operation on at least a portion of the aligned version of the source tensor to generate a planar version of the source tensor; and write back the planar version of the source tensor into the buffer memory.
Nagy teaches further comprising a planar engine circuit directly coupled to the data processor circuit (convolution control is coupled to planar engine 439, Fig. 4), the planar engine circuit (In FIG. 2, an input plane is defined as an input plane tensor 201 comprising a plurality of two-dimensional input planes of size IW×IH each [0077]) 
configured to: receive the aligned version of the source tensor (The results of the plurality of processing elements 418 may be provided to several subsequent processing elements of the convolution core 415, including one or more of a result scaling [0102]; result scaling is received by result store 439, Fig. 4); 
perform a planar operation on at least a portion of the aligned version of the source tensor to generate a planar version of the source tensor (the convolution control unit 433 may trigger a read of the result store 439 aligned to a read to the data buffer 431 and coefficient buffer 429 to present accumulated partial results of the same output plane at a position for which the actual product is being calculated [0131]); and 
write back the planar version of the source tensor into the buffer memory (The product added to the accumulated partial results (output of the sum 437) may be written back to the result store 439 as a further accumulated partial result [0131]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Cohen to incorporate the teachings of Nagy for the benefit of accelerating any tasks or scenarios that involve a huge number of (convolution or any other) operations to be performed on input data [0084] and accelerating operations for various tasks and operations, which efficiently exploits the available resources and decreases the amount of data communicated via the interface to an external host (Nagy [0030]).

 	Regarding claim 11, Modified Cohen teaches the neural processor circuit of claim 9, Modified Cohen does not explicitly teach wherein the at least one neural engine circuit is further configured to: receive the aligned version of the source tensor; perform another convolution operation on at least a portion of the aligned version of the source tensor to generate a processed version of the source tensor; and write back the processed version of the source tensor into the buffer memory.
	Nagy teaches wherein the at least one neural engine circuit is further configured to: receive the aligned version of the source tensor (The results of the plurality of processing elements 418 may be provided to several subsequent processing elements of the convolution core 415, including one or more of a result scaling [0102]; result scaling is received by result store 439, Fig. 4); 
perform another convolution operation on at least a portion of the aligned version of the source tensor to generate a processed version of the source tensor (The convolution control unit 433 may further repeat a convolution operation by the processing elements 418 in groups of M utilizing a parallelism of the plurality of processing elements 418 to produce up to R*M output tiles of output planes that will be stored in the result store 439 [0137]); and 
write back the processed version of the source tensor into the buffer memory (The product added to the accumulated partial results (output of the sum 437) may be written back to the result store 439 as a further accumulated partial result [0131]).
The same motivation to combine dependent claim 10 applies here.

Regarding claim 18, claim 18 is similar to claim 10. It is rejected in the same manner and reasoning applying.

Conclusion
	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2148                                                                                                                                                                                                       
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Nov 30, 2021
Application Filed
Jun 27, 2025
Non-Final Rejection — §103
Oct 16, 2025
Applicant Interview (Telephonic)
Oct 16, 2025
Examiner Interview Summary
Nov 07, 2025
Response Filed
Jan 21, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/919,417
Patent 12602586
SUPERVISORY NEURON FOR CONTINUOUSLY ADAPTIVE NEURAL NETWORK
2y 5m to grant Granted Apr 14, 2026
17/096,425
Patent 12530583
VOLUME PRESERVING ARTIFICIAL NEURAL NETWORK AND SYSTEM AND METHOD FOR BUILDING A VOLUME PRESERVING TRAINABLE ARTIFICIAL NEURAL NETWORK
2y 5m to grant Granted Jan 20, 2026
16/249,279
Patent 12511528
NEURAL NETWORK METHOD AND APPARATUS
2y 5m to grant Granted Dec 30, 2025
16/942,263
Patent 12367381
CHAINED NEURAL ENGINE WRITE-BACK ARCHITECTURE
2y 5m to grant Granted Jul 22, 2025
16/513,208
Patent 12314847
TRAINING OF MACHINE READING AND COMPREHENSION SYSTEMS
2y 5m to grant Granted May 27, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
44%
Grant Probability
78%
With Interview (+33.4%)
4y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 68 resolved cases by this examiner. Grant probability derived from career allow rate.