Last updated: April 19, 2026

Application No. 17/946,231

DEEP NEURAL NETWORK (DNN) ACCELERATORS WITH WEIGHT LAYOUT REARRANGEMENT

Non-Final OA §102§112

Filed

Sep 16, 2022

Examiner

LE, UYEN T

Art Unit

2156

Tech Center

2100 — Computer Architecture & Software

Assignee

Intel Corporation

OA Round

1 (Non-Final)

Interview Optional

— +9.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 797 resolved cases, 2023–2026

Examiner Intelligence

LE, UYEN T View full profile →

Grants 84% — above average

Career Allow Rate

669 granted / 797 resolved

+28.9% vs TC avg

Moderate +10% lift

Without

With

+9.7%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

24 currently pending

Career history

821

Total Applications

across all art units

Statute-Specific Performance

§101

15.8%

-24.2% vs TC avg

§103

27.6%

-12.4% vs TC avg

§102

20.0%

-20.0% vs TC avg

§112

22.2%

-17.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 797 resolved cases

Office Action

§102 §112

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-25 are pending.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 3-19-2024, 1-10-2023, 11-30-2022, 9-16-2022 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being incomplete for omitting essential structural cooperative relationships of elements, such omission amounting to a gap between the necessary structural connections.  See MPEP § 2172.01.  The omitted structural cooperative relationships are: 
between the data blocks from different ones of the plurality of virtual sub-banks and the partitioned weight tensor;
between the data blocks and the interleaving of the data blocks. It is not clear how data blocks are identified and how they are interleaved;
between the weights arranged in three-dimensional matrix and the data blocks of the plurality of virtual sub-banks;
between the linear data structure and the structure of the processing elements to execute convolutions,
The dependent claims do not cure the deficiencies of the parent claims. 
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites in the preamble “a method of deep learning”, however the end result seems to merely “write the linear data structure into a second memory associated with a part of the array”. Note the claim language of “to be used by an array of processing elements (PEs) to execute a convolution” recited at lines 4-5 merely seems intentional. The body of the claim does not show any actual operation performed by the array of the processing elements to execute any convolution.
Claims 11, 21 essentially recite similar limitations thus contain similar defects.
The dependent claims do not cure the deficiencies of their parent claims. 
Art rejection is being applied to claims 1-25 as best understood in light of the rejections under 35 U.S.C. 112 discussed above.
 		Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-25 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Yan et al (US 20230111362 A1).
Regarding claim 1, Yan discloses a method of deep learning, the method comprising: 
reading a weight tensor from a first memory (see at least Fig.9, block 910); 
wherein the weight tensor comprises weights in one or more convolutional kernels, and the weights are arranged in a three- dimensional matrix (see [0050]: During the convolution processing 220 shown in FIG. 2, MAC operations are performed on the filter 224 and each depth-wise slice, such as the first depth-wise slice 223, of the input tensor to generate a dot product, such as the dot product 228. For example, the first depth-wise slice 223 of the input tensor 222 is a 1x1x3 tensor at the top left of the input tensor 222 (the three grey cubes). Both the first depth-wise slice 223 and the filter 224 have a size of 1x1x3. See also [0053]: … In some embodiments, each of the filters may be similarly segmented along its depth dimension into a plurality of sub-filters according to the plurality of channel groups 324. That is, each of the sub-tensor of an input tensor and each of the sub-filter of a filter includes a same number of channels).
and are to be used by an array of processing elements (PEs) to execute a convolution (see at least Fig.9, block 940); 
partitioning the weight tensor into a plurality of virtual banks based on an arrangement of the PEs in the array (see at least [0006]: In some embodiments, the segmenting the plurality of filters into a plurality of sub-filter groups comprises: grouping the plurality of filters into a plurality of filter groups; segmenting each of the plurality of filters into a plurality of sub-filters according to the plurality of channel groups; and determining the sub-filters of a same filter group and of a same channel group as a sub-filter group). Note the claimed weight tensor partitioned into a plurality of virtual blocks is met by the filters segmented into a plurality of sub-filter groups according to the plurality of channel groups of Yan; 
partitioning a virtual bank of the plurality of virtual banks into a plurality of virtual sub- banks (see at least Fig.9, block 930); 
 identifying data blocks from different ones of the plurality of virtual sub-banks (see at least Fig.9, block 940); 
forming a linear data structure by interleaving the data blocks, the linear data structure comprising the data blocks arranged in a linear sequence (see at least Fig.9, block 940); and 
writing the linear data structure into a second memory associated with a part of the array (see at least Fig.9, block 950); 

Regarding claim 2, Yan teaches the method of claim 1, wherein the array of PEs comprises PEs arranged in columns, and the part of the array is one of the columns (see Fig.8A). 

Regarding claim 3, Yan teaches the method of claim 2, wherein each respective virtual bank of the plurality of virtual banks corresponds to a different one of the columns (see Fig.8A). 

Regarding claim 4, Yan teaches the method of claim 1, wherein the array of PEs constitutes at least part of a convolutional layer in a deep neural network (DNN) and is to perform the convolution on the weights and an input tensor to generate an output tensor (see Fig.8B).

Regarding claim 5, Yan teaches the method of claim 4, wherein the three-dimensional matrix has a first dimension determined based on a number of input channels in the input tensor, a second dimension determined based on a size of the one or more convolutional kernels, and a third dimension determined based on a number of output channels in the output tensor (see at least [0048]:…  In FIG. 2, the filter 224 may be a 1∗1∗3 matrix. The height and the width (e.g., 1(R)∗1(S)) of the filter 224 in each channel may be referred to as a kernel (the filter 224 has three kernels in the three channels, respectively). See also [0050]:… The number of channels in the output tensor 225 equals to the number of filters that have applied during the convolution. Since the convolution processing 220 only uses one filter 224, the corresponding output tensor 228 only has one channel.).

Regarding claim 6, Yan further teaches or suggests method of claim 5, wherein partitioning the virtual bank into a plurality of virtual sub- banks comprises: partitioning the virtual bank in the third dimension, wherein a dimension of a virtual sub-bank equals an integral divisor of the number of output channels in the output tensor (see at least [0050]…. the output tensor 225 may be determined after the filter 224 convolves (e.g., moves) through all the depth-wise slices in the input tensor 222 (9 slices in FIG. 2). The number of channels in the output tensor 225 equals to the number of filters that have applied during the convolution. Since the convolution processing 220 only uses one filter 224, the corresponding output tensor 228 only has one channel). 

Regarding claim 7, Yan further teaches the method of claim 5, wherein a data block has a dimension that equals a predetermined number of input channels (see at least [0017] in some embodiments, each of the sub-tensors and each of the sub-filters comprise the same number of channels).

Regarding claim 8, Yan further teaches the method of claim 1, wherein: the data blocks comprise first data blocks from a first virtual sub-bank of the plurality of virtual sub-banks and second data blocks from a second virtual sub-bank of the plurality of virtual sub-banks, and the first data blocks alternate with the second data blocks in the linear data structure (see at least [0110]: Block 940 includes respectively assigning a plurality of combinations of the sub-tensors and the sub-filter groups to a plurality of processors, wherein each of the plurality of combinations comprises a sub-tensor and a sub-filter group corresponding to the same channel group).

Regarding claim 9, Yan further teaches or suggests the method of claim 1, further comprises: before identifying data blocks from different ones of the plurality of virtual sub-banks, removing weights having zero values from at least some of the plurality of virtual sub-banks (see at least [0008] In some embodiments, zero values of the sub-tensor and the sub-filter group in the assigned combination are not stored).  

Regarding claim 10, Yan further teaches or suggests the method of claim 1, wherein the first memory is outside the array of PEs, and the second memory is inside the array of PEs (see at least [0038]: In some embodiments, each PE may store the assigned sub-tensor and sub-filter group in energy-efficient and memory-efficient representations by only storing the non-zero values in index-value pairs within each PE. These representations may significantly reduce the storage footprint of the neural network and make the solution suitable for devices with limited memory resources) see also [0064] As shown in FIG. 5, the hardware accelerator 500 may comprise a scheduler 570 to control the workflow within the accelerator 500 and interactions with off-chip components such as a host CPU 510 and double data rate (DDR) memories 520. For example, the accelerator 500 may interact with the host CPU 510 through a peripheral component interconnect express (PCIe) physical layer (PHY) controller 512, and an off-chip DDR memory 520 through a DDR interface 530. The accelerator 500 may fetch data from the off-chip DDR memory 520 through a direct memory access (DMA) controller 540 that communicates with the off-chip DDR memory 520 via the DDR interface 530. The fetched data may be stored in an on-chip buffer, called global buffer 550, to prepare for parallel convolution computations. The global buffer 550 may be logically divided into multiple sections, such as an input buffer 552, a weight buffer 554, and an output buffer 556). 
Claims 11-20 essentially recite limitations similar to claims 1-10 in form of a DNN accelerator thus are rejected for the same reasons discussed in claims 1-10 above
Claims 21-25 essentially recite limitations similar to claims 1, 2, 4, 8, 9 respectively in form of non-transitory computer readable memory, thus are rejected for the same reasons discussed in claims 1, 2, 4, 8, 9 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wu et al (US 10346093 B1) teach circuit arrangement (500) has multiple RAM circuits that includes read port and write port. A memory controller (506) accesses a tensor data arranged in multiple banks of tensor buffers in the RAM circuits. The memory controller generates read addresses, a read enable signal, write addresses and write enable signals to access different tensor buffers in the RAM circuits at different times. An array of processing circuits (505) includes multiple rows and multiple columns of processing circuits. Each subset of multiple subsets of rows of the processing elements is coupled to the read port of the RAM circuits by a read data bus. A last row of processing elements is coupled to the write port of each of the RAM circuits by a write data bus. The array of processing circuit performs tensor operations on the tensor data. The processing circuits in each row in the array of processing circuits are coupled to input the same tensor data.
Goulding et al (US 20190087708 A1) teach a dynamically adaptive neural network processing system includes memory to store instructions representing a neural network in contiguous blocks, hardware acceleration (HA) circuitry to execute the neural network, direct memory access (DMA) circuitry to transfer the instructions from the contiguous blocks of the memory to the HA circuitry, and a central processing unit (CPU) to dynamically modify a linked list representing the neural network during execution of the neural network by the HA circuitry to perform machine learning, and to generate the instructions in the contiguous blocks of the memory based on the linked list.
KUZMIN et al (US 20230108248 A1) teach a method includes retrieving, for a layer of a set of layers of an artificial neural network (ANN), a dense quantized matrix representing a codebook and a sparse quantized matrix representing linear coefficients. The dense quantized matrix and the sparse quantized matrix may be associated with a weight tensor of the layer. The processor-implemented method also includes determining, for the layer of the set of layers, the weight tensor based on a product of the dense quantized matrix and the sparse quantized matrix. The processor-implemented method further includes processing, at the layer, an input based on the weight tensor. The instructions loaded into the general-purpose processor 102 may comprise code to generate, during training of the ANN, the weight tensor based on a transformation of an original weight tensor from a four-dimensional weight tensor to a two-dimensional weight tensor, code to factorize, during the training, the weight tensor to a dense matrix and a sparse matrix.
Bokam et al (US 20210191765 A1) teach a method for scheduling an artificial neural network includes: accessing a processor representation of a multicore processor comprising processor cores, direct memory access cores, and a cost model; and accessing a network structure defining a set of layers. The method also includes, for each layer in the set of layers: generating a graph based on the processor representation, the graph defining compute nodes, data transfer nodes, and edges representing dependencies between the compute nodes and the data transfer nodes; and generating a schedule for the layer based on the graph, the schedule assigning the compute nodes to the processor cores and assigning the data transfer nodes to the direct memory access cores. The method further includes aggregating the schedule for each layer in the set of layers to generate a complete schedule for the artificial neural network.
Modha (US 20190332925 A1) teaches networks and encodings therefor to provide increased energy efficiency and speed for convolutional operations. In various embodiments, a neural network comprises a plurality of neural cores. Each of the plurality of neural cores comprises a memory. A network interconnects the plurality of neural cores. The memory of each of the plurality of neural cores comprises at least a portion of a weight tensor. The weight tensor comprising a plurality of weights. Each neural core is adapted to retrieve locally or receive a portion of an input image, apply the portion of the weight tensor thereto, and store locally or send a result therefrom via the network to other of the plurality of neural cores.

Yang, Dingqing, et al. "Procrustes: a dataflow and accelerator for sparse deep neural network training." 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020. 
Abstract—The success of DNN pruning has led to the development of energy-efficient inference accelerators that support pruned models with sparse weight and activation tensors. Because the memory layouts and dataflows in these architectures are optimized for the access patterns during inference, however, they do not efficiently support the emerging sparse training techniques. In this paper, we demonstrate (a) that accelerating sparse training requires a co-design approach where algorithms are adapted to suit the constraints of hardware, and (b) that hardware for sparse DNN training must tackle constraints that do not arise in inference accelerators. As proof of concept, we adapt a sparse training algorithm to be amenable to hardware acceleration; we then develop dataflow, data layout, and loadbalancing techniques to accelerate it. The resulting system is a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model. Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26× less energy and offers up to 4× speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy. 
Song, Linghao, et al. "Accpar: Tensor partitioning for heterogeneous deep learning accelerators." 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020. 
Abstract- Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present ACCPAR, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, ACCPAR considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. ACCPAR optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in ACCPAR allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate ACCPAR on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series and Resnet series. The average performance improvements of the state-of-the-art “one weird trick” (OWT) and HYPAR, and ACCPAR, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UYEN T LE whose telephone number is (571)272-4021. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ajay M Bhatia can be reached at 5712723906. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/UYEN T LE/Primary Examiner, Art Unit 2156                                                                                                                                                                                                        20 January 2026

Read full office action

Prosecution Timeline

Sep 16, 2022

Application Filed

Nov 02, 2022

Response after Non-Final Action

Jan 21, 2026

Non-Final Rejection — §102, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/652,579

Patent 12591550

SHARE REPLICATION BETWEEN REMOTE DEPLOYMENTS

2y 5m to grant Granted Mar 31, 2026

18/680,128

Patent 12591540

DATA MIGRATION IN A DISTRIBUTIVE FILE SYSTEM

2y 5m to grant Granted Mar 31, 2026

18/473,037

Patent 12581301

MEDIA AGNOSTIC CONTENT ACCESS MANAGEMENT

2y 5m to grant Granted Mar 17, 2026

18/658,616

Patent 12579189

METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR GENERATING OBJECT IDENTIFIER

2y 5m to grant Granted Mar 17, 2026

18/520,891

Patent 12561371

GRAPH OPERATIONS ENGINE FOR TENANT MANAGEMENT IN A MULTI-TENANT SYSTEM

2y 5m to grant Granted Feb 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

84%

Grant Probability

94%

With Interview (+9.7%)

2y 11m

Median Time to Grant

Low

PTA Risk

Based on 797 resolved cases by this examiner. Grant probability derived from career allow rate.