DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/29/2026 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The specification is objected to under 37 C.F.R. 1.74, which requires the detailed description to refer to the different parts of the figures by use of reference letters or reference numerals. Implicit in this rule is that the detailed description correctly reference the figures. In this application the figures and detailed description are inconsistent as explained below.
A. In paragraph [0066] line 3, “instruction fetch unit 203” should read “instruction fetch unit 223” instead
B. In paragraph [0080] line 4, “selectors 516, 512, and 514” should read “selectors 516, 515, and 514” instead.
Claim Objections
Claims 17-20 are objected to under 37 C.F.R. 1.71(a) which requires “full, clear, concise, and exact terms” as to enable any person skilled in the art or science to which the invention or discovery appertains, or with which it is most nearly connected, to make and use the same. The following should be corrected.
A. In claim 17 line 32, “weight data and activation data” should read “the weight data and the activation data” instead because weight data and activation data are already introduced in line 20. Claims 18-20 inherit the same deficiency as claim 17 by reason of dependence.
Claim Interpretation
High-dimensional is interpreted as three or more dimensional as defined in paragraph [0090].
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 8-9, 13 and 15 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 8 recites “the configured preferred mapping method” in line 8. There is insufficient antecedent basis for this limitation in the claim. For purposes of examination, this is interpreted to refer to the selected mapping method. Claim 9 inherit the same deficiency as claim 8 by reason of dependence.
Claim 13 recites “the one of the plurality of distribution circuits” in lines 1-2; and the distribution circuit” in line 3. There is insufficient antecedent basis for these limitation in the claims. It is unclear which specific distribution circuit of the plurality of the distribution circuits these are meant to refer. For purposes of examination, the one of the plurality of distribution circuits in lines 1-2 is interpreted as each distribution circuit of the plurality of the distribution circuits. Claim 15 recites “the distribution circuit” in lines 2-3 and is rejected for the same reason. Claim 15 inherit the same deficiency as claim 13 by reason of dependence.
Claim 15 recites “the combination of an index and a non-zero value” in line 3. There is insufficient antecedent basis for the underlined limitations in the claim. For purposes of examination, the claim is interpreted to depend on claim 14 which introduces a combination of an index and a non-zero value and is interpreted to recite the combination of the index and the non-zero value.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 4, 6-9, 12-13, 16-17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over “AMD Graphics Core Next Architecture, Generation 3”, hereinafter AMD, in view of Sankaradas et al. (NPL – “Massively Parallel Coprocessor for Convolutional Neural Networks”), hereinafter Sankaradas, and Hu et al. (NPL – “A Resources-Efficient Configurable Accelerator for Deep Convolutional Neural Networks”), hereinafter Hu.
Regarding claim 17, AMD teaches a system, comprising:
an accelerator, comprising (AMD Fig. 1.1 accelerator - Volcanic Islands Series Processor/GPU):
a direct memory access circuit, configured to load operation data of a plurality of sub-operations (AMD Fig. 1.1 direct memory access circuit – DMAs; page 1-1 to 1-2 “The GCN memory controller has direct access to all GCN device memory and the host specified areas of system memory. To satisfy read and write requests, the memory controller performs the functions of a direct-memory access (DMA) controller, including computing memory-address offsets based on the format of the requested data in memory … Request the GPU’s DMA engine to write data by pointing to the location of the source data on CPU memory, then pointing at the offset in the GPU memory”; plurality of sub-operations – kernels);
a plurality of cluster groups, wherein each of the cluster groups comprises a plurality of processing clusters (AMD Figs. 1.1 and 2.1 plurality of cluster groups – compute units or SIMD units; plurality of processing clusters – SIMD units or vector ALUS (VALU));
an on-chip memory, comprising a plurality of storage units that are respectively corresponding to the plurality of cluster groups, and each of the plurality of storage units is configured to store an instruction sequence and operation data for the corresponding cluster group (AMD Figs. 1.1 and 2.1 on-chip memory – L1, L2 and instruction cache; plurality of storage units - Texture R/W Cache L1; page 2-4 “On the primary read path, the device consists of multiple channels of L2 read-only cache that provides data to an L1 cache for each compute unit”; instruction sequence – data stored in the instruction cache);
a command processor, configured to (AMD Fig. 1.1 command processor - Command Processors; page 1-1 “The GCN command processor reads commands that the host has written to memory-mapped GCN registers in the system-memory address space”); and
a plurality of distribution circuits, respectively coupled to the plurality of storage units, and respectively coupled to the plurality of cluster groups, wherein each distribution circuit is configured to read the instruction sequence and the operation data of the instruction sequence from the storage unit coupled to the distribution circuit, and sends the instruction sequence and the operation data of the instruction sequence to the cluster group coupled to the distribution circuit (AMD Fig. 1.1 and 2.1 plurality of distribution circuits – plurality of bidirectional data bus to and from L1 and instruction cache to the compute units);
a scheduler, configured to instruct the accelerator to perform the operation (AMD Fig. 1.1 scheduler – host CPU; page 1-2 “A host application cannot write to the GCN device memory directly, but it can command the GCN device to copy programs and data between system memory and device memory. For the CPU to write to GPU memory, there are two ways … Upload a kernel to run on the shaders that access the memory through the PCIe link, then process it and store it in the GPU memory … The GCN programs are controlled by host commands, which … cause the GCN GPU to begin execution of a program”); and
a memory, configured to store (AMD Fig. 1.1 memory – system and/or device memory).
AMD does not explicitly teach a direct memory access circuit, configured to load operation data of a plurality of sub-operations for a plurality of times; a command processor, configured to decompose an operation associated with a specified neural network model into the plurality of sub-operations, the operation corresponding to a high-dimensional matrix operation, and each sub-operation corresponding to a two-dimensional matrix operation on a sub-matrix, convert the plurality of sub-operations into a plurality of instruction sequences executable on the plurality of processing clusters, select a mapping method from an input stationary mapping method and a weight stationary mapping method, wherein the selected mapping method specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations, and specify operation data for execution of each of the instruction sequences; a scheduler, configured to instruct the accelerator to perform an operation associated with the specified neural network model; and a memory, configured to store weight data and activation data of the specified neural network model.
However, on the same field of endeavor, Sankaradas discloses loading operation data of a plurality of sub-operations for a plurality of times (Sankaradas Fig. 3 and page 56 section III.B.1 “The left side of the figure shows a series of input FIFOs that may be implemented using two-port memory blocks to feed the first PE in each row. Input image pixels stream into one FIFO, and subsequently into the other FIFOs as shown in the figure”). Further, Sankaradas discloses decomposing an operation associated with a specified neural network model into a plurality of sub-operations, converting the plurality of sub-operations into a plurality of instruction sequences executable on a plurality of processing clusters, and specifying operation data for execution of each of the instruction sequences and instructing an accelerator to perform the operation associated with the specified neural network model (Sankaradas Fig. 4 and page 2 left col top “A CNN is decomposed into convolutions (that match the hardware primitives) and necessary data movement instructions to program the controller”; page 57 section 2 “A CNN is processed into instructions for the microcontroller. Each instruction enables a VPE cluster to independently add or skip convolutions, summation, non-linearity and sub-sampling. All CNNs are mapped into the meta-operation represented by a VPE cluster … The instruction set includes instructions to specify each CNN layer i.e., the number of input planes, number of output planes, their respective sizes as well as the number and sizes of all kernels. Given the sizes, the system loads data into suitable locations in the off-chip memory banks; page 57 section 3 “Given a CNN, and the target architecture consisting of M VPE clusters, each with N kxk convolver primitives, the pre-processor’s task is to decompose all convolutions into kxk convolutions, and organize the computations such that at most M groups of N convolutions are performed and summed together at a time”; specified neural network model - CNN). Further, Sankaradas discloses storing weight data and activation data of the specified neural network application (Fig. 2).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD using Sankaradas and configure the GCN processor to implement a specified neural network model by storing weight and activation data related to the neural network model in the system or device memory; configure the command processor to decompose the convolution operations associated with the neural network model into a plurality of smaller convolution operations based on the number of processing elements in each compute unit; then instruct the compute units to perform the smaller convolution operations by configuring the DMAs to load the weight and the activation data related to the convolution operations to the on-chip memory accessible by the compute units in order to provide a system with a parallel microarchitecture that provides an excellent platform for any data-intensive application that exhibits high bandwidth needs or significant computational requirements such as a neural network application (AMD page 1-1 top).
Therefore, the combination of AMD as modified in view of Sankaradas teaches a direct memory access circuit, configured to load operation data of a plurality of sub-operations for a plurality of times; a command processor, configured to decompose an operation associated with a specified neural network model into the plurality of sub-operations, convert the plurality of sub-operations into a plurality of instruction sequences executable on the plurality of processing clusters, and specify operation data for execution of each of the instruction sequences; a scheduler, configured to instruct the accelerator to perform the operation associated with the specified neural network model; and a memory, configured to store weight data and activation data of the specified neural network model.
AMD as modified in view of Sankaradas does not currently teach the operation corresponding to a high-dimensional matrix operation, and each sub-operation corresponding to a two-dimensional matrix operation on a sub-matrix, and select a mapping method from an input stationary mapping method and a weight stationary mapping method, wherein the selected mapping method specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations.
However, on the same field of endeavor, Hu discloses decomposing an operation associated with a specified neural network model into a plurality of sub-operations, the operation corresponding to a high-dimensional matrix operation, and each sub-operation corresponding to a two-dimensional matrix operation on a sub-matrix and selecting a mapping method from an input stationary mapping method and a weight stationary mapping method, wherein the selected mapping method specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations (Hu Fig. 2 and page 72114 section II And B; operation – 3D convolution operation; plurality of sub-operations – tiled convolution; sub-matrix – input tile; page 72117 section III.B.2 "the tiling Imap are loaded and kept within lnBuf for reuse, and the related weight are sequentially loaded into WBuf ... Note that each tiling lmap will be loaded only once from off-chip memory"; III.B.3 "the tiling weights are loaded and kept within WBuf for reuse, and relevant tiling Imap are loaded in lnBuf ... Note that each tiling weight will be loaded only once from off-chip memory"; page 72117 section III.B.4 "Hybrid stationary (HS) storage pattern combines OS, IS and WS and selects the optimal pattern among them for each layer").
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas using Hu and decompose the operation associated with the specified neural network model into the plurality of sub-operations by converting the 3D convolution into a plurality of tiled convolution operations because the methodology is a common method to resolve the problem of the limited on-chip resources (Hu page 72114 section II.B). Further, configure the command processor to select a mapping method from an input stationary mapping method and a weight stationary mapping method to specify which one of activation data or weight data is kept in the on-chip memory longer during the plurality of tiled convolution operations to select the mapping method that brings the lowest off-chip memory accesses (Hu page 72117 section III.B.4).
Therefore, the combination of AMD as modified in view of Sankaradas and Hu teaches a command processor, configured to decompose an operation associated with a specified neural network model into the plurality of sub-operations, the operation corresponding to a high- dimensional matrix operation, and each sub-operation corresponding to a two- dimensional matrix operation on a sub-matrix, convert the plurality of sub-operations into a plurality of instruction sequences executable on the plurality of processing clusters, select a mapping method from an input stationary mapping method and a weight stationary mapping method, wherein the selected mapping method specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations, and specify operation data for execution of each of the instruction sequences.
Regarding claim 20, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 17 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the decomposing the operation associated with the specified neural network model into the plurality of sub- operations comprises: converting the high-dimensional matrix operation of the weight data and the activation data into a plurality of two-dimensional matrix operations; and the converting the plurality of sub-operations into the plurality of instruction sequences executable on the plurality of processing clusters comprises: converting the plurality of two-dimensional matrix operations into the plurality of instruction sequences executable on the plurality of processing clusters (Sankaradas Fig. 4 and page 2 left col top; page 57 section 2; page 57 section 3; Hu Fig. 3 and page 72114 section II.B).
Regarding claims 1 and 4, they are directed to the accelerator of the server of claims 17 and 20 respectively. All components and component functions of the accelerator of claims 1 and 4 are included in the accelerator of the server of claims 17 and 20 respectively. Claims 17 and 20 analysis applies equally to claims 1 and 4 respectively.
Regarding claim 6, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 4 as stated above.
AMD does not explicitly teach wherein the converting the high-dimensional matrix operation of the weight data and the activation data into a plurality of two-dimensional matrix operations further comprises: when a size of a two-dimensional matrix exceeds a preset standard, dividing the two- dimensional matrix by rows and/or columns into a plurality of sub-matrices, and converting the plurality of two-dimensional matrix operations into matrix operations based on the plurality of sub-matrices.
However, on the same field of endeavor, Hu discloses when a size of a two-dimensional matrix exceeds a preset standard, dividing the two-dimensional matrix by rows and/or columns into a plurality of sub-matrices, and converting the plurality of two-dimensional matrix operations into matrix operations based on the plurality of sub-matrices (Hu Fig. 2 and section II.B; “In general, hardware resources in embedded system, such as DSP and on-chip memory, are very limited. Therefore, it is difficult to compute and save an entire convolutional layer simultaneously. The tiling methodology is a common method to resolve the problem of the limited on-chip resources. Figure 2 shows a tiling strategy in a convolutional layer”; section III.A.2).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas using Hu and divide the two-dimensional matrices into tiles (i.e., by rows and rows) and convert the two-dimensional matrix operations into matrix operations using the tiles when a size of a two-dimensional matrix exceeds a preset standard in order to multiply two matrices larger than a dimension of hardware resources such as the compute/SIMD units and the on-chip memory (Hu section II.B).
Therefore, the combination of AMD as modified in view of Sankaradas and Hu teaches wherein the converting the high-dimensional matrix operation of the weight data and the activation data into a plurality of two-dimensional matrix operations further comprises: when a size of a two-dimensional matrix exceeds a preset standard, dividing the two-dimensional matrix by rows and/or columns into a plurality of sub-matrices, and converting the plurality of two-dimensional matrix operations into matrix operations based on the plurality of sub-matrices.
Regarding claim 7, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 4 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the command processor configures a plurality of mapping methods to convert the high-dimensional matrix operation of the weight data and the activation data into a plurality of two-dimensional matrix operations, wherein the high-dimensional matrix operation comprises operating a plurality of three-or-more-dimension matrices (Hu Figs. 2 and 6 and page 72117; plurality of mapping methods - output stationary (OS); input stationary (IS); weight stationary (WS); hybrid stationary (HS)).
Regarding claim 8, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 7 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the mapping method is selected for a specific operation associated with a specified neural network model, for causing the command processor to use the configured preferred mapping method for the specific operation (Hu page 72117 section III.B.4 “Hybrid stationary (HS) storage pattern combines OS, IS and WS and selects the optimal pattern among them for each layer. Given a system based on the used DSP and on-chip memory resources with limited off-chip memory bandwidth, we propose a two-step strategy, which yields the best performance and the least off-chip memory accesses”; preferred mapping method - optimal pattern which yields the best performance and the least off-chip memory accesses).
Regarding claim 9, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 8 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the specific operation associated with the specified neural network model is one of matrix multiplication, convolution, and depth convolution (Hu page 72116 section III.B; the specific operation is a convolution operation”; Figs. 1-3).
Regarding claim 12, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 1 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the command processor is further configured to: receive indication information, and determine according to the indication information, the operation associated with the specified neural network model and a storage location of operation data of the operation (AMD Fig. 1.1 and page 1-1 “The GCN command processor reads commands that the host has written to memory-mapped GCN registers in the system-memory address space”; indication information – commands; Sankaradas page 56-57 section III.B.1-2 “ The microcontroller can be programmed to fetch instructions from certain locations in the off-chip instruction memory, load corresponding data from off-chip data memory, send commands to the different VPE clusters, receive and store results to specific locations in the off-chip data memory … A CNN is processed into instructions for the microcontroller. Each instruction enables a VPE cluster to independently add or skip convolutions, summation, non-linearity and sub-sampling. All CNNs are mapped into the meta-operation represented by a VPE cluster”).
Regarding claim 13, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 1 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the one of the plurality of distribution circuit is further configured to: store intermediate result data of a processing cluster coupled to the distribution circuit into a corresponding storage unit, and store the intermediate result data into an external memory by using the direct memory access circuit (AMD Fig. 1.1, 2.1 and 10.2; page 1-1 to 1-2 and 10-2 to 10-3).
Regarding claim 16, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 1 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches wherein the command processor is further configured to: convert a special function in the specified neural network model into a special instruction that is executable on an execution unit (Sankaradas page 57 section 2 “A CNN is processed into instructions for the microcontroller. Each instruction enables a VPE cluster to independently add or skip convolutions, summation, non-linearity and sub-sampling. All CNNs are mapped into the meta-operation represented by a VPE cluster”; special function – non-linearity function; special instruction - non-linearity function instruction; execution unit - VPE).
Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over AMD in view of Sankaradas and Hu as applied to claim 1 above, and further in view of Yao et al. (US 20220207293 A1), hereinafter Yao.
Regarding claim 2, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 1 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches
wherein each of the plurality of distribution circuits is (AMD Fig. 1.1 and 2.1).
AMD does not explicitly teach wherein each of the plurality of distribution circuits is coupled to the plurality of processing clusters in the corresponding cluster group by using a first bus, each distribution circuit sends the instruction sequence and the operation data of the instruction sequence to the first bus, and the plurality of processing clusters coupled to the distribution circuit obtains the instruction sequence and the operation data of the instruction sequence from the first bus.
However, on the same field of endeavor, Yao discloses a distribution circuit coupled to a plurality of processing clusters in a corresponding cluster group by using a first bus (Yao Fig. 2A distribution circuit – memory crossbar 216; paragraph [0054] “the memory crossbar 216 receives commands directed to performing memory operations”; paragraph [0061] “Each of the one or more instances of the parallel processing unit 202 can couple with parallel processor memory 222. The parallel processor memory 222 can be accessed via the memory crossbar 216, which can receive memory requests from the processing cluster array 212 as well as the I/O unit 204”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas and Hu and generalize the teaching of Yao by including a corresponding crossbar for each cluster group to read and load the instruction sequence and input data from the on-chip caches to the SIMD units in order to perform memory operation related to the neural network computation (Yao paragraph [0054]).
Therefore, the combination of AMD as modified in view of Sankaradas, Hu and Yao teaches wherein each of the plurality of distribution circuits is coupled to the plurality of processing clusters in the corresponding cluster group by using a first bus, each distribution circuit sends the instruction sequence and the operation data of the instruction sequence to the first bus, and the plurality of processing clusters coupled to the distribution circuit obtains the instruction sequence and the operation data of the instruction sequence from the first bus.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over AMD in view of Sankaradas and Hu as applied to claim 1 above, and further in view of Moreau et al. (NPL – “SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration”), hereinafter Moreau.
Regarding claim 3, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 1 as stated above. Further, AMD as modified in view of Sankaradas and Hu teaches
wherein one of the plurality of processing clusters comprises (AMD Fig. 1.1 and 2.1 plurality of execution units – SIMD units or VALU units; page 1-2 “The DPP array is the heart of the GCN processor. The array is organized as a set of compute unit pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data … When it receives a request, the compute unit pipeline loads instructions and data from memory, begins execution, and continues until the end of the kernel. As kernels are running, the GCN hardware automatically fetches instructions and data from memory into on-chip caches”; page 10-2).
AMD does not explicitly teach wherein one of the plurality of processing clusters comprises a cluster control unit and a plurality of execution units that are coupled to the cluster control unit by using a second bus and that have the accelerator same function, the cluster control unit obtains the instruction sequence and controls the plurality of execution units coupled to the cluster control unit to separately execute the instruction sequence, and the plurality of execution units coupled to the cluster control unit load the operation data required by the plurality of execution units from the second bus when executing a data loading instruction.
However, on the same field of endeavor, Moreau discloses a processing cluster comprising a cluster control unit and a plurality of execution units that are coupled to the cluster control unit by using a second bus, the cluster control unit obtains the instruction sequence and controls the plurality of execution units coupled to the cluster control unit to separately execute the instruction sequence (Moreau Fig. 1; processing cluster – PU; cluster control unit – control; plurality of execution units – PEs; page 605 right col top “Our design, shown in Figure 1, consists of a cluster of Processing Units (PUs) connected through a bus. Each PU is composed of a control block, a chain of Processing Elements (PEs), and a sigmoid unit, denoted by the SIG block. The PEs form a one-dimensional systolic array that feeds into the sigmoid unit … The PU control block contains a configurable sequencer that orchestrates communication between the PEs and the sigmoid unit”; page 607-608 section IV.V “Sequencer. The sequencer is a finite-state machine that processes microcoded instructions to orchestrate data movement between PEs, input and output queues, and the sigmoid unit within each PU. Each instruction is translated by the sequencer into commands that get forwarded to a physical PE along with the corresponding input data”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas and Hu using Moreau and configure each processing cluster to include a control unit to control operation of the execution units within a cluster group in order to orchestrate communication and data movement in the processing cluster (Moreau page 605 right col top; page 607-608 section IV.V).
Therefore, the combination of AMD as modified in view of Sankaradas, Hu and Moreau teaches wherein one of the plurality of processing clusters comprises a cluster control unit and a plurality of execution units that are coupled to the cluster control unit by using a second bus and that have the accelerator same function, the cluster control unit obtains the instruction sequence and controls the plurality of execution units coupled to the cluster control unit to separately execute the instruction sequence, and the plurality of execution units coupled to the cluster control unit load the operation data required by the plurality of execution units from the second bus when executing a data loading instruction.
Claims 5 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over AMD in view of Sankaradas and Hu as applied to claims 1 and 17 above, and further in view of Daultani et al. (WO 2018073975 A1), hereinafter Daultani.
Regarding claim 5, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 4 as stated above.
AMD as modified in view of Sankaradas and Hu does not explicitly teach wherein the converting the high-dimensional matrix operation of the weight data and the activation data into the plurality of two- dimensional matrix operations comprises: converting four-dimensional activation data into a two-dimensional activation data by mapping three dimensions of the four-dimensional activation data into one dimension of the two-dimensional activation data; and converting four-dimensional weight data into a two-dimensional weight data by mapping three dimensions of the four-dimensional weight data into one dimension of the two- dimensional weight data.
However, on the same field of endeavor, Daultani discloses converting a high-dimensional matrix operation of weight data and activation data into a plurality of two-dimensional matrix operations (Daultani Fig. 6B and 8B; page 22 line 21 to page 24 line 2).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas and Hu using Daultani and convert the four-dimensional weight data and activation data into two-dimensional matrix and convert the two-dimensional matrix operations into instruction sequences in order to implement the convolution operations as matrix multiplications when matrix multiplication operations outperforms direct convolution operations based on the shape of the weight data and activation data (Daultani page 5 line 19 to page 6 line 5).
Therefore, the combination of AMD as modified in view of Sankaradas, Hu and Daultani teaches wherein the converting the high-dimensional matrix operation of the weight data and the activation data into the plurality of two- dimensional matrix operations comprises: converting four-dimensional activation data into a two-dimensional activation data by mapping three dimensions of the four-dimensional activation data into one dimension of the two-dimensional activation data; and converting four-dimensional weight data into a two-dimensional weight data by mapping three dimensions of the four-dimensional weight data into one dimension of the two- dimensional weight data.
Regarding claim 14, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 4 as stated above.
AMD does not explicitly teach wherein the weight data is represented as a combination of an index and a non-zero value.
However, on the same field of endeavor, Daultani discloses wherein the weight data is represented as a combination of an index and a non-zero value (Daultani Fig. 7A and 8C; page 25 lines 11-16 “Each entry of the list has a nonzero value and an index of the nonzero element which indicates a position of the value from the learned sparse kernels. More specifically, the index information is the index of the nonzero element along each dimension in a 4-dimensional matrix representation of the learned sparse kernels”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas and Hu using Daultani and configure the command processor to represent the weight data in a sparse representation comprising a combination of an index and a non-zero value in order to reduce unnecessary convolution arithmetic operations (i.e., multiplication with zero) and to avoid unnecessary loading of values from memory, thereby resulting in an overall increase in computation efficiency (Daultani page 18 lines 1-5).
Therefore, the combination of AMD as modified in view of Sankaradas, Hu and Daultani teaches wherein the weight data is represented as a combination of an index and a non-zero value.
Regarding claim 15, AMD as modified in view of Sankaradas, Hu and Daultani teaches all the limitations of claim 14 as stated above. Further, AMD as modified in view of Sankaradas, Hu and Daultani teaches wherein before loading the weight data, the command processor or the distribution circuit converts the weight data into the combination of an index and a non-zero value (Daultani Fig. 3 and page 17 lines 14-23).
Claims 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over AMD in view of Sankaradas and Hu as applied to claim 17 above, and further in view of Vemuri et al. (US 11704535 B1), hereinafter Vemuri.
Regarding claim 18, AMD as modified in view of Sankaradas and Hu teaches all the limitations of claim 17 as stated above.
AMD does not explicitly teach wherein each of the plurality of storage units comprises a first buffer unit and a second buffer unit, the first buffer unit is configured to load data from an external memory while the second buffer unit is configured to feed data stored therein into the corresponding cluster group.
However, on the same field of endeavor, Vemuri discloses a storage unit comprising a first buffer unit and a second buffer unit, the first buffer unit is configured to load data from an external memory while the second buffer unit is configured to feed data stored therein into an array of processing units (Vemuri Fig. 7 first buffer unit and a second buffer unit – ping-pong buffers 714 and/or 716; col 9 line 26-45 “storage structures (e.g., buffers) of the reconfigurable IC are ping-pong-buffered to allow for processing of one buffer while the IO controller writes to the other buffer or reads from the external memory (e.g., DRAM) to the other buffer. This scheme hides the external memory access latencies and data transfer latencies between on-chip buffers behind compute processes of the reconfigurable IC. This ping-pong-buffering of each storage structure results in a ping buffer and a pong buffer for each storage structure”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify AMD in view of Sankaradas and Hu using Vemuri and configure each storage unit to include a ping-pong buffer arrangement in order to hide the external memory access latencies and data transfer latencies between on-chip buffers (Vemuri col 9 lines 30-33).
Therefore, the combination of AMD as modified in view of Sankaradas, Hu and Vemuri teaches wherein each of the plurality of storage units comprises a first buffer unit and a second buffer unit, the first buffer unit is configured to load data from an external memory while the second buffer unit is configured to feed data stored therein into the corresponding cluster group.
Regarding claim 19, AMD as modified in view of Sankaradas, Hu and Vemuri teaches all the limitations of claim 18 as stated above. Further, AMD as modified in view of Sankaradas, Hu and Vemuri teaches wherein the first buffer unit and the second buffer unit switch roles after each iteration of processing in the corresponding cluster group (Vemuri col 10 lines 6-16 “The feeding multiplexer 722 passes to the DPE array 730 the contents of one of the ping-pong buffers thereby emptying the buffer while withholding the contents of the other, and then while reconfigurable IC fills the emptied buffer of the ping-pong buffers, the feeding multiplexer 722 passes on to the DPE array 730 the contents of the other buffer. This alternating multiplexing pattern continues between the LF state and the PF state, discussed in further details below”).
Response to Arguments
In view of amendments made, the objection to the drawings, and the claims has been withdrawn. However, the amendments made raise new claim objection as discussed above.
Applicant has not provided any argument with respect to the objection to the specification.
The amendments made did not addressed all the 35 U.S.C. 112(b) rejection discussed in the previous non-final office action submitted on 09/17/2025. Further, the amendments made raise new 35 U.S.C. 112(b) rejection as discussed above
Applicant's arguments filed 11/20/2025, see remarks page 11-12 with respect to the 35 U.S.C. 103 rejection of claims 1-9 and 12-20 have been fully considered but they are not persuasive.
Applicant amended the claim to recite “decompose an operation associated with a specified neural network model into the plurality of sub-operations, the operation corresponding to a high-dimensional matrix operation, and each sub-operation corresponding to a two-dimensional matrix operation on a sub-matrix”, and “select a mapping method from an input stationary mapping method and a weight stationary mapping method, wherein the selected mapping method specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations”. Applicant argues that Hu does not teach the tile-level residency specification tied to the very 2-D sub-operations produced by the decomposition because the selection of mapping method in Hu chooses a storage pattern at the layer level, not at the level of the generated 2-D sub-operations.
Response: Examiner respectfully disagrees. Applicant is arguing unclaimed features. The claims do not recite or require that the selection of a mapping method is tied to or based on each sub-operation (i.e. to a two-dimensional matrix operation on a sub-matrix). The claim only requires that the selection of the mapping method is from an input stationary mapping method and a weight stationary mapping method, wherein the mapping method selected specifies which one of activation data or weight data is kept in the on-chip memory longer during the plurality of sub-operations which is fairly disclosed by Hu in at least page 72117 section III.B.2 to section III.B.4 which discloses selecting from an input stationary where the tiling Imap (tiled input) are kept in the on-chip memory longer during the plurality of sub-operations (tiled convolutions) or weight stationary where tiling weights are kept in the on-chip memory longer during the plurality of sub-operations (tiled convolutions).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767. The examiner can normally be reached 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Trujillo can be reached at (571) 272-3677. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Carlo Waje/Examiner, Art Unit 2151 (571)272-5767