DETAILED ACTION
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/10/2025 has been entered.
Examiner Notes the following: Claims 1, 2, 4, 8-11, and 15-18 have been amended. Claims 1-20 are pending.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Examiner’s Remarks
Certain dependent claims use the transitional phrase "where", for example claims 2 and 3. The applicant should consider amending the claims to use more conventional "wherein". See MPEP 2111.04.
Specification
The disclosure is objected to because of the following informalities:
In paragraph 196, line 4, “BX[ k ]” (Emphasis added) should read as “BY[ k ]” (Emphasis added). Examiner notes this objection is in the prior office action.
Appropriate correction is required.
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.
Claim Objections
Claim 1 objected to because of the following informalities:
In claim 1, line 19, “having i elements and” should read as “having i elements; and” (emphasis added).
In claim 8, last line, “column. .” should read as “column.”
Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Umuroglu et al. (NPL: "BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing"), hereinafter Umuroglu, in view of Digilent (NPL: “PYNQ-Z1 Board Reference Manual”), and further in view of Cowan et al. (NPL: “Automatic Generation of High-Performance Quantized Machine Learning Kernels”), hereinafter Cowan, further in view of Pothos et al. (NPL: “Deep Learning Inference with Dynamic Graphs on Heterogeneous Platforms”), hereinafter Pothos.
Regarding claim 1, Umuroglu Discloses:
A memory configured to store matrix data [A. Hardware Architecture: “The Fetch Stage: is responsible for reading matrix data from main memory and populating the matrix buffers with data”; C. Programming BISMO: “The RunFetch instruction specifies from where in main memory to read data and the destination matrix buffers to store read data…. The RunResult instruction specifies the base address of the result matrix stored in main memory”; Teaches that memory can hold matrices to be used for matrix multiplication]
A matrix multiply accelerator (MMA) coupled to memory configured to [III The Bit-Serial Matrix Multiplication Overlay: "BISMO consists of a hardware part and a software part. The hardware part is composed of a scalable bit-serial matrix multiplication datapath and associated memory and control logic. The software part generates instructions for the hardware for a given matrix size and precision", teaches the subsystem of Fig 2, with various components and is used for matrix multiplication]:
receive the bit slice weight tensor and the bit slice input data tensor, and multiply the bit slice weight tensor and the bit slice input data tensor to generate an output data matrix [III The Bit-Serial Matrix Multiplication Overlay: The Fetch Stage "reading matrix data from main memory and populating the matrix buffers with data"; The Execute Stage "for performing the matrix multiplication on the data present in the matrix buffers...The DPU computes a partial result of the dot product between a row and column of two bit-matrices"].
However, Umuroglu does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor, the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
A MMA coupled to the processor.
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of Hardware architecture for co-processors systems, Digilent teaches:
A Memory configured to store data [DDR3 Page 9; "The DDR3 is connected to the hard memory controller in the Processor Subsystem (PS), as outlined in the Zynq documentation." Where the DDR3 is connected to the Multiport DRAM Controller in figure 2.1].
A processor coupled to the memory [Application Processing Unit "(APU, which includes 2 Cortex-A9 processors)" (ARM A9 CPUs Page 1), Page 4, where the APU is coupled to the memory as shown in figure 2.1].
A programmable logic subsystem, coupled to the processor and the memory ["The programmable logic is also connected to the interconnect as a slave, and designs can implement multiple cores in the FPGA fabric that each also contain addressable control registers. Furthermore, cores implemented in the PL can trigger interrupts to the processors (connections not shown in Fig. 3) and perform DMA accesses to DDR3 memory." Page 4, this teaches that the programmable logic has cores (controllers) and with figure 2.1 also teaches the use of DSP and RAM components].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu and Digilent before him before the effective filing date of the claimed invention to implement the dual subsystem taught by Digilent, by implementing the MMA [Umuroglu: BISMO] as disclosed by Umuroglu as the programmable logic subsystem taught by Digilent, since Umuroglu already evaluated “BISMO on the Xilinx PYNQ-Z1 board” [Umuroglu: I. Introduction], for various improvements in performance.
However, Umuroglu and Digilent does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor, the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of quantized matrix multiplication, Cowan teaches:
At least one weight matrix and at least one input data matrix, the weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution [Page 310,l Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)" teaches the use of weight and input (activations) matrices that has columns and rows and a bit resolution];
A processor, configured to [Arm Neon: "To target low-power ARM processors, we synthesize code in a subset of the ARM NEON vectorized instruction set." Page 311, teaches the use of ARM NEON instruction set to use ARM processors]:
for the weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and for the input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and ["We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors." Page 307, Multi-bit quantization for both weights and activations (inputs); Figure 2 shows the breakdown of data into bitplane vectors and shows the number of elements in the bit slice vectors are equal to the on number of elements in the corresponding vectors for matrix multiplication; "It takes as input a d-dimensional tensor and returns a d + 1-dimensional tensor, with a new bit axis that indexes the bitplanes of the original values." Page 308, Bit-Slicing Schedules, teaches expanding into a higher dimensional tensor to include the bitplanes];
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row [Multi-bit quantization: “The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes; for example, the first bitplane vector w0 holds the least-significant bit of each weight in the vector w.”; Figure 2, shows expanding the weight data into a bit form and generating a bitslice vector based on the corresponding bitwidth]; and
where said generate, based on the bit resolution, the number of bit slice vectors for each column of the input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column [Multi-bit quantization: “The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes; for example, the first bitplane vector w0 holds the least-significant bit of each weight in the vector w.”; Figure 2, shows expanding the activation data into a bit form and generating a bitslice vector based on the corresponding bitwidth].
And wherein the operations are applied to convolutional neural networks [“we focus on both fully connected and convolutional neural networks” Sec.2 Quantized Models].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor, the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
A processor [“the above procedure, described for CPU-based implementations” p.166], the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i element [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
Cowan further discloses the bit-slicing can support GEMM operators [“We have
implemented a library of operators that support the bitpacking transformation ... The library targets common neural network operators such as 2D convolutions and dense matrix multiplication (GEMM).” Operators supporting bit-packing scheduling, Sec.3.1]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations using the converted matrices in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
Regarding claim 2, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 1 above.
Umuroglu discloses:
The number of columns of the weight matrix is the same as the number of rows of the input data matrix [Page 310,l Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)" teaches the use of weight and input (activations) matrices that has equal number of columns and rows respectively, which is a natural consequence of matrix multiplication];
However, Umuroglu does not explicitly disclose the number of columns of the converted weight matrix is the same as the number of rows of the converted input data matrix for each row of the converted weight matrix: each bit slice vector includes one bit from each element within the row each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the converted weight matrix and the bit resolution of the converted weight matrix is variable
In the analogous art of Hardware architecture for co-processors systems, Digilent teaches:
A processor coupled to the memory [Application Processing Unit "(APU, which includes 2 Cortex-A9 processors)" (ARM A9 CPUs Page 1), Page 4, where the APU is coupled to the memory as shown in figure 2.1].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu and Digilent before him before the effective filing date of the claimed invention to implement the dual subsystem taught by Digilent, by implementing the MMA [Umuroglu: BISMO] as disclosed by Umuroglu as the programmable logic subsystem taught by Digilent, since Umuroglu already evaluated “BISMO on the Xilinx PYNQ-Z1 board” [Umuroglu: I. Introduction], for various improvements in performance.
However, Umuroglu and Digilent does not explicitly disclose the number of columns of the converted weight matrix is the same as the number of rows of the converted input data matrix for each row of the converted weight matrix: each bit slice vector includes one bit from each element within the row each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the converted weight matrix and the bit resolution of the converted weight matrix is variable
In the analogous art of quantized matrix multiplication, Cowan teaches:
for each row of the weight matrix:
each bit slice vector includes one bit from each element within the row [Single-bit quantization, "pack the bits for each weight w_ij^k in row i into a single four-bit bit-vector w ̂^k" teaches packing weights in a row into a bit-vector];
each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the weight matrix [Multi bit quantization, page 307, "We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors", "The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes;"],
and the bit resolution of the weight matrix is variable [Figure 11, shows the bit resolution can be variable].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose the number of columns of the converted weight matrix is the same as the number of rows of the converted input data matrix for each row of the converted weight matrix: each bit slice vector includes one bit from each element within the row each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the converted weight matrix and the bit resolution of the converted weight matrix is variable
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches converted weight matrix and input data matrix, wherein the number of columns of the converted weight matrix equals the number of rows of the converted input data matrix [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a converted weight matrix with the number of columns is Kc × Kh × Kw elements and input tensor is converted to converted input matrix with a number of row equal to Kc x Kh x Kw].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses the additional limitations of the claim.
Regarding claim 3, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 2 above. However, Umuroglu and Cowan does not explicitly disclose the additional limitations of claim 3.
In the analogous art of quantized matrix multiplication, Cowan teaches:
for each column of the input data matrix:
each bit slice vector includes one bit from each element within the column [Fig 1, "Computing the output of a layer involves a matrix multiplication followed by an activation function. For example, in a fully connected network, we can compute the activations for layer k +1 as the matrix-vector product of the weights for layer k and activations for layer k, as shown in Figure 1. A convolutional network is similar but with higher dimension tensors for the weights and activations (i.e., more dot products required)" Page. 306, teaches the input data in a format of a column as a natural consequence of matrix multiplication; Single-bit quantization: "pack the activations a_i^k into a four-bit bit-vector a ̂_i^k" Page. 307];
each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the input data matrix [Multi bit quantization, page 307, "We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors", "The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes;"].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose for each row of the converted input data matrix: each bit slice vector includes one bit from each element within the column each bit slice vector is associated with a different bit position; and the number of bit slice vectors is the same as the bit resolution of the converted input data matrix.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches converted weight matrix and input data matrix [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a converted weight matrix and input tensor is converted to converted input matrix].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses the additional limitations of the claim.
16. Regarding claim 4, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 3 above.
Umuroglu further teaches:
where the MMA includes:
a local memory [Main Memory; Runtime Performance, Peak Binary Compute: "we assume the matrices have already been fetched into on-chip memory and disregard the cost of result writing" teaches the use of local memory];
a controller coupled to the local memory [Figure 2, set of synced Controllers controlling each stage based on instructions; The Fetch Stage: "is responsible for reading matrix data from main memory and populating the matrix buffers with data", Matrix buffers are input/weight buffers (i.e. the first and second registers); "The Execute Stage: is responsible for performing the matrix multiplication on the data present in the matrix buffers. The core of the stage consists of an array of dot product units (DPUs)"; "The Result Stage: is responsible for writing the results... accumulated dot-products are written to the result buffer from which the result stage writes them to main memory." where the result buffer is the "third register";]
a first register, coupled to the controller and the local memory, configured to store at least a portion of the bit slice input data tensor; a second register, coupled to the controller and the local memory, configured to store at least a portion of the bit slice weight tensor [The Fetch Stage: "responsible for reading matrix data from main memory and populating the matrix buffers with data...The read data and its destination form a packet that is carried through the interconnect to the appropriate matrix buffer" Figure 3, Blue Matrix Buffers D0-Dm-1 and Teal Matrix Buffers D0 to Dn-1];
a third register, coupled to the controller and the local memory, configured to store at least a portion of the output data matrix [The Result Stage: "When the execute stage has produced a new set of results, the accumulated dot-products are written to the result buffer from which the result stage writes them to main memory"]; and
an array of bit slice dot product (BSDP) elements [Figure 3, Array of DPUs], coupled to the controller and the first, second and third registers, configured to multiply the bit slice weight tensor and the bit slice input data tensor, each BSDP element configured to generate a dot product between one row of the weight matrix and one column of the input data matrix [The Execute Stage: "The DPU computes a partial result of the dot product between a row and column of two bit-matrices… The single bit multiplications are performed by a multi-bit logic AND operation and the summation is a simple population count (popcount) of the result."].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose each BSDP element configured to generate a dot product between one row of the converted weight matrix and one column of the converted input data matrix.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches converted weight matrix and input data matrix [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a converted weight matrix and input tensor is converted to converted input matrix].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses the additional limitations of the claim.
Regarding claim 5, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 4 above.
Umuroglu further teaches:
where each BSDP element [Figure 4 dot product unit (DPU)] includes:
a bit-wise AND circuit configured to input a first operand from the first register, input a second operand from the second register, and output a resultant value [The Execution Stage: "where each DPU is fed with a design-time configurable number of bits (Dk) from the left-hand-side and right-hand-side matrix buffers” i.e. first and second registers; “The single bit multiplications are performed by a multi-bit logic AND operation”];
a popcount circuit configured to receive the resultant value and output an intermediate value [The Execution Stage: “the summation is a simple population count (popcount) of the result” (of the multi-bit logic AND operation) and “The weight in Algorithm 1 is implemented by a left-shift unit and optional negation”];
an ADDER circuit configured to add the intermediate value to an accumulated value; and an accumulation register configured to store the accumulated value, and output a final accumulated value to the third register [The Execution Stage: “The partial results are accumulated and stored in a register (Acc.) of width A, which is typically 32 bits [5], [6] to avoid overflow"].
Regarding claim 6, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 5 above. However, Umuroglu and Digilent does not explicitly disclose the additional limitations of the claim.
In the analogous art of quantized matrix multiplication, Cowan teaches:
The use of any register as inputs for operations [Compute Sketch: "where each instruction can be either a bitwise operation (and, or, not, addition, etc.) or a special population count intrinsic instruction. In both cases, the synthesizer is free to choose any live registers as the inputs to the instruction" Page 310, teaches the use of any active register to hold values for bitwise operations, such as AND and SHIFT bitwise operations, for input into the operations].
the first operand is a bit slice vector from the bit slice input data tensor having an index k equal to the associated bit position of the bit slice vector; and the second operand is a bit slice vector from the bit slice weight tensor having an index j equal to the associated bit position of the bit slice vector [Multi-bit quantization: "The resulting vectors are called bitplanes; for example, the first bitplane vector w0 holds the least-significant bit of each weight in the vector w. We can then pack each bitplane into larger elements in this case, a single uint4 value, but other configurations such as two uint2s or four uint1s are possible. Given these bit packed values wi and ai, we can compute
w
^
*
a
^
=
∑
n
=
0
N
-
1
∑
m
=
0
M
-
1
2
n
+
m
p
o
p
c
o
u
n
t
(
w
^
n
&
a
^
m
)
where N and M are the bitwidths for weights and activations, respectively (N = 3 and M = 2 in the example)" Page 307, teaches the index m representing index k for input, and index n representing index k for weights, where in the equation shows the relation of the indices and the element being used as an operand].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
Regarding claim 7, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 6 above.
Umuroglu further teaches:
the index value being equal to j+k [Algorithm 1, Line 7, teaches the use of adding both indices I (index j) and j (index k) together for shifting of data as “weight”];
count a number of bits set to one in the resultant value to generate a population count value [The Execute Stage: "The single bit multiplications are performed by a multi-bit logic AND operation and the summation is a simple population count (popcount) of the result." ]; and
left-shift the population count value based on the index value to generate the intermediate value [Algorithm 1, Line 7, binary shift based on the sum of two indices i and j, and accounts for both signed and unsigned integers; The Execution Stage: "The weight in Algorithm 1 is implemented by a left-shift unit and optional negation"].
In the analogous art of quantized matrix multiplication, Cowan teaches:
The use of any register as inputs for operations [Compute Sketch: "where each instruction can be either a bitwise operation (and, or, not, addition, etc.) or a special population count intrinsic instruction. In both cases, the synthesizer is free to choose any live registers as the inputs to the instruction" Page 310, teaches the use of any active register to hold values for bitwise operations or a special population count instruction, such as popcount and SHIFT bitwise operations, for input]
receive an index value from the second register, the index value being equal to j+k [Multi-bit quantization: “we can compute
w
^
*
a
^
=
∑
n
=
0
N
-
1
∑
m
=
0
M
-
1
2
n
+
m
p
o
p
c
o
u
n
t
(
w
^
n
&
a
^
m
)
where N and M are the bitwidths for weights and activations” teaches the sum of n+m for shifting the popcount value];
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
Regarding claim 8, Umuroglu discloses:
A memory configured to store matrix data [A. Hardware Architecture: “The Fetch Stage: is responsible for reading matrix data from main memory and populating the matrix buffers with data”; C. Programming BISMO: “The RunFetch instruction specifies from where in main memory to read data and the destination matrix buffers to store read data…. The RunResult instruction specifies the base address of the result matrix stored in main memory”; Teaches that memory can hold matrices to be used for matrix multiplication]
a matrix multiply accelerator (MMA), coupled to the memory, including a local memory, an array of bit slice dot product (BSDP) elements, and a controller coupled to the local memory and the array [Figure 2; III The Bit-Serial Matrix Multiplication Overlay: "BISMO consists of a hardware part and a software part. The hardware part is composed of a scalable bit-serial matrix multiplication datapath and associated memory and control logic. The software part generates instructions for the hardware for a given matrix size and precision", teaches the subsystem of Fig 2, with various components and is used for matrix multiplication; Main Memory; Runtime Performance, Peak Binary Compute: "we assume the matrices have already been fetched into on-chip memory and disregard the cost of result writing" teaches the use of local memory; Figure 3, Array of DPUs]:
The controller is configured to execute instructions [Hardware Architecture: "Fig. 2 provides an overview of the BISMO hardware. The architecture is organized into three pipeline stages fetch, execute, and result. Each stage communicates data to the next stage via shared on-chip memory buffers. Inter-stage synchronization is achieved by blocking reads and writes to synchronization FIFOs. All stage operations, including datapath control and synchronization, are controlled by instructions, which are fetched from instruction queues and executed in order", teaches the controllers executing instructions for the accelerator];
receive the weight matrix and the input data matrix [III The Bit-Serial Matrix Multiplication Overlay: The Fetch Stage "reading matrix data from main memory and populating the matrix buffers with data"; The Execute Stage "for performing the matrix multiplication on the data present in the matrix buffers...The DPU computes a partial result of the dot product between a row and column of two bit-matrices"].
the array is configured to:
multiply the bit slice weight tensor and the bit slice input data tensor to generate an output data matrix. [The Result stage: “When the execute stage has produced a new set of results, the accumulated dot-products are written to the result buffer from which the result stage writes them to main memory”, The Run Instructions: “The RunResult instruction specifies the base address of the result matrix stored in main memory and an offset to which the current results are to be written”].
However, Umuroglu does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor coupled to memory; the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
A MMA coupled to the processor.
the controller configured to:
receive the converted weight matrix and the converted input data matrix,
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of Hardware architecture for co-processors systems, Digilent teaches:
A Memory configured to store data [DDR3 Page 9; "The DDR3 is connected to the hard memory controller in the Processor Subsystem (PS), as outlined in the Zynq documentation." Where the DDR3 is connected to the Multiport DRAM Controller in figure 2.1].
A processor coupled to the memory [Application Processing Unit "(APU, which includes 2 Cortex-A9 processors)" (ARM A9 CPUs Page 1), Page 4, where the APU is coupled to the memory as shown in figure 2.1].
A programmable logic subsystem, coupled to the processor and the memory ["The programmable logic is also connected to the interconnect as a slave, and designs can implement multiple cores in the FPGA fabric that each also contain addressable control registers. Furthermore, cores implemented in the PL can trigger interrupts to the processors (connections not shown in Fig. 3) and perform DMA accesses to DDR3 memory." Page 4, this teaches that the programmable logic has cores (controllers) and with figure 2.1 also teaches the use of DSP and RAM components].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu and Digilent before him before the effective filing date of the claimed invention to implement the dual subsystem taught by Digilent, by implementing the MMA [Umuroglu: BISMO] as disclosed by Umuroglu as the programmable logic subsystem taught by Digilent, since Umuroglu already evaluated “BISMO on the Xilinx PYNQ-Z1 board” [Umuroglu: I. Introduction], for various improvements in performance.
However, Umuroglu and Digilent does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor, the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
the controller configured to:
receive the converted weight matrix and the converted input data matrix,
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of quantized matrix multiplication, Cowan teaches:
At least one weight matrix and at least one input data matrix, the weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution [Page 310,l Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)" teaches the use of weight and input (activations) matrices that has columns and rows and a bit resolution];
A processor [Arm Neon: "To target low-power ARM processors, we synthesize code in a subset of the ARM NEON vectorized instruction set." Page 311, teaches the use of ARM NEON instruction set to use ARM processors]:
An instruction set for generating bit sliced data [Arm Neon: "To target low-power ARM processors, we synthesize code in a subset of the ARM NEON vectorized instruction set." Page 311, teaches the use of ARM NEON instruction set to use ARM processors; Other platforms: ”We have also implemented a synthesis backend for x86’s AVX2 vector instruction set… our experience with x86 suggests that synthesis enables rapid porting to new architectures, including potentially to programmable accelerators]
for the weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and for the input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and ["We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors." Page 307, Multi-bit quantization for both weights and activations (inputs); Figure 2 shows the breakdown of data into bitplane vectors and shows the number of elements in the bit slice vectors are equal to the on number of elements in the corresponding vectors for matrix multiplication; "It takes as input a d-dimensional tensor and returns a d + 1-dimensional tensor, with a new bit axis that indexes the bitplanes of the original values." Page 308, Bit-Slicing Schedules, teaches expanding into a higher dimensional tensor to include the bitplanes];
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row [Multi-bit quantization: “The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes; for example, the first bitplane vector w0 holds the least-significant bit of each weight in the vector w.”; Figure 2, shows expanding the weight data into a bit form and generating a bitslice vector based on the corresponding bitwidth]; and
where said generate, based on the bit resolution, the number of bit slice vectors for each column of the input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column [Multi-bit quantization: “The first step is to decompose each value in the vectors w and a into their constituent bits at the corresponding bitwidth. The resulting vectors are called bitplanes; for example, the first bitplane vector w0 holds the least-significant bit of each weight in the vector w.”; Figure 2, shows expanding the activation data into a bit form and generating a bitslice vector based on the corresponding bitwidth].
And wherein the operations are applied to convolutional neural networks [“we focus on both fully connected and convolutional neural networks” Sec.2 Quantized Models].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the controller as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
A processor, the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements; and
the controller configured to:
receive the converted weight matrix and the converted input data matrix,
for the converted weight matrix: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements; and generate a bit slice weight tensor based on the bit slice vectors for each row; and
for the converted input data matrix: generate, based on the bit resolution, a number of bit slice vectors for each column each bit slice vector having i elements and generate a bit slice input data tensor based on the bit slice vectors for each column; and
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the converted weight matrix includes: arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row; and where said generate, based on the bit resolution, the number of bit slice vectors for each column of the converted input data matrix includes: arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the column.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
A processor [“the above procedure, described for CPU-based implementations” p.166], the processor configured to:
generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements; and generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i element [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
Cowan further discloses the bit-slicing can support GEMM operators [“We have
implemented a library of operators that support the bitpacking transformation ... The library targets common neural network operators such as 2D convolutions and dense matrix multiplication (GEMM).” Operators supporting bit-packing scheduling, Sec.3.1]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations using the converted matrices in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
System claims 9-14 correspond to system claims 2-7 respectively. System claims 9-14 are therefore rejected for the reasons given above for the system claims 2-7.
Regarding Method claim 15, Umuroglu, Digilent, Cowan, and Pothos discloses the systems according to claims 1 and 8 which implements the method of claim 15 in a processor and MMA respectively. Method claim 15 is therefore rejected for the reasons given above for the system claims 1 and 8
Method claims 16-19 correspond to system claims 2-5 respectively. A mere change in statutory class is obvious. Method claims 16-19 are therefore rejected for the reasons given above for system claims 2-5.
Regarding Method claims 20, Umuroglu, Digilent, Cowan, and Pothos discloses the method according to claim 19. Method claim 20 corresponds to the system claim 7 (including the limitations of claim 6). Method claim 20 is therefore rejected for the reasons given above for the system claims 7 (and 6).
Response to Arguments
Applicant’s arguments, see page 13, filed 12/10/2025, with respect to Objections to the Specification have been fully considered but they are not persuasive. The Objections to the Specification of the Office Action mailed 07/16/2025 has been maintained. See objections above.
Applicant's arguments, see page 14, filed 12/10/2025, with respect to Rejections under 35 U.S.C. 103 have been fully considered but they are not persuasive. Regarding claim 1 (and 8/15), the applicant argues that Umuroglu and Cowan does not teach how to implement a convolutional neural network. However, the applicant’s argument mischaracterizes Cowan. Cowan describes their operations based on convolutional neural network data. See at least Cowan Sec.2. Additionally, the argument is directed to the references individually and not the references as combined. The examiner respectfully disagrees with the applicant’s assertion to the contrary for at least the reasons given above.
Applicant's arguments, see page 14-15, filed 12/10/2025, with respect to the new limitations for the Rejections under 35 U.S.C. 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Kenny K. Bui whose telephone number is (571)270-0604. The examiner can normally be reached 8:00 am to 3:00 pm on Monday, 8:00 am to 4:00 pm on Tuesday to Friday ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew T Caldwell can be reached at (571)272-3702. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KENNY K. BUI/Patent Examiner, Art Unit 2182 (571)270-0604
/ANDREW CALDWELL/Supervisory Patent Examiner, Art Unit 2182