Office Action Analysis: 17708919 — Matrix Multiply Accelerator for Variable Bitwidth Operands

Office Action

§103 §112 §DP
DETAILED ACTION
The Office Action is sent in response to Applicant’s Communication received on 12/15/2025 for application number 17/708,919. The Office hereby acknowledges receipt of the following and placed of record in file: Applicant’s remarks, and amendments to claims, specification and drawings.
Examiner Notes the following: Claims 1-13, 15, and 18 have been amended, claim 17 have been cancelled. Claims 1-16 and 18-20 are pending 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Remarks
Certain dependent claims use the transitional phrase "where", for example claims 2 and 3. The applicant should consider amending the claims to use more conventional "wherein". See MPEP 2111.04.

Claim Objections
Claims 7 and 15 objected to because of the following informalities: 
In claims 7 and 15, “second bit-slice tensor” should read as “second bitslice vector”.  
In claim 5, “in row of the” should read “in the row of the”
In claim 5, “in a column of the” should read “in the column of the”
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 6 and 14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 states: “a first bitslice vector comprising i-elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element…” and “a second bitslice vector comprising i-elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position k of an n-bit element…”. Since each bitslice vector has i-elements, which each i-element corresponds to the bit position of their respective m-bit or n-bit element, shows that i=m and i=n, or m=i=n. 
Regarding claim 6 (and identical claim 14), states that “m is less than n”, This claim is unclear as it specifies a condition that is inconsistent with a parent claim for the reasons given above for claim 1. As such, claims 6 and 14 are rejected under 35 U.S.C. 112(b).
For purposes of examination, specifically for claims 6 and 14, the parent limitations of “a first bitslice vector comprising i-elements” and “a second bitslice vector comprising i-elements” would be considered as different number of elements.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).

A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 

The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.

The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 1-5, 7-13, 15-16, and 18-20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 7 of copending Application No. 17/493,420 (hereinafter Reference claim 7) in view of Umuroglu et al. (NPL: "BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing" from the IDS filed 04/16/2025), hereinafter Umuroglu. 
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Regarding claim 1, Reference claim 7 discloses:
Instant Application
Claim 1
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
An apparatus for implementing a convolution neural network, the apparatus comprising:
A memory configured to store one or more weight tensors of a convolution neural network layer, where the weight tensor has a height, a width, and a depth greater than one; and

A processor coupled to the memory and configured to:

Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network

Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network

Generate an output matrix, each element of the output matrix generated as a product of a row of the converted weight matrix and a column of the converted input matrix

Claim 1: A system, comprising:
a memory configured to store at least one converted weight matrix and at least one converted input data matrix, the converted weight matrix having a number of rows, a number of columns, a number of elements and a bit resolution, the converted input data matrix including a number of rows, a number of columns, a number of elements and a bit resolution;
a processor, coupled to the memory, configured to:

generate each row of a converted weight matrix from a weight tensor of a convolution neural network layer, the weight tensor having a height, a width, and a depth greater than one and each row having a number (i) of elements 

generate each column of a converted input matrix from an input feature map of the convolutional neural network layer, each column having i elements for the converted weight matrix:

receive the bit slice weight tensor and the bit slice input data tensor, and multiply the bit slice weight tensor and the bit slice input data tensor to generate an output data matrix,
Where said generate the product of a row of the converted weight matrix and a column of the converted input matrix includes:
obtain a first bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element of the row of the converted weight matrix;

obtain a second bitslice vector comprising n elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at a particular bit position of a respective n-bit operand of a number of n-bit operands, where n and m are greater than one;
Claim 1: generate, based on the bit resolution, a number of bit slice vectors for each row, each bit slice vector having i elements…  for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of the row:
Claim 2: …the number of bit slice vectors is the same as the bit resolution of the weight matrix

Claim 1: generate, based on the bit resolution, a number of bit slice vectors for each column, each bit slice vector having i elements… for each bit position in the sequence of bits: form a bit slice vector as values of the bits in the bit position for elements of then column.
Claim 3:… the number of bit slice vectors is the same as the bit resolution of the input data matrix.
provide the first bitslice vector as a first input to a single-bit dot product unit; and
provide the second bit-slice vector as a second input to the single-bit dot product unit; and
where the single-bit dot product unit is configured to:
generate a bitslice dot product as a dot product of the first and second bitslice vectors
shift the bitslice dot product by j+k bit positions to provide a shifted bitslice dot product; and accumulate the shifted bitslice dot products for all values of j and k to provide the element of the output matrix

Claim 1: generate, based on the bit resolution, a number of bit slice vectors for each row… generate, based on the bit resolution, a number of bit slice vectors for each column,
Claim 4: the system according to claim 3, where the MMA includes:
an array of bit slice dot product (BSDP) elements… each BSDP element configured to generate a dot product between one row of the weight matrix and one column of the input data matrix.

Claim 7: receive an index value… the index value being equal to j + k;
count a number of bits set to one in the resultant value to generate a population count value; and
left-shift the population count value based on the index value to generate the intermediate value.

Claim 5: an ADDER circuit configured to add the intermediate value to an accumulated value;

However, Reference claim 7 does not explicitly disclose when the first bitslice vector corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does not correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; and when the first bitslice vector does not corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product;
	In the analogous art of bit-wise matrix multiplication, Umuroglu teaches when the first bitslice vector corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does not correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; and when the first bitslice vector does not corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product [Algorithm 1, discloses on line 7 and 12, for signed inputs the first bits are sign bits and are used to determine negation]
It would have been obvious to one of ordinary skill in the art, having the teachings of Reference Application and Umuroglu et al. before him before the effective filing date of the claimed invention to incorporate the dot product unit structure [Umuroglu et al.: BISMO] as taught by Umuroglu et al. into the second circuit as disclosed by the Reference Application, to implement support for signed data values using bit-serial matrix multiplication, in order to efficiently compute matrix multiplications using available instructions for most processors [Umuroglu et al.: II. Bit-Serial Matrix Multiplication]. The combination of the Reference Application and Umuroglu et al. discloses the limitations of claim 1.
Regarding claim 2, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim 1 above. Reference claim 7 discloses:
Instant Application
Claim 2
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
Claim 2:
said provide the first bitslice vector includes provide the elements of the first bitslice vector in a first sequence; and
said provide the second bitslice vector includes provide the elements of the second bitslice vector in a second sequence; 

Claim 1: generate, based on the bit resolution, a number of bit slice vectors for each row… generate a bit slice weight tensor based on the bit slice vectors for each row…generate, based on the bit resolution, a number of bit slice vectors for each column… generate a bit slice input data tensor based on the bit slice vectors for each column;
Claim 4: the system according to claim 3
an array of bit slice dot product (BSDP) elements, coupled to the controller and the first, second and third registers, configured to multiply the bit slice weight tensor and the bit slice input data tensor, each BSDP element configured to generate a dot product between one row of the converted weight matrix and one column of the converted input data matrix.

Regarding claim 3, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  2 above. Reference claim 7 discloses:
Instant Application
Claim 3
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
The apparatus according to claim 2, where:
said obtain the first bitslice vector includes 
said obtain the second bitslice vector includes
Claim 1:  generate a bit slice weight tensor based on the bit slice vectors for each row, generate a bit slice input data tensor based on the bit slice vectors for each column;
Claim 4: the system according to claim 3
a local memory;
a controller coupled to the local memory;
a first register, coupled to the controller and the local memory, configured to store at least a portion of the bit slice input data tensor;
a second register, coupled to the controller and the local memory, configured to store at least a portion of the bit slice weight tensor;

However, the Reference Application does not explicitly disclose: said obtain the first bit-slice vector includes read the first bit vector from a storage; and said obtain the second bit-slice vector includes read the second bit vector from the storage.
In the analogous art of bit-wise matrix multiplication, Umuroglu et al. teaches
said obtain the first vector from a storage; and said obtain the second vector from the storage. [C. Programming BISMO: “BISMO provides programmability through the use of instructions that control each of the pipeline stages… The RunFetch instruction specifies from where in main memory to read data and the destination matrix buffers to store read data.”; Figure 3, teaches loading in data from main memory into two different sets of buffers, i.e. The Execute Stage: “the left-hand-side and right-hand-side matrix buffers”].
It would have been obvious to one of ordinary skill in the art, having the teachings of Reference Application and Umuroglu et al. before him before the effective filing date of the claimed invention to incorporate the dot product array instruction set for data fetching [Umuroglu et al.: BISMO] as taught by Umuroglu et al. into the system as disclosed by the Reference Application, to implement the instruction set for fetching data, in order to avoid bottlenecks and ensure efficient use of bandwidth [Umuroglu et al.: III. The Bit-Serial Matrix Multiplication Overlay: A. Hardware Architecture]. The combination of the Reference Application and Umuroglu et al. discloses the limitations of claim 3.
Regarding claim 4, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  2 above. Reference claim 7 discloses:
Instant Application
Claim 4
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
The apparatus according to claim 2, where:
said obtain the first bitslice vector includes:

for each m-bit element of the row of the converted weight matrix, generate a bit vector having m bits, each bit corresponding to a bit value at a particular bit position of the m-bit elements, and
generate the first bitslice vector based on the bit vectors for the m-bit elements; and

Claim 1:
where said generate, based on the bit resolution, the number of bit slice vectors for each row of the weight matrix includes:
arrange elements of the row in bit vector form as a bit vector including a sequence of bits; and
for each bit position in the sequence of bits:
form a bit slice vector as values of the bits in the bit position for elements of the row:
Claim 2: The system according to claim 1
the number of bit slice vectors is the same as the bit resolution of the weight matrix
said obtain the second bitslice vector includes: 

for each n-bit element of the column of the converted input matrix, generate a bit vector having n bits, each bit corresponding to a bit value at a particular bit position of the n-bit elements, and
generate the second bitslice vector based on the bit vectors for the n-bit elements.
Claim 1:
where said generate, based on the bit resolution, the number of bit slice vectors for each column of the input data matrix includes:
arrange elements of the column in bit vector form as a bit vector including a sequence of bits; and
for each bit position in the sequence of bits:
form a bit slice vector as values of the bits in the bit position for elements of the column.
Claim 3: The system according to claim 2
the number of bit slice vectors is the same as the bit resolution of the input data matrix.

Regarding claim 5, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  4 above. Reference claim 7 discloses:
Instant Application
Claim 5
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
The apparatus according to claim 4, where the number of m-bit elements in the row of the converted weight matrix is the same as the number of n-bit elements in a column of the converted input matrix.
Claim 2: The system according to claim 1
the number of columns of the weight matrix is the same as the number of rows of the input data matrix;

Regarding claim 7, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  5 above. Reference claim 7 discloses:
Instant Application
Claim 7
Reference Application, No. 17/493,420
Claim 4/3/2/1
provide the first bitslice vector as the first input to an array of single-bit dot product units;
provide the second bit-slice tensor as the second input to the array of single-bit dot product units; and
obtain, from the array of single-bit dot product units, an output comprising a product of the multiplication of the converted weight matrix and the converted input matrix.
a matrix multiply accelerator (MMA), coupled to the processor and the memory, configured to:
receive the bit slice weight tensor and the bit slice input data tensor, and
multiply the bit slice weight tensor and the bit slice input data tensor to generate an output data matrix,
Claim 4: the system according to claim 3
an array of bit slice dot product (BSDP) elements, coupled to the controller and the first, second and third registers, configured to multiply the bit slice weight tensor and the bit slice input data tensor, each BSDP element configured to generate a dot product between one row of the weight matrix and one column of the input data matrix.

Regarding claim 8, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  7 above. Reference claim 7 discloses:
Instant Application
Claim 8
Reference Application, No. 17/493,420
Claim 4/3/2/1
The apparatus according to claim 7, where:
each first bitslice vector is provided as the first input to each single-bit dot product unit in one row of the array of single-bit dot product units; and
each second bitslice vector is provided as the second input to each single-bit dot product unit in one column of the array of single-bit dot product units.
Claim 4: the system according to claim 3
an array of bit slice dot product (BSDP) elements, coupled to the controller and the first, second and third registers, configured to multiply the bit slice weight tensor and the bit slice input data tensor, each BSDP element configured to generate a dot product between one row of the weight matrix and one column of the input data matrix.

Method claims 9-13 and 15-16 of the instant application corresponds to apparatus to claims 1-5 and 7-8, respectively. A mere change in statutory class is obvious. Method claims 9-13 and 15-16 are therefore provisionally rejected for the reasons given above.
Regarding claim 18, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  1 above. Reference claim 7 discloses:
Instant Application
Claim 18
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
The apparatus according to claim 17, where the single-bit dot product unit includes:
a first circuit configured to input a first operand, input a second operand, and perform a bit-wise AND operation to produce a resultant value
In claim 5: The system according to claim 4, where each BSDP element includes:
a bit-wise AND circuit configured to input a first operand from the first register, input a second operand from the second register, and output a resultant value”
a second circuit configured to input an index parameter, 

In claim 5: a popcount circuit configured to receive the resultant value and output an intermediate value”
In claim 7: where the popcount circuit is configured to: receive an index value from the second register, the index value being equal to j + k… left-shift the population count value based on the index value to generate the intermediate value.
a third circuit configured to receive the shifted bitslice dot product from the second circuit, and add the shifted bitslice dot product to an accumulated value
In claim 5: an ADDER circuit configured to add the intermediate value to an accumulated value
an accumulation storage configured to store the accumulated value, and output a final accumulated value as the element of the output matrix
In claim 5: an accumulation register configured to store the accumulated value, and output a final accumulated value to the third register
In claim 4: …each BSDP element configured to generate a dot product between…

However, the Reference Application does not explicitly disclose: a second circuit configured to input an index parameter, input a sign parameter, receive the resultant value from the first circuit, and output an intermediate value based on the index parameter, the sign parameter and the resultant value;
In the analogous art of bit-wise matrix multiplication, Umuroglu et al. teaches
a second circuit configured to input an index parameter, input a sign parameter, receive the resultant value from the first circuit, and output an intermediate value based on the index parameter, the sign parameter and the resultant value [The Execution Stage: “the summation is a simple population count (popcount) of the result” (of the multi-bit logic AND operation) and “The weight in Algorithm 1 is implemented by a left-shift unit and optional negation”]
It would have been obvious to one of ordinary skill in the art, having the teachings of Reference Application and Umuroglu et al. before him before the effective filing date of the claimed invention to incorporate the dot product unit structure [Umuroglu et al.: BISMO] as taught by Umuroglu et al. into the second circuit as disclosed by the Reference Application, to implement support for signed data values using bit-serial matrix multiplication, in order to efficiently compute matrix multiplications using available instructions for most processors [Umuroglu et al.: II. Bit-Serial Matrix Multiplication].
Regarding claim 19, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  18 above. Reference claim 7 discloses:
Instant Application
Claim 19
Reference Application, No. 17/493,420
Reference claim 7 (7/6/5/4/3/2/1)
The first operand is an element of the first bitslice vector having an index j equal to the associated bit position of the element
In claim 6: the first operand is a bit slice vector from the bit slice input data tensor having an index k equal to the associated bit position of the bit slice vector
the second operand is an element of the second bitslice vector having an index k equal to the associated bit position of the element
In claim 6: the second operand is a bit slice vector from the bit slice weight tensor having an index j equal to the associated bit position of the bit slice vector
the second circuit is configured to:
count a number of bits set to one in the resultant value to generate a population count value
In claim 7: where the popcount circuit is configured to: count a number of bits set to one in the resultant value to generate a population count value
left-shift the population count value based on the index parameter to generate the intermediate value
In claim 7: left-shift the population count value based on the index value to generate the intermediate value
the index parameter is equal to j + k
in claim 7: the index value being equal to j + k

However, the Reference Application does not explicitly disclose: multiply the intermediate value by the sign parameter
In the analogous art of bit-wise matrix multiplication, Umuroglu et al. teaches
multiply the intermediate value by the sign parameter [Algorithm 1, Line 7, teaches the use of dealing with sign of the integers as part of the “weight” multiplication which is done within the dot product unit]. 
It would have been obvious to one of ordinary skill in the art, having the teachings of Reference Application and Umuroglu et al. before him before the effective filing date of the claimed invention to incorporate the Algorithm methodology and dot product unit structure [Umuroglu et al.: BISMO] as taught by Umuroglu et al. into the second circuit as disclosed by the Reference Application, to implement support for signed data values using bit-serial matrix multiplication, in order to efficiently compute matrix multiplications using available instructions for most processors [Umuroglu et al.: II. Bit-Serial Matrix Multiplication].
Regarding claim 20, Reference claim 7 of the Reference Application and Umuroglu discloses the invention substantially as claimed. See the discussion of claim  19 above. Reference claim 7 does not disclose the additional limitation of claim 20. More specifically, Reference claim 7 does not teach where: when the m-bit and n-bit operands are unsigned elements, the sign parameter is equal to 1; and when the m-bit and n-bit operands are signed elements, the sign parameter is equal to 1 or -1, based on the index j and the index k.
In the analogous art of bit-wise matrix multiplication, Umuroglu et al. teaches
where: when the m-bit and n-bit operands are unsigned elements, the sign parameter is equal to 1 [Figure 1, "Fig. 1. Example of a bit-serial matrix multiplication on unsigned integers… weight on line 8 is always positive" teaches that unsigned integers are always positive which would make sign equal to 1]; and 
when the m-bit and n-bit operands are signed elements, the sign parameter is equal to 1 or -1, based on the index j and the index k [Algorithm 1, Line 5-7, the sign parameter for signed integers are equal to 1 or -1, based on i or j (j and k)].
It would have been obvious to one of ordinary skill in the art, having the teachings of Reference Application and Umuroglu et al. before him before the effective filing date of the claimed invention to incorporate the Algorithm methodology and dot product unit structure [Umuroglu et al.: BISMO] as taught by Umuroglu et al. into the second circuit as disclosed by the Reference Application, to implement support for signed data values using bit-serial matrix multiplication, in order to efficiently compute matrix multiplications using available instructions for most processors [Umuroglu et al.: II. Bit-Serial Matrix Multiplication].
Claims 6 and 14 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 4 of copending Application No. 17/493,420 and Umuroglu et al, in view of Cowan et al. (NPL: "Automatic Generation of High-Performance Quantized Machine Learning Kernels" from the IDS filed 04/16/2025). 
Regarding claim 6, claim 4 of the copending Application No. 17/493,420 (Reference Application) and Umuroglu et al teaches the invention substantially as claimed. See the discussion of claim 5 above. Reference claim 4 does not disclose the additional limitation of claim 6. More specifically, Reference claim 4 does not teach a scenario where m is less than n.
In the analogous art of quantized matrix multiplication, Cowan et al. teaches where m is less than n [Page 310,l Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)" teaches the ability to use the algorithm with 1 bit activations and 2 bit weights].
	It would have been obvious to one of ordinary skill in the art, having the teachings of the Reference Application, Umuroglu, and Cowan et al. before him before the effective filing date of the claimed invention to incorporate the Multi-bit quantization bit-slicing instruction set as taught by Cowan et al. into the processor as disclosed by Reference Application and Umuroglu, to allow for bitwise operations on mixed bitwidth quantized data for improvement in computations [Cowan et al.: Quantized Models Pages 306-307].
Method claim 14 of the instant application correspond to apparatus claim 6. A mere change in statutory class is obvious. Method claim 14 is therefore provisionally rejected for the reasons given above for apparatus claim 6.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-16, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Umuroglu et al. (NPL: "BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing" from the IDS filed 04/16/2025), hereinafter Umuroglu, and in view of Digilent (NPL: “PYNQ-Z1 Board Reference Manual" from the IDS filed 04/16/2025), and further in view of Cowan et al. (NPL: "Automatic Generation of High-Performance Quantized Machine Learning Kernels" from the IDS filed 04/16/2025), hereinafter Cowan, further in view of Pothos et al. (NPL: “Deep Learning Inference with Dynamic Graphs on Heterogeneous Platforms”), hereinafter Pothos.
Regarding claim 1, Umuroglu Discloses:
A memory configured to store matrix data [A. Hardware Architecture: “The Fetch Stage: is responsible for reading matrix data from main memory and populating the matrix buffers with data”; C. Programming BISMO: “The RunFetch instruction specifies from where in main memory to read data and the destination matrix buffers to store read data…. The RunResult instruction specifies the base address of the result matrix stored in main memory”; Teaches that memory can hold matrices to be used for matrix multiplication]
A dot product array; provide the first bit vector as a first input to a single-bit dot product unit; provide at least the second bit vector as a second input to the single-bit dot product unit; Generate an output matrix, each element of the output matrix generated as a product of a row of the left matrix and a column of the right matrix [III The Bit-Serial Matrix Multiplication Overlay: "BISMO consists of a hardware part and a software part. The hardware part is composed of a scalable bit-serial matrix multiplication datapath and associated memory and control logic. The software part generates instructions for the hardware for a given matrix size and precision.", teaches the subsystem of Fig 2, with various components and is used for matrix multiplication; The Execute Stage: "for performing the matrix multiplication on the data present in the matrix buffers...The DPU computes a partial result of the dot product between a row and column of two bit-matrices"]
where the single-bit dot product unit is configured to: generate a dot product as a dot product of the first and second bit vectors [“The DPU computes a partial result of the dot product between a row and column of two bit-matrices”, Sec,III.A.2.]
when the first bit vector corresponds to sign bits of elements of the row of the left matrix and the second bit vector does not correspond to sign bits of elements of the column of the right matrix, negate the dot product; when the first bit vector does not corresponds to sign bits of elements of the row of the left matrix and the bit bitslice vector does correspond to sign bits of elements of the column of the right matrix, negate the dot product [Algorithm 1, Line 5-7, teaches considering the sign bits to determine if the value to be accumulated is to be negated];
shift the dot product by j+k bit positions to provide a shifted bit dot product [Algorithm 1, Line 7, teaches the use of adding both indices i (index j) and j (index k) together for shifting of data as “weight”]; and 
accumulate the shifted bit dot products for all values of j and k to provide the element of the output matrix [Algorithm 1, Line 3-7, and 12, teaches accumulating using the shifting products;]
However Umuroglu does not explicitly disclose:
A memory configured to store one or more weight tensors of a convolution neural network layer, where the weight tensor has a height, a width, and a depth greater than one; and

A processor coupled to the memory and configured to: Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network; Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network; Generate an output matrix, each element of the output matrix generated as a product of a row of the converted weight matrix and a column of the converted input matrix

Where said generate the product of a row of the converted weight matrix and a column of the converted input matrix includes: obtain a first bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element of the row of the converted weight matrix; obtain a second bitslice vector comprising n elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at a particular bit position of a respective n-bit operand of a number of n-bit operands, where n and m are greater than one; provide the first bitslice vector as a first input to a single-bit dot product unit; and provide the second bitslice vector as a second input to the single-bit dot product unit; and

where the single-bit dot product unit is configured to: generate a bitslice dot product as a dot product of the first and second bitslice vectors; when the first bitslice vector corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does not correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; when the first bitslice vector does not corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; shift the bitslice dot product by j+k bit positions to provide a shifted bitslice dot product; and accumulate the shifted bitslice dot products for all values of j and k to provide the element of the output matrix

In the analogous art of Hardware architecture for co-processors systems, Digilent teaches:
A Memory configured to store data [DDR3 Page 9; "The DDR3 is connected to the hard memory controller in the Processor Subsystem (PS), as outlined in the Zynq documentation." Where the DDR3 is connected to the Multiport DRAM Controller in figure 2.1].
A processor and a programmable logic subsystem, coupled to the processor [Application Processing Unit "(APU, which includes 2 Cortex-A9 processors)" (ARM A9 CPUs Page 1), Page 4; “The programmable logic is also connected to the interconnect as a slave, and designs can implement multiple cores in the FPGA fabric that each also contain addressable control registers. Furthermore, cores implemented in the PL can trigger interrupts to the processors (connections not shown in Fig. 3) and perform DMA accesses to DDR3 memory." Page 4, this teaches that the programmable logic has cores (controllers) and with figure 2.1 also teaches the use of DSP]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu and Digilent before him before the effective filing date of the claimed invention to implement the dual subsystem taught by Digilent, by implementing the dot product array [Umuroglu: BISMO] as disclosed by Umuroglu as the programmable logic subsystem taught by Digilent, since Umuroglu already evaluated “BISMO on the Xilinx PYNQ-Z1 board” [Umuroglu: I. Introduction], for various improvements in performance.
However, Umuroglu and Digilent does not explicitly disclose:
A memory configured to store one or more weight tensors of a convolution neural network layer, where the weight tensor has a height, a width, and a depth greater than one; and

A processor coupled configured to: Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network; Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network; Generate an output matrix, each element of the output matrix generated as a product of a row of the converted weight matrix and a column of the converted input matrix

Where said generate the product of a row of the converted weight matrix and a column of the converted input matrix includes: obtain a first bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element of the row of the converted weight matrix; obtain a second bitslice vector comprising n elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at a particular bit position of a respective n-bit operand of a number of n-bit operands, where n and m are greater than one; provide the first bitslice vector as a first input to a single-bit dot product unit; and provide the second bitslice vector as a second input to the single-bit dot product unit; and

where the single-bit dot product unit is configured to: generate a bitslice dot product as a dot product of the first and second bitslice vectors; when the first bitslice vector corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does not correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; when the first bitslice vector does not corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; shift the bitslice dot product by j+k bit positions to provide a shifted bitslice dot product; and accumulate the shifted bitslice dot products for all values of j and k to provide the element of the output matrix

In the analogous art of quantized matrix multiplication, Cowan teaches:
the operations are applied to convolutional neural networks and higher-dimension tensors; [“we focus on both fully connected and convolutional neural networks… for example, in a fully connected network, we can compute the activations for layer k +1 as the matrix-vector product of the weights for layer k and activations for layer k… A convolutional network is similar but with higher-dimension tensors for the weights and activations (i.e., more dot products required)” Sec.2 Quantized Models; “It takes as input a d-dimensional tensor and returns a d + 1-dimensional tensor, with a new bit axis that indexes the bitplanes of the original values” Sec.3.1 Bit-Slicing Schedules]. 
A processor configured to [Arm Neon: "To target low-power ARM processors, we synthesize code in a subset of the ARM NEON vectorized instruction set." Page 311, teaches the use of ARM NEON instruction set to use ARM processors]:
Generate an output matrix, each element of the output matrix generated as a product of a row of the weight matrix and a column of the input matrix 
Where said generate the product of a row of the weight matrix and a column of the input matrix [See Multi-bit quantization section] includes:
obtain a first bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element of the row of the weight matrix [Figure 2, "slicing the values of… activations into bitplanes", and Multi-bit Quantization Section, page 307, teaches a bit packed vector set wherein "decompose each value in the vector… into their constituent bits at the corresponding bitwidth" teaches a number of bitplane vectors equal to number of bits of each element and using a particular bit position of a respective operand];
obtain a second bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position k of an n-bit element of the column of the input matrix, where n and m are greater than one [Figure 2, "slicing the values of weights... into bitplanes", and Multi-bit Quantization Section, page 307, teaches a bit packed vector set wherein "decompose each value in the vector… into their constituent bits at the corresponding bitwidth" teaches a number of bitplane vectors equal to number of bits of each element and using a particular bit position of a respective operand];
	It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by Umuroglu and Digilent, to allow for bitwise operations on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors and with capability to handle higher-dimension tensors by converting them into bit-slices [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311]. 
However Umuroglu, Digilent, and Cowan does not explicitly disclose:
A processor configured to: Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network; Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network; Generate an output matrix, each element of the output matrix generated as a product of a row of the converted weight matrix and a column of the converted input matrix

Where said generate the product of a row of the converted weight matrix and a column of the converted input matrix includes: obtain a first bitslice vector comprising i elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at bit position j of an m-bit element of the row of the converted weight matrix; obtain a second bitslice vector comprising n elements, where each element is a bit vector having a plurality of bits, where each bit corresponds to a bit value at a particular bit position of a respective n-bit operand of a number of n-bit operands, where n and m are greater than one; provide the first bitslice vector as a first input to a single-bit dot product unit; and provide the second bit-slice vector as a second input to the single-bit dot product unit; and

where the single-bit dot product unit is configured to: generate a bitslice dot product as a dot product of the first and second bitslice vectors; when the first bitslice vector corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does not correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; when the first bitslice vector does not corresponds to sign bits of elements of the row of the converted weight matrix and the second bitslice vector does correspond to sign bits of elements of the column of the converted input matrix, negate the bitslice dot product; shift the bitslice dot product by j+k bit positions to provide a shifted bitslice dot product; and accumulate the shifted bitslice dot products for all values of j and k to provide the element of the output matrix

In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
a convolution neural network layer [Figure 4]; and
A processor [“the above procedure, described for CPU-based implementations” p.166], the processor configured to: 
Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network; Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network; [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations of using the converted matrices based on tensor data in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
Regarding claim 2, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 1 above. Umuroglu and Digilent does not explicitly disclose the additional limitations of claim 2. More specifically, Umuroglu and Digilent does not disclose said provide at least one element of the first bitslice vector includes provide the elements of the first bitslice vector in a first sequence; said provide at least one element of the second bitslice vector includes provide the elements of the second bitslice vector in a second sequence; and the output is a dot product of the first and second bitslice vectors.
In the analogous art of quantized matrix multiplication, Cowan teaches said provide the first bitslice vector includes provide the elements of the first bitslice vector in a first sequence; and said provide the second bitslice vector includes provide the elements of the second bitslice vector in a second sequence; [Multi-bit Quantization: "Given these bit packed values                         
                            
                                            w
                                        
                                        ^
                                    
                                    i
                                
                     and                         
                            
                                            a
                                        
                                        ^
                                    
                                    i
                                
                    , we can compute                         
                            
                                    w
                                
                                ^
                            
                            *
                            
                                    a
                                
                                ^
                            
                            =
                            
                                    ∑
                                    
                                        n
                                        =
                                        0
                                    
                                        N
                                        -
                                        1
                                    
                                            ∑
                                            
                                                m
                                                =
                                                0
                                            
                                                M
                                                -
                                                1
                                            
                                                    2
                                                
                                                    n
                                                    +
                                                    m
                                                
                                            p
                                            o
                                            p
                                            c
                                            o
                                            u
                                            n
                                            t
                                            (
                                            
                                                            w
                                                        
                                                        ^
                                                    
                                                    n
                                                
                                            &
                                            
                                                            a
                                                        
                                                        ^
                                                    
                                                    m
                                                
                                            )
                                        
                    , where N and M are the bit widths" teaches a sequence of providing each of the elements of a bitslice vector set in order to compute the dot product of the two bitslice vector sets].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to allow for bitwise multiplication on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311].
Regarding claim 3, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 2 above.
Umuroglu discloses said obtain the first vector from a storage; and said obtain the second vector includes from the storage. [C. Programming BISMO: “BISMO provides programmability through the use of instructions that control each of the pipeline stages… The RunFetch instruction specifies from where in main memory to read data and the destination matrix buffers to store read data.”; Figure 3, teaches loading in data from main memory into two different sets of buffers, i.e. The Execute Stage: “the left-hand-side and right-hand-side matrix buffers”].
In the analogous art of Hardware architecture for co-processors systems, Digilent teaches the processor coupled to a storage [DDR3 Page 9; "The DDR3 is connected to the hard memory controller in the Processor Subsystem (PS), as outlined in the Zynq documentation." Where the DDR3 is connected to the Multiport DRAM Controller in figure 2.1; APU is coupled to the memory as shown in figure 2.1].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu and Digilent before him before the effective filing date of the claimed invention to implement the dual subsystem taught by Digilent, by implementing the dot product array [Umuroglu: BISMO] as disclosed by Umuroglu as the programmable logic subsystem taught by Digilent, since Umuroglu already evaluated “BISMO on the Xilinx PYNQ-Z1 board” [Umuroglu: I. Introduction], for various improvements in performance.
However, Umuroglu and Digilent does not explicitly disclose: said obtain the first bitslice vector includes read the first bit vector from a storage; and said obtain the second bitslice vector includes read the second bit vector from the storage.
In the analogous art of quantized matrix multiplication, Cowan teaches said The use of any register as inputs for bitslice vector operations [Compute Sketch: "where each instruction can be either a bitwise operation (and, or, not, addition, etc.) or a special population count intrinsic instruction. In both cases, the synthesizer is free to choose any live registers as the inputs to the instruction" Page 310, teaches the use of any active register to hold values for bitwise operations, such as AND and SHIFT bitwise operations, for input into the operations; "Given these bit packed values                         
                            
                                            w
                                        
                                        ^
                                    
                                    i
                                
                     and                         
                            
                                            a
                                        
                                        ^
                                    
                                    i
                                
                    , we can compute                         
                            
                                    w
                                
                                ^
                            
                            *
                            
                                    a
                                
                                ^
                            
                            =
                            
                                    ∑
                                    
                                        n
                                        =
                                        0
                                    
                                        N
                                        -
                                        1
                                    
                                            ∑
                                            
                                                m
                                                =
                                                0
                                            
                                                M
                                                -
                                                1
                                            
                                                    2
                                                
                                                    n
                                                    +
                                                    m
                                                
                                            p
                                            o
                                            p
                                            c
                                            o
                                            u
                                            n
                                            t
                                            (
                                            
                                                            w
                                                        
                                                        ^
                                                    
                                                    n
                                                
                                            &
                                            
                                                            a
                                                        
                                                        ^
                                                    
                                                    m
                                                
                                            )
                                        
                    " Page 307, teaches computing the operation on two bit packed values, i.e. the bitslice vectors].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing instruction set as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to allow for bitwise multiplication on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311]. The combination of Umuroglu, Digilent, and Cowan, discloses the limitations of claim 3.
Regarding claim 4, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 2 above. Umuroglu and Digilent does not explicitly disclose the additional limitations of claim 4. More specifically, Umuroglu and Digilent does not disclose where: said obtain the first bitslice vector includes: for each m-bit element of the row of the converted weight matrix, generate a bit vector having m bits, each bit corresponding to a bit value at a particular bit position of the m-bit element, and generate the first bitslice vector based on the bit vectors for the m-bit elements; and said obtain the second bitslice vector includes: for each n-bit element of the column of the converted input matrix, generate a bit vector having n bits, each bit corresponding to a bit value at a particular bit position of the n-bit elements, and generate the second bitslice vector based on the bit vectors for the n-bit elements.
In the analogous art of quantized matrix multiplication, Cowan teaches where: said obtain the first bitslice vector includes: for each m-bit element of the row of the weight matrix, generate a bit vector having m bits, each bit corresponding to a bit value at a particular bit position of the m-bit element, and generate the first bitslice vector based on the bit vectors for the m-bit elements; and said obtain the second bitslice vector includes: for each n-bit element of the column of the input matrix, generate a bit vector having n bits, each bit corresponding to a bit value at a particular bit position of the n-bit elements, and generate the second bitslice vector based on the bit vectors for the n-bit elements. [Figure 2. "Slicing values of weights and activations into bitplanes" and Multi-bit quantization: "We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors.", teaches for x-bit operands to be decomposed into bitplanes having x-bits of the respective operand at a bit position; 3.1 Bit-Slicing Schedules: "We introduce a parameterizable bit-packing Layout Transformation to the scheduling process. It takes as input a d-dimensional tensor and returns a d+1-dimensional tensor, with a new bit axis that indexes he bitplanes of the original values", teaches generating the bitplanes from a d-dimensional tensor as part of the scheduling process; Operators supporting bit-packing scheduling: "We have implemented a library of operators that support the bit packing transformation as part of their schedules”].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing Scheduling instruction set as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to implement bitwise multiplication on quantized data for improvement in computations and parallelism, while also allowing the implementation on various platforms [Cowan: Quantized Models Pages 306-307, 4. Microkernel Synthesis Page 309, and 4.3 Implementation Page 311]. 
However, Umuroglu, Digilent, and Cowan does not explicitly disclose using the converted weight matrix, and the converted input matrix.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
Generate a converted weight matrix having a plurality of rows, each row comprising i m-bit elements of a weight tensor of a convolution neural network; Generate a converted input matrix having a plurality of columns, each column comprising i m-bit elements of an input feature map of a convolution neural network; [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations of using the converted matrices based on tensor data in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
Regarding claim 5, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 4 above. Umuroglu and Digilent does not explicitly disclose the additional limitations of claim 5. More specifically, Umuroglu and Digilent does not disclose where the number of m-bit elements in the row of the converted weight matrix is the same as the number of n-bit elements in a column of the converted input matrix.
In the analogous art of quantized matrix multiplication, Cowan teaches where the number of m-bit elements in the row of the weight matrix is the same as the number of n-bit elements in a column of the input matrix [Page 310, l. Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)" teaches the use of weight and input (activations) matrices that has equal number of columns and rows respectively i.e. same number of operands, which is a natural consequence of matrix multiplication].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing matrix multiplication algorithm as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to implement bitwise multiplication on quantized data for improvement in computations and parallelism [Cowan: Quantized Models Pages 306-307]. 
However, Umuroglu, Digilent, and Cowan does not explicitly disclose using the converted weight matrix, and the converted input matrix.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
where the number of m-bit elements in the row of the converted weight matrix is the same as the number of n-bit elements in a column of the converted input matrix; [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations of using the converted matrices based on tensor data in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
Regarding claim 6, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 5 above. Umuroglu and Digilent does not explicitly disclose the additional limitations of claim 6. More specifically, Umuroglu and Digilent does not disclose where m is less than n.
In the analogous art of quantized matrix multiplication, Cowan teaches where m is less than n [Page 310,l Kernel Specification: "For example, the schedule might require a matrix multiplication between an 8 × 16 matrix with 2-bit values (the weights) and a 16 × 1 matrix with 1-bit values (the activations)"]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing matrix multiplication algorithm as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to implement bitwise multiplication on quantized data for improvement in computations and parallelism [Cowan: Quantized Models Pages 306-307]. 
Regarding claim 7, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 5 above. 
Umuroglu discloses:
the processor to: 
provide the first vector as the first input to an array of single-bit dot product units; provide the second bitslice tensor as the second input to the array of single-bit dot product units; and obtain, from the array of single-bit dot product units, an output comprising a product of the multiplication of the first and second matrices [Figure 3; The Execute Stage: "The core of the stage consists of an array of dot product units (DPUs), where each DPU is fed with a design-time configurable number of bits (Dk) from the left-hand-side and right-hand-side matrix buffers… A single software controllable sequence generator is responsible for reading out the appropriate data from the matrix buffers."; The Result Stage: “is responsible for writing the results generated by the execute stage to main memory”].
However, Umuroglu, Digilent, and Cowan does not explicitly disclose using the converted weight matrix, and the converted input matrix.
In the analogous art of General matrix multiplication algorithms for convolutional neural networks, Pothos teaches:
where the number of m-bit elements in the row of the converted weight matrix is the same as the number of n-bit elements in a column of the converted input matrix; [Figure 2, discloses converting filters and input layers into 2 matrices, wherein each filter tensor is converted to a row of Kc × Kh × Kw elements producing the converted weight matrix and input tensor is converted to converted input matrix with Kc x Kh x Kw elements in each column]
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, Cowan, and Pothos before him before the effective filing date of the claimed invention to modify the processor disclosed by the combination of Umuroglu, Digilent, and Cowan, to preprocess and allow the use of the filter and input tensors into their respective converted matrices to allow for the use of GEMM optimized libraries with respect to data prefetching/caching, vectorization and threading mechanisms for improved performance [Pothos: Sec.3.1] and still use the matrix operations and respective MMA given by the combination of Umuroglu, Digilent, and Cowan. The combination of Umuroglu, Digilent, Cowan, and Pothos discloses converted weight matrix and converted input matrix that are basically the matrices to be multiplied. As such, the combination Umuroglu, Digilent, Cowan, and Pothos discloses the limitations of using the converted matrices based on tensor data in place of the matrices disclosed by Umuroglu, Digilent, and Cowan.
Regarding claim 8, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 7 above. 
Umuroglu discloses: where:
each first vector is provided as the first input to each single-bit dot product unit in one row of the array of single-bit dot product units; and each second vector is provided as the second input to each single-bit dot product unit in one column of the array of single-bit dot product units [Figure 3; "The DPUs on the same row...broadcasted by the left-hand-side matrix buffer... DPUs on the same column... broadcasted by the right-hand-side matrix buffer… A single software controllable sequence generator is responsible for reading out the appropriate data from the matrix buffers"].	
However, Umuroglu and Digilent does not explicitly disclose each first bitslice vector is provided as the first input to each single-bit dot product unit in one row of the array of single-bit dot product units; and each second bitslice vector is provided as the second input to each single-bit dot product unit in one column of the array of single-bit dot product units.
In the analogous art of quantized matrix multiplication, Cowan teaches:
Providing the first and second bitslice vector to compute the dot product [Figure 2, shows the breakdown of vectors into a respective bitslice vector; Multi-bit Quantization: "Given these bit packed values                         
                            
                                            w
                                        
                                        ^
                                    
                                    i
                                
                     and                         
                            
                                            a
                                        
                                        ^
                                    
                                    i
                                
                    , we can compute                         
                            
                                    w
                                
                                ^
                            
                            *
                            
                                    a
                                
                                ^
                            
                            =
                            
                                    ∑
                                    
                                        n
                                        =
                                        0
                                    
                                        N
                                        -
                                        1
                                    
                                            ∑
                                            
                                                m
                                                =
                                                0
                                            
                                                M
                                                -
                                                1
                                            
                                                    2
                                                
                                                    n
                                                    +
                                                    m
                                                
                                            p
                                            o
                                            p
                                            c
                                            o
                                            u
                                            n
                                            t
                                            (
                                            
                                                            w
                                                        
                                                        ^
                                                    
                                                    n
                                                
                                            &
                                            
                                                            a
                                                        
                                                        ^
                                                    
                                                    m
                                                
                                            )
                                        
                    , where N and M are the bit widths"].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing scheduling instruction set as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to order to, modify the sequence of data from the matrices to allow for bitwise operations and formatting on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311]. The combination of Umuroglu, Digilent, and Cowan disclose the limitations of the claim.
Method claims 9-16 correspond to apparatus claims 1-8. A mere change in statutory class is obvious. Method claims 9-16  are therefore rejected for the reasons given above for their respective apparatus claims 1-8.
Regarding claim 18, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 1 above. 
Umuroglu discloses: where the single-bit dot product unit includes: a first circuit configured to input a first operand, input a second operand, and perform a bit-wise AND operation to produce a resultant value [The Execution Stage: "where each DPU is fed with a design-time configurable number of bits (Dk) from the left-hand-side and right-hand-side matrix buffers” i.e. first and second operands; “The single bit multiplications are performed by a multi-bit logic AND operation”];
a second circuit configured to input an index parameter, input a sign parameter, receive the resultant value from the first circuit, and output the shifted bit dot product based on the index parameter, the sign parameter and the resultant value [The Execution Stage: “the summation is a simple population count (popcount) of the result” (of the multi-bit logic AND operation) and “The weight in Algorithm 1 is implemented by a left-shift unit and optional negation”];
a third circuit configured to receive the shifted bit dot product from the second circuit, and add the shifted bit dot product to an accumulated value; and an accumulation storage configured to store the accumulated value, and output a final accumulated value as the element of the output matrix [The Execution Stage: “The partial results are accumulated and stored in a register (Acc.) of width A, which is typically 32 bits [5], [6] to avoid overflow"; The Result Stage: "When the execute stage has produced a new set of results, the accumulated dot-products are written to the result buffer from which the result stage writes them to main memory"].
However, Umuroglu and Digilent does not explicitly disclose using bitslice vectors to generate the bitslice dot product.
 In the analogous art of quantized matrix multiplication, Cowan teaches where: obtaining the bitslice vectors for matrix multiplication. [Figure 2. "Slicing values of weights and activations into bitplanes" and Multi-bit quantization: "We can extend the above approach to larger weights and activations by slicing the bits of the weights and activations into bitplanes and then packing them into vectors.", teaches for x-bit operands to be decomposed into bitplanes having x-bits of the respective operand at a bit position; 3.1 Bit-Slicing Schedules: "We introduce a parameterizable bit-packing Layout Transformation to the scheduling process. It takes as input a d-dimensional tensor and returns a d+1-dimensional tensor, with a new bit axis that indexes he bitplanes of the original values", teaches generating the bitplanes from a d-dimensional tensor as part of the scheduling process; Operators supporting bit-packing scheduling: "We have implemented a library of operators that support the bit packing transformation as part of their schedules”].
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing Scheduling instruction set as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to implement bitwise multiplication on quantized data for improvement in computations and parallelism, while also allowing the implementation on various platforms [Cowan: Quantized Models Pages 306-307, 4. Microkernel Synthesis Page 309, and 4.3 Implementation Page 311]. 
Regarding claim 19, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 18 above. 
Umuroglu discloses:
the first operand is an element of the first vector having an index j equal to the associated bit position of the element; the second operand is an element of the second vector having an index k equal to the associated bit position of the element [Algorithm 1, Lines 3-4, teaches I and j which indicates the bit position as shown in Figure 1];
the second circuit is configured to: count a number of bits set to one in the resultant value to generate a population count value, left-shift the population count value based on the index parameter to generate the intermediate value [The Execution Stage: “the summation is a simple population count (popcount) of the result” (of the multi-bit logic AND operation) and “The weight in Algorithm 1 is implemented by a left-shift unit and optional negation”], and
multiply the intermediate value by the sign parameter [Algorithm 1, Line 7, teaches the use of dealing with sign of the integers as part of the “weight” multiplication]; and
the index parameter is equal to j + k [Algorithm 1, Line 7, teaches the use of adding both indices i (index j) and j (index k) together for shifting of data as “weight”].
However Umuroglu and Digilent does not explicitly disclose the first operand is an element of the first bitslice vector having an index j equal to the associated bit position of the element; the second operand is an element of the second bitslice vector having an index k equal to the associated bit position of the element
In the analogous art of quantized matrix multiplication, Cowan teaches:
the first operand is an element of the first bitslice vector having an index j equal to the associated bit position of the element; the second operand is an element of the second bitslice vector having an index k equal to the associated bit position of the element [Multi-bit quantization: “Given these bit packed values                         
                            
                                            w
                                        
                                        ^
                                    
                                    i
                                
                     and                         
                            
                                            a
                                        
                                        ^
                                    
                                    i
                                
                    , we can compute                         
                            
                                    w
                                
                                ^
                            
                            *
                            
                                    a
                                
                                ^
                            
                            =
                            
                                    ∑
                                    
                                        n
                                        =
                                        0
                                    
                                        N
                                        -
                                        1
                                    
                                            ∑
                                            
                                                m
                                                =
                                                0
                                            
                                                M
                                                -
                                                1
                                            
                                                    2
                                                
                                                    n
                                                    +
                                                    m
                                                
                                            p
                                            o
                                            p
                                            c
                                            o
                                            u
                                            n
                                            t
                                            (
                                            
                                                            w
                                                        
                                                        ^
                                                    
                                                    n
                                                
                                            &
                                            
                                                            a
                                                        
                                                        ^
                                                    
                                                    m
                                                
                                            )
                                        
                    , where N and M are the bitwidths for weights and activations” teaches the sum of n+m for shifting the popcount value which also represents the associated bit positions];
It would have been obvious to one of ordinary skill in the art, having the teachings of Umuroglu, Digilent, and Cowan before him before the effective filing date of the claimed invention to incorporate the bit-slicing scheduling instruction set and bitplane multiplication algorithm as taught by Cowan into the processor as disclosed by the combination of Umuroglu and Digilent, to order to, modify the sequence of data from the matrices to allow for bitwise operations and formatting on quantized data for improvement in computations and parallelism, while also improving performance on ARM processors [Cowan: Quantized Models Pages 306-307, ARM NEON Page 311]. The combination of Digilent, Umuroglu, and Cowan disclose the limitations of claim 19.
Regarding claim 20, Umuroglu, Digilent, Cowan, and Pothos disclose the invention substantially as claimed. See the discussion of claim 19 above. 
Umuroglu discloses when the m-bit and n-bit operands are unsigned elements, the sign parameter is equal to 1 [Figure 1, "Fig. 1. Example of a bit-serial matrix multiplication on unsigned integers… weight on line 8 is always positive" teaches that unsigned integers are always positive which would make sign equal to 1]; and 
when the m-bit and n-bit operands are signed elements, the sign parameter is equal to 1 or -1, based on the index j and the index k [Algorithm 1, Line 5-7, the sign parameter for signed integers are equal to 1 or -1, based on i or j (j and k)].

Response to Arguments
Applicant’s arguments, see page 10, filed 12/15/2025, with respect to Objections to the Specification, Drawings, and Claims have been fully considered and are persuasive. The Objections to the Specification, Drawings, and Claims of the Office Action mailed 07/16/2025 has been withdrawn. 
Applicant’s arguments, see page 11, filed 12/15/2025, with respect to the rejection under 35 U.S.C. 112 have been fully considered and are persuasive. The rejection under 35 U.S.C. 112 of the Office Action mailed 07/16/2025 has been withdrawn.  
Applicant's arguments, see page 13, filed 12/15/2025, with respect to Rejections under 35 U.S.C. 103 have been fully considered but they are not persuasive. Regarding claim 1 (and 8/15), the applicant argues that Umuroglu and Cowan does not teach how to implement a convolutional neural network. However, the applicant’s argument mischaracterizes Cowan. Cowan describes their operations based on convolutional neural network data. See at least Cowan Sec.2. Additionally, the claim only recites the implementation of the CNN in the preamble. A preamble is not ordinarily given patentable weight and the argument is directed to the references individually and not the combination of the references as combined. The examiner respectfully disagrees with the applicant’s assertion to the contrary for at least the reasons given above.
Applicant's arguments, see page 11-14, filed 12/15/2025, with respect to the new limitations for the Rejections under 35 U.S.C. 103 and Double Patenting have been considered but are moot in view of the different grounds of rejection necessitated by the claim amendments.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Kenny K. Bui whose telephone number is (571)270-0604. The examiner can normally be reached 8:00 am to 3:00 pm on Monday, 8:00 am to 4:00 pm on Tuesday to Friday ET.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew T Caldwell can be reached at (571)272-3702. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/KENNY K. BUI/Patent Examiner, Art Unit 2182                                                                                                                                                                                                        (571)270-0604
/ANDREW CALDWELL/Supervisory Patent Examiner, Art Unit 2182
Read full office action
Matrix Multiply Accelerator for Variable Bitwidth Operands

This examiner grants 60% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Matrix Multiply Accelerator for Variable Bitwidth Operands

This examiner grants 60% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email