Last updated: April 19, 2026
Application No. 18/187,465
MANAGING INPUT REGISTERS FOR A COMPUTE ENGINE OF A MATRIX MATH ASSIST ACCELERATOR

Final Rejection §102§103§112
Filed
Mar 21, 2023
Examiner
VICARY, KEITH E
Art Unit
2183
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
6 (Final)
Interview Optional

— +41.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 683 resolved cases, 2023–2026
Examiner Intelligence

VICARY, KEITH E View full profile →
Grants 58% of resolved cases
Career Allow Rate
393 granted / 683 resolved
+2.5% vs TC avg
Strong +41% interview lift
Without
With
+41.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
41 currently pending
Career history
724
Total Applications
across all art units
Statute-Specific Performance

§101
8.7%
-31.3% vs TC avg
§103
34.0%
-6.0% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
37.6%
-2.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 683 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 
Claims 1-20 are pending in this office action and presented for examination. Claims 1, 6, 11, and 16 are newly amended by the response received February 19, 2026.

In amended claim 11, line 10, the limitation “of the MMA accelerator” has been deleted without appropriate strikethrough. Therefore, it is unclear as to whether this limitation was intended to be deleted. Examiner recommends, in the next office action, restoring this limitation with no markup if this limitation was intended to be retained, or restoring this limitation with strikethrough if this limitation was intended to be deleted.

In amended claim 16, line 12, the limitation “of the MMA accelerator” has been deleted without appropriate strikethrough. Therefore, it is unclear as to whether this limitation was intended to be deleted. Examiner recommends, in the next office action, restoring this limitation with no markup if this limitation was intended to be retained, or restoring this limitation with strikethrough if this limitation was intended to be deleted.

Claim Objections
Claims 1-20 are objected to because of the following informalities.  Appropriate correction is required.
In claim 1, the “and” at the end of line 19 should be moved to line 20 in order to precede the last recited operation of the method.
Claims 2-10 are objected to for failing to alleviate the objection of claim 1 above.

In claim 11, the “and” at the end of line 23 should be moved to line 24 in order to precede the last recited operation.
Claims 12-15 are objected to for failing to alleviate the objection of claim 11 above.

In claim 16, the “and” at the end of line 25 should be moved to line 26 in order to precede the last recited operation.
Claims 17-20 are objected to for failing to alleviate the objection of claim 16 above.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation “the input data from each of the first plurality of input registers” in line 17. However, there is insufficient antecedent basis for this limitation in the claims.
Claim 1 recites the limitation “the input data from each other input register of the first plurality of input registers” in line 18. However, there is insufficient antecedent basis for this limitation in the claims.
Claims 2-10 are rejected for failing to alleviate the rejections of claim 1 above.

Claim 11 recites the limitation “the input data from each of the first plurality of input registers” in line 21. However, there is insufficient antecedent basis for this limitation in the claims.
Claim 11 recites the limitation “the input data from each other input register of the first plurality of input registers” in line 22. However, there is insufficient antecedent basis for this limitation in the claims.
Claims 12-15 are rejected for failing to alleviate the rejections of claim 11 above.

Claim 16 recites the limitation “the input data from each of the first plurality of input registers” in line 23. However, there is insufficient antecedent basis for this limitation in the claims.
Claim 16 recites the limitation “the input data from each other input register of the first plurality of input registers” in line 24. However, there is insufficient antecedent basis for this limitation in the claims.
Claims 17-20 are rejected for failing to alleviate the rejections of claim 16 above.

First Grounds of Rejection

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bhat et al. (Bhat) (Matrix-Multiply Assist Best Practices Guide).
Consider claim 1, Bhat discloses a method, comprising: configuring a defined set of input registers (page 12, section 2.2, VSRs) coupled to a compute engine (page 14, Figure 2-2, the blocks that are not registers, for example) of a Matrix Math Assist (MMA) accelerator (page 11, Matrix-Multiply Assist Architecture), wherein the defined set of input registers comprises a first plurality of input registers configured to provide multiple matrix data inputs to the compute engine in a first direction and a second input register configured to provide a vector data input to the compute engine in a second direction orthogonal to the first direction (page 14, which shows that VSRs are used to input multiple matrix data inputs and a vector data input; page 14, Figure 2-2, which shows that VSRs may provide data from the top of Figure 2-2 to the bottom of Figure 2-2, and may provide data from the left of Figure 2-2 to the right of Figure 2-2), and wherein the first plurality of input registers represents a matrix comprising a plurality of rows and a plurality of columns (page 6, Figure 1-6, for example, which shows a plurality of VSRs, e.g., VSR0 and VSR1, represents a matrix comprising a plurality of rows, e.g., VSR0 as a first row of data and VSR1 as a second row of data, and a plurality of columns, e.g., a bit position of VSR0 in conjunction with that same bit position in VSR1); providing a plurality of predefined instructions to the compute engine to process at least one of: matrix-vector multiply compute operations or multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger); selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example); processing, based on selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), at least one of: the matrix-vector multiply compute operations or the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), wherein the processing comprises using the input data from each of the first plurality of input registers in parallel with the input data from each other input register of the first plurality of input registers (page 14, which shows XA being used in parallel with XB); and generating compute results; accumulating the compute results using an accumulator of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2).

Consider claim 2, Bhat discloses the method of claim 1 (see above), wherein the second input register and each of the first plurality of input registers are Vector Scalar Registers (VSRs) (page 12, section 2.2, VSRs) coupled to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), wherein the VSRs have a size based on the compute engine (page 12, section 2.2, first paragraph, line 3, each VSR is 128 bits; in other words, the VSR is not, for example, one bit, since the compute engine operates on a greater number of bits).

Consider claim 3, Bhat discloses the method of claim 1 (see above), wherein the plurality of predefined instructions comprise a plurality of predefined Vector Scalar Extension (VSX) instructions added to an instruction set architecture (ISA) (page 6, Vector Scalar Extension (VSX), VSX instruction, VSX ISA) of the MMA accelerator (page 12, section 2.1, MMA architecture) to process the matrix-vector multiply compute operations and the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 4, Bhat discloses the method of claim 1 (see above), wherein the compute engine comprises a compute array (page 14, an array of blocks of Figure 2-2), and wherein based on one or more of the plurality of predefined instructions (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), each row of the compute array is loaded with matrix data from a respective different one of the first plurality of input registers, which is multiplied with vector data from the second input register (for example, a compute array corresponds to two rows of the compute engine, and for a first predefined instruction specifying a first VSR and a second VSR, a first row is loaded with matrix data from the first VSR, which is multiplied with vector data from the second VSR, and for a second predefined instruction specifying a third VSR and the second VSR, a second row is loaded with matrix data from the third VSR, which is multiplied with vector data from the second VSR).

Consider claim 5, Bhat discloses the method of claim 1 (see above), wherein selectively feeding the input data from the defined set of input registers to the compute engine comprises selectively feeding input matrix data elements from a respective one of the first plurality of input registers to a respective row of a compute array of the compute engine and feeding a respective input vector element from the second input register to the respective row of input matrix data elements, and wherein the processing comprises multiplying the respective input vector element to the input matrix data elements of each of the respective compute array rows (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger; for example, for a first predefined instruction specifying a first VSR and a second VSR, input matrix data elements are selectively fed from a first VSR to a first row of a compute array of the compute engine and an input vector element is fed from the second VSR to the first row of input matrix data elements, and the input vector element is multiplied by the input matrix data elements of the first row, and for a second predefined instruction specifying a third VSR and a second VSR, input matrix data elements are selectively fed from a third VSR to a second row of a compute array of the compute engine and another input vector element is fed from the second VSR to the second row of input matrix data elements, and the other input vector element is multiplied by the input matrix data elements of second first row).

Consider claim 6, Bhat discloses the method of claim 1 (see above), wherein selectively feeding input data from the defined set of input registers to the compute engine comprises feeding respective input matrix data elements and respective input vector elements from the defined set of input registers to the compute engine (page 14, which shows that VSRs are used to input respective input matrix data elements and respective input vector elements), and wherein the processing comprises multiplying the respective input matrix data elements and the respective input vector elements (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 7, Bhat discloses the method of claim 1 (see above), wherein selectively feeding the input data from the defined set of input registers to the compute engine comprises feeding consecutive vector data elements from the defined set of input registers to a respective row of a compute array and wherein the processing comprises multiplying a scalar element from one defined register to each vector data element based on one or more of the plurality of predefined instructions to process the multiply-add compute operations of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 8, Bhat discloses the method of claim 1 (see above), wherein the plurality of predefined instructions comprise predefined MMA Vector Scalar Extension (VSX) instructions (page 6, Vector Scalar Extension (VSX), VSX instruction, VSX ISA) to process predefined precision levels of floating point and integer operations (page 12, section 2.1, MMA architecture supports both floating-point and integer data types, FP32, FP64, FP16, bfloat16, INT16, INT8, INT4), wherein the predefined MMA VSX instructions control operations of the compute engine including the accumulator of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 9, Bhat discloses the method of claim 1 (see above), wherein the second input register and each of the first plurality of input registers are Vector Scalar Registers (VSRs) of 128 bits (page 12, section 2.2, first paragraph, line 3, each VSR is 128 bits) storing four 32-bit data elements (page 14, last paragraph, four 32-bit single precision values) to be multiplied and accumulated to the accumulator, wherein the accumulator comprises a 512-bit accumulator (page 14, last paragraph, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2.).

Consider claim 10, Bhat discloses the method of claim 1 (see above), further comprising: receiving an input MxN matrix having M rows and N columns with consecutive data elements in the M rows, wherein the input MxN matrix is larger than a compute array of the compute engine, and transforming the input MxN matrix into a transformed matrix to provide the consecutive data elements in the N columns, and dividing the transformed matrix into submatrixes based on a size of the compute array of the compute engine (page 35, section 4.3, To compute the full resultant 256 x 256 matrix C, the program loops over A in blocks of size 8x256 and over B in blocks of size 16x256, as shown in Example 4-5; page 4, section 1.2, Matrix A is transposed and then the computation is performed in a blocked manner, as follows: The first 8x4 block of transposed matrix A (AT) is iterated over the two 8x4 blocks of matrix B, computing outer products of the corresponding rows from blocks of AT and B, to generate two 4x4 results. The second 8x4 block of matrix A transposed is iterated over the same two blocks of matrix B to generate the next two 4x4 results. This operation is explained in Figure 1-3).

Consider claim 11, Bhat discloses a system, comprising: a processor (page 6, section 1.3, IBM POWER7 processor); and a memory, wherein the memory includes a computer program product which, when executed, configures the processor (page 6, section 1.3, IBM POWER7 processor; note that the IBM POWER7 has such memory) to perform operations for implementing a Matrix Math Assist (MMA) accelerator (page 11, Matrix-Multiply Assist Architecture), the operations comprising: configuring a defined set of input registers (page 12, section 2.2, VSRs) coupled to a compute engine (page 14, Figure 2-2, the blocks that are not registers, for example) of the MMA accelerator (page 11, Matrix-Multiply Assist Architecture), wherein the defined set of input registers comprises a first plurality of input registers configured to provide multiple matrix data inputs to the compute engine in a first direction and a second input register configured to provide a vector data input to the compute engine in a second direction orthogonal to the first direction (page 14, which shows that VSRs are used to input multiple matrix data inputs and a vector data input; page 14, Figure 2-2, which shows that VSRs may provide data from the top of Figure 2-2 to the bottom of Figure 2-2, and may provide data from the left of Figure 2-2 to the right of Figure 2-2), and wherein the first plurality of input registers represents a matrix comprising a plurality of rows and a plurality of columns (page 6, Figure 1-6, for example, which shows a plurality of VSRs, e.g., VSR0 and VSR1, represents a matrix comprising a plurality of rows, e.g., VSR0 as a first row of data and VSR1 as a second row of data, and a plurality of columns, e.g., a bit position of VSR0 in conjunction with that same bit position in VSR1); providing a plurality of predefined instructions to the compute engine to process at least one of: matrix-vector multiply compute operations or multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger); selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example); processing, based on selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), at least one of: the matrix-vector multiply compute operations or the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), wherein the processing comprises using the input data from each of the first plurality of input registers in parallel with the input data from each other input register of the first plurality of input registers (page 14, which shows XA being used in parallel with XB); and generating compute results; accumulating the compute results using an accumulator of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2).

Consider claim 12, Bhat discloses the system of claim 11 (see above), wherein the second input register and each of the first plurality of input registers are Vector Scalar Registers (VSRs) (page 12, section 2.2, VSRs) coupled to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), wherein the VSRs have a size based on the compute engine (page 12, section 2.2, first paragraph, line 3, each VSR is 128 bits; in other words, the VSR is not, for example, one bit, since the compute engine operates on a greater number of bits).

Consider claim 13, Bhat discloses the system of claim 11 (see above), wherein the plurality of predefined instructions comprise a plurality of predefined Vector Scalar Extension (VSX) instructions added to an instruction set architecture (ISA) (page 6, Vector Scalar Extension (VSX), VSX instruction, VSX ISA) of the MMA accelerator (page 12, section 2.1, MMA architecture) to process the matrix-vector multiply compute operations and the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 14, Bhat discloses the system of claim 11 (see above), wherein the compute engine comprises a compute array (page 14, an array of blocks of Figure 2-2), and wherein based on one or more of the plurality of predefined instructions (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), each row of the compute array is loaded with matrix data from a respective different one of the first plurality of input registers, which is multiplied with vector data from the second input register (for example, a compute array corresponds to two rows of the compute engine, and for a first predefined instruction specifying a first VSR and a second VSR, a first row is loaded with matrix data from the first VSR, which is multiplied with vector data from the second VSR, and for a second predefined instruction specifying a third VSR and the second VSR, a second row is loaded with matrix data from the third VSR, which is multiplied with vector data from the second VSR).

Consider claim 15, Bhat discloses the system of claim 11 (see above), wherein selectively feeding the input data from the defined set of input registers to the compute engine comprises feeding consecutive vector data elements from the defined set of input registers to a respective row of a compute array and wherein the processing comprises multiplying a scalar element from one defined register to each vector data element based on one or more of the plurality of predefined instructions to process the multiply-add compute operations of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 16, Bhat discloses a computer program product (page 6, section 1.3, IBM POWER7 processor; note that the IBM POWER7 has such a computer program product; also see the program code throughout Bhat) for accelerating operations (page 11, Matrix-Multiply Assist Architecture) of matrix-vector multiply, multiply-add compute and mixed matrix-matrix multiply and matrix-vector multiply compute patterns with a Matrix Math Assist (MMA) accelerator (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), the computer program product comprising: a computer-readable storage medium comprising computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising (page 6, section 1.3, IBM POWER7 processor; note that the IBM POWER7 has such a computer program product; also see the program code throughout Bhat): configuring a defined set of input registers (page 12, section 2.2, VSRs) coupled to a compute engine (page 14, Figure 2-2, the blocks that are not registers, for example) of the MMA accelerator (page 11, Matrix-Multiply Assist Architecture), wherein the defined set of input registers comprises a first plurality of input registers configured to provide multiple matrix data inputs to the compute engine in a first direction and a second input register configured to provide a vector data input to the compute engine in a second direction orthogonal to the first direction (page 14, which shows that VSRs are used to input multiple matrix data inputs and a vector data input; page 14, Figure 2-2, which shows that VSRs may provide data from the top of Figure 2-2 to the bottom of Figure 2-2, and may provide data from the left of Figure 2-2 to the right of Figure 2-2), and wherein the first plurality of input registers represents a matrix comprising a plurality of rows and a plurality of columns (page 6, Figure 1-6, for example, which shows a plurality of VSRs, e.g., VSR0 and VSR1, represents a matrix comprising a plurality of rows, e.g., VSR0 as a first row of data and VSR1 as a second row of data, and a plurality of columns, e.g., a bit position of VSR0 in conjunction with that same bit position in VSR1); providing a plurality of predefined instructions to the compute engine to process at least one of: matrix-vector multiply compute operations or multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger); selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example); processing, based on selectively feeding input data from the defined set of input registers (page 12, section 2.2, VSRs; Figure 2-2, which shows input data from XA and XB in particular being selectively fed) to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), at least one of: the matrix-vector multiply compute operations or the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), wherein the processing comprises using the input data from each of the first plurality of input registers in parallel with the input data from each other input register of the first plurality of input registers (page 14, which shows XA being used in parallel with XB); and generating compute results; accumulating the compute results using an accumulator of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2).

Consider claim 17, Bhat discloses the computer program product of claim 16 (see above), wherein the second input register and each of the first plurality of input registers are Vector Scalar Registers (VSRs) (page 12, section 2.2, VSRs) coupled to the compute engine (page 14, Figure 2-2, the blocks that are not registers, for example), wherein the VSRs have a size based on the compute engine (page 12, section 2.2, first paragraph, line 3, each VSR is 128 bits; in other words, the VSR is not, for example, one bit, since the compute engine operates on a greater number of bits).

Consider claim 18, Bhat discloses the computer program product of claim 16 (see above), wherein the plurality of predefined instructions comprise a plurality of predefined Vector Scalar Extension (VSX) instructions added to an instruction set architecture (ISA) (page 6, Vector Scalar Extension (VSX), VSX instruction, VSX ISA) of the MMA accelerator (page 12, section 2.1, MMA architecture) to process the matrix-vector multiply compute operations and the multiply-add compute operations (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Consider claim 19, Bhat discloses the computer program product of claim 16 (see above), wherein the compute engine comprises a compute array (page 14, an array of blocks of Figure 2-2), and wherein based on one or more of the plurality of predefined instructions (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger), each row of the compute array is loaded with matrix data from a respective different one of the first plurality of input registers, which is multiplied with vector data from the second input register (for example, a compute array corresponds to two rows of the compute engine, and for a first predefined instruction specifying a first VSR and a second VSR, a first row is loaded with matrix data from the first VSR, which is multiplied with vector data from the second VSR, and for a second predefined instruction specifying a third VSR and the second VSR, a second row is loaded with matrix data from the third VSR, which is multiplied with vector data from the second VSR).

Consider claim 20, Bhat discloses the computer program product of claim 16 (see above), wherein selectively feeding the input data from the defined set of input registers to the compute engine comprises feeding consecutive vector data elements from the defined set of input registers to a respective row of a compute array and wherein the processing comprises multiplying a scalar element from one defined register to each vector data element based on one or more of the plurality of predefined instructions to process the multiply-add compute operations of the compute engine (page 11, new compute instructions for the matrix multiplication operation; page 14, each value in VSR 32 is multiplied with each value in VSR 33, generating a 4×4 array of 32-bit results (a total of 512 bits of output). The output is accumulated with the content of ACC1, as shown in Figure 2-2; page 4, vector outer product operation; page 14, xvf32ger).

Second Grounds of Rejection

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bhat et al. (Bhat) (Matrix-Multiply Assist Best Practices Guide) in view of Applicant-Admitted Prior Art (AAPA).
The rejections of claims 1-20 in this second grounds of rejection are identical to the rejections of claims 1-20 in the first grounds of rejection, except that, to any extent to which Bhat as cited does not explicitly, implicitly, or inherently disclose matrix-vector multiply and multiply-add operations under the broadest reasonable interpretation of these limitations, AAPA explicitly discloses the well-known matrix-vector multiply and multiply-add operations ([0002], last sentence, traditional MMA units only weakly support SIMD acceleration to perform regular multiply add or matrix-vector multiplication), and it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of AAPA with the invention of Bhat in order to increase the capability of the MMA via supporting  matrix-vector multiply and multiply-add operations.

Response to Arguments
Applicant on page 9 argues: “As agreed during the interview, the claims as amended overcome the current rejection under 35 U.S.C. § 112(a). Therefore, withdrawal of this rejection is respectfully requested.”
In view of the associated claim amendments, the previously presented rejections under 35 U.S.C. § 112(a) are withdrawn.

Applicant on page 10 argues: “With respect to Claims 1, 11, and 16, the Examiner argues that "it is indefinite, at least in part due to the 'wherein' language in line 14 and 'to perform the processing' language in lines 16-17, whether the 'each of the first set plurality of input registers is used in parallel with each other input register of the first plurality of input registers' step is a step that occurs as part of the recited processing (which appears to occur after the recited selective feeding), or a step that occurs as part of the recited selective feeding." Office Action, pp. 4-5. In the interest of expeditious prosecution, Applicant has amended Claims 1, 6, and 11 accordingly. Withdrawal of this rejection is respectfully requested.”
In view of the associated claim amendments, the previously presented rejections under 35 U.S.C. § 112(b) are withdrawn. 

Applicant across pages 11-12 argues: ‘Even assuming arguendo that Bhat's multiplication of XA and XB means they are used in parallel, this is distinct from "using the input data from each of the first plurality of input registers in parallel with the input data from each other input register of the first plurality of input registers" at least because XA and XB cannot both be part of the "first plurality of input registers." As recited earlier in the claim, the first plurality of input registers is "configured to provide multiple matrix data inputs to the compute engine in a first direction" and "represents a matrix comprising a plurality of rows and a plurality of columns." As Figure 2-2 of Bhat shows, XA and XB provide input in different directions.’ 
However, after full consideration of the broadest reasonable interpretation of the claim language, Examiner submits that a VSR of Bhat is “configured to” (e.g., arranged, put together, manufactured, and/or designed to) provide data input to the compute engine in a first direction, because the physical architecture and instruction set architecture of Bhat support a VSR being designated to correspond to XA. Similarly, a VSR of Bhat is also “configured to” (e.g., arranged, put together, manufactured, and/or designed to) provide data input to the compute engine in a second direction orthogonal to the first direction, because the physical architecture and instruction set architecture of Bhat support a VSR being designated to correspond to XB. 
Therefore, even if a particular instance of an instruction uses VSR 32 to provide the XA operand and VSR 33 to provide the XB operand, such does not preclude VSR 32 and VSR 33 from being “configured to” provide multiple matrix data inputs to the compute engine in a [same] first direction, under the broadest reasonable interpretation of “configured to”. 

Applicant on page 12 argues: ‘Furthermore, neither XA nor XB of Bhat "represents a matrix comprising a plurality of rows and a plurality of columns" because both XA and XB of Bhat are input registers of vector (i.e., one-dimensional) data. For at least the foregoing reasons, Bhat fails to disclose these features of the claims.’
However, Examiner submits that a first VSR and a second VSR considered collectively may be reasonably considered to represent a matrix comprising a plurality of rows and a plurality of columns, as explained in the rejection above. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEITH E VICARY whose telephone number is (571)270-1314. The examiner can normally be reached Monday to Friday, 9:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached at (571)270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/KEITH E VICARY/Primary Examiner, Art Unit 2183
Read full office action
Prosecution Timeline

Mar 21, 2023
Application Filed
Aug 14, 2024
Non-Final Rejection — §102, §103, §112
Sep 04, 2024
Applicant Interview (Telephonic)
Sep 04, 2024
Examiner Interview Summary
Nov 08, 2024
Response Filed
Nov 22, 2024
Final Rejection — §102, §103, §112
Jan 08, 2025
Examiner Interview Summary
Jan 08, 2025
Applicant Interview (Telephonic)
Jan 22, 2025
Response after Non-Final Action
Feb 27, 2025
Request for Continued Examination
Mar 04, 2025
Response after Non-Final Action
Apr 07, 2025
Non-Final Rejection — §102, §103, §112
Jul 09, 2025
Applicant Interview (Telephonic)
Jul 09, 2025
Examiner Interview Summary
Jul 11, 2025
Response Filed
Jul 21, 2025
Final Rejection — §102, §103, §112
Sep 23, 2025
Response after Non-Final Action
Oct 23, 2025
Request for Continued Examination
Oct 25, 2025
Response after Non-Final Action
Nov 15, 2025
Non-Final Rejection — §102, §103, §112
Feb 11, 2026
Applicant Interview (Telephonic)
Feb 11, 2026
Examiner Interview Summary
Feb 19, 2026
Response Filed
Apr 08, 2026
Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/213,598
Patent 12602349
HANDLING DYNAMIC TENSOR LENGTHS IN A RECONFIGURABLE PROCESSOR THAT INCLUDES MULTIPLE MEMORY UNITS
2y 5m to grant Granted Apr 14, 2026
17/720,657
Patent 12572360
Cache Preload Operations Using Streaming Engine
2y 5m to grant Granted Mar 10, 2026
18/328,688
Patent 12554507
SYSTEMS AND METHODS FOR PROCESSING FORMATTED DATA IN COMPUTATIONAL STORAGE
2y 5m to grant Granted Feb 17, 2026
18/626,629
Patent 12554494
APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS TO REQUEST A HISTORY RESET OF A PROCESSOR CORE
2y 5m to grant Granted Feb 17, 2026
18/739,070
Patent 12547401
Load Instruction Fusion
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

7-8
Expected OA Rounds
58%
Grant Probability
99%
With Interview (+41.2%)
3y 8m
Median Time to Grant
High
PTA Risk
Based on 683 resolved cases by this examiner. Grant probability derived from career allow rate.
MANAGING INPUT REGISTERS FOR A COMPUTE ENGINE OF A MATRIX MATH ASSIST ACCELERATOR

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email