Last updated: April 19, 2026
Application No. 18/930,671
SYSTEMS, METHODS, AND APPARATUSES FOR TILE MATRIX MULTIPLICATION AND ACCUMULATION

Non-Final OA §103§DP
Filed
Oct 29, 2024
Examiner
SUN, MICHAEL
Art Unit
2183
Tech Center
2100 — Computer Architecture & Software
Assignee
Intel Corporation
OA Round
1 (Non-Final)
Interview Optional

— -1.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 768 resolved cases, 2023–2026
Examiner Intelligence

SUN, MICHAEL View full profile →
Grants 88% — above average
Career Allow Rate
679 granted / 768 resolved
+33.4% vs TC avg
Minimal -2% lift
Without
With
+-1.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
17 currently pending
Career history
785
Total Applications
across all art units
Statute-Specific Performance

§101
5.8%
-34.2% vs TC avg
§103
39.8%
-0.2% vs TC avg
§102
36.9%
-3.1% vs TC avg
§112
5.3%
-34.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 768 resolved cases
Office Action

§103 §DP
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Status of the Application
This Office Action is in response to Applicant’s Continuation filed on 10/29/2024 and subsequent Preliminary Amendment filed 9/24/2025.
Claims 25-39 are pending for this examination.
Claims 1-24 were cancelled.
Claims 25-39 have been added.

Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 12/31/2024; 12/31/2024; 12/31/2024; 12/31/2024; 4/07/2025; and 8/13/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Obvious-Type Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

Claims 25, 30, and 35 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 8, and 15 of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787). Although the claims at issue are not identical, they are not patentably distinct from each other because claims 25, 30, and 35 of the instant Application, respectively contains every element of claims 1, 8, and 15 of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787), as listed below with underlines being used to highlight the differences between the two, and as such are anticipated by the claims of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787):
Claims
Instant Application
Claims
U.S. Patent No. 11,086,623 (parent application s/n 16/487,787)
Independent claim 25
A processor, comprising: 

fetch circuitry to fetch a single instruction having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements; and 

a register file comprising a plurality of vector registers, each vector register to store a plurality of packed matrix data elements, wherein a first set of vector registers is to be allocated to store the first source matrix tile, a second set of the vector registers is to be allocated to store the second source matrix tile, and a third set of the vector registers is to be allocated to store the third source matrix tile and a result matrix tile; 

execution circuitry comprising a plurality of multiply-accumulate circuits coupled to the plurality of vector registers, the execution circuitry to perform operations corresponding to the fetched single instruction, wherein the execution circuitry is to: 

read the first source matrix tile from the first set of vector registers, the second source matrix tile from the second set of vector registers, and the third source matrix tile from the third set of vector registers; 

multiply the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products; 

add each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile, each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements.
Independent claim 1
A processor comprising: 

decode circuitry to decode an instruction having fields for an opcode, an identifier for a first source matrix operand, an identifier of a second source matrix operand, and an identifier for a source/destination matrix operand, wherein the matrices are multi-dimensional and the opcode is to indicate that execution circuitry is to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand; and 








execution circuitry to execute the decoded instruction to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand.
Analysis 
Examiner points out that the instant claims are a more specific / narrower version of the claims seen in U.S. Patent No. 11,086,623 (parent application s/n 16/487,787), where the general idea shared between the two is the instruction with an opcode, and 3 operands where the first operand is a source matrix, second operand is another source matrix which are to be multiplied together, then added to with the contents of the third operand (accumulation operation) which is a source/destination matrix that has a third matrix for the adding operation and where the result of the adding is stored to.  Although the parent application talks about decoding the instruction first, the major elements to be done for the multiply-accumulate operation is done in both cases.  As such, the instant application which has many more details would be anticipated by the already allowed broader claims of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787).
Independent claim 30
A method comprising: 

fetching a single instruction having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements; and 

decoding the single instruction; 








executing the decoded single instruction using execution circuitry by:

reading the first source matrix tile from a first set of vector registers, the second source matrix tile from a second set of vector registers, and the third source matrix tile from a third set of vector registers; 

multiplying the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products; 

adding each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile, each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements.
Independent claim 8
A method comprising: 

decoding an instruction having fields for an opcode, an identifier for a first source matrix operand, an identifier of a second source matrix operand, and an identifier for a source/destination matrix operand, wherein the matrices are multi- dimensional and the opcode is to indicate the decoded instruction is to cause execution circuitry to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand; and 

executing the decoded instruction to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand.
Analysis 
Examiner points out that the instant claims are a more specific / narrower version of the claims seen in U.S. Patent No. 11,086,623 (parent application s/n 16/487,787), where the general idea shared between the two is the instruction with an opcode, and 3 operands where the first operand is a source matrix, second operand is another source matrix which are to be multiplied together, then added to with the contents of the third operand (accumulation operation) which is a source/destination matrix that has a third matrix for the adding operation and where the result of the adding is stored to.  Although the parent application talks about decoding the instruction first, the major elements to be done for the multiply-accumulate operation is done in both cases.  As such, the instant application which has many more details would be anticipated by the already allowed broader claims of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787).
Independent claim 35
A non-transitory machine readable medium having stored thereon an instance of a single instruction which when handled by an apparatus is to cause a method to be performed the method comprising: 

fetching a single instruction having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements; and 

decoding the single instruction; 








executing the decoded single instruction using execution circuitry by: 

reading the first source matrix tile from a first set of vector registers, the second source matrix tile from a second set of vector registers, and the third source matrix tile from a third set of vector registers; 

multiplying the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products; 

adding each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile, each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements.
Independent claim 15
A non-transitory machine-readable medium storing an instruction which causes a processor to perform a method, the method comprising: 





decoding the instruction having fields for an opcode, an identifier for a first source matrix operand, an identifier of a second source matrix operand, and an identifier for a source/destination matrix operand, wherein the matrices are multi- dimensional and the opcode is to indicate the decoded instruction is to cause execution circuitry to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand; and 

executing the decoded instruction to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand.
Analysis 
Examiner points out that the instant claims are a more specific / narrower version of the claims seen in U.S. Patent No. 11,086,623 (parent application s/n 16/487,787), where the general idea shared between the two is the instruction with an opcode, and 3 operands where the first operand is a source matrix, second operand is another source matrix which are to be multiplied together, then added to with the contents of the third operand (accumulation operation) which is a source/destination matrix that has a third matrix for the adding operation and where the result of the adding is stored to.  Although the parent application talks about decoding the instruction first, the major elements to be done for the multiply-accumulate operation is done in both cases.  As such, the instant application which has many more details would be anticipated by the already allowed broader claims of U.S. Patent No. 11,086,623 (parent application s/n 16/487,787).


	
Claim 25 is rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 of U.S. Patent No. 12,147,804 (parent application s/n 17/382,917). Although the claims at issue are not identical, they are not patentably distinct from each other because claim 25 of the instant Application, respectively contains every element of claim 1 of U.S. Patent No. 12,147,804 (parent application s/n 17/382,917), as listed below with underlines being used to highlight the differences between the two, and as such are anticipated by the claims of U.S. Patent No. 12,147,804 (parent application s/n 17/382,917):
Claims
Instant Application
Claims
U.S. Patent No. 12,147,804 (parent application s/n 17/382,917)
Independent claim 25
A processor, comprising: 

fetch circuitry to fetch a single instruction having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements; and 

a register file comprising a plurality of vector registers, each vector register to store a plurality of packed matrix data elements, wherein a first set of vector registers is to be allocated to store the first source matrix tile, a second set of the vector registers is to be allocated to store the second source matrix tile, and a third set of the vector registers is to be allocated to store the third source matrix tile and a result matrix tile; 

execution circuitry comprising a plurality of multiply-accumulate circuits coupled to the plurality of vector registers, the execution circuitry to perform operations corresponding to the fetched single instruction, wherein the execution circuitry is to: 

read the first source matrix tile from the first set of vector registers, the second source matrix tile from the second set of vector registers, and the third source matrix tile from the third set of vector registers; 

multiply the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products; 

add each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile, each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements.
Independent claim 1
A processor comprising: 

decode circuitry to decode an instance of a single instruction having fields for an opcode, an identifier for a first source multidimensional matrix operand, an identifier of a second source multidimensional matrix operand, and an identifier for a source/destination multidimensional matrix operand; and 


















execution circuitry to execute the decoded instance of the single instruction to multiply the identified first source multidimensional matrix operand by the identified second source multidimensional matrix operand, add a result of the multiplication to the identified source/destination multidimensional matrix operand, and store a result of the addition in the identified source/destination multidimensional matrix operand.
Analysis 
Examiner points out that the instant claims are a more specific / narrower version of the claims seen in U.S. Patent No. 12,147,804 (parent application s/n 17/382,917), where the general idea shared between the two is the instruction with an opcode, and 3 operands where the first operand is a source matrix, second operand is another source matrix which are to be multiplied together, then added to with the contents of the third operand (accumulation operation) which is a source/destination matrix that has a third matrix for the adding operation and where the result of the adding is stored to.  Although the parent application talks about decoding the instruction first, the major elements to be done for the multiply-accumulate operation is done in both cases.  As such, the instant application which has many more details would be anticipated by the already allowed broader claims of U.S. Patent No. 12,147,804 (parent application s/n 17/382,917).


	
Claim Rejections - 35 U.S.C. § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 25-29 are rejected under 35 U.S.C. 103 as being unpatentable over Nair et al. (US 2004/0111587), herein referred to as Nair ‘587, in view of Sih et al. (US 2005/0198472), herein referred to as Sih ‘472, and further in view of Yeh et al. (US 2003/0105943), herein referred to as Yeh ‘943.
Referring to claim 25, Nair ‘587 teaches a processor (see Fig. 1, matrix data coprocessor 120), comprising: 
fetch circuitry (see Fig. Paragraph 0021, where instruction are fetched from memory) to fetch a single instruction (see Paragraph 0050, wherein a single instruction can be used to implement the matrix operations, see Fig. 4) having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements (see Fig. 4, instruction 400 having opcode field, two source matrix fields M.sub.x, M.sub.y, and one destination field M.sub.d, where the processor can perform a plurality of matrix operations including matrix-accumulate, see Paragraph 0090 and Table 1; see Paragraphs 0043 and 0047, wherein the data elements are packed matrix data); and 
a register file (see Fig. 1, register set 200; see Paragraphs 0048-0049, wherein register file stores operands and data elements) comprising a plurality of vector registers (see Paragraph 0045, wherein the microprocessors can include vector architectures, which would mean the data set can be configured as matrices and vectors, i.e. meaning the registers to store matrices / vectors would also be configured as such), each vector register to store a plurality of packed matrix data elements, wherein a first set of vector registers is to be allocated to store the first source matrix tile, a second set of the vector registers is to be allocated to store the second source matrix tile, and a third set of the vector registers is to be allocated to store the third source matrix tile and a result matrix tile (see Paragraphs 0047-0049, wherein registers can stored the data of two source matrix fields M.sub.x, M.sub.y, and one destination field M.sub.d); 
execution circuitry (see Paragraph 0031, wherein instructions are executed by processors / coprocessors, which would imply the existence of execution units in each processor), the execution circuitry to perform operations corresponding to the fetched single instruction (see Fig. 7), wherein the execution circuitry is to: 
read the first source matrix tile from the first set of vector registers, the second source matrix tile from the second set of vector registers (see Fig. 7, identify elements on source matrix 730 and 740); 
multiply the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products (see Fig. 7, perform matrix operation to generate results 750, wherein in a multiply-accumulate operation, the first operation done is a multiplication operation of the two source matrices); 
add each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile (see Paragraph 0090 and Table 1, where the processor can perform a plurality of matrix operations including matrix-accumulate operation which would involve multiplying the two source matrices, and accumulating the results with a destination matrix which is known in the art to be adding the results of the multiplication with whatever is stored in the destination matrix), each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements (see Fig. 7, store results in destination matrix 770).
However, Nair ‘587 does not teach the execution unit comprising a plurality of multiply-accumulate circuits coupled to the plurality of vector registers, and the reading the third source matrix tile from the third set of vector registers.
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110).
Nair ‘587 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006).
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040).
Nair ‘587, Sih ‘472, and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Sih ‘472 system as set forth above to have the multiply-add / multiply-accumulate instructions include three source operands, thus reading of operands to perform the multiply-add / multiply-accumulate operation would involve reading three source operands, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that the number of operands involved in a multiply-accumulate operations require three different operands, at least two sources, and a third operand which is normally a source/destination operand where the data in this third operand is simply added to the results of the multiply operation, i.e. why it’s called an accumulate operation, where Examiner points out that naming the third operand as a source operand or a destination operand for a multiply-accumulate operation is matter of design choice, as Applicant’s claim language here clearly indicates this third operand is where the results of the add operations are stored into, meaning the third source is being used as a source/destination operand.
As to claim 26, Nair ‘587 does not teach the processor of claim 25, wherein the execution circuitry comprises a plurality of lanes in which to perform parallel multiplications of the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements. 
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006).
As to claim 27, Nair ‘587 does not specifically teach the processor of claim 26, wherein each lane of the plurality of lanes has a width equal to a width of the plurality of result matrix data elements and the third plurality of source matrix data elements.
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, where the width of the lanes would be equal to the width of the data elements of the third operand / result data, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006), wherein Examiner points out that the width of the lanes, which is used to transmit data would inherently need to be of the same size as the width of the third operand / destination operand in order to complete a multiply-accumulate operations, i.e. the product of the multiplication has to have the same width size (in bits) as the data to be added in order to work.
As to claim 28, Nair ‘587 teaches the processor of claim 25, wherein the single instruction specifies a data element size and format of 8-bit integer for the first and second plurality of source data elements and specifies a data element size and format of 32-bit integer for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587, Sih ‘472, and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Sih ‘472 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.
As to claim 29, Nair ‘587 teaches the processor of claim 25, wherein the single instruction specifies a data element size and format of 16-bit floating point (FP16) for the first and second plurality of source data elements and specifies a data element size and format of 32-bit floating point (FP32) for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587, Sih ‘472, and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Sih ‘472 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.


Claims 30, 33-35, and 38-39 are rejected under 35 U.S.C. 103 as being unpatentable over Nair ‘587, in view of Yeh ‘943.
Referring to claim 30, Nair ‘587 teaches a method (see Abstract) comprising: 
fetching (see Fig. Paragraph 0021, where instruction are fetched from memory) a single instruction (see Paragraph 0050, wherein a single instruction can be used to implement the matrix operations, see Fig. 4) having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements (see Fig. 4, instruction 400 having opcode field, two source matrix fields M.sub.x, M.sub.y, and one destination field M.sub.d, where the processor can perform a plurality of matrix operations including matrix-accumulate, see Paragraph 0090 and Table 1; see Paragraphs 0043 and 0047, wherein the data elements are packed matrix data); and 
decoding the single instruction (see Paragraph 0024, wherein instruction decoding is done, usually before the execution of the instruction); 
executing the decoded single instruction using execution circuitry (see Paragraph 0031, wherein instructions are executed by processors / coprocessors, which would imply the existence of execution units in each processor; see Fig. 7) by:
reading the first source matrix tile from a first set of vector registers, the second source matrix tile from a second set of vector registers (see Fig. 7, identify elements on source matrix 730 and 740); 
multiplying the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products (see Fig. 7, perform matrix operation to generate results 750, wherein in a multiply-accumulate operation, the first operation done is a multiplication operation of the two source matrices); 
adding each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile (see Paragraph 0090 and Table 1, where the processor can perform a plurality of matrix operations including matrix-accumulate operation which would involve multiplying the two source matrices, and accumulating the results with a destination matrix which is known in the art to be adding the results of the multiplication with whatever is stored in the destination matrix), each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements (see Fig. 7, store results in destination matrix 770).
However, Nair ‘587 does not teach the reading the third source matrix tile from a third set of vector registers; 
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the multiply-add / multiply-accumulate instructions include three source operands, thus reading of operands to perform the multiply-add / multiply-accumulate operation would involve reading three source operands, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that the number of operands involved in a multiply-accumulate operations require three different operands, at least two sources, and a third operand which is normally a source/destination operand where the data in this third operand is simply added to the results of the multiply operation, i.e. why it’s called an accumulate operation, where Examiner points out that naming the third operand as a source operand or a destination operand for a multiply-accumulate operation is matter of design choice, as Applicant’s claim language here clearly indicates this third operand is where the results of the add operations are stored into, meaning the third source is being used as a source/destination operand.
Referring to claim 33, Nair ‘587 teaches the method of claim 30, wherein the single instruction specifies a data element size and format of 8-bit integer for the first and second plurality of source data elements and specifies a data element size and format of 32-bit integer for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.
As to claim 34, Nair ‘587 teaches the method of claim 30, wherein the single instruction specifies a data element size and format of 16-bit floating point (FP16) for the first and second plurality of source data elements and specifies a data element size and format of 32-bit floating point (FP32) for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.

As to claim 35, Nair ‘587 teaches a non-transitory machine readable medium (see Paragraphs 0101-0102) having stored thereon an instance of a single instruction (see Paragraph 0050, wherein a single instruction can be used to implement the matrix operations, see Fig. 4) which when handled by an apparatus (see Fig. 1, system 100 with matrix data coprocessor 120) is to cause a method to be performed (see Abstract) the method comprising: 
fetching (see Fig. Paragraph 0021, where instruction are fetched from memory) a single instruction (see Paragraph 0050, wherein a single instruction can be used to implement the matrix operations, see Fig. 4) having an opcode, and one or more operands to indicate a first source matrix tile comprising a first plurality of source matrix data elements, a second source matrix tile comprising a second plurality of source matrix data elements, and a third source matrix tile comprising a third plurality of source matrix data elements (see Fig. 4, instruction 400 having opcode field, two source matrix fields M.sub.x, M.sub.y, and one destination field M.sub.d, where the processor can perform a plurality of matrix operations including matrix-accumulate, see Paragraph 0090 and Table 1; see Paragraphs 0043 and 0047, wherein the data elements are packed matrix data); and 
decoding the single instruction (see Paragraph 0024, wherein instruction decoding is done, usually before the execution of the instruction); 
executing the decoded single instruction using execution circuitry (see Paragraph 0031, wherein instructions are executed by processors / coprocessors, which would imply the existence of execution units in each processor; see Fig. 7) by:
reading the first source matrix tile from a first set of vector registers, the second source matrix tile from a second set of vector registers (see Fig. 7, identify elements on source matrix 730 and 740); 
multiplying the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements to generate a corresponding plurality of products (see Fig. 7, perform matrix operation to generate results 750, wherein in a multiply-accumulate operation, the first operation done is a multiplication operation of the two source matrices); 
adding each data element of the third plurality of source matrix data elements to one or more corresponding products of the plurality of products to generate a corresponding result data element of the result matrix tile (see Paragraph 0090 and Table 1, where the processor can perform a plurality of matrix operations including matrix-accumulate operation which would involve multiplying the two source matrices, and accumulating the results with a destination matrix which is known in the art to be adding the results of the multiplication with whatever is stored in the destination matrix), each result data element to be stored in the third set of vector registers in a data element location of a corresponding data element of the third plurality of source matrix data elements (see Fig. 7, store results in destination matrix 770).
However, Nair ‘587 does not teach the reading the third source matrix tile from a third set of vector registers; 
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the multiply-add / multiply-accumulate instructions include three source operands, thus reading of operands to perform the multiply-add / multiply-accumulate operation would involve reading three source operands, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that the number of operands involved in a multiply-accumulate operations require three different operands, at least two sources, and a third operand which is normally a source/destination operand where the data in this third operand is simply added to the results of the multiply operation, i.e. why it’s called an accumulate operation, where Examiner points out that naming the third operand as a source operand or a destination operand for a multiply-accumulate operation is matter of design choice, as Applicant’s claim language here clearly indicates this third operand is where the results of the add operations are stored into, meaning the third source is being used as a source/destination operand.
As to claim 38, Nair ‘587 teaches the non-transitory machine readable medium of claim 35, wherein the single instruction specifies a data element size and format of 8-bit integer for the first and second plurality of source data elements and specifies a data element size and format of 32-bit integer for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.
As to claim 39, Nair ‘587 teaches the non-transitory machine readable medium of claim 35, wherein the single instruction specifies a data element size and format of 16-bit floating point (FP16) for the first and second plurality of source data elements and specifies a data element size and format of 32-bit floating point (FP32) for the third plurality of source matrix data elements (see Paragraph 0006, wherein data elements can be 32, 64, and 128 bits, with smaller representative data resolutions of 4, 8, and 16 bits).  
Yeh ‘943 teaches a pipelined processor system (see Abstract), wherein an instruction that can be queued for execution can include a first source, second source, and third source fields for implementing multiply-add operations (see Paragraph 0040), where data can be implemented in integer, or floating point formats (see Paragraph 0041).
Nair ‘587 and Yeh ‘943 apply as analogous prior arts as all these arts pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Nair ‘587 system as set forth above to have the data implemented in integer or floating point format, as taught by Yeh ‘943, as a person of ordinary skill in the art would be recognize that there are two different formats for data known in the art where Yeh ‘943 specifically provides for the execution units for both integer and floating point types in the processors, thereby allowing for the usage of integer format, floating point format, or a combination of both.


Claims 31-32 and 36-37 are rejected under 35 U.S.C. 103 as being unpatentable over Nair ‘587, in view of Yeh ‘943, and further in view of Sih ‘472.
As to claim 31, Nair ‘587 does not teach the method of claim 30, wherein the execution circuitry comprises a plurality of lanes in which to perform parallel multiplications of the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements. 
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587, Yeh ‘943 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Yeh ‘943 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006).
As to claim 32, Nair ‘587 does not specifically teach the method of claim 31, wherein each lane of the plurality of lanes has a width equal to a width of the plurality of result matrix data elements and the third plurality of source matrix data elements.
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587, and Yeh ‘943 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Yeh ‘943 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, where the width of the lanes would be equal to the width of the data elements of the third operand / result data, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006), wherein Examiner points out that the width of the lanes, which is used to transmit data would inherently need to be of the same size as the width of the third operand / destination operand in order to complete a multiply-accumulate operations, i.e. the product of the multiplication has to have the same width size (in bits) as the data to be added in order to work.
As to claim 36, Nair ‘587 does not teach the non-transitory machine readable medium of claim 35, wherein the execution circuitry comprises a plurality of lanes in which to perform parallel multiplications of the first plurality of source matrix data elements by corresponding source matrix data elements of the second plurality of source matrix data elements. 
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587, and Yeh ‘943 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Yeh ‘943 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006).
As to claim 37, Nair ‘587 does not specifically teach the non-transitory machine readable medium of claim 36, wherein each lane of the plurality of lanes has a width equal to a width of the plurality of result matrix data elements and the third plurality of source matrix data elements.
Sih ‘472 teaches a digital signal processor system (see Abstract) that utilizes a dual multiply-accumulate (MAC) unit connected to the output of the register file (see Fig. 1, multipliers 122 and adders 140 connected to register file 110), wherein the register file can output / execute multiple multiply operations in parallel.
Nair ‘587, Yeh ‘943 and Sih ‘472 apply as analogous prior arts as both pertain to the same field of endeavor of processor system executing instructions, particularly multiply-add / multiply-accumulate instructions.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination Nair ‘587 and Yeh ‘943 system as set forth above to have a plurality of multiply-accumulate units connected to the registers for the purposes of performing the multiply-add / multiply-accumulate operations, where multiple multiplication operations are done in parallel lanes, where the width of the lanes would be equal to the width of the data elements of the third operand / result data, as taught by Sih ‘472, as a person of ordinary skill in the art would be motivated to include multiple MAC units to allow for higher number of operations per unit of time and provide flexibility to perform different types of operations concurrently to allow for better utilization of available hardware (see Paragraph 0006), wherein Examiner points out that the width of the lanes, which is used to transmit data would inherently need to be of the same size as the width of the third operand / destination operand in order to complete a multiply-accumulate operations, i.e. the product of the multiplication has to have the same width size (in bits) as the data to be added in order to work.

Relevant Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Corbal et al. (US 10,146,535) teaches a system and method for implementing fused multiply add instructions, where the instruction includes fields for an opcode, plurality of packed data source operands of a first type, packed data element source operands of a second type, source operands of a third type, and an packed data destination operand, the instruction being used to implement a multiply-add operation on matrices. 
Harrison et al. (US 2005/0289208) teaches a system wherein fused multiply-accumulate instructions are used and executed but provides no details of the makeup of the instruction other than the fact that the instruction includes a multiply instruction and addition instruction.

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL SUN whose telephone number is (571)270-1724. The examiner can normally be reached Monday-Friday 8am-4pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on 571-270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL SUN/Primary Examiner, Art Unit 2183
Read full office action
Prosecution Timeline

Oct 29, 2024
Application Filed
Sep 24, 2025
Response after Non-Final Action
Jan 24, 2026
Non-Final Rejection — §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/425,378
Patent 12591434
SHADOW CACHE FOR SECURING CONDITIONAL SPECULATIVE INSTRUCTION EXECUTION
2y 5m to grant Granted Mar 31, 2026
17/899,531
Patent 12585612
MEMORY DEVICE WITH EMBEDDED DEEP LEARNING ACCELERATOR IN MULTI-CLIENT ENVIRONMENT
2y 5m to grant Granted Mar 24, 2026
18/394,412
Patent 12585598
STORAGE DEVICE WITH HARDWARE ACCELERATOR
2y 5m to grant Granted Mar 24, 2026
18/660,120
Patent 12572478
Method and Apparatus for Dual Issue Multiply Instructions
2y 5m to grant Granted Mar 10, 2026
18/388,940
Patent 12561249
PREFETCHING USING A DIRECT MEMORY ACCESS ENGINE
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
88%
Grant Probability
87%
With Interview (-1.6%)
2y 5m
Median Time to Grant
Low
PTA Risk
Based on 768 resolved cases by this examiner. Grant probability derived from career allow rate.