Last updated: April 19, 2026
Application No. 17/548,344
METHOD AND APPARATUS FOR SEPARABLE CONVOLUTION FILTER OPERATIONS ON MATRIX MULTIPLICATION ARRAYS

Non-Final OA §101§103§112
Filed
Dec 10, 2021
Examiner
GUDAS, JAKOB OSCAR
Art Unit
2151
Tech Center
2100 — Computer Architecture & Software
Assignee
Intel Corporation
OA Round
3 (Non-Final)
This examiner grants 44% of cases after interview

— +71.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

GUDAS, JAKOB OSCAR View full profile →
Grants 44% of resolved cases
Career Allow Rate
4 granted / 9 resolved
-10.6% vs TC avg
Strong +71% interview lift
Without
With
+71.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
28 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.2%
-6.8% vs TC avg
§103
37.0%
-3.0% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
19.9%
-20.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§101 §103 §112
Detailed Action
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is final and is in response to claims filed on 09/25/2025 via amendment. Claims 1-20 are pending for examination. Claims 1, 4-5, 12, 15, and 17-18 are currently amended. Claims 2-3, 6-11, 13-14, 16, and 19-20 are as originally filed.


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/04/2026 has been entered.
 
Rejections Under 35 U.S.C. 112
With respect to the rejection under 35 U.S.C. 112 for claim 1, applicant has amended or cancelled the claims at issue and the previous rejections have therefore been withdrawn.

With respect to the rejection under 35 U.S.C. 112 for claims 12, 15, and 18, they still recite the limitation that was rejected under 35 U.S.C. 112 and thus the previous rejections are maintained.

Rejections Under 35 U.S.C. 101
Applicant’s arguments regarding the 35 U.S.C. 101 rejections have been fully considered. Regarding the rejection under 35 U.S.C. 101, Applicant argues that “claim 1 has been amended to integrate non-generic hardware components into the claims, including a plurality of Fused Multiply-Add (FMA) blocks”. See Remarks 6 filed 02/04/2026. Examiner respectfully disagrees with Applicants arguments. Merely reciting the FMA blocks at a high level of generality is generally linking to a field of use as well as a clear apply-it scenario. See MPEP 2106.05(h). The recitation is not an improvement to circuit but to the abstract ideas (e.g. applying the convolution kernels to the data, generating the convolution kernels, etc.).
“It is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements. See the discussion of Diamond v. Diehr, 450 U.S. 175, 187 and 191-92, 209 USPQ 1, 10 (1981)) in subsection Il, below. In addition, the improvement can be provided by the additional element(s) in combination with the recited judicial exception... However, it is important to keep in mind that an improvement in the abstract idea itself (e.g. a recited fundamental economic concept) is not an improvement in technology....” See MPEP 2106.05(a).

Rejections Under 35 U.S.C. 103
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Objections
Claims 1, 12, 15, and 18 are objected to because of the following informalities: 

Claims 1, 12, 15, and 18 recite “wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array”. This should be changed to “wherein a first FMA block in a row p and a column q of the matrix processing array is coupled vertically to a second FMA block in a row p+1 and the column q of the matrix processing array and coupled diagonally to a third FMA block in the row p+1 and a column q-1 of the matrix processing array”.

Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 12-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claims 12, 15, and 18 recite “wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row.” It is unclear if the “Nth row” and the “first row after the Nth row” are referring to the FMA blocks or the output data. For examination purposes, examiner has interpreted “wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row” as “wherein the matrix processing array is to write out an Nth row of the matrix processing array to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row of the matrix processing array after the Nth row of the matrix processing array.”

Claims 13-14, 16-17, and 19-20 are rejected for being dependent on an above rejected claim.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract ideas without significantly more.

With regards to claim 1, at Step 1, the claim is directed to a machine, which is a statutory category of invention.
	At Step 2A Prong 1, the examiner notes that the claim is directed to mental processes and/or mathematical concepts. The claim language has been reproduced below:
	An apparatus comprising: 
logic circuitry to generate a first convolution kernel and a second convolution kernel based on a two-dimensional convolution kernel; (mathematical calculation)
a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks, (mental process, evaluation) wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array (mental process, evaluation) and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array, (mental process, evaluation) to apply the first convolution kernel to input data during a first pass to generate an intermediate data; (mathematical calculation)
the matrix processing array to apply the second convolution kernel to the intermediate data to generate output data (Mathematical calculation).
	
Each of the non-bolded limitations are mental processes and/or mathematical calculations. The “to generate a first convolution kernel” limitation is a mathematical calculation that can be performed by generating the two kernels by hand using pen and paper. The “a matrix processing array comprising” limitation is an evaluation mental process that can be performed by choosing what comprises the matrix processing array. The “wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically” limitation is an evaluation mental process that can be performed by choosing how the blocks are connected. The “and coupled diagonally to a third FMA block in row p+1 and column q-1” limitation is an evaluation mental process that can be performed by choosing how the blocks are connected. The “to apply the first convolution kernel” limitation is a mathematical calculation that can be performed by convolving the input and the first kernel by hand using pen and paper. The “to apply the second convolution kernel” limitation is a mathematical calculation that can be performed by convolving the intermediate data and the second kernel by hand using pen and paper. 

At Step 2A prong 2, The additional elements are bolded above. The “matrix processing array:, “Fused Multiply-Add (FMA) blocks”, and “FMA block” are generally linking a computer component to a mathematical calculation. The remaining bolded limitations are generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f).
	
Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claim 12, it recites similar language to claim 1, and is rejected for, at least, the same reasons therein. Herein claim 12 is directed towards the statutory category of a machine, thus also satisfying step 1. Under step 2A prong 1, the “decode an instruction” limitation is a mathematical calculation and evaluation mental process that can be performed by decoding the instruction by hand using pen and paper and choosing the instruction to have a field for an operand value. The “to execute the decoded instruction to perform” limitation is an evaluation mental process that can be performed by performing the instruction by hand. The “wherein the matrix processing array is to write out an Nth row” limitation is an evaluation mental process and mathematical calculation that can be performed by choosing what to do with the inputs after storing the row.

Under step 2A prong 2, the ‘write out an Nth row to storage’ limitation, as claimed and under BRI, is an additional element that is insignificant extra-solution activity. For example, ‘write out an Nth row to storage’ in the context of this claim encompasses mere data gathering. See MPEP 2106.05(g). The remaining additional elements (the decode circuitry, the execution circuitry, etc.) are no more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). 

Under step 2B, the claim recites “write out an Nth row to storage” and, per MPEP 2106.05(d) (Il), the courts have recognized the following computer functions as well-understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity:
	
i. Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); and

iv. Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092-93.

With regards to claim 15, it recites similar language to claim 12, and is rejected for, at least, the same reasons therein. Herein claim 15 is directed towards the statutory category of an article of manufacture, thus also satisfying step 1. Moreover under step 2A prong 2 the additional elements are one or more non-transitory computer-readable medium and a processor. These are no more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claim 18, it recites similar language to claim 12, and is rejected for, at least, the same reasons therein. Herein claim 18 is directed towards the statutory category of an article of manufacture, thus also satisfying step 1. Moreover under step 2A prong 2 the additional elements are one or more non-transitory computer-readable medium and a processor. These are no more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claim 2, it is directed to an evaluation mental process that can be performed by choosing the input to be image data. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

With regards to claims 3 and 16, they are directed to an evaluation mental process that can be performed by choosing the first and second kernels to me one-dimensional vectors. Under steps 2A prong 2 and 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claims 4 and 17, they are directed to a mathematical calculation that can be performed by calculating the Nx1 and 1xN kernels by hand using pen and paper. Under steps 2A prong 2 and 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claim 5, it is directed to an evaluation mental process that can be performed by choosing to couple memory to the FMA blocks. Under steps 2A prong 2 the “store” limitation, as claimed under BRI, is an additional element that is insignificant extra-solution activity. For example, the ‘store’ in the context of the claim is encompasses mere data gathering used for the claimed convolution step. None of the remaining additional elements regarding the generic computer components (i.e. the memory, etc.) are more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under Step 2B, the claim recites “memory to store one or more kernel values” and, per MPEP 2106.05(d) (Il), the courts have recognized the following computer functions as well- understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity:
iv. Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092-93.

With regards to claim 6, it is directed to mental processes and/or mathematical concepts. The “wherein the matrix processing array comprises” limitation is an evaluation mental process that can be performed by choosing what the matrix processing array comprises. The “with each column element coupled vertically to its neighboring” limitation is an evaluation mental process that can be performed by choosing what how to couple the FMA blocks. The “where data is to be” limitation is an evaluation mental process that can be performed by choosing when to store the data. Under steps 2A prong 2 the “stored” limitation, as claimed under BRI, is an additional element that is insignificant extra-solution activity. For example, the ‘stored’ in the context of the claim is encompasses mere data gathering. Under Step 2B, the claim recites “where data is to be stored after a last FMA element operation” and, per MPEP 2106.05(d) (Il), the courts have recognized the following computer functions as well- understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity:
iv. Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092-93.

With regards to claim 7, it is directed to an evaluation mental process that can be performed by choosing what the logic circuitry comprises. None of the remaining additional elements regarding the generic computer components (i.e. processor, etc.) are more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under Step 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

With regards to claim 8, it is directed to an evaluation mental process that can be performed by choosing what the processor comprises. None of the remaining additional elements regarding the generic computer components (i.e. graphics processing unit, general-purpose processor, etc.) are more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under Step 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

With regards to claim 9, is directed to an evaluation mental process that can be performed by choosing what the instructions do. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

With regards to claim 10, is directed to an evaluation mental process that can be performed by choosing what the kernels are applied to. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

With regards to claims 11, 14, and 20, they are directed to an evaluation mental process that can be performed by choosing what comprises the matrix processing array. Under steps 2A prong 2, the systolic array is merely generally linking the mathematical calculations of claim 1 with general systolic DPAS computing. See MPEP 2106.05(h). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.

With regards to claims 13 and 19, they are to an evaluation mental process that can be performed by choosing the operations to perform. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor does it amount to significantly more than the judicial exception.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-7 and 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Nichani et al. (“SAP: Design of a Systolic Array Processor for Computations in Vision”) hereinafter Nichani in view Yoon et al. (US 20230015148 A1) hereinafter Yoon-1 further in view of Yoon et al. (US 11507452 B1) hereinafter Yoon-2.

With regards to claim 1, Nichani teaches An apparatus comprising: logic circuitry to generate a first convolution kernel and a second convolution kernel based on a two-dimensional convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one dimensional Gaussians, one along the x direction and the other along the y direction);
	to apply the first convolution kernel to input data during a first pass to generate an intermediate data (Nichani Section 2 Separability Section page 316: Therefore, the Gaussian filter can be applied to an image, by convolving at first with a one-dimensional Gaussian along each row);
	and the matrix processing array to apply the second convolution kernel to the intermediate data to generate output data (Nichani Section 2 Separability Section page 316: and then convolving the result again with a one-dimensional Gaussian along each column).
	Nichani fails to teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks, wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array.
However, Yoon-1 does teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks (Yoon-1 [0036]: FIG. 4A depicts an example systolic array 400 that is used to multiply the matrices A 410 and B 420 to produce an output matrix C 430. In particular, systolic array 400 may be a 2D array of multiply-and-accumulate (MAC) units...The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array)
wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array (Yoon-1 [0036]: The fused multiplier and adder may be used in computing, at least partially, the dot products of each row in matrix A 410 and each column in matrix B 420. CPA 445 may operate similar to the several parallel segmented adders implemented as a CPA described above. The outputs of the fused multiplier and adder in MAC units at the bottom of the systolic array 400 may be output to bottom CPA 445 to be added to produce the values of the matrix C 430).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to replace the Wallace multiplier and accumulator of Nichani with the FMA units of Yoon-1. One of ordinary skill in the art would be motivated to make this replacement due to the increase in speed and efficiency of FMAs compared to separate multipliers, adders, and accumulators, allowing for more data throughput, as well as because FMA units reduce the rounding error.
Nichani in view of Yoon-1 fails to teach and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array.
However, Yoon-2 teaches and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array (Yoon-2 Column 6 Lines 31-33: the systolic array 110 can include one or more data buses along processing elements positioned diagonally up or down relative to one another in the array).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teaching of Nichani in view of Yoon-1 with coupling the blocks diagonally as taught by Yoon-2. One of ordinary skill in the art would be motivated to make this combination because it would increase the flexibility of the system as the data could be sent more quickly to where it is needed than if the array was not connected diagonally. Also, data can travel across each data bus or interconnect of the systolic array 110 independent of one another. In other words, different processing elements may receive and transmit different data to a neighboring processing element at different points in time as taught by Yoon-2 (Yoon-2 Column 7 Lines 12-16).

With regards to claim 2, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein the input data comprises image data (Nichani Section 2 Separability Section page 316: Therefore, the Gaussian filter can be applied to an image).

With regards to claim 3, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein the first convolution kernel and the second convolution kernel each comprise a one-dimensional vector (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one-dimensional Gaussians).

With regards to claim 4, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein, for an NxN two-dimensional convolution kernel, the logic circuitry is to generate an Nx1 convolution kernel and a 1xN convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one-dimensional Gaussians, one along the x direction and the other along the y direction).

With regards to claim 5, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches is coupled to memory to store one or more kernel values (Nichani Section 4 Weight FIFO Buffer Section Page 316: WFIFO is a circular buffer consisting of 18 registers with a data with of 8 bits).
Nichani fails to teach wherein a subset of the plurality of FMA blocks.
However, Yoon-1 does teach wherein a subset of the plurality of FMA blocks (Yoon-1 [0036]: The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to replace the Wallace multiplier and accumulator of Nichani with the FMA units of Yoon-1. One of ordinary skill in the art would be motivated to make this replacement due to the increase in speed and efficiency of FMAs compared to separate multipliers, adders, and accumulators, allowing for more data throughput, as well as because FMA units reduce the rounding error.

With regards to claim 6, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani fails to teach wherein the matrix processing array comprises a two- dimensional matrix of the plurality of FMA blocks with each column element coupled vertically to its neighboring downstream FMA element, where data is to be stored after a last FMA element operation.
	However, Yoon-1 does teach wherein the matrix processing array comprises a two- dimensional matrix of the plurality of FMA blocks (Yoon-1 [0036]: FIG. 4A depicts an example systolic array 400 that is used to multiply the matrices A 410 and B 420 to produce an output matrix C 430. In particular, systolic array 400 may be a 2D array of multiply-and-accumulate (MAC) units...The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array)
with each column element coupled vertically to its neighboring downstream FMA element (Yoon-1 [0036]: The fused multiplier and adder may be used in computing, at least partially, the dot products of each row in matrix A 410 and each column in matrix B 420. CPA 445 may operate similar to the several parallel segmented adders implemented as a CPA described above. The outputs of the fused multiplier and adder in MAC units at the bottom of the systolic array 400 may be output to bottom CPA 445 to be added to produce the values of the matrix C 430.),
 where data is to be stored after a last FMA element operation (Yoon-1 [0069]: FIG. 8 depicts a block diagram of an example electronic device 800. The electronic device 800 may include one or more processor 810, such as one or more xPUs, system memory 820, a bus 830, the networking interface(s) 840, and other components (not shown), such as storage(s), output device interface(s), input device interface(s). A bus 830 may be used for communicating between the processor 810, the system memory 820; Yoon-1 [0070]: The processor 810 may include a systolic array, such as the systolic array described in connection with FIGS. 4A).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the 2D array of FMA units of Yoon-1. One of ordinary skill in the art would be motivated to make this combination because it would allow for parallel operation of the array allowing for more throughput of calculations.

With regards to claim 7, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein a processor, having one or more processor cores, comprises the logic circuitry (Nichani Section 6 page 317: The PE architecture has been designed, verified and fabricated on a 4.6 x 6.8 mm MOSIS standard frame. The testing of the fabricated chip).

With regards to claim 10, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein the matrix processing array is to apply the first and second convolution kernels to execute operations in one or more of: image processing, data filtering, and data encoding or decoding (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one-dimensional Gaussians, one along the x direction and the other along the y direction. Therefore, the Gaussian filter can be applied to an image, by convolving at first with a one-dimensional Gaussian along each row and then convolving the result again with a one-dimensional Gaussian along each column).

With regards to claim 11, Nichani in view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani further teaches wherein the matrix processing array comprises a systolic array (Nichani Section 4 Page 316: The systolic array processor (SAP)).

Claims 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Nichani in view Yoon-1 further in view of Yoon-2 further in view of Venkataramani et al. (US 20190303743 A1) hereinafter Venkataramani.

With regards to claim 8, Nichani on view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 7 above. While Nichani teaches of a processor, Nichani fails to teach wherein the processor comprises a graphics processing unit and/or a general-purpose processor.
	However, Venkataramani does teach wherein the processor comprises a graphics processing unit and/or a general-purpose processor (Venkataramani [0155]: the processor 2500 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit)).
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the graphics processor or general-purpose processor of Venkataramani. One of ordinary skill in the art would be motivated to make this combination because it would allow for more flexibility in manufacturing and using the circuits.

With regards to claim 9, Nichani on view of Yoon-1 further in view of Yoon-2 teaches all of the limitations of claim 1 above. Nichani fails to teach wherein one or more instructions are to be executed to configure, load, execute, and/or store the first and second convolution kernels on the matrix processing array.
	However, Venkataramani does teach wherein one or more instructions are to be executed to configure, load, execute, and/or store the first and second convolution kernels on the matrix processing array (Venkataramani [0077]: FIGS. 10A-10B illustrate instructions 1000-1001 according to embodiments of the disclosure…  Coarse-grained Data Instructions 1004: e.g., compute dominant instructions such as convolutions (nD-convolutions); Venkataramani Fig. 10A: shows the NDCONV instruction which convolves the input with the kernel).
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the convolution instruction of Venkataramani. One of ordinary skill in the art would be motivated to make this combination because it would allow for the circuitry to perform the instructions needed to convolve the matrix with the kernels.

Claims 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Nichani view Yoon-1 further in view of Yoon-2 further in view of Liu et al. (US 20200334322 A1) hereinafter Liu.

With regards to claim 15, Nichani teaches logic circuitry to generate a first convolution kernel and a second convolution kernel based on a two-dimensional convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one dimensional Gaussians, one along the x direction and the other along the y direction);
	to apply the first convolution kernel to input data during a first pass to generate an intermediate data (Nichani Section 2 Separability Section page 316: Therefore, the Gaussian filter can be applied to an image, by convolving at first with a one-dimensional Gaussian along each row);
	and the matrix processing array to apply the second convolution kernel to the intermediate data to generate output data (Nichani Section 2 Separability Section page 316: and then convolving the result again with a one-dimensional Gaussian along each column).
	Nichani fails to teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks, wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array and One or more non-transitory computer-readable media comprising one or more instructions that when executed on a processor configure the processor to perform one or more operations to cause.
However, Yoon-1 does teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks (Yoon-1 [0036]: FIG. 4A depicts an example systolic array 400 that is used to multiply the matrices A 410 and B 420 to produce an output matrix C 430. In particular, systolic array 400 may be a 2D array of multiply-and-accumulate (MAC) units...The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array).
One or more non-transitory computer-readable media comprising one or more instructions that when executed on a processor configure the processor to perform one or more operations to cause (Yoon-1 [0078]: Aspects of the present disclosure may be implemented as a computer implemented process, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by an electronic device and may comprise instructions for causing an electronic device or other device to perform processes and techniques described in the present disclosure)
wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array (Yoon-1 [0036]: The fused multiplier and adder may be used in computing, at least partially, the dot products of each row in matrix A 410 and each column in matrix B 420. CPA 445 may operate similar to the several parallel segmented adders implemented as a CPA described above. The outputs of the fused multiplier and adder in MAC units at the bottom of the systolic array 400 may be output to bottom CPA 445 to be added to produce the values of the matrix C 430).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to replace the Wallace multiplier and accumulator of Nichani with the FMA units of Yoon-1 and combine the teachings of Nichani with the non-transitory computer readable medium of Yoon-1. One of ordinary skill in the art would be motivated to make this replacement due to the increase in speed and efficiency of FMAs compared to separate multipliers, adders, and accumulators, allowing for more data throughput, as well as because FMA units reduce the rounding error.
Nichani in view of Yoon-1 fails to teach and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array.
However, Yoon-2 teaches and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array (Yoon-2 Column 6 Lines 31-33: the systolic array 110 can include one or more data buses along processing elements positioned diagonally up or down relative to one another in the array).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teaching of Nichani in view of Yoon-1 with coupling the blocks diagonally as taught by Yoon-2. One of ordinary skill in the art would be motivated to make this combination because it would increase the flexibility of the system as the data could be sent more quickly to where it is needed than if the array was not connected diagonally. Also, data can travel across each data bus or interconnect of the systolic array 110 independent of one another. In other words, different processing elements may receive and transmit different data to a neighboring processing element at different points in time as taught by Yoon-2 (Yoon-2 Column 7 Lines 12-16).
Nichani in view of Yoon-1 further in view of Yoon-2 fails to teach wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row.
However, Liu teaches wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row (Liu [0078]: the matrix multiplier 60 further includes a third memory 605, and the third memory 605 is configured to store operation results of the vector multiplication circuit and the addition circuit, and store operation results in different clock cycles. It can be understood that the third memory 605 in this application may include X*Y storage units, and each storage unit is configured to store an operation result obtained each time a corresponding operation unit performs an operation; Liu [0085]: when N % X≠0, computation is not performed on a row (N+1) to a row (T*X−N) of the first matrix, and a value of a result is assigned 0).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani in view of Yoon-1 further in view of Yoon-2 with storing the row and setting the input to zero as taught by Liu. One of ordinary skill in the art would be motivated to make this combination because In this way, read and operation power consumption of the corresponding operation unit can be reduced as taught by Liu (Liu [0085]).

With regards to claim 16, Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu teaches all of the limitations of claim 15 above. Nichani further teaches the first convolution kernel and the second convolution kernel to each comprise a one-dimensional vector (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one-dimensional Gaussians).
	Nichani fails to teach further comprising one or more instructions that when executed on the at least one processor configure the at least one processor to perform one or more operations to cause the.
	However, Yoon-1 does teach further comprising one or more instructions that when executed on the at least one processor configure the at least one processor to perform one or more operations to cause the (Yoon-1 [0078]: Aspects of the present disclosure may be implemented as a computer implemented process, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by an electronic device and may comprise instructions for causing an electronic device or other device to perform processes and techniques described in the present disclosure)
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the non-transitory computer readable medium of Yoon-1. One of ordinary skill in the art would be motivated to make this combination because it would allow for the wider manufacture of the circuitry, allowing for it to be distributed more quickly.

With regards to claim 17, Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu teaches all of the limitations of claim 15 above. Nichani further teaches generate an Nx1 convulsion kernel and a 1xN convolution kernel for an NxN two-dimensional convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one-dimensional Gaussians, one along the x direction and the other along the y direction).
Nichani fails to teach further comprising one or more instructions that when executed on the at least one processor configure the at least one processor to perform one or more operations to cause the logic circuitry to.
	However, Yoon-1 does teach further comprising one or more instructions that when executed on the at least one processor configure the at least one processor to perform one or more operations to cause the logic circuitry to (Yoon-1 [0078]: Aspects of the present disclosure may be implemented as a computer implemented process, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by an electronic device and may comprise instructions for causing an electronic device or other device to perform processes and techniques described in the present disclosure).
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the non-transitory computer readable medium of Yoon-1. One of ordinary skill in the art would be motivated to make this combination because it would allow for the wider manufacture of the circuitry, allowing for it to be distributed more quickly.

Claims 12-14 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Nichani in view Yoon-1 further in view of Yoon-2 further in view of Liu further in view of Venkataramani.


With regards to claim 12, Nichani teaches An apparatus comprising: logic circuitry to generate a first convolution kernel and a second convolution kernel based on a two-dimensional convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one dimensional Gaussians, one along the x direction and the other along the y direction);
	to apply the first convolution kernel to input data during a first pass to generate an intermediate data (Nichani Section 2 Separability Section page 316: Therefore, the Gaussian filter can be applied to an image, by convolving at first with a one-dimensional Gaussian along each row);
	and the matrix processing array to apply the second convolution kernel to the intermediate data to generate output data (Nichani Section 2 Separability Section page 316: and then convolving the result again with a one-dimensional Gaussian along each column).
	Nichani fails to teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks, wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array.
However, Yoon-1 does teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks (Yoon-1 [0036]: FIG. 4A depicts an example systolic array 400 that is used to multiply the matrices A 410 and B 420 to produce an output matrix C 430. In particular, systolic array 400 may be a 2D array of multiply-and-accumulate (MAC) units...The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array)
wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array (Yoon-1 [0036]: The fused multiplier and adder may be used in computing, at least partially, the dot products of each row in matrix A 410 and each column in matrix B 420. CPA 445 may operate similar to the several parallel segmented adders implemented as a CPA described above. The outputs of the fused multiplier and adder in MAC units at the bottom of the systolic array 400 may be output to bottom CPA 445 to be added to produce the values of the matrix C 430).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to replace the Wallace multiplier and accumulator of Nichani with the FMA units of Yoon-1. One of ordinary skill in the art would be motivated to make this replacement due to the increase in speed and efficiency of FMAs compared to separate multipliers, adders, and accumulators, allowing for more data throughput, as well as because FMA units reduce the rounding error.
Nichani in view of Yoon-1 fails to teach and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array.
However, Yoon-2 teaches and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array (Yoon-2 Column 6 Lines 31-33: the systolic array 110 can include one or more data buses along processing elements positioned diagonally up or down relative to one another in the array).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teaching of Nichani in view of Yoon-1 with coupling the blocks diagonally as taught by Yoon-2. One of ordinary skill in the art would be motivated to make this combination because it would increase the flexibility of the system as the data could be sent more quickly to where it is needed than if the array was not connected diagonally. Also, data can travel across each data bus or interconnect of the systolic array 110 independent of one another. In other words, different processing elements may receive and transmit different data to a neighboring processing element at different points in time as taught by Yoon-2 (Yoon-2 Column 7 Lines 12-16).
Nichani in view of Yoon-1 further in view of Yoon-2 fails to teach and wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row.
However, Liu teaches and wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row (Liu [0078]: the matrix multiplier 60 further includes a third memory 605, and the third memory 605 is configured to store operation results of the vector multiplication circuit and the addition circuit, and store operation results in different clock cycles. It can be understood that the third memory 605 in this application may include X*Y storage units, and each storage unit is configured to store an operation result obtained each time a corresponding operation unit performs an operation; Liu [0085]: when N % X≠0, computation is not performed on a row (N+1) to a row (T*X−N) of the first matrix, and a value of a result is assigned 0).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani in view of Yoon-1 further in view of Yoon-2 with storing the row and setting the input to zero as taught by Liu. One of ordinary skill in the art would be motivated to make this combination because In this way, read and operation power consumption of the corresponding operation unit can be reduced as taught by Liu (Liu [0085]).
Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu fails to teach decode circuitry to decode an instruction having a field for an operand value; and execution circuitry to execute the decoded instruction to perform one or more operations on a matrix processing array.
However, Venkataramani does teach decode circuitry to decode an instruction having a field for an operand value (Venkataramani [0143]: The decode unit 2340 (or decoder or decoder unit) may decode instructions (e.g., macro-instructions), and generate as an output one or more micro-operations, micro-code entry points, micro-instructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions); 
and execution circuitry to execute the decoded instruction to perform one or more operations on a matrix processing array (Venkataramani [0142]: shows processor core 2390 including a front end unit 2330 coupled to an execution engine unit),
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu with the decode and execution circuitry of Venkataramani. One of ordinary skill in the art would be motivated to make this combination to allow for more flexibility in the circuit by allowing for different instructions to be performed by the circuit and allow for the execution of instructions from the decode circuitry to perform the needed operations.

With regards to claim 13, Nichani on view of Yoon-1 further in view of Yoon-2 further in view of Liu further in view of Venkataramani teaches all of the limitations of claim 12 above. Nichani fails to teach wherein the one or more operations comprise: a load array configuration operation, a matrix load operation, a filter load operation, an array filter operation, and/or a matrix store operation.
	However, Venkataramani does teach wherein the one or more operations comprise: a load array configuration operation, a matrix load operation, a filter load operation, an array filter operation, and/or a matrix store operation (Venkataramani [0077]: FIGS. 10A-10B illustrate instructions 1000-1001 according to embodiments of the disclosure…  Coarse-grained Data Instructions 1004: e.g., compute dominant instructions such as convolutions (nD-convolutions); Venkataramani Fig. 10A: shows the NDCONV instruction which convolves the input with the kernel).
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the convolution instruction of Venkataramani. One of ordinary skill in the art would be motivated to make this combination because it would allow for the circuitry to perform the instructions needed to convolve the matrix with the kernels.

With regards to claim 14, Nichani on view of Yoon-1 further in view of Yoon-2 further in view of Liu further in view of Venkataramani teaches all of the limitations of claim 12 above. Nichani further teaches wherein the matrix processing array comprises a systolic array (Nichani Section 4 Page 316: The systolic array processor (SAP)).

With regards to claim 18, Nichani teaches logic circuitry to generate a first convolution kernel and a second convolution kernel based on a two-dimensional convolution kernel (Nichani Section 2 Separability Section page 316: A two-dimensional Gaussian filter can be separated into two one dimensional Gaussians, one along the x direction and the other along the y direction);
	to apply the first convolution kernel to input data during a first pass to generate an intermediate data (Nichani Section 2 Separability Section page 316: Therefore, the Gaussian filter can be applied to an image, by convolving at first with a one-dimensional Gaussian along each row);
	and the matrix processing array to apply the second convolution kernel to the intermediate data to generate output data (Nichani Section 2 Separability Section page 316: and then convolving the result again with a one-dimensional Gaussian along each column).
	Nichani fails to teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks, wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array and One or more non-transitory computer-readable media comprising one or more instructions that when executed on a processor configure the processor to perform one or more operations to cause.
However, Yoon-1 does teach a matrix processing array comprising a plurality of Fused Multiply-Add (FMA) blocks (Yoon-1 [0036]: FIG. 4A depicts an example systolic array 400 that is used to multiply the matrices A 410 and B 420 to produce an output matrix C 430. In particular, systolic array 400 may be a 2D array of multiply-and-accumulate (MAC) units...The output b values of the Booth encoder 440, and the scalar values of the matrix A 410 may be used as input to a fused multiplier and adder in each MAC unit in the systolic array)
wherein a first FMA block in row p and column q of the matrix processing array is coupled vertically to a second FMA block in row p+1 and column q of the matrix processing array (Yoon-1 [0036]: The fused multiplier and adder may be used in computing, at least partially, the dot products of each row in matrix A 410 and each column in matrix B 420. CPA 445 may operate similar to the several parallel segmented adders implemented as a CPA described above. The outputs of the fused multiplier and adder in MAC units at the bottom of the systolic array 400 may be output to bottom CPA 445 to be added to produce the values of the matrix C 430).
One or more non-transitory computer-readable media comprising one or more instructions that when executed on a processor configure the processor to perform one or more operations to cause (Yoon-1 [0078]: Aspects of the present disclosure may be implemented as a computer implemented process, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by an electronic device and may comprise instructions for causing an electronic device or other device to perform processes and techniques described in the present disclosure):
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to replace the Wallace multiplier and accumulator of Nichani with the FMA units of Yoon-1. One of ordinary skill in the art would be motivated to make this replacement due to the increase in speed and efficiency of FMAs compared to separate multipliers, adders, and accumulators, allowing for more data throughput, as well as because FMA units reduce the rounding error.
Nichani in view of Yoon-1 fails to teach and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array.
However, Yoon-2 teaches and coupled diagonally to a third FMA block in row p+1 and column q-1 of the matrix processing array (Yoon-2 Column 6 Lines 31-33: the systolic array 110 can include one or more data buses along processing elements positioned diagonally up or down relative to one another in the array).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teaching of Nichani in view of Yoon-1 with coupling the blocks diagonally as taught by Yoon-2. One of ordinary skill in the art would be motivated to make this combination because it would increase the flexibility of the system as the data could be sent more quickly to where it is needed than if the array was not connected diagonally. Also, data can travel across each data bus or interconnect of the systolic array 110 independent of one another. In other words, different processing elements may receive and transmit different data to a neighboring processing element at different points in time as taught by Yoon-2 (Yoon-2 Column 7 Lines 12-16).
Nichani in view of Yoon-1 further in view of Yoon-2 fails to teach and wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row.
However, Liu teaches and wherein the matrix processing array is to write out an Nth row to storage and set to zero one or more adder inputs to one or more of the plurality of FMA blocks in the first row after the Nth row (Liu [0078]: the matrix multiplier 60 further includes a third memory 605, and the third memory 605 is configured to store operation results of the vector multiplication circuit and the addition circuit, and store operation results in different clock cycles. It can be understood that the third memory 605 in this application may include X*Y storage units, and each storage unit is configured to store an operation result obtained each time a corresponding operation unit performs an operation; Liu [0085]: when N % X≠0, computation is not performed on a row (N+1) to a row (T*X−N) of the first matrix, and a value of a result is assigned 0).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani in view of Yoon-1 further in view of Yoon-2 with storing the row and setting the input to zero as taught by Liu. One of ordinary skill in the art would be motivated to make this combination because In this way, read and operation power consumption of the corresponding operation unit can be reduced as taught by Liu (Liu [0085]).
Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu fails to teach decode circuitry to decode an instruction having a field for an operand value; and execution circuitry to execute the decoded instruction to perform one or more operations on a matrix processing array.
However, Venkataramani does teach decode circuitry to decode an instruction having a field for an operand value (Venkataramani [0143]: The decode unit 2340 (or decoder or decoder unit) may decode instructions (e.g., macro-instructions), and generate as an output one or more micro-operations, micro-code entry points, micro-instructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions); 
and execution circuitry to execute the decoded instruction to perform one or more operations on a matrix processing array (Venkataramani [0142]: shows processor core 2390 including a front end unit 2330 coupled to an execution engine unit),
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani in view of Yoon-1 further in view of Yoon-2 further in view of Liu with the decode and execution circuitry of Venkataramani. One of ordinary skill in the art would be motivated to make this combination to allow for more flexibility in the circuit by allowing for different instructions to be performed by the circuit and allow for the execution of instructions from the decode circuitry to perform the needed operations.

With regards to claim 19, Nichani on view of Yoon-1 further in view of Yoon-2 further in view of Liu further in view of Venkataramani teaches all of the limitations of claim 18 above. Nichani fails to teach wherein the one or more operations comprise: a load array configuration operation, a matrix load operation, a filter load operation, an array filter operation, and/or a matrix store operation.
	However, Venkataramani does teach wherein the one or more operations comprise: a load array configuration operation, a matrix load operation, a filter load operation, an array filter operation, and/or a matrix store operation (Venkataramani [0077]: FIGS. 10A-10B illustrate instructions 1000-1001 according to embodiments of the disclosure…  Coarse-grained Data Instructions 1004: e.g., compute dominant instructions such as convolutions (nD-convolutions); Venkataramani Fig. 10A: shows the NDCONV instruction which convolves the input with the kernel).
	Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Nichani with the convolution instruction of Venkataramani. One of ordinary skill in the art would be motivated to make this combination because it would allow for the circuitry to perform the instructions needed to convolve the matrix with the kernels.

With regards to claim 20, Nichani on view of Yoon-1 further in view of Yoon-2 further in view of Liu further in view of Venkataramani teaches all of the limitations of claim 18 above. Nichani further teaches wherein the matrix processing array comprises a systolic array (Nichani Section 4 Page 316: The systolic array processor (SAP)).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jakob O Gudas whose telephone number is (571)272-0695. The examiner can normally be reached Monday-Thursday: 7:30AM-5:00PM Friday: 7:30AM-4:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Trujillo can be reached at (571) 272-3677. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/J.O.G./Examiner, Art Unit 2151 

/James Trujillo/Supervisory Patent Examiner, Art Unit 2151
Read full office action
Prosecution Timeline

Dec 10, 2021
Application Filed
May 02, 2022
Response after Non-Final Action
Apr 21, 2025
Non-Final Rejection — §101, §103, §112
Sep 25, 2025
Response Filed
Oct 30, 2025
Final Rejection — §101, §103, §112
Feb 04, 2026
Request for Continued Examination
Feb 13, 2026
Response after Non-Final Action
Feb 19, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/485,179
Patent 12602200
ANALOG MULTIPLY-ACCUMULATE UNIT FOR MULTIBIT IN-MEMORY CELL COMPUTING
2y 5m to grant Granted Apr 14, 2026
17/765,495
Patent 12566586
HIGH-SPEED QUANTUM RANDOM NUMBER GENERATOR BASED ON VACUUM STATE FLUCTUATION TECHNOLOGY
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
44%
Grant Probability
99%
With Interview (+71.1%)
4y 2m
Median Time to Grant
High
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allow rate.