Last updated: May 29, 2026
Application No. 18/906,859
SYSTOLIC DISAGGREGATION WITHIN A MATRIX ACCELERATOR ARCHITECTURE

Non-Final OA §103§DP
Filed
Oct 04, 2024
Priority
Mar 15, 2019 — provisional 62/819,361 +5 more
Examiner
SHENG, XIN
Art Unit
2619
Tech Center
2600 — Communications
Assignee
Intel Corporation
OA Round
1 (Non-Final)
Interview Optional

— +17.2% interview lift. Examiner has a relatively high allowance rate (72%); +17.2% interview lift. A written response may suffice.
Based on 404 resolved cases, 2023–2026
Examiner Intelligence

SHENG, XIN View full profile →
Grants 72% — above average
Career Allowance Rate
293 granted / 404 resolved
+10.5% vs TC avg
Strong +17% interview lift
Without
With
+17.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
14 currently pending
Career history
421
Total Applications
across all art units
Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
94.5%
+54.5% vs TC avg
§102
1.0%
-39.0% vs TC avg
§112
0.3%
-39.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 404 resolved cases
Office Action

§103 §DP
CTNF 18/906,859 CTNF 90839 Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Double Patenting 08-33 AIA The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg , 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman , 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi , 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum , 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel , 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington , 418 F.2d 528, 163 USPQ 644 (CCPA 1969). A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA. A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13. The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA/25, or PTO/AIA/26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer. Claims 1-4, 8-12, 15-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4, 8-12, 15-20 of U.S. Patent No. 12141094. US App #18906850 1 2 3 4 8 9 10 11 12 15 16 17 18 19 20 US Patent # 12141094 1 2 3 4 8 9 10,13 10,11 12 15 16 17 18 19 20 US App #18906850 Claim 1 US Patent # 12141094 Claim 1 1. general-purpose graphics processing unit comprising: a matrix accelerator including: memory to store input data; a systolic array coupled with the memory, the systolic array including multiple stages, wherein each of the multiple stages include multiple processing elements; and circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with the input data. 1. A general-purpose graphics processing unit comprising: a matrix accelerator including: memory to store input data; a systolic array coupled with the memory, the systolic array including multiple stages, wherein each of the multiple stages include multiple processing elements; and circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with the inputs, wherein each of the multiple processing elements include hardware logic to detect a zero-value input and bypass a matrix multiply operation based on the zero-value input. US App #18906850 Claim 2 US Patent # 12141094 Claim 2 2. The general-purpose graphics processing unit as in claim 1, wherein the matrix accelerator is to receive the metadata as input in association with operands that specify a location for a zero-value input. 2. The general-purpose graphics processing unit as in claim 1, wherein the matrix accelerator is to receive the metadata as input in association with operands that specify a location for the zero-value input. US App #18906850 Claim 3 US Patent # 12141094 Claim 3 3. The general-purpose graphics processing unit as in claim 1, wherein the metadata is to be pre-generated for an entire set of input data. 3. The general-purpose graphics processing unit as in claim 2, wherein the metadata is to be pre-generated for an entire set of input data. US App #18906850 Claim 4 US Patent # 12141094 Claim 4 4. The general-purpose graphics processing unit as in claim 1, wherein the metadata is to be pre-generated based on a row of a first matrix for input to the matrix accelerator or a column of a second matrix for input to the matrix accelerator. 4. The general-purpose graphics processing unit as in claim 2, wherein the metadata is to be pre-generated based on a row of a first matrix for input to the matrix accelerator or a column of a second matrix for input to the matrix accelerator. US App #18906850 Claim 8 US Patent # 12141094 Claim 8 8. The general-purpose graphics processing unit as in claim 1, wherein each processing element includes hardware logic to detect a zero value input and bypass the matrix multiply operation based on the zero value input. 8. The general-purpose graphics processing unit as in claim 1, wherein to bypass the matrix multiply operation based on the zero-value input, a processing element of the multiple processing elements is to: detect, based on the metadata, that at least one of multiple inputs to a first portion of the matrix multiply operation is a zero-value input; and bypass a load for the multiple inputs into the processing element associated with the first portion of the matrix multiply operation. US App #18906850 Claim 9 US Patent # 12141094 Claim 9 9. The general-purpose graphics processing unit as in claim 8, wherein a processing element is to bypass a first matrix multiply operation having zero value inputs and load input for a second matrix multiply operation within a single clock cycle. 9. The general-purpose graphics processing unit as in claim 8, wherein a processing element is to bypass a first matrix multiply operation having zero value inputs and load input for a second matrix multiply operation within a single clock cycle. US App #18906850 Claim 10 US Patent # 12141094 Claim 13 10. A method comprising: on a general-purpose graphics processor having a matrix accelerator: generating metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator; analyzing the metadata for input to a matrix multiply operation to be performed by the matrix accelerator, the input to the matrix multiply operation including one or more elements of multiple input matrices; determining, based on the metadata, whether the input to the matrix multiply operation includes a zero-value input; and bypassing at least a first portion of the matrix multiply operation in response to a determination that the matrix multiply operation includes a zero-value input. 13. The method as in claim 10, further comprising generating the metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator. US Patent # 12141094 Claim 10 10. A method comprising: on a general-purpose graphics processor having a matrix accelerator: analyzing metadata for input to a matrix multiply operation to be performed by the matrix accelerator, the input to the matrix multiply operation including one or more elements of multiple input matrices; determining, based on the metadata, whether the input to the matrix multiply operation includes a zero value input; and bypassing at least a first portion of the matrix multiply operation in response to a determination that the matrix multiply operation includes a zero-value input, wherein bypassing at least the first portion of the matrix multiply operation includes: determining that at least one of multiple inputs to the first portion of the matrix multiply operation is a zero-value input; and bypassing loading of the multiple inputs into a processing element associated with the first portion of the matrix multiply operation. US App #18906850 Claim 11 US Patent # 12141094 Claim 10 11. The method as in claim 10, wherein bypassing at least the first portion of the matrix multiply operation includes: determining that at least one of multiple inputs to the first portion of the matrix multiply operation is a zero-value input; bypassing loading of the multiple inputs into a processing element associated with the first portion of the matrix multiply operation; and determining that each of multiple inputs to a second portion of the matrix multiply operation are non-zero-value input; loading the multiple inputs to the second portion of the matrix multiply operation into the processing element; and performing the second portion of the matrix multiply operation via the processing element. 10. A method comprising: on a general-purpose graphics processor having a matrix accelerator: analyzing metadata for input to a matrix multiply operation to be performed by the matrix accelerator, the input to the matrix multiply operation including one or more elements of multiple input matrices; determining, based on the metadata, whether the input to the matrix multiply operation includes a zero value input; and bypassing at least a first portion of the matrix multiply operation in response to a determination that the matrix multiply operation includes a zero-value input, wherein bypassing at least the first portion of the matrix multiply operation includes: determining that at least one of multiple inputs to the first portion of the matrix multiply operation is a zero-value input; and bypassing loading of the multiple inputs into a processing element associated with the first portion of the matrix multiply operation. US Patent # 12141094 Claim 11 11. The method as in claim 10, wherein bypassing at least the portion of the matrix multiply operation additionally includes: determining that each of multiple inputs to a second portion of the matrix multiply operation are non-zero value input; loading the multiple inputs to the second portion of the matrix multiply operation into the processing element; and performing the second portion of the matrix multiply operation via the processing element. US App #18906850 Claim 12 US Patent # 12141094 Claim 12 12. The method as in claim 11, the method additionally comprising: bypassing loading of the multiple inputs into the processing element associated with the first portion of the matrix multiply operation during a first clock cycle; and loading the multiple inputs to the second portion of the matrix multiply operation into the processing element during the first clock cycle. 12. The method as in claim 11 , the method additionally comprising: bypassing loading of the multiple inputs into the processing element associated with the first portion of the matrix multiply operation during a first clock cycle; and loading the multiple inputs to the second portion of the matrix multiply operation into the processing element during the first clock cycle. US App #18906850 Claim 15 US Patent # 12141094 Claim 15 15. A data processing system comprising: a memory device; and an accelerator device coupled with the memory device, wherein the accelerator device comprises: a matrix accelerator including circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with inputs to the matrix accelerator, wherein the matrix accelerator includes multiple processing elements and is to receive the metadata as input in association with operands that specify a location for a zero-value input or generate the metadata based on data referenced by input operands, the data including the zero-value inputs. 15. A data processing system comprising: a memory device; and a general-purpose graphics processing unit coupled with the memory device, wherein the general-purpose graphics processing unit comprises: a matrix accelerator including circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with the inputs, wherein the matrix accelerator includes multiple processing elements, each of the multiple processing elements include hardware logic to detect a zero value input and bypass the matrix multiply operation based on the zero value input, and the matrix accelerator is to receive the metadata as input in association with operands that specify a location for the zero-value input or generate the metadata based on data referenced by input operands, the data including the zero-value inputs. US App #18906850 Claim 16 US Patent # 12141094 Claim 16 16. The data processing system as in claim 15, wherein the multiple processing elements are configured as systolic array of processing elements. 16. The data processing system as in claim 15, wherein the multiple processing elements are configured as systolic array of processing elements. US App #18906850 Claim 17 US Patent # 12141094 Claim 17 17. The data processing system as in claim 15, wherein the metadata is analyzed or generated in relation to a sub-matrix of input before the input is loaded into the multiple processing elements. 17. The data processing system as in claim 15, wherein the metadata is analyzed or generated in relation to a sub-matrix of input before the input is loaded into the multiple processing elements. US App #18906850 Claim 18 US Patent # 12141094 Claim 18 18. The data processing system as in claim 15, wherein the matrix accelerator is to generate the metadata based on the data referenced by the input operands, the matrix accelerator to generate the metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator. 18. The data processing system as in claim 15, wherein the matrix accelerator is to generate the metadata based on the data referenced by the input operands, the matrix accelerator to generate the metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator. US App #18906850 Claim 19 US Patent # 12141094 Claim 19 19. data processing system as in claim 15, wherein each processing element includes hardware logic to detect a zero value input and bypass the matrix multiply operation based on the zero value input, a processing element of the multiple processing elements is configured to: detect, based on the metadata, that at least one of multiple inputs to a first portion of the matrix multiply operation is a zero-value input; and bypass a load for the multiple inputs into the processing element associated with the first portion of the matrix multiply operation. 19. The data processing system as in claim 15, wherein to bypass the matrix multiply operation based on the zero value input, a processing element of the multiple processing elements is to: detect, based on the metadata, that at least one of multiple inputs to a first portion of the matrix multiply operation is a zero-value input; and bypass a load for the multiple inputs into the processing element associated with the first portion of the matrix multiply operation. US App #18906850 Claim 20 US Patent # 12141094 Claim 20 20. The data processing system as in claim 19, wherein a processing element is to bypass a first matrix multiply operation having zero value inputs and load input for a second matrix multiply operation within a single clock cycle. 20. The data processing system as in claim 19, wherein a processing element is to bypass a first matrix multiply operation having zero value inputs and load input for a second matrix multiply operation within a single clock cycle. Although the claims at issue are not identical, they are not patentably distinct from each other. For example, Claim 15 of Patent #12141094 discloses &quot; a general-purpose graphics processing unit coupled with the memory device, wherein the general-purpose graphics processing unit comprises&quot; while App # 18906850 Claim 15 discloses &quot; an accelerator device coupled with the memory device, wherein the accelerator device comprises &quot;. It is obvious to a person with ordinary skill in the art that a computer including GPU and/or CPU can function as an accelerator device. Therefore, Patent #12141094 Claim 15 discloses all limitations of App #18906850 Claim 15. Claim Rejections - 35 USC § 103 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-21-aia AIA Claim s 1-3, 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Frumkin et al (US20190278600) in view of Lu et al (US11550971) further in view of Wu (GB2286909) . Regarding Claim 1. Frumkin teaches A general-purpose graphics processing unit comprising: a matrix accelerator (Frumkin, abstract, the invention describes implementation for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero value entities. These empty tiles can be ignored, and only the tiles with non-zero entries processed, which reduces resource and time requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value, which enables the values to be multiplied correctly against, for example, values of a dense matrix. The tiles can be processed in parallel and the results accumulated to generate a matrix product. The matrix product can then be passed to the next step in a process or operation, such as to a next layer in a deep neural network. [0021] In some embodiments the matrix multiplication is implemented on a single GPU that includes multiple streaming multiprocessors (SMs). SMs can work independently and in parallel. Each SM can be assigned one or more of the tiles of the sparse matrix for processing as discussed later herein, and can perform the appropriate multiplication of those tiles times the dense matrix. The tiling can be performed as part of a pre-processing step in some embodiments, either on the GPU or on a CPU. For large matrices or numbers of tiles, multiple GPUs may be utilized for the processing. The final product can then be sent to the CPU if needed, but is often kept on the GPU and passed to the next layer of the neural network.) including: Frumkin didn’t explicitly teach, however, Lu teaches memory to store input data; a systolic array coupled with the memory, (Lu, abstract, the invention describes operations comprising configuring a simulated environment to be representative of a physical device based, at least in part, on an initial description of the physical device that described structural parameters of the physical device. The operations further comprise performing a physics simulation with an artificial intelligence ("AI") accelerator. The AI accelerator includes a matrix multiply unit for computing convolution operations via a plurality of multiply-accumulate units. The operations further comprise computing a field response in response of the physical device in response to an excitation source within the simulated environment when performing the physics simulation. The field response is computed, at least in part, with the convolution operations to perform spatial differencing. Col 5, line 40-67, In the illustrated embodiment, AI accelerator 211 is a distributed processing platform including a plurality of tensor processing units (TPUs) 212 interconnected with one another by bus 223. Each of the TPUs 212 may individually be considered a machine learning accelerator, that when coupled together, form a distributed system that scales 45 computational speed (e.g., linearly) with respect to the number of TPUs within AI accelerator 211. Each of the plurality of TPUs 212 includes, inter alia, a buffer 214 and a matrix multiply unit (MXU) 216. The buffer 214 provides a storage medium (e.g., memory) for storing instructions (e.g., inputs) and outputs (e.g., result of matrix multiplication or convolution operations). The matrix multiply unit 216 includes a plurality of multiply-accumulate (MAC) units. Each MAC unit is a hardware unit that performs a multiply-accumulate operation (or fused multiply-accumulate operation) , which computes the product of two numbers and adds that product to an accumulator. The precision of such operations are dependent on system design, but may include eight-bit integer multipliers, sixteen-bit floating point multipliers, and the like. In one embodiment, each of the MAC units can perform eight-bit multiply-accumulate operations. In the illustrated embodiment, the MAC units of each of the MXUs 216 are arranged in N rows by N columns to form a systolic array . Each column of the systolic array produces a partial product (e.g., of a matrix multiply or convolution operation) that may be summed to determine the result of the matrix multiplication or convolution operation. Thus, through the MXUs 216 of the plurality of TPUs 212 the AI accelerator 211 may be able to provide hardware acceleration of the matrix multiplication or convolution operations that make up the bulk of the computational costs of neural networks.). Frumkin and Lu are analogous art because they both teach enhancement to parallel processing for processor including matrix multiplying accelerator. Lu further teaches the matrix multiplying units in a processor is arranged to form a systolic array. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the matrix multiplying accelerator (taught in Frumkin) to further arrange the matrix multiplying units in a systolic array (taught in Lu), so as to provide method for performing physics simulation on machine-learning accelerated hardware platforms, which can adjust for suitability to a particular application based on input design parameters (Lu, col 1, line 13-28, col 2, line 18-28). The combination of Frumkin and Lu fails to explicitly teach, however, Wu teaches the systolic array including multiple stages, wherein each of the multiple stages include multiple processing elements; and (Wu, abstract, the invention describes a pipelined SIMD-systolic array processor comprising a number of processing elements constructed as an array architecture, multiport memory, registers. multiplexers, and controller , wherein the registers and multiplexers are connected for transferring data between the multiport memory and processing elements. The array processor can have a faster processing speed and, through using a multiport memory, each processing element requires only a small amount of storage. Page 23, par 3, page 24, par 1, As shown in Fig. 33, the array processing architecture is a stage-pipelined embodiment of the present invention . Such an array processing architecture comprises n pipelined SIMD-Systolic array processing architectures, which are cascaded in a pipelined manner, and is called stage-pipelined architecture. Also, such architecture can be combined with a general-purpose processor 1001 to enhance its computational performance. Shown as Fig. 34, the computation of 1008-point discrete Fourier transform is used as an example for explanation. A general purpose processor 1001 is cascaded with three pipelined SIMD-Systolic array processing architectures 3000, 3001, 3002 10 which are for computing 7-point, 9-point, 16-point discrete Fourier transform respectively. By using such an architecture, the 1008-point discrete Fourier transform can be computed with a high computational performance. As shown in Fig. 35, the array processing architecture is an embodiment of combining the present 15 invention with systolic architecture which comprises of multiple processing elements .) Frumkin, Lu and Wu are analogous art because they all teach enhancement to parallel processing for processor including matrix multiplying accelerator. Wu further teaches the matrix multiplying units systolic array include many stages. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the matrix multiplying accelerator with systolic array arrangement (taught in Frumkin and Lu) to further implement the systolic array in a stage-pipelined structure (taught in Wu), so as to provide the array processor with faster processing speed (Wu, abstract). The combination of Frumkin, Lu and Wu further teaches circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with the input data (Frumkin, [0013] Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero-value entities. These empty tiles can be ignored for purposes of the computation, and only the tiles with non-zero entries processed, which reduces resource requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value , which enables the values to be multiplied correctly against, for example, values of a dense matrix. The tiles can be processed in parallel and the results accumulated to generate a matrix product. The product can then be passed to a next step in a process, such as to a next layer in a deep neural network. Therefore, the index of the tile is the input metadata for matrix multiplying operations.) . Regarding Claim 2. The combination of Frumkin, Lu and Wu further teaches The general-purpose graphics processing unit as in claim 1, wherein the matrix accelerator is to receive the metadata as input in association with operands that specify a location for a zero-value input (Frumkin, [0013] Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero-value entities. These empty tiles can be ignored for purposes of the computation, and only the tiles with non-zero entries processed, which reduces resource requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value, which enables the values to be multiplied correctly against, for example, values of a dense matrix. [0054] FIG. 5 illustrates an example process 500 for using tiling for sparse matrix multiplication that can be utilized in accordance with various embodiments. … The tiles can be analyzed and any tile including only zero-value elements, or any empty tiles, can be discarded 506 or ignored for purposes of the matrix multiplication operation. As part of the tiling algorithm, elements of the tiles can be provided 508 with individual indices using the tile identifiers and positional offsets . The elements of the individual tiles can then be multiplied 510 by the dense matrix. Therefore, the tile index specifies the location of the zero value tiles.). Regarding Claim 3. The combination of Frumkin, Lu and Wu further teaches The general-purpose graphics processing unit as in claim 1, wherein the metadata is to be pre-generated for an entire set of input data (Frumkin, [0013] Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero-value entities. These empty tiles can be ignored for purposes of the computation, and only the tiles with non-zero entries processed, which reduces resource requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value, which enables the values to be multiplied correctly against, for example, values of a dense matrix. Therefore, the indexing metadata is pre-generated at least for the set of tiles the processing unit is operating on.) . Regarding Claim 15. The combination of Frumkin, Lu and Wu further teaches A data processing system (Frumkin, abstract, the invention describes implementation for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero value entities. These empty tiles can be ignored, and only the tiles with non-zero entries processed, which reduces resource and time requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value, which enables the values to be multiplied correctly against, for example, values of a dense matrix. The tiles can be processed in parallel and the results accumulated to generate a matrix product. The matrix product can then be passed to the next step in a process or operation, such as to a next layer in a deep neural network. [0021] In some embodiments the matrix multiplication is implemented on a single GPU that includes multiple streaming multiprocessors (SMs). SMs can work independently and in parallel. Each SM can be assigned one or more of the tiles of the sparse matrix for processing as discussed later herein, and can perform the appropriate multiplication of those tiles times the dense matrix. The tiling can be performed as part of a pre-processing step in some embodiments, either on the GPU or on a CPU. For large matrices or numbers of tiles, multiple GPUs may be utilized for the processing. The final product can then be sent to the CPU if needed, but is often kept on the GPU and passed to the next layer of the neural network.) comprising: a memory device; and an accelerator device coupled with the memory device, wherein the accelerator device (Lu, abstract, the invention describes operations comprising configuring a simulated environment to be representative of a physical device based, at least in part, on an initial description of the physical device that described structural parameters of the physical device. The operations further comprise performing a physics simulation with an artificial intelligence ("AI") accelerator. The AI accelerator includes a matrix multiply unit for computing convolution operations via a plurality of multiply-accumulate units. The operations further comprise computing a field response in response of the physical device in response to an excitation source within the simulated environment when performing the physics simulation. The field response is computed, at least in part, with the convolution operations to perform spatial differencing. Col 5, line 40-67, In the illustrated embodiment, AI accelerator 211 is a distributed processing platform including a plurality of tensor processing units (TPUs) 212 interconnected with one another by bus 223. Each of the TPUs 212 may individually be considered a machine learning accelerator, that when coupled together, form a distributed system that scales 45 computational speed (e.g., linearly) with respect to the number of TPUs within AI accelerator 211. Each of the plurality of TPUs 212 includes, inter alia, a buffer 214 and a matrix multiply unit (MXU) 216. The buffer 214 provides a storage medium (e.g., memory) for storing instructions (e.g., inputs) and outputs (e.g., result of matrix multiplication or convolution operations). The matrix multiply unit 216 includes a plurality of multiply-accumulate (MAC) units. Each MAC unit is a hardware unit that performs a multiply-accumulate operation (or fused multiply-accumulate operation) , which computes the product of two numbers and adds that product to an accumulator. The precision of such operations are dependent on system design, but may include eight-bit integer multipliers, sixteen-bit floating point multipliers, and the like. In one embodiment, each of the MAC units can perform eight-bit multiply- accumulate operations. In the illustrated embodiment, the MAC units of each of the MXUs 216 are arranged in N rows by N columns to form a systolic array . Each column of the systolic array produces a partial product (e.g., of a matrix multiply or convolution operation) that may be summed to determine the result of the matrix multiplication or convolution operation. Thus, through the MXUs 216 of the plurality of TPUs 212 the AI accelerator 211 may be able to provide hardware acceleration of the matrix multiplication or convolution operations that make up the bulk of the computational costs of neural networks.) comprises: a matrix accelerator including circuitry to bypass a matrix multiply operation having zero-value inputs, the bypass performed based on metadata associated with inputs to the matrix accelerator (Frumkin, [0013] Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero-value entities. These empty tiles can be ignored for purposes of the computation, and only the tiles with non-zero entries processed, which reduces resource requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value , which enables the values to be multiplied correctly against, for example, values of a dense matrix. The tiles can be processed in parallel and the results accumulated to generate a matrix product. The product can then be passed to a next step in a process, such as to a next layer in a deep neural network. Therefore, the index of the tile is the input metadata for matrix multiplying operations.) , wherein the matrix accelerator includes multiple processing elements (Wu, abstract, the invention describes a pipelined SIMD-systolic array processor comprising a number of processing elements constructed as an array architecture, multiport memory, registers. multiplexers, and controller , wherein the registers and multiplexers are connected for transferring data between the multiport memory and processing elements. The array processor can have a faster processing speed and, through using a multiport memory, each processing element requires only a small amount of storage. Page 23, par 3, page 24, par 1, As shown in Fig. 33, the array processing architecture is a stage-pipelined embodiment of the present invention . Such an array processing architecture comprises n pipelined SIMD-Systolic array processing architectures, which are cascaded in a pipelined manner, and is called stage-pipelined architecture. Also, such architecture can be combined with a general-purpose processor 1001 to enhance its computational performance. Shown as Fig. 34, the computation of 1008-point discrete Fourier transform is used as an example for explanation. A general purpose processor 1001 is cascaded with three pipelined SIMD-Systolic array processing architectures 3000, 3001, 3002 10 which are for computing 7-point, 9-point, 16-point discrete Fourier transform respectively. By using such an architecture, the 1008-point discrete Fourier transform can be computed with a high computational performance. As shown in Fig. 35, the array processing architecture is an embodiment of combining the present 15 invention with systolic architecture which comprises of multiple processing elements .) and is to receive the metadata as input in association with operands that specify a location for a zero-value input or generate the metadata based on data referenced by input operands, the data including the zero-value inputs (Frumkin, [0013] Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments utilize a tiling approach that divides a sparse matrix into submatrices, many of which will include only zero-value entities. These empty tiles can be ignored for purposes of the computation, and only the tiles with non-zero entries processed, which reduces resource requirements for the processing. An indexing approach can be used for each entity that is a combination of the tile identifier and an offset value, which enables the values to be multiplied correctly against, for example, values of a dense matrix. [0054] FIG. 5 illustrates an example process 500 for using tiling for sparse matrix multiplication that can be utilized in accordance with various embodiments. … The tiles can be analyzed and any tile including only zero-value elements, or any empty tiles, can be discarded 506 or ignored for purposes of the matrix multiplication operation. As part of the tiling algorithm, elements of the tiles can be provided 508 with individual indices using the tile identifiers and positional offsets . The elements of the individual tiles can then be multiplied 510 by the dense matrix. Therefore, the tile index specifies the location of the zero value tiles.). The reasoning for combination of Frumkin, Lu and Wu is the same as described in Claim 1. Regarding Claim 16. The combination of Frumkin, Lu and Wu further teaches The data processing system as in claim 15, wherein the multiple processing elements are configured as systolic array of processing elements (Wu, abstract, the invention describes a pipelined SIMD-systolic array processor comprising a number of processing elements constructed as an array architecture, multiport memory, registers. multiplexers, and controller , wherein the registers and multiplexers are connected for transferring data between the multiport memory and processing elements. The array processor can have a faster processing speed and, through using a multiport memory, each processing element requires only a small amount of storage. Page 23, par 3, page 24, par 1, As shown in Fig. 33, the array processing architecture is a stage-pipelined embodiment of the present invention . Such an array processing architecture comprises n pipelined SIMD-Systolic array processing architectures, which are cascaded in a pipelined manner, and is called stage-pipelined architecture. Also, such architecture can be combined with a general-purpose processor 1001 to enhance its computational performance. Shown as Fig. 34, the computation of 1008-point discrete Fourier transform is used as an example for explanation. A general purpose processor 1001 is cascaded with three pipelined SIMD-Systolic array processing architectures 3000, 3001, 3002 10 which are for computing 7-point, 9-point, 16-point discrete Fourier transform respectively. By using such an architecture, the 1008-point discrete Fourier transform can be computed with a high computational performance. As shown in Fig. 35, the array processing architecture is an embodiment of combining the present 15 invention with systolic architecture which comprises of multiple processing elements .) . The reasoning for combination of Frumkin, Lu and Wu is the same as described in Claim 1. Regarding Claim 17. The combination of Frumkin, Lu and Wu further teaches The data processing system as in claim 15, wherein the metadata is analyzed or generated in relation to a sub-matrix of input before the input is loaded into the multiple processing elements (Frumkin, [0026] Approaches in accordance with various embodiments can reduce the memory overhead needed to represent a sparse matrix with respect to one of these conventional formats. Further, such approaches can improve memory locality through the organization of non-zero elements into tiles, or submatrices . FIG. 3A illustrates an example tiling approach that can be utilized in accordance with various embodiments. In this example, a (small for example purposes) sparse matrix 300 is illustrated wherein a significant majority of the entries have zero values. The sparse matrix can be divided or segmented into a number of tiles, or submatrices, of the same size, although in other embodiments different sizes might be used as well. It is obvious to a person with ordinary skill in the art that the zero value tiles are sorted out before processor operating on the tile, so as to reduce the memory overheads.) . 07-21-aia AIA Claim s 5-7 are rejected under 35 U.S.C. 103 as being unpatentable over Frumkin et al (US20190278600) in view of Lu et al (US11550971), Wu (GB2286909) further in view of Latorre et al (US20200272425) . Regarding Claim 5. The combination of Frumkin, Lu and Wu fails to explicitly teach, however, Latorre teaches The general-purpose graphics processing unit as in claim 1, wherein the input data has a block sparsity pattern (Latorre, abstract, the invention describes computing systems that process data organized in a matrix format . For sparse matrices that hold few significant values and many values that can be ignored, transmitting and processing all the values in such matrices is wasteful. Thus, this invention introduces method for storing a sparse matrix in a compressed format that allows for a matrix transpose operation to be performed on the compressed matrix without having to first decompress the compressed matrix. By utilizing the introduced techniques, more matrix operations can be performed than conventional systems. [0025] Similar to patterns of non-zero elements, patterns of non-zero element square blocks in a matrix under a sparsity restriction can also be used as metadata . FIG. 1B illustrates a 4x4 matrix 180, that is under a 2 dimension, 2 element sparsity restriction and hence has 90 possible patterns when 1xl or individual non-zero element is considered as an element of the matrix 180. The number of possible patterns is also 90 for both an 8x8 matrix 185 and a 16xl6 matrix under the same sparsity restriction when 2x2 and 4x4 non-zero element blocks are considered as an element, respectively. When using patterns of non-zero element square blocks in a matrix as metadata, the size of the blocks is stored, e.g., as a field in the LUT.) . Frumkin, Lu, Wu and Latorre are analogous art because they all teach enhancement to parallel processing for processor including matrix multiplying operation. Frumkin and Latorre are both teaching sparse matrix data. Latorre further teaches input data as a block sparsity pattern. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the matrix multiplying accelerator with systolic array arrangement (taught in Frumkin, Lu and Wu) to further implement the method of storing a sparse matrix in a compressed format (taught in Latorre), so as to allows for a matrix transpose operation to be performed on the compressed matrix without having to first decompress the compressed matrix (Latorre, abstract, [0003-0004]). Regarding Claim 6. The combination of Frumkin, Lu, Wu and Latorre further teaches The general-purpose graphics processing unit as in claim 5, wherein the block sparsity pattern was induced via block sparsity pruning (Latorre, [0021] FIG. 1A illustrates diagrams of a sparse matrix 100 that is compressed and encoded according to the principles of the disclosure. In the illustrated embodiment, a sparse 4x4 matrix 100 that has been constrained under a 2 dimension, 2 element sparsity restriction is provided . "2 dimension" refers to a number of dimensions along which the sparsity constraint is imposed, e.g., along row N and column M, and "2 element" refers to a number of non-zero elements in each dimension along which the sparsity constraint is imposed. As a two element constraint is imposed on two dimensions, the matrix 100 has two non-zero elements per row and also per column. The hatched squares such as 120s refer to zero elements. Therefore, the sparsity constraint/restriction is equivalent to sparsity pruning.) . The reasoning for combination of Frumkin, Lu, Wu and Latorre is the same as described in Claim 5. Regarding Claim 7. The combination of Frumkin, Lu, Wu and Latorre further teaches The general-purpose graphics processing unit as in claim 5, wherein the metadata is to be pre-generated in relation to a sub-matrix of input data before the input data is loaded into the multiple processing elements (Frumkin, [0026] Approaches in accordance with various embodiments can reduce the memory overhead needed to represent a sparse matrix with respect to one of these conventional formats. Further, such approaches can improve memory locality through the organization of non-zero elements into tiles, or submatrices . FIG. 3A illustrates an example tiling approach that can be utilized in accordance with various embodiments. In this example, a (small for example purposes) sparse matrix 300 is illustrated wherein a significant majority of the entries have zero values. The sparse matrix can be divided or segmented into a number of tiles, or submatrices, of the same size, although in other embodiments different sizes might be used as well. It is obvious to a person with ordinary skill in the art that the zero value tiles are sorted out before processor operating on the tile, so as to reduce the memory overheads.) . Allowable Subject Matter Claims 4, 8-9, 18-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Claims 10-14 are allowable if double patenting rejection is overcome. 13-03-01 The following is a statement of reason for the indication of allowable subject matter: Regarding Claim 4 , it recites “ The general-purpose graphics processing unit as in claim 1, wherein the metadata is to be pre-generated based on a row of a first matrix for input to the matrix accelerator or a column of a second matrix for input to the matrix accelerator ” in the context of Claim 4. The prior arts of record either alone or in combination fails to teach or suggest the above quoted limitation of Claim 4. Therefore, Claim 4 is allowable over prior art. Regarding Claim 8 , it recites The general-purpose graphics processing unit as in claim 1, wherein each processing element includes hardware logic to detect a zero value input and bypass the matrix multiply operation based on the zero value input ” in the context of Claim 8. The prior arts of record either alone or in combination fails to teach or suggest the above quoted limitation of Claim 8. Therefore, Claim 8 is allowable over prior art. Claim 9 depends from Claim 8 with respective additional limitations. Therefore, Claim 9 is allowable over prior art. Regarding Claim 10 . It recites “ A method comprising: on a general-purpose graphics processor having a matrix accelerator: generating metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator; analyzing the metadata for input to a matrix multiply operation to be performed by the matrix accelerator, the input to the matrix multiply operation including one or more elements of multiple input matrices; determining, based on the metadata, whether the input to the matrix multiply operation includes a zero-value input; and bypassing at least a first portion of the matrix multiply operation in response to a determination that the matrix multiply operation includes a zero-value input. ” in the context of Claim 10. The prior arts of record either alone or in combination fails to teach or suggest the above quoted limitation of Claim 10. Therefore, Claim 10 is allowable over prior art. Claims 11-14 depend from Claim 10 with respective additional limitations. Therefore, Claims 11-14 is allowable over prior art. Regarding Claim 18. It recites “ The data processing system as in claim 15, wherein the matrix accelerator is to generate the metadata based on the data referenced by the input operands, the matrix accelerator to generate the metadata based on a row of a first matrix for input to the matrix accelerator and a column of a second matrix for input to the matrix accelerator. ” in the context of Claim 18. The prior arts of record either alone or in combination fails to teach or suggest the above quoted limitation of Claim 18. Therefore, Claim 18 is allowable over prior art. Regarding Claim 19. It recites “ The data processing system as in claim 15, wherein each processing element includes hardware logic to detect a zero value input and bypass the matrix multiply operation based on the zero value input, a processing element of the multiple processing elements is configured to: detect, based on the metadata, that at least one of multiple inputs to a first portion of the matrix multiply operation is a zero-value input; and bypass a load for the multiple inputs into the processing element associated with the first portion of the matrix multiply operation. ” in the context of Claim 19. The prior arts of record either alone or in combination fails to teach or suggest the above quoted limitation of Claim 19. Therefore, Claim 19 is allowable over prior art. Claim 20 depends from Claim 19 with respective additional limitations. Therefore, Claim 20 is allowable over prior art. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIN SHENG whose telephone number is (571)272-5734. The examiner can normally be reached M-F 9:30AM-3:30PM 6:00PM-8:30PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at 5712723022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /Xin Sheng/ Primary Examiner, Art Unit 2619 Application/Control Number: 18/906,859 Page 2 Art Unit: 2619 Application/Control Number: 18/906,859 Page 3 Art Unit: 2619 Application/Control Number: 18/906,859 Page 4 Art Unit: 2619 Application/Control Number: 18/906,859 Page 5 Art Unit: 2619 Application/Control Number: 18/906,859 Page 6 Art Unit: 2619 Application/Control Number: 18/906,859 Page 7 Art Unit: 2619 Application/Control Number: 18/906,859 Page 8 Art Unit: 2619 Application/Control Number: 18/906,859 Page 9 Art Unit: 2619 Application/Control Number: 18/906,859 Page 11 Art Unit: 2619 Application/Control Number: 18/906,859 Page 12 Art Unit: 2619 Application/Control Number: 18/906,859 Page 13 Art Unit: 2619 Application/Control Number: 18/906,859 Page 14 Art Unit: 2619 Application/Control Number: 18/906,859 Page 15 Art Unit: 2619 Application/Control Number: 18/906,859 Page 16 Art Unit: 2619 Application/Control Number: 18/906,859 Page 17 Art Unit: 2619 Application/Control Number: 18/906,859 Page 18 Art Unit: 2619 Application/Control Number: 18/906,859 Page 19 Art Unit: 2619 Application/Control Number: 18/906,859 Page 20 Art Unit: 2619 Application/Control Number: 18/906,859 Page 21 Art Unit: 2619 Application/Control Number: 18/906,859 Page 22 Art Unit: 2619 Application/Control Number: 18/906,859 Page 23 Art Unit: 2619 Application/Control Number: 18/906,859 Page 24 Art Unit: 2619 Application/Control Number: 18/906,859 Page 25 Art Unit: 2619 Application/Control Number: 18/906,859 Page 26 Art Unit: 2619 Application/Control Number: 18/906,859 Page 27 Art Unit: 2619 Application/Control Number: 18/906,859 Page 28 Art Unit: 2619 Application/Control Number: 18/906,859 Page 29 Art Unit: 2619
Read full office action
Prosecution Timeline

Oct 04, 2024
Application Filed
Mar 26, 2026
Non-Final Rejection mailed — §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/173,623
Patent 12626326
IMAGE STITCHING WITH AN ADAPTIVE THREE-DIMENSIONAL BOWL MODEL OF THE SURROUNDING ENVIRONMENT FOR SURROUND VIEW VISUALIZATION
3y 2m to grant Granted May 12, 2026
18/367,119
Patent 12620165
SYSTEMS, METHODS, AND COMPUTER PROGRAM PRODUCTS FOR POPULATING ENVIRONMENT MODELS
2y 7m to grant Granted May 05, 2026
18/367,115
Patent 12614341
SYSTEMS, METHODS, AND COMPUTER PROGRAM PRODUCTS FOR POPULATING ENVIRONMENT MODELS
2y 7m to grant Granted Apr 28, 2026
18/490,458
Patent 12614337
SYSTEM AND METHODS FOR CUSTOMIZING 3D MODELS
2y 6m to grant Granted Apr 28, 2026
18/796,576
Patent 12614366
AUTOMATIC POINT CLOUD BUILDING ENVELOPE SEGMENTATION (AUTO-CuBES) USING MACHINE LEARNING
1y 8m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
90%
With Interview (+17.2%)
2y 4m (~8m remaining)
Median Time to Grant
Low
PTA Risk
Based on 404 resolved cases by this examiner. Grant probability derived from career allowance rate.