Last updated: May 29, 2026
Application No. 18/318,521
IN-PLACE TENSOR FORMAT CHANGE

Non-Final OA §103
Filed
May 16, 2023
Examiner
KASSIM, IMAD MUTEE
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +33.0% interview lift. Examiner has a relatively high allowance rate (72%); +33.0% interview lift. A written response may suffice.
Based on 163 resolved cases, 2023–2026
Examiner Intelligence

KASSIM, IMAD MUTEE View full profile →
Grants 72% — above average
Career Allowance Rate
118 granted / 163 resolved
+17.4% vs TC avg
Strong +33% interview lift
Without
With
+33.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
10 currently pending
Career history
187
Total Applications
across all art units
Statute-Specific Performance

§101
5.7%
-34.3% vs TC avg
§103
82.1%
+42.1% vs TC avg
§102
8.5%
-31.5% vs TC avg
§112
2.1%
-37.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 163 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pappu et al. (US 20230109990 A1) in view of Wang et al. (US 20210334142 A1).

Regarding claim 1. 
Pappu teaches a neural processing unit (NPU) (see ¶ 97, “The tensor cores 371 may include a plurality of execution units specifically designed to perform matrix operations, which are the fundamental compute operation used to perform deep learning operations.”, also see ¶ 98, “In deep learning implementations, parallel matrix multiplication work may be scheduled for execution on the tensor cores 371.”), comprising:
a data arbiter configured to receive tensor data in a first data format and tensor metadata comprising a tensor data descriptor corresponding to the first data format (see ¶ 58, “When the host interface 206 receives a command buffer via the I/O unit 204”, also see ¶ 63, “the processing cluster array 212 can receive processing tasks to be executed via the scheduler 210, which receives commands defining processing tasks from front end 208. For graphics processing operations, processing tasks can include indices of data to be processed, e.g., surface (patch) data, primitive data, vertex data, and/or pixel data, as well as state parameters and commands defining how the data is to be processed (e.g., what program is to be executed). The scheduler 210 may be configured to fetch the indices corresponding to the tasks or may receive the indices from the front end 208.”, also see ¶ 99, “Matrix elements may be stored at different precisions depending on the particular implementation, including 16-bit words, 8-bit bytes (e.g., INT8) and 4-bit half-bytes (e.g., INT4). Different precision modes may be specified for the tensor cores 371 to ensure that the most efficient precision is used for different workloads (e.g., such as inferencing workloads which can tolerate quantization to bytes and half-bytes)… Supported formats additionally include 64-bit floating point (FP64) and non-IEEE floating point formats such as the bfloat16 format…”, also see ¶ 100, “Compressed, encoded, and/or compressed and encoded matrix data, along with associated compression and/or encoding metadata, can be read by the tensor cores 371 and the non-zero values can be extracted.”);
an input data handler (see ¶ 79, “The graphics multiprocessor 234 has an execution pipeline including but not limited to an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268. The graphics multiprocessor 234 may additionally include tensor and/or ray-tracing cores 263 that include hardware logic to accelerate matrix and/or ray-tracing operations.”, also see ¶ 84, “the memory and cache interconnect 268 is a crossbar interconnect that allows the load/store unit 266 to implement load and store operations between the shared memory 270 and the register file 258.”, also see ¶ 98, “Before the matrix multiply begins, one entire matrix is loaded into tile registers and at least one column of a second matrix is loaded each cycle for N cycles. Each cycle, there are N dot products that are processed.”);
a data router (see ¶ 66, “The memory crossbar 216 can be configured to transfer the output of each cluster 214A-214N to any partition unit 220A-220N or to another cluster 214A-214N, which can perform additional processing operations on the output.”, also see ¶ 72, “The graphics multiprocessor 234 can process data and a data crossbar 240 can be used to distribute the processed data to one of multiple possible destinations, including other shader units.”, also see ¶ 87, “The interconnect fabric 327 may include one or more crossbar switches to enable communication between the various components of the graphics multiprocessor 325…The interconnect fabric 327 can arbitrate communication within the graphics multiprocessor 325 to ensure a fair bandwidth allocation between components.”, i.e. crossbar 2015, data crossbar 240 and interconnect fabric 327 corresponds to data router);
a systolic array comprising a plurality of clusters, each including a cluster memory and cluster processing logic (see ¶ 59, “The processing cluster array 212 can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N).”, also see ¶ 75, “The graphics multiprocessor 234 may include an internal cache memory to perform load and store operations. Optionally, the graphics multiprocessor 234 can forego an internal cache and use a cache memory (e.g., level 1 (L1) cache 248) within the processing cluster 214.”, also see ¶ 93, “One or more combined level 1 (L1) caches and shared memory units 373 store graphics data such as texture data, vertex data, pixel data, ray data, bounding volume data, etc., locally within each multi-core group 365A.”, also see ¶ 98, “Before the matrix multiply begins, one entire matrix is loaded into tile registers and at least one column of a second matrix is loaded each cycle for N cycles. Each cycle, there are N dot products that are processed.”); 
and wherein the data arbiter is configured to send the tensor data and a command corresponding to the tensor data descriptor to the input data handler (see ¶ 58, “When the host interface 206 receives a command buffer via the I/O unit 204, the host interface 206 can direct work operations to perform those commands to a front end 208.”,also see ¶ 90, “the processor cores may allocate work to the GPU in the form of sequences of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuitry/logic for efficiently processing these commands/instructions.”);
in response to the command, see ¶ 76, “The MMU 245 includes a set of page table entries (PTEs) used to map a virtual address to a physical address of a tile and optionally a cache line index...The physical address is processed to distribute surface data access locality to allow efficient request interleaving among partition units.”, also see ¶ 80, “An instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. The address mapping unit 256 can be used to translate addresses in the unified address space into a distinct memory address that can be accessed by the load/store units 266.”, also see ¶ 98, “The training of neural networks, in particular, requires a significant number of matrix dot product operations. In order to process an inner-product formulation of an N×N×N matrix multiply, the tensor cores 371 may include at least N dot-product processing elements. Before the matrix multiply begins, one entire matrix is loaded into tile registers and at least one column of a second matrix is loaded each cycle for N cycles. Each cycle, there are N dot products that are processed.”, also see ¶ 100, “Compressed, encoded, and/or compressed and encoded matrix data, along with associated compression and/or encoding metadata, can be read by the tensor cores 371 and the non-zero values can be extracted.”, I.e. metadata are implicitly generated, e.g., information corresponding to each cycle, such as column reference(s) for each cycle, address translations); 
and the data router is configured to, see ¶ 66, “The memory crossbar 216 can be configured to transfer the output of each cluster 214A-214N to any partition unit 220A-220N or to another cluster 214A-214N, which can perform additional processing operations on the output.”, also see ¶ 77, “Each graphics multiprocessor 234 outputs processed tasks to the data crossbar 240 to provide the processed task to another processing cluster 214 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via the memory crossbar 216.”, also see ¶ 87, “The interconnect fabric 327 can arbitrate communication within the graphics multiprocessor 325 to ensure a fair bandwidth allocation between components.”, also see ¶ 92, “integer registers for storing integer values, floating point registers for storing floating point values, vector registers for storing packed data elements (integer and/or floating-point data elements) and tile registers for storing tensor/matrix values. The tile registers may be implemented as combined sets of vector registers.”, also see ¶ 98, “Before the matrix multiply begins, one entire matrix is loaded into tile registers and at least one column of a second matrix is loaded each cycle for N cycles. Each cycle, there are N dot products that are processed.”, also see ¶ 99, “Matrix elements may be stored at different precisions depending on the particular implementation).
Pappu do not specifically teach the input data handler is configured to generate first metadata corresponding to the first data format; and according to the first metadata, route the tensor data into the plurality of cluster memories.
Wang teaches the input data handler is configured to generate first metadata corresponding to the first data format (see ¶ 80, “systolic array output is written back to memory with data masks to merge with previous N×N data blocks. For example, a data mask module may receive a pattern generation signal from a controller and automatically generate mask bits for 2n-1 consecutive writes. The data mask generator may generate a data mask pattern according to the input from the controller to support different data formats, e.g., (int8, fp16, bf16, fp32, etc).”); and according to the first metadata, route the tensor data into the plurality of cluster memories (see ¶ 84, “the controller 706 may send addresses and read commands to memory sub-system 710 every cycle to produce the systolic array input columns. During output cycles, controller 706 may issue write commands and addresses to memory sub-system 710 to write systolic array output columns to specified memory address locations. Additionally, controller 706 may send compute control information to compute engine 702 to cause compute engine 702 to start to compute, accumulate, and output matrices in specific cycles.”, i.e. controller memory addressing, also see ¶ 94, “A multiplexer (MUX) 906 may receive segments of a matrix block of data from a systolic array (not shown) and mask data from data mask generator 904. For a byte where the masked bit is 1, the corresponding data of the segment is read through to memory 902. If the masked bit of the corresponding byte is 0, the corresponding data of the segment is ignored and will not be written to the memory.”, I.e. mask bit (metadata) determines whether data is writing and memory placement controlled by metadata, also see ¶ 98, “Memory 902 may subsequently receive a second output 914 and may align data segments based on complementary masks. For example, the segment at addr2 having mask “0 0 1” may be matched with the complementary segment of the second output, which is associated with a mask of “1 1 0.” The complementary segments are then merged such that each address stores useful information, e.g., no placeholder values.”). 
Both Pappu and Wang pertain to the problem of memory data placement and routing, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Pappu and Wang to teach the above limitations. The motivation for doing so would be “System 700 eliminates the overheard associated with traditional methods for storing staged input and output data and enables subsequent computations to be completed using output data, thereby increasing the utilization rate over the previously described systems.” (see Wang ¶ 81) and “data reuse or share schemes for overlapped data among receptive fields b1 to b3 can be beneficial for improving overall system throughput by reducing data stored in the buffer or by minimizing data transfer bandwidth usage. Embodiments of the present disclosure can provide an accelerator enabling efficient processing of CNN operations.” (see Wang ¶ 32).

Regarding claim 2. 
Pappu and Wang teaches the NPU of claim 1, 
Pappu further teaches wherein the cluster processing logic of each of the plurality of clusters is configured to perform a first operation on the tensor data stored in the respective cluster memory to generate a first cluster result for each cluster, and wherein the first cluster results for all clusters collectively comprise first output data (see ¶ 59, “The processing cluster array 212 can include up to “N” processing clusters (e.g., cluster 214A, cluster 214B, through cluster 214N).”, also see ¶ 98, “The training of neural networks, in particular, requires a significant number of matrix dot product operations. In order to process an inner-product formulation of an N×N×N matrix multiply, the tensor cores 371 may include at least N dot-product processing elements. Before the matrix multiply begins, one entire matrix is loaded into tile registers and at least one column of a second matrix is loaded each cycle for N cycles. Each cycle, there are N dot products that are processed.”).

Regarding claim 3. 
Pappu and Wang teaches the NPU of claim 2, 
Pappu further teaches wherein the NPU further comprises an output data handler coupled between the systolic array and the input data handler, the output data handler configured to receive the first output data from the systolic array and to send the first output data to the input data handler (see ¶ 77, “a processing cluster 214 may be configured such that each graphics multiprocessor 234 is coupled to a texture unit 236 for performing texture mapping operations, e.g., determining texture sample positions, reading texture data, and filtering the texture data. Texture data is read from an internal texture L1 cache (not shown) or in some embodiments from the L1 cache within graphics multiprocessor 234 and is fetched from an L2 cache, local parallel processor memory, or system memory, as needed. Each graphics multiprocessor 234 outputs processed tasks to the data crossbar 240 to provide the processed task to another processing cluster 214 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via the memory crossbar 216.”).
Also Wang teaches (see ¶ 84, “During output cycles, controller 706 may issue write commands and addresses to memory sub-system 710 to write systolic array output columns to specified memory address locations. Additionally, controller 706 may send compute control information to compute engine 702 to cause compute engine 702 to start to compute, accumulate, and output matrices in specific cycles.”, also see ¶ 96, “Components of accelerator 900 may be integrated into NPU architecture 200, for example, as part of a core 202, shown in FIG. 2B. For example, memory 902 may correspond to local memory 2032, while data mask generator 904 and mux 906 may be additional components of core 202 or of first operational unit 2020.”). 
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 3.

Regarding claim 4. 
Pappu and Wang teaches the NPU of claim 3, 
Wang further teaches wherein the output data handler is configured to format the output data in a second data format (see ¶ 64, “multiplier 240 or accumulator 250 may be configured to perform its operation on different data type from what the element-wise operation processor 260 performs its operations on. For example, multiplier 240 or accumulator 250 can be configured to perform its operations on integer type data such as Int 8, Int 16, and the like and element-wise operation processor 260 can perform its operations on floating point type data”, also see ¶ 80, “a data mask module may receive a pattern generation signal from a controller and automatically generate mask bits for 2n-1 consecutive writes. The data mask generator may generate a data mask pattern according to the input from the controller to support different data formats, e.g., (int8, fp16, bf16, fp32, etc).”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 4.

Regarding claim 5. 
Pappu and Wang teaches the NPU of claim 4, 
Wang further teaches wherein the input data handler is further configured to generate second metadata corresponding to the second data format, and to send the first output data and the second metadata to the data router, the data router configured to route the first output data to the plurality of cluster memories of the systolic array according to the second metadata (see ¶ 94, “multiplexer (MUX) 906 may receive segments of a matrix block of data from a systolic array (not shown) and mask data from data mask generator 904. For a byte where the masked bit is 1, the corresponding data of the segment is read through to memory 902. If the masked bit of the corresponding byte is 0, the corresponding data of the segment is ignored and will not be written to the memory.”, also see ¶ 105, “At step 1130, the system may generate a pattern generation signal. The pattern generation signal may be generated by a controller (e.g., controller 706, shown in FIG. 7), or by other circuitry configured to generate a signal during each clock cycle.”, also see ¶ 106, “At step 1140, the system may generate a mask based on the pattern generation signal. The mask may be generated by circuitry, for example, by a shift register, or by a data mask generator (e.g., data mask generator 708, shown in FIG. 7). The pattern may be based on the width of the systolic array output and may be incremented based on the pattern generation signal.”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 5.

Regarding claim 6. 
Pappu and Wang teaches the NPU of claim 5, 
Wang further teaches wherein the cluster processing logic of each of the plurality of clusters is configured to perform a second operation on the first output data stored in the respective cluster memory to generate a second cluster result for each cluster, and wherein the second cluster results for all clusters collectively comprise second output data (see ¶ 107, “the head and tail may be merged such to generate output data that does not include any placeholder values. The merged data may be stored in a memory (e.g., memory sub-system 710, shown in FIG. 7) of the system or may be operated on by the systolic array to perform subsequent operations.”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 6.

Regarding claim 7. 
Pappu and Wang teaches the NPU of claim 6, 
Wang further teaches wherein the output data handler is configured to receive the second output data from the systolic array and to send the second output data to the input data handler (see ¶ 107, “the head and tail may be merged such to generate output data that does not include any placeholder values. The merged data may be stored in a memory (e.g., memory sub-system 710, shown in FIG. 7) of the system or may be operated on by the systolic array to perform subsequent operations.”, also see ¶ 84, “During output cycles, controller 706 may issue write commands and addresses to memory sub-system 710 to write systolic array output columns to specified memory address locations. Additionally, controller 706 may send compute control information to compute engine 702 to cause compute engine 702 to start to compute, accumulate, and output matrices in specific cycles.”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 7.


Regarding claim 8. 
Pappu and Wang teaches the NPU of claim 7, 
Wang further teaches wherein the input data handler is further configured to format the second output data according to a third data format and to send the formatted second output data to the data arbiter, the data arbiter configured to export the formatted second output data (see ¶ 80, “a data mask module may receive a pattern generation signal from a controller and automatically generate mask bits for 2n-1 consecutive writes. The data mask generator may generate a data mask pattern according to the input from the controller to support different data formats, e.g., (int8, fp16, bf16, fp32, etc).”, also see ¶ 64, quantizer and de-quantization).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 8.

Regarding claim 9. 
Pappu and Wang teaches the NPU of claim 6, 
Wang further teaches wherein the first and the second operations comprise convolution operations (see ¶ 52, “sequencer 2026 can distribute convolution commands”, also see ¶ 68, “FIG. 3 illustrates an exemplary systolic array-based accelerator. As described above, such a configuration may be used to facilitate operations of convolutional neural networks.”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 9.

Regarding claim 10. 
Pappu and Wang teaches the NPU of claim 1, 
Wang further teaches wherein tensor data comprises 3-dimensional tensor data, wherein the 3-dimensional tensor data comprises a plurality of 2-dimensional channels, wherein each of the plurality of 2-dimensional channels comprises a plurality of data elements (see ¶ 68, “Activation memory 304 (e.g., first buffer 2034 shown in FIG. 2C) may store an activation matrix, which may be fed into each row of the systolic array at a rate of one row per clock cycle. ”, also see ¶ 57, “input data stored in first buffer 2034 can be activation data for a convolution operation.”, i.e. CNN activation are 3D (H x W x C)).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 10.

Regarding claim 11. 
Pappu and Wang teaches the NPU of claim 10, 
Wang further teaches wherein the data router is configured to route the tensor data according to the first metadata by designating particular ones of the plurality of cluster memories to receive the data elements corresponding to respective ones of the plurality of 2-dimentional channels, and routing the tensor data thereto (see ¶ 84, “The controller 706 may send addresses and read commands to memory sub-system 710 every cycle to produce the systolic array input columns.”, also see ¶ 94, “multiplexer (MUX) 906 may receive segments of a matrix block of data from a systolic array (not shown) and mask data from data mask generator 904. For a byte where the masked bit is 1, the corresponding data of the segment is read through to memory 902. If the masked bit of the corresponding byte is 0, the corresponding data of the segment is ignored and will not be written to the memory.”).
The motivation utilized in the combination of claim 1 super, applies equally as well to claim 11.

Claims 12-20 recites a method to perform the neural processing unit (NPU) recited in claims 1-11. Therefore the rejection of claims 1-11 above applies equally here. 

Related prior arts:
Parra et al. (US 12189571 B2) teaches a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
Huynh et al. (US 12026607 B1) teaches load a first weight data element of an array of weight data elements from a memory into a systolic array; extract, from the instructions, information indicating a first number of input data elements to be obtained from a first address of the memory and a second number of input data elements to be skipped between adjacent input data elements to be obtained, the first address being based on first coordinates of the first weight data element, and the first and second numbers being based on a stride of a convolution operation; based on the information, obtain first input data elements from the first address of the memory; and control the systolic array to perform first computations based on the first weight data element and the first input data elements to generate first output data elements of an output data array.
Surti et al. (US 20220129521 A1) teaches disaggregation of special function compute arrays via a shared reg file. One embodiment enables packed data compress and expand operations on a GPGPU. One embodiment provides techniques to exploit block sparsity within the cache hierarchy of a GPGPU.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IMAD M KASSIM whose telephone number is (571)272-2958. The examiner can normally be reached 10:30AM-5:30PM, M-F (E.S.T.).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297 - 4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IMAD KASSIM/Primary Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

May 16, 2023
Application Filed
Mar 09, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

19/056,627
Patent 12619876
FLEXIBILE ENTITY RESOLUTION NETWORKS
1y 2m to grant Granted May 05, 2026
17/745,103
Patent 12614096
ANOMALY SCORE NORMALISATION BASED ON EXTREME VALUE THEORY
3y 11m to grant Granted Apr 28, 2026
16/960,448
Patent 12608617
MODEL TRAINING APPARATUS, MODEL TRAINING METHOD, AND PROGRAM
5y 9m to grant Granted Apr 21, 2026
16/152,868
Patent 12596923
MACHINE LEARNING OF KEYWORDS
7y 6m to grant Granted Apr 07, 2026
17/250,758
Patent 12572843
AGENT SYSTEM FOR CONTENT RECOMMENDATIONS
5y 0m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+33.0%)
3y 8m (~7m remaining)
Median Time to Grant
Low
PTA Risk
Based on 163 resolved cases by this examiner. Grant probability derived from career allowance rate.