Last updated: May 29, 2026
Application No. 18/901,027
INSTRUCTIONS AND LOGIC TO PERFORM FLOATING POINT AND INTEGER OPERATIONS FOR MACHINE LEARNING

Non-Final OA §102§103§DP
Filed
Sep 30, 2024
Priority
Apr 28, 2017 — provisional 62/491,699 +4 more
Examiner
MUSHAMBO, MARTIN
Art Unit
2615
Tech Center
2600 — Communications
Assignee
Intel Corporation
OA Round
1 (Non-Final)
Interview Optional

— +14.0% interview lift. Interview lift (+14.0%) is below the 15.0% threshold. A written response is recommended.
Based on 823 resolved cases, 2023–2026
Examiner Intelligence

MUSHAMBO, MARTIN View full profile →
Grants 85% — above average
Career Allowance Rate
697 granted / 823 resolved
+22.7% vs TC avg
Moderate +14% lift
Without
With
+14.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
17 currently pending
Career history
833
Total Applications
across all art units
Statute-Specific Performance

§101
7.4%
-32.6% vs TC avg
§103
68.0%
+28.0% vs TC avg
§102
11.8%
-28.2% vs TC avg
§112
5.6%
-34.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 823 resolved cases
Office Action

§102 §103 §DP
CTNF 18/901,027 CTNF 86783 DETAILED ACTION Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Information Disclosure Statement The information disclosure statement (IDS) submitted on 02/25/2026, 01/05/2026, 10/16/2025, 08/22/2025, 03/17/2025, 01/07/2025, 09/30/2024 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Double Patenting 08-33 AIA The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg , 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman , 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi , 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum , 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel , 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington , 418 F.2d 528, 163 USPQ 644 (CCPA 1969). A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA. A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13. The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA/25, or PTO/AIA/26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer. 08-34 AIA Claim s 1-7, 10-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim s 1-11,13-18 and 22 of U.S. Patent No. 12141578 B2 . Although the claims at issue are not identical, they are not patentably distinct from each other because recited claims 1-7, 10-20 are and obvious variance of claims 1-11, 13-18 and 22 of U.S. Patent No. 12141578 B2 as shown below . Current application US 12141578 B2 Claim 1. An apparatus comprising: an interconnect fabric; a memory interface coupled to the interconnect fabric; an input/output, IO, unit coupled to the interconnect fabric; an array of multiprocessors coupled to the interconnect fabric, a multiprocessor in the array of multiprocessors to execute a mixed-precision instruction in parallel across multiple threads, the multiprocessor in the array of multiprocessors comprising: a plurality of registers to store packed floating-point operand values; and execution circuitry to execute one or more of the mixed-precision instructions to perform a fused multiply-accumulate operation, the execution circuitry comprising: a 16-bit multiplier to multiply a first 16-bit floating point source value and a second 16-bit floating point source value to generate an intermediate result; and a 32-bit accumulator to add the intermediate result to an accumulated floating point value to generate a new accumulation result. Claim 1. An apparatus comprising: an interconnect fabric; a memory interface coupled to the interconnect fabric; an input/output, IO, unit coupled to the interconnect fabric; an array of multiprocessors coupled to the interconnect fabric, a multiprocessor in the array of multiprocessors to execute a mixed-precision instruction in parallel across multiple threads; and virtualization circuitry to share the array of multiprocessors with a plurality of virtual machines, each virtual machine of the plurality of virtual machines having a dedicated slice of resources provided by the array of multiprocessors, the dedicated slice of resources including the multiprocessor in the array of multiprocessors, the multiprocessor comprising: a plurality of registers to store packed floating-point operand values; and execution circuitry to execute one or more of the mixed-precision instructions to perform a fused multiply-accumulate operation, the execution circuitry comprising: a 16-bit multiplier to multiply a first 16-bit floating point source value and a second 16-bit floating point source value to generate an intermediate result; and a 32-bit accumulator to add the intermediate result to an accumulated floating point value to generate a new accumulation result. 2. The apparatus of claim 1. further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (IO) unit, and the array of multiprocessors, the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the stacked memory dies. 2. The apparatus of claim 1 further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (IO) unit, and the array of multiprocessors, the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the stacked memory dies. 3. (Currently Amended) The apparatus of claim 2, wherein the mixed-precision instructions are primitives of a machine learning framework. 3. The apparatus of claim 2 wherein the mixed-precision instructions are primitives of a machine learning framework. 4. (Currently Amended) The apparatus of claim 3, wherein the first and second 16-bit floating point source values comprise data elements a first matrix and a second matrix and wherein each of the plurality of multiplications comprises a multiplication of a data element from the first matrix and a data element from the second matrix. 4. The apparatus of claim 3 wherein the first and second 16-bit floating point source values comprise data elements a first matrix and a second matrix and wherein each of the plurality of multiplications comprises a multiplication of a data element from the first matrix and a data element from the second matrix. 5. (Currently Amended) The apparatus of claim 4. wherein the first and second matrices are to be associated with a convolutional layer of the machine learning framework. 5. The apparatus of claim 4 wherein the first and second matrices are to be associated with a convolutional layer of the machine-learning framework. 6. (Currently Amended) The apparatus of claim 5, wherein the machine learning framework comprises a neural network. 6. The apparatus of claim 3 wherein the machine learning framework comprises a neural network. 7. (Currently Amended) The apparatus of claim 6, wherein the machine learning framework comprises a recurrent neural network, RNN. 7. The apparatus of claim 3 wherein the machine learning framework comprises a recurrent neural network, RNN. 10. (Original) The apparatus of claim 2, wherein the memory interface is to couple the interconnect fabric to a memory device, the memory interface to use virtual channels to separate traffic streams to access the memory device. 8. The apparatus of claim 2, wherein the memory interface is to couple the interconnect fabric to a memory device, the memory interface to use virtual channels to separate traffic streams to access the memory device. 11. (Currently Amended) The apparatus of claim 1. further comprising: a cache hierarchy to store data for the array of multiprocessors, the cache hierarchy including an L1 cache and an L2 cache to be shared between a plurality of multiprocessors within the array of multiprocessors. 13. The apparatus of claim 1 further comprising: a cache hierarchy to store data for the array of multiprocessors, the cache hierarchy including an L1 cache and an L2 cache to be shared between the plurality of multiprocessors. 12. (Original) The apparatus of claim 2, further comprising: a memory management unit (MMU) coupled to the interconnect fabric, the MMU comprising a translation lookaside buffer (TLB) to store virtual-to-physical address translations to access the stacked memory dies. 9. The apparatus of claim 2, further comprising: a memory management unit (MMU) coupled to the interconnect fabric, the MMU comprising a translation lookaside buffer (TLB) to store virtual-to-physical address translations to access the stacked memory dies. 13. (Currently Amended) The apparatus of claim 12, wherein the MMU is to use a shared virtual memory address space to access the stacked memory dies and one or more system memory devices. 10. The apparatus of claim 9 wherein the MMU is to use a shared virtual memory address space to access the stacked memory dies and one or more system memory devices. 14. (Currently Amended) The apparatus of claim 2, wherein the stacked memory dies comprise a High Bandwidth Memory, HBM, device. 11. The apparatus of claim 2 wherein the stacked memory dies comprise a High Bandwidth Memory, HBM, device. 15. (Original) An apparatus comprising: an interconnect fabric; a memory interface coupled to the interconnect fabric; an input/output, IO, unit coupled to the interconnect fabric; an array of multiprocessors coupled to the interconnect fabric, a multiprocessor in the array of multiprocessors comprising: a plurality of registers to store packed floating-point and packed integer operand values including 32-bit floating-point values, 16-bit floating-point values, and 8-bit integer values; and a decoder to decode a plurality of mixed-precision fused multiply-accumulate (FMA) instructions including a first FMA instruction indicating N 16-bit floating-point source operands and a 32-bit floating-point source operand, and a second FMA instruction indicating 2N 8-bit integer source operands and a 32-bit integer source operand, and parallel multiplication circuitry to: perform N/2 parallel 16-bit floating-point multiplications responsive to the first FMA instruction to produce N/2 floating-point products; perform N parallel 8-bit integer multiplications responsive to the second FMA instruction to produce N integer products; and accumulation circuitry to: add the N/2 floating point products to the 32-bit floating-point source operand responsive to the first FMA instruction to generate an accumulated 32-bit floating-point result; and add the N integer products to the 32-bit integer source operand responsive to the second FMA instruction to generate an accumulated 32-bit integer result. 14. An apparatus comprising: an interconnect fabric; a memory interface coupled to the interconnect fabric; an input/output, IO, unit coupled to the interconnect fabric; an array of multiprocessors coupled to the interconnect fabric; and virtualization circuitry to share the array of multiprocessors with a plurality of virtual machines, each virtual machine of the plurality of virtual machines having a dedicated slice of resources provided by the array of multiprocessors, the dedicated slice of resources including a multiprocessor in the array of multiprocessors, the multiprocessor comprising: a plurality of registers to store packed floating-point and packed integer operand values including 32-bit floating-point values, 16-bit floating-point values, and 8-bit integer values; and a decoder to decode a plurality of mixed-precision fused multiply-accumulate (FMA) instructions including a first FMA instruction indicating N 16-bit floating-point source operands and a 32-bit floating-point source operand, and a second FMA instruction indicating 2N 8-bit integer source operands and a 32-bit integer source operand, and parallel multiplication circuitry to: perform N/2 parallel 16-bit floating-point multiplications responsive to the first FMA instruction to produce N/2 floating-point products; perform N parallel 8-bit integer multiplications responsive to the second FMA instruction to produce N integer products; and accumulation circuitry to: add the N/2 floating point products to the 32-bit floating-point source operand responsive to the first FMA instruction to generate an accumulated 32-bit floating-point result; and add the N integer products to the 32-bit integer source operand responsive to the second FMA instruction to generate an accumulated 32-bit integer result. 16. (Currently Amended) The apparatus of claim 15, wherein the N/2 parallel 16-bit floating-point or N parallel 8-bit integer multiplications are performed in a single clock cycle. 15. The apparatus of claim 14 wherein the N/2 parallel 16-bit floating-point or N parallel 8-bit integer multiplications are performed in a single clock cycle. 17. (Currently Amended) The apparatus of claim 15, wherein the decoder is to further decode a third FMA instruction indicating N/2 32-bit floating point source operands, the parallel multiplication circuitry, responsive to the third FMA instruction, to further perform N/4 single-precision floating point multiplications to generate N/4 product(s). 16. The apparatus of claim 14 wherein the decoder is to further decode a third FMA instruction indicating N/2 32-bit floating point source operands, the parallel multiplication circuitry, responsive to the third FMA instruction, to further perform N/4 single-precision floating point multiplications to generate N/4 product(s). 18. (Currently Amended) The apparatus of claim 15, wherein the 32-bit floating-point source operand and the 32-bit integer source operand comprise accumulated values from one or more prior instances of the first FMA instruction and the second FMA instruction, respectively. 17. The apparatus of claim 14 wherein the 32-bit floating-point source operand and the 32-bit integer source operand comprise accumulated values from one or more prior instances of the first FMA instruction and the second FMA instruction, respectively. 19. (Currently Amended) The apparatus of claim 15, further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (IO) unit, and the array of multiprocessors, the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the stacked memory dies. 18. The apparatus of claim 14 further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (IO) unit, and the array of multiprocessors, the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the stacked memory dies, wherein the stacked memory dies comprise a High Bandwidth Memory (HBM) device. 20. (Currently Amended) The apparatus of claim 15, wherein the mixed-precision instructions are primitives of a machine learning framework. 22. The apparatus of claim 14 wherein the mixed-precision instructions are primitives of a machine learning framework . Claim Rejections - 35 USC § 102 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-07-aia AIA 07-07 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – 07-08-aia AIA (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. 07-12-aia AIA (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. 07-15-aia AIA Claim(s) 1 and 11 is/are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Boswell et al., U.S. Patent Application Publication No. 2018/0321938 (herein Boswell) . Claim 1. Boswell discloses an apparatus comprising: an interconnect fabric [XBar 270. Boswell at paragraph 36; FIG. 2] ; a memory interface coupled to the interconnect fabric [Memory interface 370 is coupled to XBar 270. See Boswell at paragraph 36, 48; FIG. 3B] ; an input/output, IO, unit coupled to the interconnect fabric [Input/Output (I/O) unit 205 is coupled to XBar 270. Boswell at paragraphs 30 and 31; FIG. 2] ; an array of multiprocessors coupled to the interconnect fabric, a multiprocessor in the array of multiprocessors to execute a mixed-precision instruction in parallel across multiple threads [An array of a streaming multiprocessors (SMs) that execute instructions across multiple threads simultaneously, wherein a matrix multiply and accumulate (MMA) instruction that has operands from a plurality of precision formats (i.e. a mixed precision instruction) is executed. Boswell at paragraphs 45, 82-83, 86; FIG. 3A] , the multiprocessor in the array of multiprocessors comprising: a plurality of registers to store packed floating-point operand values [Each SM includes a register file 420 that stores operands including floating point values. Boswell at paragraphs 53 and 83; FIG. 4] ; and execution circuitry to execute one or more of the mixed-precision instructions to perform a fused multiply-accumulate operation [Core 450 of each SM. Boswell at paragraph 107] , the execution circuitry comprising: a 16-bit multiplier to multiply a first 16-bit floating point source value and a second 16-bit floating point source value to generate an intermediate result [Each core 450 (i.e. execution circuitry) comprises a half-precision matrix multiply accumulate (HMMA) datapath 930 that comprises a half precision (i.e. 16-bit) multiplier that multiplies the input operand values (i.e. 16-bit values) A and B. Boswell at paragraphs 107 and 117; FIG. 11] ; and a 32-bit accumulator to add the intermediate result to an accumulated floating-point value to generate a new accumulation result [The HMMA datapath accumulates the product of A and B values with full-precision (i.e. 32-bit) value C using adder 1144 (i.e. 32-bit accumulator). Boswell at paragraphs 106, 121-122; FIG. 11] .Claim 11. (Currently Amended) The apparatus of claim 1. further comprising: a cache hierarchy to store data for the array of multiprocessors, the cache hierarchy including an L1 cache and an L2 cache to be shared between a plurality of multiprocessors within the array of multiprocessors [The cache hierarchy includes an L1 cache 490 and L2 cache 360 that is shared among GPCs and SMs. Boswell at paragraphs 48 and 50; FIGS. 3B, 4] . Claim Rejections - 35 USC § 103 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-23-aia AIA The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 07-21-aia AIA Claim (s) 2-4 and 12-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Macri, “AMD’s Next Generation GPU and High Bandwidth Memory Architecture: FURY” . Regarding claim 2, Boswell teaches the apparatus of claim 1 further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (IO) unit, and the array of multiprocessors [The PPU/GPU including the XBar, memory interface, I/O unit, and array of SMs is comprised on a single chip (i.e. die). Boswell at paragraph 29, 154-155] . Boswell doesn’t teach the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the plurality of stacked memory dies. However, Macri teaches a GPU comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the plurality of stacked memory dies [The graphics memory comprises a local interconnect (i.e. wide interface) to stacked DRAM/HBM (i.e. memory dies) with groups of memory controllers MCs (i.e. memory interfaces) associated with respective stacked DRAM/HBM dies. See Macri at pages 3, 4, 10, and 14.] . The die-stacked memory system improves GPU system performance and energy efficiency [Macri at pages 12, 14, and 15] . It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of Boswell with the teachings of Macri since they are both analogous in graphics processing units related field. One ordinary skilled in the art before the filing of the claimed invention would have been motivated to combine the teachings of Boswell with the teachings of Macri in order to improve system performance while being energy efficient. Claim 3. (Currently Amended) The apparatus of claim 2, wherein the mixed-precision instructions are primitives of a machine learning framework [The MMA operations are applied to machine learning operations. Boswell at paragraph 150] . Same rationale as claim 2. Claim 4. (Currently Amended) The apparatus of claim 3. wherein the first and second 16-bit floating point source values comprise data elements a first matrix and a second matrix and wherein each of the plurality of multiplications comprises a multiplication of a data element from the first matrix and a data element from the second matrix [The MMA instruction operands include a first source matrix A and a second source matrix B, wherein the multiplications comprise multiplying packed elements from source matrix A and elements from source matrix B. Boswell at paragraph, 81 and 85] . Same rationale as claim 2. Claim 12. (Original) The apparatus of claim 2, further comprising: a memory management unit (MMU) coupled to the interconnect fabric [The GPC 250 includes MMU 390 coupled to XBar 270. Boswell at paragraphs 36 and 39; FIGS. 2, 3A] , the MMU comprising a translation lookaside buffer (TLB) to store virtual-to-physical address translations to access the stacked memory dies [MMU 390 includes a TLB for translating virtual to physical addresses. Boswell at paragraph 46] . Same rationale as claim 2. Claim 13. (Currently Amended) The apparatus of claim 12, wherein the MMU is to use a shared virtual memory address space to access the stacked memory dies and one or more system memory devices [MMU provides virtual address translation to access shared memories 204 and 470 (i.e. stacked memory dies – See claim 2 rejection above) and system memory devices. See Boswell at paragraphs 46, 48, and 56] . Same rationale as claim 2. Claim 14. (Currently Amended) The apparatus of claim 2, wherein the stacked memory dies comprise a High Bandwidth Memory, HBM, device [Macri at page 10 (See claim 2 rejection above)] . Same rationale as claim 2 . 07-21-aia AIA Claim (s) 5-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Macri and, further, in view Sze et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey” (herein Sze) . Claim 5, the combination of Boswell and Macri does not disclose the apparatus of claim 4. wherein the first and second matrices are to be associated with a convolutional layer of the machine-learning machine learning framework. Sze teaches an apparatus executing mixed-precision instructions [Processor executing instruction using variable quantization/precision. Sze at section V, 1st paragraph; section VII(A), 1st – 4th paragraphs] that are primitives of a machine learning framework, wherein first and second matrices are to be associated with a convolutional layer of the machine-learning framework [Sze at section III, 6th – 8th paragraphs; section V(A), 1st paragraphs] . Convolution neural networks share sets of weights [See Sze at section III, 6th paragraph] , thereby reducing the memory requirements of the machine learning framework. It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of Boswell and Macri with the teachings of Sze since they are all analogous in graphics processing units related field. One ordinary skill in the art, before the effective filing date of the invention, would have been motivated to modify the apparatus of the combination of Boswell and Macri so that the first and second matrices are to be associated with a convolutional layer of the machine-learning framework, as taught by Sze, in order to reduce the memory requirements of the machine-learning framework. Claim 6. (Currently Amended) The apparatus of claim 5 wherein the machine learning framework comprises a neural network [Processor executing instruction using variable quantization/precision. Sze at section V, 1st paragraph; section VII(A), 1st – 4th paragraphs] that are primitives of a machine learning framework, wherein the machine learning framework comprises a deep neural network [Sze at section II(B), 1st – 2nd paragraphs] . Same rationale as claim 5. Claim 7. (Currently Amended) The apparatus of claim 6, wherein the machine learning framework comprises a recurrent neural network, RNN. In the analogous art of neural network processing, Sze teaches an apparatus executing mixed-precision instructions [Processor executing instruction using variable quantization/precision. Sze at section V, 1st paragraph; section VII(A), 1st – 4th paragraphs] that are primitives of a machine learning framework, wherein the machine learning framework comprises a recurrent neural network, RNN [Sze at section III, 2nd – 3rd paragraphs] . Recurrent neural networks have internal memory that allow long-term dependencies to affect the output [Sze at section III, 3rd paragraph] , thereby increasing accuracy of the machine learning model. Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify the apparatus of Boswell so that the machine learning framework comprises a recurrent neural network, RNN, as taught by Sze, in order allow long-term dependencies to affect the output, thereby increasing accuracy of the machine learning model. Same rationale as claim 5 . 07-21-aia AIA Claim (s) 8 and 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Herrera, Alex, “NVIDIA GRID: Graphics Accelerated VDI With the Visual Performance of a Workstation” (herein Herrera) . Claim 8, Boswell does not explicitly teach the apparatus of any of claim 1. further comprising: virtualization circuitry to share the array of multiprocessors with a plurality of virtual machines. Herrera teaches a GPU comprising: virtualization circuitry to share the array of multiprocessors with a plurality of virtual machines [Herrera at page 15, 1st-2nd paragraphs] . Including virtualization circuitry in the GPU enables the serving of multiple concurrent users (CCUs) without the performance and latency penalty of excessive software overhead [Herrera at page 15, 3rd paragraph] . It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of Boswell with the teachings of Herrera since they are all analogous in graphics processing units GPUs related field. Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to modify Bowell’s apparatus to further comprise: virtualization circuitry to share the array of multiprocessors with a plurality of virtual machines, as taught by Herrera, in order to serve multiple CCUs without the performance and latency penalty of excessive software overhead. Claim 9. (Currently Amended) The apparatus of claim 8, wherein the virtualization circuitry comprises multiple sets of control registers to be associated with multiple corresponding virtual machines, a group of control registers to store one or more address pointers to identify a region of memory associated with a corresponding virtual machine [The MMU maps and translates the virtual address space to give each process (i.e. virtual machine in this context) its own physical address space, thereby inherently storing (i.e. in a register) an address pointer for each VM. Herrera at page 15, 2nd paragraph; Fig. 8] . Same rationale as claim 8 . 07-21-aia AIA Claim (s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Macri and, further, in view Yuan et al., “Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures” (herein Yuan) . Claim 10, the combination of Boswell and Macri discloses the apparatus of claim 2, wherein the memory interface is to couple the interconnect fabric to a memory device [The memory interface 370 couples the XBar 270 (i.e. interconnect fabric) to memory device 204. See Boswell at paragraph 36, 48; FIG. 3B] , the memory interface to use virtual channels to separate traffic streams to access the memory device. The combination of Boswell and Macri does not disclose that the memory interface to use virtual channels to separate traffic streams to access a memory device. Yuan teaches a memory interface that is to use virtual channels to separate traffic streams to access a memory device [Memory controller (i.e. interface) uses virtual channels to separate requests to memory. Yuan at section 3.3; Fig. 5] . Using virtual channels reduces router pipeline stall, thereby reducing network latency [Yuan at section 3.3, 1st paragraph] . It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of combination of Boswell and Macri with the teachings of Yuan since they are all analogous in many core architectures (e.g. GPU architectures) related field. One ordinary skill in the art, before the effective filing date of the invention, would have been motivated to modify Boswell’s memory interface to use virtual channels to separate traffic streams to access a memory device, as taught by Yuan, in order to reduce network latency . 07-21-aia AIA Claim (s) 15-18, 20, and 28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Mansell, U.S. Patent Application Publication No. 2020/0218538 (herein Mansell) . Claim 15, Boswell discloses an apparatus comprising: an interconnect fabric [XBar 370. Boswell at paragraph 36; FIG. 2] ; a memory interface coupled to the interconnect fabric [Partition unit 208 implements a memory interface. Boswell at paragraph 37 and 47; FIG. 2] ; an input/output, IO, unit coupled to the interconnect fabric [Input/Output (I/O) unit 205. Boswell at paragraphs 30 and 31; FIG. 2] ; an array of multiprocessors coupled to the interconnect fabric [An array of a streaming multiprocessors (SMs). Boswell at paragraphs 43 and 45; FIG. 3A, 3B] , a multiprocessor in the array of multiprocessors comprising: a plurality of registers to store packed floating-point and packed integer operand values including 32-bit floating-point values, 16-bit floating-point values, and 8-bit integer values [Each SM includes a register file 420 that stores operands including floating point values. Boswell at paragraphs 53 and 83; FIG. 4] ; and a plurality of mixed-precision fused multiply-accumulate (FMA) instructions [Instruction include matrix multiply and accumulate (MMA) instructions (i.e. FMA instructions). Boswell at paragraph 81] including a first FMA instruction indicating N 16-bit floating-point source operands and a 32-bit floating-point source operand [The MMA instructions include an instruction with half precision (i.e. 16-bit floating point) source matrix operands and a full-precision (i.e. 32-bit floating point) collector matrix operand. Boswell at paragraph 83, 85, and 86] , and a second FMA instruction indicating integer source operands [The MMA instructions include another instruction with integer source operands. See Boswell at paragraph 83, 86] , and parallel multiplication circuitry to: perform N/2 parallel 16-bit floating-point multiplications responsive to the first FMA instruction to produce N/2 floating-point products [Each core 450 comprises a half-precision matrix multiply accumulate (HMMA) datapath 930 that performs MMA operation using half precision (i.e. 16-bit) floating point multiplications to produce parallel products. See Boswell at paragraphs 107 and 117; FIGS. 11 and 13] ; and accumulation circuitry to: add the N/2 floating point products to the 32-bit floating-point source operand responsive to the first FMA instruction to generate an accumulated 32-bit floating-point result [The HMMA datapath accumulates the products of A and B values with full-precision (i.e. 32-bit) value C to generate an accumulated full-precision (i.e. 32-bit) value. Boswell at paragraphs 106, 121-122; FIGS. 11 and 13] . Boswell doesn’t explicitly teach: a decoder to decode the plurality of mixed-precision fused multiply-accumulate (FMA) instructions including a second FMA instruction indicating 2N 8-bit integer source operands and a 32-bit integer source operand; and parallel multiplication circuitry to: perform N parallel 8-bit integer multiplications responsive to the second FMA instruction to produce N integer products; and accumulation circuitry to: add the N integer products to the 32-bit integer source operand responsive to the second FMA instruction to generate an accumulated 32-bit integer result. Mansell teaches: a decoder to decode a plurality of mixed-precision fused multiply-accumulate (FMA) instructions including a second FMA instruction indicating 2N 8-bit integer source operands and a 32-bit integer source operand [Decode circuitry 18 decodes multiply and add, or dot product, (i.e. FMA) operations, including one that include 8-bit integer source operands and a 32-bit source/accumulator operand. Mansell at paragraphs 51, 61, and 63; See FIG. 11] ; and parallel multiplication circuitry to: perform N parallel 8-bit integer multiplications responsive to the second FMA instruction to produce N integer products [The elements of the source operands (i.e. 8 bit integers) are multiplied in parallel to produce products. Mansell at paragraph 61, 66; FIGS. 11, 15A] ; and accumulation circuitry to: add the N integer products to the 32-bit integer source operand responsive to the second FMA instruction to generate an accumulated 32-bit integer result [The products are added together to generate a lane sized (i.e. 32-bit) integer accumulation result. Mansell at paragraph 61, 66; FIGS. 11, 15A] . A person of ordinary skill in the art would have recognized that circuitry to decode instructions (i.e. a decoder) is required to execute instructions and Mansell teaches that this circuitry and instruction provide efficient and compact parallel processing and is particularly beneficial in applications where smaller values need parallel processing [Mansell at paragraphs 32, 34] . It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of Boswell with the teachings of Mansell since they are all analogous in data processing related field. One ordinary skill in the art, before the effective filing date of the invention, would have been motivated to modify Boswell’s with the teachings of Mansell in order to provide for instruction decoding as well as provide efficient and compact parallel processing, particularly, where smaller values need parallel processing. Claim 16. (Currently Amended) The apparatus of claim 15, wherein the N/2 parallel 16-bit floating-point or N parallel 8-bit integer multiplications are performed in a single clock cycle [The MMA operation (including those from claim 15 – e.g. N/2 parallel 16-bit FP) performs multiple parallel dot products (i.e. including multiplications) in a single cycle. See Boswell at paragraph 148] . Same rationale as claim 15. Claim 17. (Currently Amended) The apparatus of claim 15, wherein the decoder is to further decode a third FMA instruction indicating N/2 32-bit floating point source operands [The MMA instructions include an instruction with full precision (i.e. 32-bit floating point) source matrix operands. Boswell at paragraph 81, 83, and 84] , the parallel multiplication circuitry, responsive to the third FMA instruction, to further perform N/4 single-precision floating point multiplications to generate N/4 product(s) [The MMA/dot product (i.e. FMA) operation multiplies pairs of the elements from the source operands (i.e. single-precision operands) to generate products. See Boswell at paragraphs 93; FIG. 8] . Same rationale as claim 15. Claim 18. (Currently Amended) The apparatus of claim 15, wherein the 32-bit floating-point source operand and the 32-bit integer source operand comprise accumulated values from one or more prior instances of the first FMA instruction and the second FMA instruction, respectively [The collector matrix/operand (i.e. 32-bit FP source operand and 32-bit integer sources operand, as modified in claim 15) accumulates values from prior operations. See Boswell at paragraph 88; FIG. 7] . Same rationale as claim 15. Claim 20. (Currently Amended) The apparatus of claim 15, wherein the mixed-precision instructions are primitives of a machine learning framework [The MMA operations are applied to machine learning operations. Boswell at paragraph 150] . Same rationale as claim 15 . 07-21-aia AIA Claim (s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boswell in view of Mansell and, further, in view of Macri . Regarding claim 19, Boswell, as modified, teaches the apparatus of claim 15 further comprising: a parallel processor die comprising the interconnect fabric, memory interface, the input/output (I/O) unit, and the array of multiprocessors [The PPU/GPU including the XBar, memory interface, I/O unit, and array of SMs is comprised on a single chip (i.e. die). Boswell at paragraph 29, 154-155] . The combination of Boswell and Mansell doesn’t teach the parallel processor die further comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the plurality of stacked memory dies. Macri teaches a GPU comprising: a local memory interconnect to couple the memory interface to stacked memory dies, the local memory interconnect comprising independent groups of memory interfaces, the independent groups of memory interfaces associated with respective memory dies of the plurality of stacked memory dies [The graphics memory comprises a local interconnect (i.e. wide interface) to stacked DRAM/HBM (i.e. memory dies) with groups of memory controllers MCs (i.e. memory interfaces) associated with respective stacked DRAM/HBM dies. See Macri at pages 3, 4, 10, and 14] . The die-stacked memory system improves GPU system performance and energy efficiency [Macri at pages 12, 14, and 15] . It would have been obvious to one ordinary skilled in the art before the filing of the claimed invention to combine the teachings of Boswell and Mansell with the teachings of Macri since they are all analogous in graphics processing units related field. It would have been obvious one ordinary skilled in the art before the filing of the claimed invention to modify the teachings of Boswell and Mansell, as taught by Macri, in order to improve system performance while being energy efficient . Conclusion 07-96 AIA The prior art made of record and not relied upon is considered pertinent to applicant's disclosure is as follows: US 10324689 B2 Systems and methods for matrix-solve applications include a memory-optimized hardware acceleration (HWA) solution with scalable architecture (i.e. specialized circuitry) for HWA matrix-solve operations. The matrix-solve solutions described herein may include a scalable hardware architecture with parallel processing (e.g., “within column” processing), which provides the ability to compute several output values in parallel. The HWA matrix-solve solutions described herein may include simultaneous multi-column processing, which provides a lower execution cycle count and a reduced total number of memory accesses. This HWA matrix-solve provides a low latency and energy-efficient matrix-solve solutions, which may be used to reduce energy consumption and improve performance in various matrix-based applications, such as computer vision, SLAM, AR/VR/mixed-reality, machine learning, data analytics, and other matrix-based applications. US 20190042195 A1 The hardware accelerated matrix-solve system comprises a fetch-A block to retrieve and provide a portion of a matrix A. A matrix column computation block is provided, which has an X-buffer block to fetch one value of a matrix X, and a column parallel compute block to generate multiple partial dot products based on the portion of the matrix A and on the value of matrix X. A serial compute block is provided to generate a new element of matrix X based on the partial dot products, where the new element of matrix X is provided to the X-buffer block for storage in a memory. US 9928034 B2 A method, computer readable medium, and system are disclosed for processing a segmented data set. The method includes the steps of receiving a data structure storing a plurality of values segmented into a plurality of sequences; assigning a plurality of processing elements to process the plurality of values; and processing the plurality of values by the plurality of processing elements according to a merge-based algorithm. Each processing element in the plurality of processing elements identifies a portion of values in the plurality of values allocated to the processing element based on the merge-based algorithm. In one embodiment, the processing elements are threads executed in parallel by a parallel processing unit. Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN MUSHAMBO whose telephone number is (571)270-3390. The examiner can normally be reached Monday-Friday (8:00AM-5:00PM). Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alicia Harrington can be reached at (571) 272-2330. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /MARTIN MUSHAMBO/Primary Examiner, Art Unit 2615 03/23/2026 Application/Control Number: 18/901,027 Page 2 Art Unit: 2615 Application/Control Number: 18/901,027 Page 3 Art Unit: 2615 Application/Control Number: 18/901,027 Page 4 Art Unit: 2615 Application/Control Number: 18/901,027 Page 5 Art Unit: 2615 Application/Control Number: 18/901,027 Page 6 Art Unit: 2615 Application/Control Number: 18/901,027 Page 7 Art Unit: 2615 Application/Control Number: 18/901,027 Page 8 Art Unit: 2615 Application/Control Number: 18/901,027 Page 9 Art Unit: 2615 Application/Control Number: 18/901,027 Page 10 Art Unit: 2615 Application/Control Number: 18/901,027 Page 11 Art Unit: 2615 Application/Control Number: 18/901,027 Page 12 Art Unit: 2615 Application/Control Number: 18/901,027 Page 13 Art Unit: 2615 Application/Control Number: 18/901,027 Page 14 Art Unit: 2615 Application/Control Number: 18/901,027 Page 15 Art Unit: 2615 Application/Control Number: 18/901,027 Page 16 Art Unit: 2615 Application/Control Number: 18/901,027 Page 18 Art Unit: 2615 Application/Control Number: 18/901,027 Page 19 Art Unit: 2615 Application/Control Number: 18/901,027 Page 20 Art Unit: 2615 Application/Control Number: 18/901,027 Page 21 Art Unit: 2615 Application/Control Number: 18/901,027 Page 22 Art Unit: 2615 Application/Control Number: 18/901,027 Page 23 Art Unit: 2615 Application/Control Number: 18/901,027 Page 24 Art Unit: 2615 Application/Control Number: 18/901,027 Page 25 Art Unit: 2615 Application/Control Number: 18/901,027 Page 26 Art Unit: 2615 Application/Control Number: 18/901,027 Page 27 Art Unit: 2615 Application/Control Number: 18/901,027 Page 28 Art Unit: 2615 Application/Control Number: 18/901,027 Page 29 Art Unit: 2615 Application/Control Number: 18/901,027 Page 30 Art Unit: 2615 Application/Control Number: 18/901,027 Page 31 Art Unit: 2615 Application/Control Number: 18/901,027 Page 32 Art Unit: 2615 Application/Control Number: 18/901,027 Page 33 Art Unit: 2615 Application/Control Number: 18/901,027 Page 34 Art Unit: 2615 Application/Control Number: 18/901,027 Page 35 Art Unit: 2615 Application/Control Number: 18/901,027 Page 36 Art Unit: 2615 Application/Control Number: 18/901,027 Page 37 Art Unit: 2615 Application/Control Number: 18/901,027 Page 38 Art Unit: 2615 Application/Control Number: 18/901,027 Page 39 Art Unit: 2615 Application/Control Number: 18/901,027 Page 40 Art Unit: 2615 Application/Control Number: 18/901,027 Page 41 Art Unit: 2615 Application/Control Number: 18/901,027 Page 42 Art Unit: 2615 Application/Control Number: 18/901,027 Page 43 Art Unit: 2615 Application/Control Number: 18/901,027 Page 44 Art Unit: 2615 Application/Control Number: 18/901,027 Page 45 Art Unit: 2615
Read full office action
Prosecution Timeline

Sep 30, 2024
Application Filed
Mar 27, 2026
Non-Final Rejection mailed — §102, §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/904,454
Patent 12639789
METHOD FOR PROCESSING INSAR IMAGES TO EXTRACT GROUND DEFORMATION SIGNALS
3y 9m to grant Granted May 26, 2026
18/574,701
Patent 12641213
HEAD MOTION DEPENDENT VIEWPORT REGION MODIFICATION FOR OMNIDIRECTIONAL CONVERSATIONAL VDD
2y 5m to grant Granted May 26, 2026
18/263,706
Patent 12620183
MOVING MEDIA IN EXTENDED REALITY
2y 9m to grant Granted May 05, 2026
18/670,573
Patent 12620129
POSE QUANTIZATION-BASED KEYFRAME PRUNING FOR SIMULTANEOUS LOCALIZATION AND MAPPING
1y 11m to grant Granted May 05, 2026
18/441,528
Patent 12614364
METHOD AND DEVICE FOR CREATING HEAD AVATAR USING SHOT VIDEO
2y 2m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
85%
Grant Probability
99%
With Interview (+14.0%)
2y 5m (~9m remaining)
Median Time to Grant
Low
PTA Risk
Based on 823 resolved cases by this examiner. Grant probability derived from career allowance rate.