Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responding to the amendment filed on 3/11/2026.
Claims 1-36 are pending in the application.
The information disclosure statements filed on 4/2/2026 and 3/12/2026 have been considered.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-36 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. Specifically, claims 1-36 are directed to an abstract idea.
Per claim 1, the claim is directed to an idea of itself, mental processes that can be performed in the human mind, or by a human using a pen and paper. The steps of generating respective coordinates, identifying two or more reduction operations and combing the identified two or more reduction operations into a single software kernel and generating executable program code, as drafted, are functions that, under its broadest reasonable interpretation, recite the abstract idea of a mental process. There is no recitation how the combination into a single software kernel is performed in particular manners. The single software kernel is recited as mere code or an instruction. A developer can combine operations into a single code mentally or using a pen and paper. Furthermore, for generating executable program code, as the claim does not recite particular manners or details of the generation of the executable code and the code does not need to be large, a developer can certainly generate a small executable code. The limitations encompass a human mind carrying out the functions through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper. Thus, these limitations recite and fall within the “Mental Processes” grouping of abstract ideas under Prong 1 Step 2A. The additional limitation, one or more circuits is mere generic computing functions or elements for automation recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computers to perform the judicial exception. See MPEP see MPEP 2106.05(f) /2106.05(h). Therefore, the additional limitation does not integrate the judicial exception into a practical application. More particularly, the claims do not recite (1) an improvement to the functionality of a computer or other technology or technical field; (11) a “particular machine” to apply or use the judicial exception; (111) a particular transformation of an article to a different thing or state; or (iv) any other meaningful limitation. See MPEP § 2106.05(a)-(c), (e). Rather, claim 1 recites an abstract idea as identified in Step 2A prong 1, supra, and none of the limitations integrates the judicial exception into a practical application. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components or insignificant extra solution activities (e.g. processors, devices, program instructions, data gathering and displaying output), then it falls within the "Mental Processes" grouping of abstract ideas (2019 PEG step 2A, Prong 1: Abstract idea grouping? Yes, Mental Process). Viewing the limitations individually and as a combination, the additional element merely performs the mental steps using generic computer components without integrating the abstract idea into a practical application. For at least these reasons, claim 1 is not patent eligible.
Per claims 2-6, these claims are directed to the same idea itself as in claim 1, reciting details of the mental steps and generic computing element such as a parallel processing unit to apply the mental steps without adding any other additional element that is significantly more. Therefore, the claims are rejected for the same reasons as in claim 1.
Per claims 7-12, these claims are directed to the same idea itself as in claims 1-6, reciting details of data and the mental steps without adding any other additional element that is significantly more. The compiler is recited as a generic tool to perform the mental steps. Therefore, the claims are rejected for the same reasons as in claims 1-6.
Per claims 13-18, these claims are directed to the same idea itself as in claims 1-6, reciting details of data and the mental steps without adding any other additional element that is significantly more. The processor, parallel processing unit or a graphics processing unit are recited as a generic tool to perform the mental steps. Therefore, the claims are rejected for the same reasons as in claims 1-6.
Per claims 19-24, these claims are directed to the same idea itself as in claims 1-6, reciting details of data and the mental steps without adding any additional element that is significantly more. Therefore, the claims are rejected for the same reasons as in claims 1-6.
Per claims 25-30, these claims are directed to the same idea itself as in claims 1-6, reciting details of data and the mental steps without adding any additional element that is significantly more. The processors, parallel processing unit or a graphics processing unit are recited as a generic tool to perform the mental steps. Therefore, the claims are rejected for the same reasons as in claims 1-6.
Per claims 31-36, these claims are directed to the same idea itself as in claims 1-6, reciting details of data and the mental steps without adding any additional element that is significantly more. The computer vision system including processor, one or more of a population system, a directional control system and a vehicle operator notification system, compiler, and neural network are recited as a generic tool to perform the mental steps. Therefore, the claims are rejected for the same reasons as in claims 1-6.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 7-9 and 11-18 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Majnemer et al. (US20210049231, hereafter Majnemer).
7. A processor, comprising: one or more circuits to perform a software kernel comprising two or more reduction operations including a second reduction operation that depends on a first output of a first reduction operation, (Majnemer, see at least [0080], The process 600 includes fusing (610) the at least two operations of the set of operations; [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. … fuse (e.g., combine) the repetitive operations of the calculation into a single operation…. The intermediate result can be stored and referenced repeatedly as needed to complete the convolution computation; [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0009]; [0011]; [0037], the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations. The dependence graph can be updated by the compiler 102 to show where parallelization or pipelining may occur in the dependence graph in response to receiving the specification 108; [0049], The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed; Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. One dependent on the other’s output combined into a single GPU kernel with intermediates stored in registers);
wherein the first reduction operation generates the first output with fewer dimensions than an input to the first reduction operation, and the second reduction operation generates a second output with fewer dimensions than an input to the second reduction operation (Majnemer, see at least [0041]; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0079] The compiler 102 thus can instruct the intermediate values to be stored and combine steps of the matrix multiplication and summation for the convolution operation, reducing processing time; [0080] fusing two or more operations of the set of operations by the matrix multiplication circuitry indicated by the specification. The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations. The process 600 includes generating (612) executable logic configured to sequence the fused operations into the matrix multiplication circuitry ---Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. In the dependence graph, an edge from one operation to another indicates that one operation depends on the output of another. The intermediate values stored in registers create these dependency edges. In the reduce sum operation, the first reduction (Addition) accumulates products into partial sums and output has fewer values than input, and the reduction step (second reduction) depends on Addition’s output and collapses the partial sums into the final result. The two dependent reductions and multiplication are fused into a single GPU kernel).
wherein performance of the software kernel includes fewer accesses to a global memory than performing the first reduction operation and the second reduction operation in separate software kernels (Majnemer, see at least [0038]; [0049], The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0035]; [0079]; [0080], The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations; Note that the addition, multiplication, and reduction steps of a reduce_sum operation are combined into the single operation, that is, the two dependent reductions and multiplication are fused into a single GPU kernel; Note that off-chip (global) memory traffic between dependent reductions is reduced by the fusion with the GPU registers for the intermediate values).
8. The processor of claim 7, wherein the two or more reduction operations have been combined into the software kernel by a compiler based, at least in part, on coordinates for a tensor used by one or more reduction operations of the two or more reduction operations (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel).
9. The processor of claim 7, wherein the two or more reduction operations have been combined into the software kernel with one or more elementwise operations by a compiler (Majnemer, see at least [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0070] to perform element-wise multiplication, a cell may receive a value at the sum in register 406 and the received value may be output to an adjacent cell, i.e., without the summation circuitry 412 adding the product to the received value. The cell may also provide the product produced by the multiplication circuitry 410, corresponding to an element-wise multiplication by the cell, to an adjacent cell without summing the product and a value received at the sum in register 406. The result is that the array 306 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights).
Per claim 11, Majnemer further teaches:
wherein the two or more reduction operations include two or more of a mean operation, a sum operation, a product operation, a min operation, or a max operation (Majnemer, see at least [0002], An operation can include a computing step of an algorithm, such as an addition step…. an operation includes a sequence of steps, such as a multiplication operation. For example, in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. For multiplying tensors without optimization, some of the intermediate values are repeatedly calculated to be included in the final sum. The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation. in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0027], the intermediate representation 104 can represent a cross replica sum operation, which computes a sum across each processing unit and in which the output shape of the data is the same as the input shape. For example, the operations can include concatenation of two values, matrix multiplication, global gather, global scatter (or global broadcast), convolution, transposition of a matrix, retrieving a matrix slice, and so forth. These operations are a representative number of operations, and this list is non-exhaustive; [0038], A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
12. The processor of claim 7, wherein the one or more circuits are to perform the software kernel after receiving a kernel launch command from a host computer system (Majnemer, see at least [0049] The target-specific instructions 116 can be sent to target hardware for execution. For example, the target hardware including the multi-dimensional array of processing units can be a part of a graphical processing unit (GPU). The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; Note that based on receiving the instructions to execute on the target system (i.e. kernel launch command), the GPU kernel is performed).
13. A non-transitory machine-readable medium having stored thereon a set of instructions, which if performed by a processor, cause the processor to at least: perform a software kernel comprising two or more reduction operations including a second reduction operation that depends on a first output of a first reduction operation, (Majnemer, see at least [0080], The process 600 includes fusing (610) the at least two operations of the set of operations; [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. … fuse (e.g., combine) the repetitive operations of the calculation into a single operation…. The intermediate result can be stored and referenced repeatedly as needed to complete the convolution computation; [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0009]; [0011]; [0037], the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations. The dependence graph can be updated by the compiler 102 to show where parallelization or pipelining may occur in the dependence graph in response to receiving the specification 108; [0049], The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed; Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. One dependent on the other’s output combined into a single GPU kernel with intermediates stored in registers);
wherein the first reduction operation generates the first output with fewer dimensions than an input to the first reduction operation, and the second reduction operation generates a second output with fewer dimensions than an input to the second reduction operation (Majnemer, see at least [0041]; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0079] The compiler 102 thus can instruct the intermediate values to be stored and combine steps of the matrix multiplication and summation for the convolution operation, reducing processing time; [0080] fusing two or more operations of the set of operations by the matrix multiplication circuitry indicated by the specification. The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations. The process 600 includes generating (612) executable logic configured to sequence the fused operations into the matrix multiplication circuitry ---Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. In the dependence graph, an edge from one operation to another indicates that one operation depends on the output of another. The intermediate values stored in registers create these dependency edges. In the reduce sum operation, the first reduction (Addition) accumulates products into partial sums and output has fewer values than input, and the reduction step (second reduction) depends on Addition’s output and collapses the partial sums into the final result. The two dependent reductions and multiplication are fused into a single GPU kernel).
wherein performance of the software kernel includes fewer accesses to a global memory than performing the first reduction operation and the second reduction operation in separate software kernels (Majnemer, see at least [0038]; [0049], The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0035]; [0079]; [0080], The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations; Note that the addition, multiplication, and reduction steps of a reduce_sum operation are combined into the single operation, that is, the two dependent reductions and multiplication are fused into a single GPU kernel; Note that off-chip (global) memory traffic between dependent reductions is reduced by the fusion with the GPU registers for the intermediate values).
14. The non-transitory machine-readable medium of claim 13, wherein the two or more reduction operations were combined into the software kernel by a compiler (Majnemer, see at least [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler).
15. The machine-readable medium of claim 13, wherein the two or more reduction operations were combined into the software kernel by a compiler with one or more of an elementwise operation or a copy operation (Majnemer, see at least [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0070] to perform element-wise multiplication, a cell may receive a value at the sum in register 406 and the received value may be output to an adjacent cell, i.e., without the summation circuitry 412 adding the product to the received value. The cell may also provide the product produced by the multiplication circuitry 410, corresponding to an element-wise multiplication by the cell, to an adjacent cell without summing the product and a value received at the sum in register 406. The result is that the array 306 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights).
16. The non-transitory machine-readable medium of claim 13, wherein the software kernel includes instructions to be performed in parallel, and the two or more reduction operations were combined into the software kernel by a compiler based, at least in part, on a plurality of threads assigned to one or more tensors used by one or more of the two or more reduction operations, wherein the plurality of threads are to perform one or more operations in parallel (Majnemer, see at least [0036], The compiler 102 compiles the fused operations into target-specific instructions 114 comprising executable code specifying that the fused operations be performed together in the multi-dimensional array of processing units; [0038], The compiler 102 can determine, for example, how large a portion of a first array can be stored for summing repeatedly with portions of the second array to reduce the computation time from exponential to sub-exponential. The size of the portion of the first array can be chosen to match the dimensions of the multi-dimensional array of processing units. This results in the compiler 102 combining the operations (such as the addition operations) in different ways depending on the size of the multi-dimensional array of processing units as specified in the specification 108. In this example, the first array can include an array of parameter values, and the second array can include an array of activation values. A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
Per claim 17, this is the medium version of claim 11. See the rejection of claim 11 above.
18. The non-transitory machine-readable medium of claim 13, wherein the software kernel is to be performed on a parallel processing unit or a graphics processing unit (Majnemer, see at least [0002], parallel processing; [0049], the target hardware including the multi-dimensional array of processing units can be a part of a graphical processing unit (GPU).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6 and 19-30 are rejected under 35 U.S.C. 103 as being unpatentable over Majnemer et al. (US20210049231, hereafter Majnemer) in view of Brevdo (US20180204117).
Per claim 1:
Majnemer teaches A processor, comprising: one or more circuits to: generate respective coordinates for input tensors and output tensors of a plurality of operations of a graph representation of a computer program (Majnemer, see at least [0009] fusing … the graph including one or more optimization pathways. In some implementations, generating executable logic comprises just-in-time compiling. …the processing units of the matrix multiplication circuitry are arranged to form a systolic array; [0037], The dependence graph can be updated as more operations are received in the intermediate representation 104 and in response to status changes for the multi-dimensional array; [0002]; [0003]; [0038] executing a matrix multiplication computation, the compiler 102 optimizes the process of matrix multiplication for a particular configuration of the multi-dimensional array of processing units as specified in the specification 108. …depending on the size of the multi-dimensional array of processing units as specified in the specification; [0060] The matrix computation circuitry 300 includes a two-dimensional systolic array 306. The array 306 includes multiple cells 310. In some implementations, a first dimension of the array 306 corresponds to columns of cells and a second dimension of the array 306 corresponds to rows of cells. The array 306 can have more rows than columns, more columns than rows, or an equal number of columns and rows; [0061] In the illustrated example, value loaders 202a, 202b, 202c, and 202d send activation inputs to rows of the array 306 and a weight fetcher interface 302 sends weight inputs to columns of the array 306; [0064]; [0066]; [0076]; [0027] The intermediate representation 104 can be a language representing operations that are specified by a computation graph, e.g., operations specified in a software framework that represents neural network computations as computation graphs—Note that the processing unit of the multi-dimensional array of processing units receives the data and in what order each vector of the matrix is processed. This is coordinate generation (determining which vector which processing unit and in what order) which is fundamentally assigning positional coordinates to tensor data across the array. The compiler references the specification, which includes a description of the array and determines which operations to combine to optimize execution for the array described in the specification. The first dimension of the array corresponds to columns of cells and a second dimension to rows of cells. Value loaders send activation inputs to rows of the array and a weight fetcher interface sends weight inputs to columns of the array (coordinates/indices), this is a coordinate assignment. The graph is a computation graph of the neural network and the compiler generates coordinates for tensors within that graph through the specification).
identify, based on the generated respective coordinates, two or more reduction operations of the plurality of operations for combination (Majnemer, see at least [0080], The process 600 includes fusing (610) the at least two operations of the set of operations; [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. … fuse (e.g., combine) the repetitive operations of the calculation into a single operation…. The intermediate result can be stored and referenced repeatedly as needed to complete the convolution computation; [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0009]; [0011]; [0037], the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations. The dependence graph can be updated by the compiler 102 to show where parallelization or pipelining may occur in the dependence graph in response to receiving the specification 108; [0049], The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed; Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. One dependent on the other’s output combined into a single GPU kernel with intermediates stored in registers);
the two or more reduction operations including a second reduction operation that depends on a first output of a first reduction operation including a second reduction operation that depends on a first output of a first reduction operation, wherein the first reduction operation generates the first output with fewer dimensions than an input to the first reduction operation, and the second reduction operation generates a second output with fewer dimensions than an input to the second reduction operation (Majnemer, see at least [0041]; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0079] The compiler 102 thus can instruct the intermediate values to be stored and combine steps of the matrix multiplication and summation for the convolution operation, reducing processing time; [0080] fusing two or more operations of the set of operations by the matrix multiplication circuitry indicated by the specification. The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations. The process 600 includes generating (612) executable logic configured to sequence the fused operations into the matrix multiplication circuitry ---Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. In the dependence graph, an edge from one operation to another indicates that one operation depends on the output of another. The intermediate values stored in registers create these dependency edges. In the reduce sum operation, the first reduction (Addition) accumulates products into partial sums and output has fewer values than input, and the reduction step (second reduction) depends on Addition’s output and collapses the partial sums into the final result. The two dependent reductions and multiplication are fused into a single GPU kernel).
combine the identified two or more reduction operations including the first reduction operation and the second reduction operation into a single software kernel (Majnemer, see at least [0038]; [0049], [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0035]; [0079]; [0080], The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations; Note that the addition, multiplication, and reduction steps of a reduce_sum operation are combined into the single operation, that is, the two dependent reductions and multiplication are fused into a single GPU kernel).
generate executable program code including executable program code for the single software kernel (Majnemer, see at least [0002] The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation. …. This results in more efficient processing of the operation because the amount of data that is loaded and extracted from the multi-dimensional array of processing units is reduced; [0049] the target hardware including the multi-dimensional array of processing units can be a part of a graphical processing unit (GPU). The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0010], generating executable logic configured to sequence the fused operations into the matrix multiplication circuitry; [0007] determining, based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry; fusing the at least two operations of the set of operations; and generating executable logic configured to sequence the fused operations into the matrix multiplication circuitry; Note that the reduction operations are combined into a single operation which includes a sequence of operations that are combined for execution by a single GPU kernel, meaning, the single operation is a single GPU kernel operation, that is, a single software kernel. The generated executable code for the fused operations is to run on a single GPU kernel).
Majnemer teaches eliminating off-chip (global) memory traffic between dependent reductions and gradient vectors which are the core output of backpropagation (Majnemer, see at least [0049]; abstract, performing reduction of gradient vectors). However, Mjnemer does not explicitly teach indexing or coordinating across tensor dimensions through backpropagation. Brevdo teaches such backpropagation through the graph representation (Brevdo, see at least [0017] The tensor array also allows lazy execution of operations by ensuring operations occur in sequence using control dependencies. The tensor array also ensures that back propagation calculation occurs correctly by using the control dependencies and intelligent resource management when executing create the back propagation graph; [0065] During forward propagation, tensor arrays have the property that a tensor at any given index may be written only once. This property ensures the source of the write to be able to send the source to the correct gradient tensor during backpropagation. If more than one write occurs, the system does not know what values to pass to the writing operations during backpropagation; [0065]). It would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have combined Brevdo’s backpropagation with Majnemer’s kernel fusion to modify Majnemer’s system to combine the backpropagation as taught by Brevdo, with a reasonable expectation of success, since they are analogous art because they are from the same field of endeavor related to machine learning. Combining Brevdo’s functionality with that of Majnemer results in a system that allows to use backpropagation. The modification would be obvious because one having ordinary skill in the art would be motivated to make this combination to optimize speed and calculation precision reusing calculations for fast processing and allow a model to learn spatial relationships (Brevdo, see at least [0017] The tensor array also allows lazy execution of operations by ensuring operations occur in sequence using control dependencies. The tensor array also ensures that back propagation calculation occurs correctly by using the control dependencies and intelligent resource management when executing create the back propagation graph; [0065] During forward propagation, tensor arrays have the property that a tensor at any given index may be written only once. This property ensures the source of the write to be able to send the source to the correct gradient tensor during backpropagation. If more than one write occurs, the system does not know what values to pass to the writing operations during backpropagation; [0065]).
2. The processor of claim 1, the one or more circuits cause coordinates for one or elements of input tensors to the second reduction operation to be generated, and cause the two or more reduction operations to be combined based, at least in part, on the generated coordinates (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0038],[0042];[0009] fusing the operations comprises concatenation of two tensors; [0011] store at least a portion of a result of a fused operation and an arithmetic processing unit (ALU) configured to operate on the portion of the result; [0037], the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations. The dependence graph can be updated by the compiler 102 to show where parallelization or pipelining may occur in the dependence graph in response to receiving the specification 108).
3. The processor of claim 1, wherein the two or more reduction operations include two or more of a mean operation, a sum operation, a product operation, a min operation, or a max operation (Majnemer, see at least [0002], An operation can include a computing step of an algorithm, such as an addition step…. an operation includes a sequence of steps, such as a multiplication operation. For example, in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. For multiplying tensors without optimization, some of the intermediate values are repeatedly calculated to be included in the final sum. The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation. in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0027], the intermediate representation 104 can represent a cross replica sum operation, which computes a sum across each processing unit and in which the output shape of the data is the same as the input shape. For example, the operations can include concatenation of two values, matrix multiplication, global gather, global scatter (or global broadcast), convolution, transposition of a matrix, retrieving a matrix slice, and so forth. These operations are a representative number of operations, and this list is non-exhaustive; [0038], A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
4. The processor of claim 1, wherein the one or more circuits cause one or more threads to be assigned to elements of tensors used by the two or more reduction operations, and cause the two or more reduction operations to be combined into the software kernel based, at least in part, on the one or more assigned threads (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0038], A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
5. The processor of claim 1, wherein the one or more circuits cause one or more multiplication operations between a vector and a set of data to be replaced with a set of replacement operations, and cause the two or more reduction operations to be combined with the set of replacement operations into the software kernel (Majnemer, see at least [0038], [0038] In an example of executing a matrix multiplication computation, the compiler 102 optimizes the process of matrix multiplication for a particular configuration of the multi-dimensional array of processing units as specified in the specification 108. The compiler 102 receives the specification detailing a size of the multi-dimensional array of processing units, how any memory associated with the multi-dimensional array of processing units is configured, how the processing units are connected to one another, and so forth. The compiler 102 can determine, for example, how large a portion of a first array can be stored for summing repeatedly with portions of the second array to reduce the computation time from exponential to sub-exponential; 0065] The output can be passed to adjacent cells along the same dimensions as the given weight input and output without being accumulated, e.g., to perform element-wise multiplication of a set of weights and activation inputs; [0070] The cell may also provide the product produced by the multiplication circuitry 410, corresponding to an element-wise multiplication by the cell, to an adjacent cell without summing the product and a value received at the sum in register 406. The result is that the array 306 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights; Note that the element-wise multiplication and sum reduction are the replaced/modified operations).
6. The processor of claim 1, wherein the software kernel is to be performed on a parallel processing unit (Majnemer, see at least [0002] This specification describes technologies relating to processing of datasets in general, and specifically to parallel processing … in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0038], Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs).
Per claim 19, this is the method version of claim 1, therefore it is rejected in connection to the rejection of claim 1 above.
20. The method of claim 19, further comprising generating coordinates for one or more elements of one or more tensors used by one or more of the two or more reduction operations, and combining the two or more reduction operations based, at least in part, on the generated coordinates (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0038], A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
Per claim 21, this is the method version of claim 3, see the rejection of claim 3 above.
22. The method of claim 19, further comprising replacing a first operation with a set of replacement operations, and combining the two or more reduction operations into the software kernel with the set of replacement operations (Majnemer, see at least [0038], [0038] In an example of executing a matrix multiplication computation, the compiler 102 optimizes the process of matrix multiplication for a particular configuration of the multi-dimensional array of processing units as specified in the specification 108. The compiler 102 receives the specification detailing a size of the multi-dimensional array of processing units, how any memory associated with the multi-dimensional array of processing units is configured, how the processing units are connected to one another, and so forth. The compiler 102 can determine, for example, how large a portion of a first array can be stored for summing repeatedly with portions of the second array to reduce the computation time from exponential to sub-exponential; 0065] The output can be passed to adjacent cells along the same dimensions as the given weight input and output without being accumulated, e.g., to perform element-wise multiplication of a set of weights and activation inputs; [0070] The cell may also provide the product produced by the multiplication circuitry 410, corresponding to an element-wise multiplication by the cell, to an adjacent cell without summing the product and a value received at the sum in register 406. The result is that the array 306 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights; Note that the element-wise multiplication and sum reduction are the replaced/modified operations).
23. The method of claim 19, further comprising assigning one or more threads to elements of one or more tensors used by one or more of the two or more reduction operations, wherein combining the two or more reduction operations into the software kernel is based, at least in part, on the assigned one or more threads, and the software kernel includes instructions to be performed in parallel using the assigned one or more threads (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112.).
24. The method of claim 19, further comprising selecting a reduction algorithm, wherein combining the two or more reduction operations into the software kernel is based, at least in part, on the selected reduction algorithm (Majnemer, see at least [0002], compiler can configure the instructions to optimize the processing of the instructions for the particular configuration of the processing units of the array. For example, the compiler can configure an operation, which may include reuse of portions of the data for repetitive calculations, to store results for those calculations and reuse the results as needed, rather than performing the repetitive calculation. An operation can include a computing step of an algorithm, such as an addition step. In some implementations, an operation includes a sequence of steps, such as a multiplication operation. For example, in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set … The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation. In this example, the compiler recognizes that the intermediate result can be computed; the algorithm that is configured to combine the operations is the selected reduction algorithm).
Per claims 25-26, they are the system versions of claims 1 and 15, therefore, they are rejected in connection to the rejections of claims 1 and 15. Majnemer further teaches a system with one or more processors and memories to store the software kernel (Mjnemer, see at least [0010]; [0028]; [0086]).
27. The system of claim 25, wherein the two or more reduction operations include two or more of a mean operation, a sum operation, a product operation, a min operation, or a max operation, and the software kernel includes instructions to be performed in parallel on a parallel processing unit ((Majnemer, see at least [0002], An operation can include a computing step of an algorithm, such as an addition step…. an operation includes a sequence of steps, such as a multiplication operation. For example, in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. For multiplying tensors without optimization, some of the intermediate values are repeatedly calculated to be included in the final sum. The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation. in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0027], the intermediate representation 104 can represent a cross replica sum operation, which computes a sum across each processing unit and in which the output shape of the data is the same as the input shape. For example, the operations can include concatenation of two values, matrix multiplication, global gather, global scatter (or global broadcast), convolution, transposition of a matrix, retrieving a matrix slice, and so forth. These operations are a representative number of operations, and this list is non-exhaustive; [0036], The compiler 102 compiles the fused operations into target-specific instructions 114 comprising executable code specifying that the fused operations be performed together in the multi-dimensional array of processing units; [0038], The compiler 102 can determine, for example, how large a portion of a first array can be stored for summing repeatedly with portions of the second array to reduce the computation time from exponential to sub-exponential. The size of the portion of the first array can be chosen to match the dimensions of the multi-dimensional array of processing units. This results in the compiler 102 combining the operations (such as the addition operations) in different ways depending on the size of the multi-dimensional array of processing units as specified in the specification 108. In this example, the first array can include an array of parameter values, and the second array can include an array of activation values. A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
28. The system of claim 25, wherein the software kernel implements a portion of an inferencing operation using a neural network (Majnemer, see at least [0030], The intermediate representation 104 can be a language representing operations that are specified by a computation graph, e.g., operations specified in a software framework that represents neural network computations as computation graphs).
29. The system of claim 25, wherein the software kernel includes instructions to be performed in parallel, the one or more processors are a first one or more processors, the system further comprises a second one or more processors, and the first one or more processors are to launch the software kernel for performance by the second one or more processors (Majnemer, see at least [0030] The compiler 102 is configured to combine (e.g., fuse) multiple operations of the intermediate representation 104 together for any target hardware architecture of the specification 108 and for any application, but can be configured specifically for parallel processing, pipeline processing, and multiprocessing datasets in multi-dimensional arrays of processing units (e.g., systolic arrays). Examples of the multi-dimensional arrays of processing units are described subsequently in further detail. When configuring instructions for executing (e.g., processing) data on a multi-dimensional array of processing units, the compiler 102 can configure the instructions to optimize the processing of the instructions for the particular configuration of the processing units of the array …The compiler 102 references the specification 108, which includes a description of the multi-dimensional array of processing units that are for executing the operations).
30. The system of claim 25, wherein the one or more processors are to generate a schedule based, at least in part, on a representation of a computer program that includes the two or more reduction operations, identify reused data based, at least in part, on the schedule, and combine the two or more reduction operations into the software kernel based, at least in part, on the reused data (Majnemer, see at least [0030] The compiler 102 is configured to combine (e.g., fuse) multiple operations of the intermediate representation 104 together for any target hardware architecture of the specification 108 and for any application, but can be configured specifically for parallel processing, pipeline processing, and multiprocessing datasets in multi-dimensional arrays of processing units (e.g., systolic arrays). …the compiler 102 can configure an operation, which may include reuse of portions of the data for repetitive calculations (e.g., for convolution, matrix multiplication, etc.), to store results for those calculations and reuse the results as needed, rather than performing the repetitive calculation. As previously described, an operation can include a computing step of an algorithm, such as an addition step).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Majnemer in view of Raiman (US10901715).
Per claim 10:
Majnemer teaches teaches wherein the two or more reduction operations have been combined into the software kernel with other operations by a compiler (Majnemer, see at least [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler; [0031]; [0032], associated memory; [0077], memory of the processing unit; [0065] to perform element-wise multiplication of a set of weights and activation inputs).
Majnemer does not explicitly teach one or more copy operations. Raiman teaches copy operations (Raiman, see at least fig. 4 and associated texts, copy operations). It would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have combined Raiman’s copy operation with Majnemer’s kernel fusion to modify Majnemer’s system to combine the copy operations as taught by Raiman, with a reasonable expectation of success, since they are analogous art because they are from the same field of endeavor related to data processing and execution. Combining Raiman’s functionality with that of Majnemer results in a system that allows to incorporate copy operations in Majnemer’s kernel fusion. The modification would be obvious because one having ordinary skill in the art would be motivated to make this combination to combine the copy operations with other operations into a single kernel for efficient processing by reducing memory usage and avoiding repetitive operations (Raiman, see at least fig. 4 and associated texts, copy operations).
Claims 31-33, 35, and 36 are rejected under 35 U.S.C. 103 as being unpatentable over Majnemer (US20210049231, hereafter Majnemer) in view of Kursar (US 2020073969).
Per claim 31:
Majnemer teaches one or more processors to identify one or more objects based, at least in part, on performing one or more inferencing operations using two or more reduction operations combined into a software kernel by a compiler, the two or more reduction operations including a second reduction operation that depends on a first output of a first reduction operation (Majnemer, see at least [0080], The process 600 includes fusing (610) the at least two operations of the set of operations; [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. … fuse (e.g., combine) the repetitive operations of the calculation into a single operation…. The intermediate result can be stored and referenced repeatedly as needed to complete the convolution computation; [0008], fusing at least two operations comprises causing one or more processors of the matrix multiplication circuitry to latch data representing kernel values for the convolution operation; [0009]; [0011]; [0037], the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations. The dependence graph can be updated by the compiler 102 to show where parallelization or pipelining may occur in the dependence graph in response to receiving the specification 108; [0049], The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed; Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. One dependent on the other’s output combined into a single GPU kernel with intermediates stored in registers);
wherein the first reduction operation generates the first output with fewer dimensions than an input to the first reduction operation, and the second reduction operation generates a second output with fewer dimensions than an input to the second reduction operation (Majnemer, see at least [0041]; [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0079] The compiler 102 thus can instruct the intermediate values to be stored and combine steps of the matrix multiplication and summation for the convolution operation, reducing processing time; [0080] fusing two or more operations of the set of operations by the matrix multiplication circuitry indicated by the specification. The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations. The process 600 includes generating (612) executable logic configured to sequence the fused operations into the matrix multiplication circuitry ---Note that the steps of addition (accumulates products into partial sums producing output with fewer dimensions) and reduction (collapses partial sums into final result depending on the addition step’s output, producing output with even fewer dimensions) of the reduce_sum operation are reduction operations. In the dependence graph, an edge from one operation to another indicates that one operation depends on the output of another. The intermediate values stored in registers create these dependency edges. In the reduce sum operation, the first reduction (Addition) accumulates products into partial sums and output has fewer values than input, and the reduction step (second reduction) depends on Addition’s output and collapses the partial sums into the final result. The two dependent reductions and multiplication are fused into a single GPU kernel).
wherein performance of the software kernel includes fewer accesses to a global memory than performing the first reduction operation and the second reduction operation in separate software kernels (Majnemer, see at least [0038]; [0049], The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel. In this example, intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0035]; [0079]; [0080], The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operations; Note that the addition, multiplication, and reduction steps of a reduce_sum operation are combined into the single operation, that is, the two dependent reductions and multiplication are fused into a single GPU kernel; Note that off-chip (global) memory traffic between dependent reductions is reduced by the fusion with the GPU registers for the intermediate values).
Majnemer does not explicitly teach A vehicle, comprising: a computer vision system and one or more of a propulsion system, a directional control system, and a vehicle operator notification system to perform one or more actions based, at least in part, on the identified one or more objects. Kursar teaches A vehicle, comprising: a computer vision system and one or more of a propulsion system, a directional control system, and a vehicle operator notification system to perform one or more actions based, at least in part, on the identified one or more objects (Kursar, see at least [0107], The vehicle 100 can include a propulsion system 141, a braking system 142, a steering system 143, throttle system 144, a transmission system 145, a signaling system 146, and/or a navigation system 147; [0109], processors …the vision system…various vehicle systems…to control the movement…direction). It would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have combined Kursar’s vehicle system environment with Majnemer’s kernel fusion to modify Majnemer’s system to combine the vehicle environment as taught by Kursar, with a reasonable expectation of success, since they are analogous art because they are from the same field of endeavor related to data processing and execution. Combining Kursar’s functionality with that of Majnemer results in a system that allows to incorporate a machine vision model of a vehicle with a propulsion, signaling and directional systems in Majnemer’s kernel fusion. The modification would be obvious because one having ordinary skill in the art would be motivated to make this combination to execute the kernel fusion system on a various target environment including a vehicle with associated systems (Kursar, see at least [0107], The vehicle 100 can include a propulsion system 141, a braking system 142, a steering system 143, throttle system 144, a transmission system 145, a signaling system 146, and/or a navigation system 147; [0109], processors …the vision system…various vehicle systems…to control the movement…direction).
Per claims 32 and 33, these are the vehicle version of claims 8, 11, therefore they are rejected in connection to the rejections of claim 8 and 11 above.
35. The vehicle of claim 31, wherein the two or more reduction operations have been combined into the software kernel based, at least in part, on one or more threads assigned to elements of tensors used by the two or more reduction operations (Majnemer, see at least [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. …The compiler is configured to recognize this repetition and fuse (e.g., combine) the repetitive operations of the calculation into a single operation; [0015], fusing the operations comprises concatenation of two tensors by multi-dimensional broadcasting; [0037] the compiler 102 can import or generate a dependence graph representing the dependence of operations of the intermediate representation 104 on one another. The dependence graph specifies which operations that are performed depend on the results of other operations; [0049], how the operations are combined is dependent on what the processing capabilities are of the GPU, for example which processing resources are available for use and how they are connected. …the intermediate representation 104 can be loaded into the compiler 102 at run time to determine how best to combine operations, the compiler 102 can run as JIT compiler. For example, the compiler 102 can receive bytecode representing tensor flow graphs. The compiler 102 can optimize those graphs (if possible) using target-independent fusions 112; [0038], A subset of the activation values are stored in memory local to a processing unit and repeatedly summed by that processing unit with portions of the array of parameter values. Each processing unit can be assigned a different subset of the activation values, and their results can be broadcast to other processing units after parallel summing occurs; [0042], the compiler 102 is configured to introduce parallel loading of data into the processing units (if the hardware architecture is configured to do so) to further optimize execution of the instruction set. Parallel loading can be introduced for parallel processing or for pipelining the processing of data by the multi-dimensional array of processing units; [0046]; [0066]).
36. The vehicle of claim 31, wherein the two or more reduction operations have been combined into the software kernel based, at least in part, on an input graph that represents operations using a neural network (Majnemer, see at least [0027], a computation graph…represents neural network computations as computation graphs; [0049], the compiler 102 can receive bytecode representing tensor flow graphs …optimize those graphs … neural network computations).
Claim 34 is rejected under 35 U.S.C. 103 as being unpatentable over Majnemer in view of Kursar and Raiman (US10901715).
Per claim 34.
Majnemer teaches: wherein the two or more reduction operations have been combined with one or more elementwise operations into the software kernel by the compiler (Majnemer, see at least [0049] The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel; [0070] to perform element-wise multiplication, a cell may receive a value at the sum in register 406 and the received value may be output to an adjacent cell, i.e., without the summation circuitry 412 adding the product to the received value. The cell may also provide the product produced by the multiplication circuitry 410, corresponding to an element-wise multiplication by the cell, to an adjacent cell without summing the product and a value received at the sum in register 406. The result is that the array 306 can provide output vectors corresponding to element-wise multiplication of activation inputs and weights).
Majnemer does not explicitly teach one or more copy operations. Raiman teaches copy operations (Raiman, see at least fig. 4 and associated texts, copy operations). It would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have combined Raiman’s copy operation with Majnemer’s kernel fusion combined with Kursar to modify Majnemer’s system to combine the copy operations as taught by Raiman, with a reasonable expectation of success, since they are analogous art because they are from the same field of endeavor related to data processing and execution. Combining Raiman’s functionality with that of Majnemer and Kursar results in a system that allows to incorporate copy operations in Majnemer’s kernel fusion. The modification would be obvious because one having ordinary skill in the art would be motivated to make this combination to combine the copy operations with other operations into a single kernel for efficient processing by reducing memory usage and avoiding repetitive operations (Raiman, see at least fig. 4 and associated texts, copy operations).
Examiner’s Note
The Examiner has pointed out particular references contained in the prior art of record within the body of this action for the convenience of the Applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply. Applicant, in preparing the response, should consider fully the entire reference as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
Response to Arguments
Applicant's arguments filed 3/11/2026 have been fully considered but they are not persuasive.
The applicant states that Majnemer does not teach the recited limitations: nowhere does Majnemer describe how its compiler identifies which operations are to be combined. Majnemer does not describe to generate respective coordinates for input tensors and output tensors of a plurality of operations of the graph representation based on backpropagation through the graph representation, and then identify for combination, based on the generated respective coordinates, two or more reduction operations of the plurality of operations. Moreover, Majnemer fails to teach the first reduction operation and the second reduction operation that depends on an output of the first reduction operation are both combined into a single software kernel. The only mention of combining operations into a single software kernel in Majnemer is the brief mention in [0049]. Regular addition and multiplication operations do not generate an output with fewer dimensions than an input to the operation. Moreover, just because Majnemer describes a dependence graph does not mean that any two particular operations from that graph have a dependency. For example, dependence graphs can have different branches where an operation on one branch is not dependent on an operation on another branch. The assertion in the Office Action that "combining addition, multiplication, and reduction steps of a operation" described in [0049] of Majnemer involves multiple dependent reduction operations is mere speculation not supported by what is actually described in the reference.
In response to applicant’s statements, Majnemer teaches that the processing unit of the multi-dimensional array of processing units receives data in what order each vector of the matrix is processed. Determining the vector, processing units and the order is coordinate generation which is fundamentally assigning positional coordinates to tensor data across the array. The compiler references the specification, which includes a description of the array and determines which operations to combine to optimize execution for the array described in the specification. The first dimension of the array corresponds to columns of cells and a second dimension to rows of cells. Value loaders send activation inputs to rows of the array and a weight fetcher interface sends weight inputs to columns of the array (coordinates/indices), this is clearly a coordinate assignment. The graph is a computation graph of the neural network and the compiler generates coordinates for tensors within that graph through the specification (see at least [0009]; [0037]; [0002]; [0003]; [0038] executing a matrix multiplication computation, the compiler 102 optimizes the process of matrix multiplication for a particular configuration of the multi-dimensional array of processing units as specified in the specification 108. …depending on the size of the multi-dimensional array of processing units as specified in the specification; [0060]; [0061]). The matrix operations (Matrix multiplication) are organized reductions where each output element is a dot product (a reduction over the k dimension) which is a reduction, the multiply step is independent, the summation step is a reduction with register-level raw dependencies. One dependent on the other’s output combined into a single GPU kernel with intermediates stored in registers. Specifically, the reduce_sum operation comprises three steps (operations) of multiplication, addition and reduction which are combined into the single operation: Multiplication (each element is multiplied, intermediate products are generated and stored in GPU registers); Addition (product accumulated/summed, partial sums generated, and depending on the multiplication’s products (this is RAW dependence), and finally Reduction (partial sums collapsed, final scalar/vector output, depending on the Addition’s sums (RAW dependence). In a dependence graph for the reduce_sum, an edge from one operation to another indicates that one operation depends on the output of another. The intermediate values stored in registers create these dependency edges. No edge is between two nodes if they are independent and therefore can run in parallel, while edges existing between two nodes indicates one must wait for the other (sequential dependency). The dependency graph encodes which node must wait for which node at the register level. The intermediate representation serves as the bridge to determine which operations can be fused into a single kernel pass to eliminate memory roundtrips. Therefore, the fusion itself precisely indicates dependence among operations because the fusion is available because of the dependence. That is, no dependence means no data flowing between operations, hence nothing to fuse. Fusion keeps dependent dataflow inside the registers. The intermediate representation reads the dependence graph and determines whether to fuse operations into a single kernel. Therefore, the dependence graph is what makes fusion visible. When two reduction operations are fused (the first and second reductions above in the reduce_sum), each independently reduces its input dimensionality and the fusion keeps both intermediate results in registers rather than writing to memory (see at least [0041]; [0049]; [0079]; [0080]). Therefore, the first reduction is from Addition (accumulates products into partial sums and output has fewer values than input), and the second reduction from Reduction depending on Addition’s output and collapsing partial sums into the final result. Here, the intermediate values produced during the Addition, Multiplication and Reduction steps can be stored in GPU registers. The dependence chain includes the input tensor to multiplication which is intermediate products that will be used by addition (partial sums in registers) used by reduction (final output with fewer dimensions). When the two reduction operations are fused, each independently reduces its input dimensionality and the fusion keeps both intermediate results in registers rather than writing to off-chip (global) memory for reducing memory accesses off-chip that reduces memory bandwidth usage and improves processing performance of the GPU (Majnemer, see at least [0080], The process 600 includes fusing (610) the at least two operations of the set of operations; [0002], in a convolution computation, data (e.g., a tensor) of a first data set is multiplied with data of a second data set. Generally, perform this multiplication, intermediate values can be generated which are then summed into a final result. … fuse (e.g., combine) the repetitive operations of the calculation into a single operation…. The intermediate result can be stored and referenced repeatedly as needed to complete the convolution computation; [0009]; [0011]; [0037], [0049], The specification 108 can include a description of the GPU. The target-specific instructions 116 can be configured to run on a single GPU kernel, such as combining addition, multiplication, and reduction steps of a reduce_sum operation into the single operation. In this example, the “single” operation refers to a sequence of operations that are occurring but that are combined for execution by a single GPU kernel).
The applicant further states that Majnemer also fails to teach "performance of the software kernel includes fewer accesses to a global memory than performing the first reduction operation and the second reduction operation in separate software kernels," as recited in amended claims 7, 13 and 31.
In response, as stated above, Majnemer teaches the dependence graph for the fusion. The dependence chain includes the input tensor to multiplication which is intermediate products that will be used by Addition (partial sums in registers) used by Reduction (final output with fewer dimensions). The first reduction (Addition) by accumulates products into partial sums and output has fewer values than input, and Reduction (second reduction) depending on Addition’s output and collapses partial sums into the final result. The final reduction step depends directly on the output of the first reduction (the addition/accumulation step). The two dependent reductions and multiplication are fused into a single GPU kernel keeping all intermediate values in registers rather than writing them to global memory between steps. Therefore, Majnemer clearly eliminates off-chip (global) memory traffic between dependent reductions (Majnemer, see at least [0038]; [0049], , intermediate values produced during the addition, multiplication, and reduction steps can be stored in GPU registers associated with the kernel while the computation is being performed. Reducing memory accesses off-chip reduces memory bandwidth usage, which can be a source of latency for processing these operations (e.g., the cause of a computation bottleneck); [0035]; [0079]; [0080], The process includes determining (608), based on the applying, that the computing cost is reduced by fusing at least two operations of the set of operations when the fused operations are executed by the matrix multiplication circuitry. The process 600 includes fusing (610) the at least two operations of the set of operation).
The applicant states that the claimed approach "reduces accesses to global memory and provides performance improvements in comparison to legacy approaches." Thus, the claim is directed to an improvement in computer technology and a practical application. Moreover, the process of generating respective coordinates for input tensors and output tensors based on backpropagation through the graph representation, identifying two or more reduction operations based on the generated respective coordinates including a second reduction operation that depends on a first output of a first reduction operation, combining the identified two or more reduction operations including the first reduction operation and the second reduction operation that depends on the first output of the first reduction operation into a single software kernel, and generating executable program code including executable program code for the single software kernel, is a specific "ordered combination" that "significantly more" than any conventional approach.
In response, a human (developer) can certainly manually do all the steps of generating coordinates for tensors (e.g. [b][i][j] for each tensor element), identifying reduction operations to combine them into a single operation by evaluating/inspecting the code and fusing the identified operations by merging by hand on paper. A small executable program for the kernel can be manually generated. Furthermore, the reducing accesses to global memory and performance improvements are mere intended result or goal from the mental steps, not activity steps.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to INSUN KANG whose telephone number is (571)272-3724. The examiner can normally be reached M-TR 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chat Do can be reached at 571-272-3721. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/INSUN KANG/Primary Examiner, Art Unit 2193