DETAILED ACTION
This office action is in response to the Application No. 17214723 filed on
02/02/2026. Claim 1-36 are presented for examination and are currently pending. Applicant’s arguments have been carefully and respectfully considered.
Response to Arguments
It is noted that the arguments has been considered but are moot because the Examiner is withdrawing the rejections in the previous Office action because Applicant’s amendment necessitated the new grounds of rejection presented in this Office action.
It is noted that arguments regarding independent claims 1, 10, 19 and 28 have
been considered but are moot has new references has been added to remap the independent claims 1, 10, 19 and 28.
The dependent claims 2-9, 11-18, 20-27 and 29-36 which depend directly or
indirectly from independent claims 1, 10, 19 and 28 are not allowable because the Applicant’s arguments are moot for similar reasons regarding the independent claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
3. Claims 1, 10, 17, 19 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020)
Regarding claim 1, Badin teaches one or more processors, comprising: circuitry to perform an operation (The method 400 may be implemented in a computing device in software executing in a processor (e.g., the processor 14 in FIGS. 1 and 2), in dedicated hardware or circuitry, or a combination of a processor and dedicated hardware, such as a processor executing software within a machine learning device [0059]; The storage memory 24 may be configured much like an embodiment of the memory 16 in which the storage memory 24 may store the data or processor-executable code for access by one or more of the processors 14 [0038]) using at least one tensor (matrix A 300 with matrix B 302 [0043]. The Examiner notes each matrix is a tensor) by at least:
identifying a first one or more portions of the at least one tensor (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F) from a memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
wherein … one or more first identified portions are dimensioned equally (first portion 306 a has equal dimensions as portion 306 b, 306 b has equal dimensions as portion 306 c and 306 c has equal dimension as portion 306 d);
identifying a second one or more portions of the at least one tensor (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F) from the memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
wherein the first one or more identified portions and the second one or more identified portions correspond to tiles that are differently shaped (tile portion 306 a of matrix A 300 and tile portion 308 a of matrix B 302 are differently shaped); and
generating one or more outputs (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]) by using a combination of the first one or more identified portions and the second one or more identified portions (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]).
Badin does not explicitly teach wherein each of the one or more first identified portions are dimensioned equally
Das teaches wherein each of the one or more first identified portions are dimensioned equally (a 16×16 IFM tensor may be stored in a form of four 4×4 IFM tiles. Each 4×4 IFM tile may include 16 IFM pixels of 8 bits each [0049])
Since, Badin teaches discloses portions 306 a-306 d of matrix A 300 [0075] and matrix A represent the input data [0044] and Das teaches the IFM data stored in the memory 202 corresponding to the input … may be stored in a form of IFM tiles with equal dimensions [0049], then,
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin to incorporate the teachings of Das for the benefit of processing of neural networks, and more particularly, to reducing execution time and power dissipation in processing of layers in a neural network (Das [0002])
Regarding claim 10, claim 10 is similar to claim 1. It is rejected in the same manner and reasoning applying.
Regarding claim 17, Badin and Das teaches the method of claim 10, Badin teaches where generating the one or more outputs further comprises: generating an output tensor (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]).
Regarding claim 19, Badin teaches a system, comprising a memory (The computing device 10 may include a system-on-chip (SoC) 12 with a processor 14, a memory 16 [0033]); and
one or more processors, the one or more processors operatively coupled with the
memory, to perform an operation (The method 400 may be implemented in a computing device in software executing in a processor (e.g., the processor 14 in FIGS. 1 and 2), in dedicated hardware or circuitry, or a combination of a processor and dedicated hardware, such as a processor executing software within a machine learning device [0059]; The storage memory 24 may be configured much like an embodiment of the memory 16 in which the storage memory 24 may store the data or processor-executable code for access by one or more of the processors 14 [0038])
using at least one tensor (matrix A 300 with matrix B 302 [0043]. The Examiner notes each matrix is a tensor) comprising:
identifying, from a memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
first one or more portions to obtain a first output (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F),
wherein … one or more first identified portions are dimensioned equally (first portion 306 a has equal dimensions as portion 306 b, 306 b has equal dimensions as portion 306 c and 306 c has equal dimension as portion 306 d);
identifying, from the memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
a second one or more portions to obtain a second output (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F)
the second one or more identified portions correspond to tiles that are differently shaped from the first one or more identified portions (tile portion 306 a of matrix A 300 and tile portion 308 a of matrix B 302 are differently shaped); and
generating one or more outputs (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]) by using a combination of the first one or more identified portions and the second one or more identified portions (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]).
Badin does not explicitly teach wherein each of the one or more first identified portions are dimensioned equally
Das teaches wherein each of the one or more first identified portions are dimensioned equally (a 16×16 IFM tensor may be stored in a form of four 4×4 IFM tiles. Each 4×4 IFM tile may include 16 IFM pixels of 8 bits each [0049])
Since, Badin teaches discloses portions 306 a-306 d of matrix A 300 [0075] and matrix A represent the input data [0044] and Das teaches The IFM data stored in the memory 202 corresponding to the input … may be stored in a form of IFM tiles with equal dimensions [0049], then,
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin to incorporate the teachings of Das for the benefit of processing of neural networks, and more particularly, to reducing execution time and power dissipation in processing of layers in a neural network (Das [0002])
Regarding claim 26, claim 26 is similar to claim 17. It is rejected in the same manner and reasoning applying.
4. Claims 2, 8, 11, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) and further in view of Nurvitadhi et al. (US20180189638)
Regarding claim 2, Badin and Das teaches the one or more processors of claim 1, Badin teaches wherein the first one or more identified portions are identified (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F)
wherein the first one or more identified portions correspond to one or more tiles of predetermined dimensions (tile portion 306 a of matrix A 300 and tile portion 308 a of matrix B 302 are of predetermined dimensions),
Badin and Das does not explicitly teach first technique, wherein the first technique is to cause the circuitry to: load data from a plurality of cells corresponding to the first one or more identified portions into a cache, and wherein the cache is accessible by one or more processing threads of the one or more processors.
Nurvitadhi teaches wherein the first one or more identified portions are identified using a first technique (FIG. 15a highlights paths (using dotted lines) for spMspV_csc [0157]; In particular, the processing elements 901-N may include hardware support for column and row-oriented matrix processing [0135]. The Examiner notes that CSC is a compressed sparse column [0139] and spMspV_csc as first technique),
wherein the first technique (The Examiner notes that CSC is a compressed sparse column [0139] and spMspV_csc as first technique) is to cause the circuitry to: load data from a plurality of cells corresponding to the first one or more identified portions into a cache (For example, the cache coherent interface 930 may implement a cache coherency protocol to ensure that data accessed/modified by the accelerator 900 and stored in the accelerator cache 907 [0132]),
and wherein the cache is accessible by one or more processing threads of the one or more processors (In one embodiment, coherency is maintained between one or more cache units 3706 and cores 3702-A-N [0277]; In some embodiments, one or more of the cores 3702A-N are capable of multithreading [0278]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin and Das to incorporate the teachings of Nurvitadhi for the benefit of a processor that achieves extreme efficiencies and efficiently converts the available memory bandwidth provided to it into performance for matrix operations (Nurvitadhi [0154])
Regarding claim 8, Badin and Das teaches the one or more processors of claim 1, Badin teaches wherein the first one or more identified portions are identified (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F)
wherein the second one or more identified portions are identified (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F)
wherein generating one or more outputs (Fig. 3A-3F illustrate a non-limiting
example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]) by using a combination of the first one or more identified portions and the second one or more identified portions (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]) comprises:
generating a first result (The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F)
generating a second result (The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F)
and generating an output tensor (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]) by combining the first result and the second result (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]).
Badin and Das does not explicitly teach using first technique and second technique,
Nurvitadhi teaches wherein the first one or more identified portions are identified using a first technique (FIG. 15a highlights paths (using dotted lines) for spMspV_csc [0157]; In particular, the processing elements 901-N may include hardware support for column and row-oriented matrix processing [0135]. The Examiner notes that CSC is a compressed sparse column [0139] and spMspV_csc as first technique),
wherein the second one or more identified portions are identified using a second technique (FIG. 15b illustrates paths for a spMdV_csr operation [0157]; In particular, the processing elements 901-N may include hardware support for column and row-oriented matrix processing [0135]; The Examiner notes that CSR is a compressed sparse row [0139] and spMdV_csr as second technique),
comprises: generating a first result (The Examiner notes the Output Buf as the first result from Fig. 15a)
using the first technique (In particular, FIG. 15a highlights paths (using dotted lines) for spMspV_csc [0157]; The Examiner notes that CSR is a compressed sparse column [0139] and spMspV_csc as first technique);
generating a second result (The Examiner notes the Output Buf as the second output from Fig. 15b)
using the second technique (FIG. 15b illustrates paths for a spMdV_csr operation [0157]. The Examiner notes CSR is a compressed sparse row [0139] and spMdV_csr as second technique); and
generating an output tensor by combining the first result and the second result (The Examiner notes the arrows from output buffers from each PEs are the first and second results which are combined in the Reduction Unit 1404 of accelerator 900, Fig. 14).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin and Das to incorporate the teachings of Nurvitadhi for the benefit of a processor that achieves extreme efficiencies and efficiently converts the available memory bandwidth provided to it into performance for matrix operations (Nurvitadhi [0154])
Regarding claim 11, claim 11 is similar to claim 2 and it is rejected in the same manner and reasoning applying.
Regarding claim 20, claim 20 is similar to claim 2 and it is rejected in the same manner and reasoning applying.
5. Claims 3-5, 12-14 and 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Nurvitadhi et al. (US20180189638) and further in view of Woolley jr. et al. (US20160162402 hereinafter “Woolley”)
Regarding claim 3, Badin, Das and Nurvitadhi teaches the one or more processors of claim 2, Badin, Das and Nurvitadhi does not explicitly teach wherein the operation comprises one or more convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter, each portion of the operation comprises the convolution operations for each cell, and each portion of the operation is performed by a separate processing thread of the one or more processing threads.
Woolley teaches wherein the operation comprises one or more convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter, each portion of the operation comprises the convolution operations for each cell (each of the thread groups configures the floating-point unit to perform matrix multiplication operations between the assigned image tile 542 and the corresponding filter tile 544 [0096]. The Examiner notes image tile 542 incudes plurality of cell patches), and
each portion of the operation is performed by a separate processing thread of the one or more processing threads (The convolution engine divides the virtual image matrix into separate image tiles and then assigns the processing of each image tile to a different thread group [0107]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das and Nurvitadhi to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
Regarding claim 4, Badin, Das, Nurvitadhi and Woolley teaches the one or more processors of claim 3, Woolley teaches wherein the convolutional filter (FIG. 4 illustrates an image batch 410, a filter stack 440, and an output batch 470 associated with a multi-convolution operation [0062])
comprises a plurality of weight values for a convolution layer of a convolutional neural network model (Consequently, the optimized performance of the matrix-based multi-convolution operation is relatively consistent across changes in the values of individual parameters [0072]; The multi-convolution operation corresponds to the predominant calculation involved in executing a particular convolution layer included in a CNN [0062]. The Examiner notes parameters to imply weights).
The same motivation to combine dependent claim 3 applies here.
Regarding claim 5, Badin, Das, Nurvitadhi teaches the processor of claim 2, Badin, Das, Nurvitadhi does not explicitly teach load padding elements or zero-padding cells corresponding to the one of the first one or more portions into the cache.
Woolley teaches wherein the circuitry is further to: load padding elements or zero-padding cells corresponding to the one of the first one or more portions into the cache (For example, in some embodiments, the parameters 465 may include a padding height and a padding width. The padding height and the padding width append, respectively, rows of zeros and columns of zeros to output images 480 included in the output batch 470 for any technical reason, such as formatting for future operations [0066]; The L1 cache 384 supports, among other things, load and store operations performed by the execution units [0054]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Nurvitadhi to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
Regarding claim 12, claim 12 is similar to claim 3 and it is rejected in the same manner and reasoning applying.
Regarding claim 13, claim 13 is similar to claim 4 and it is rejected in the same manner and reasoning applying.
Regarding claim 14, claim 14 is similar to claim 5 and it is rejected in the same manner and reasoning applying.
Regarding claim 21, Badin, Das, Nurvitadhi teaches the system of claim 20, Badin, Das, Nurvitadhi does not explicitly teach wherein the operation comprises: performing convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter to generate the first output, wherein the convolution operations for each cell in the first output is performed by a separate processing thread of the one or more processing threads.
Woolley teaches wherein the operation comprises: performing convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter to generate the first output (The pipeline then performs matrix multiplication operations between the image tile and a filter tile to generate a contribution of the image tile to an output matrix (abstract); The Examiner notes the output matrix (or output tensor) is the first tiled technique output of the operation between image tile and a filter tile),
wherein the convolution operations for each cell in the first output is performed by a separate processing thread of the one or more processing threads (The convolution engine divides the virtual image matrix into separate image tiles and then assigns the processing of each image tile to a different thread group [0107]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Nurvitadhi to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
Regarding claim 22, claim 22 is similar to claim 4 and it is rejected in the same manner and reasoning applying.
Regarding claim 23, claim 23 is similar to claim 5 and it is rejected in the same manner and reasoning applying.
6. Claims 6, 15, 24 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Nurvitadhi et al. (US20180189638) and further in view of Nagy et al. (US20210241082 filed 04/09/2019)
Regarding claim 6, Badin, Das, Nurvitadhi teaches the one or more processors of claim 2, Nurvitadhi teaches wherein the circuitry is further to (FIG. 10 illustrates another view of accelerator 900 and other components previously described including a data management unit 905, a plurality of processing elements 901-N [0135], Fig. 9 and 10):
access a second one or more identified portions of the at least one tensor from the memory (For the spMdV_csr, the x vector subset is loaded in to the PE's RAM 1421. DMU 905 streams in matrix row elements (i.e., {A.val,A.idx} pairs) from memory [0159])
Badin, Das, Nurvitadhi does not explicitly teach make a determination that the second one or more identified portions have smaller dimensions than the predetermined dimensions; and access the second one or more identified portions of the at least one tensor from the memory responsive to the determination.
Nagy teaches make a determination that the second one or more identified portions have smaller dimensions than the predetermined dimensions (The stride value may also be used to determine whether the result store 439 may store results of more than one output area, … the size of the resulting output area will be smaller than the size of corresponding input areas [0168]); and
access the second one or more identified portions of the at least one tensor from the memory responsive to the determination (For hardware resources, M may determine the amount of parallel processing elements 418 and R may determine the physical capacity of the result store 439 in each processing element 418 [0162]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Nurvitadhi to incorporate the teachings of Nagy for the benefit of accessing and processing of data, such as by applying convolution operations, which may be flexibly adapted to various scenarios, which efficiently exploits available resources, and handles bandwidth issues with regard to exchange of input and output data processed by the accelerator apparatus (Nagy [0007]).
Regarding claim 15, claim 15 is similar to claim 6 and it is rejected in the same manner and reasoning applying.
Regarding claim 24, claim 24 is similar to claim 6 and it is rejected in the same manner and reasoning applying.
7. Claims 7, 16, 25 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Nurvitadhi et al. (US20180189638) in view of Abdelaziz et al. (US20210326686 filed 06/12/2020)
Regarding claim 7, Badin and Das teaches the one or more processors of claim 1, Badin teaches the second one or more identified portions are identified (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F)
Badin and Das does not explicitly teach using second technique,
Nurvitadhi teaches wherein the second technique is to cause the circuitry to (FIG. 15b illustrates paths for a spMdV_csr operation [0157]; The Examiner notes that CSR is a compressed sparse row [0139]):
Badin, Das and Nurvitadhi does not explicitly teach apply a transformation to one or more blocks of the at least one tensor to unroll the one or more blocks into at least one two-dimensional (2D) tensor; load the at least one 2D tensor into a cache; and perform, using data in the cache, a convolution operation between a convolutional filter and the at least one tensor, via general matrix multiply, to obtain a second output.
Abdelaziz teaches apply a transformation to one or more blocks of the at least one tensor to unroll the one or more blocks (In one embodiment, the PE units are configured to unroll in one or more selected dimensions of the input feature map, convolution kernel, and/or output feature map, and perform parallel dot-product computations in the unrolled dimension(s) [0028])
into at least one two-dimensional (2D) tensor (In one embodiment, the 2D array of PE units is invoked for performing dot-product computations for computing a layer of the neural network, such as a convolution layer of the CNN, in parallel [0028]; In one embodiment, the size of register files per PE (RF/PE) is reduced by unrolling in one or two of the kernel dimensions [0032]);
load the at least one 2D tensor into a cache (For example, input a0 is fed to the PE units in row 412 of the tile, input a1 is fed to the PE units in row 414 of the tile, and input a15 is fed to the PE units in row 416 of the tile [0060]; In the weight stationary architecture, given that the weight data may be preloaded into the register files of the various PE units [0032]. The Examiner notes register in the PE unit as cache); and
perform, using data in the cache, a convolution operation between a convolutional filter and the at least one tensor, via general matrix multiply, to obtain a second output (In performing the convolution computation during the first processing cycle, inputs a0-a15 are multiplied with weight B0 of kernel column 406 (in the various input channels), which is stored in the PE units of column 400 of the tile, for generating a partial sum of output pixel a0 [0060]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das and Nurvitadhi to incorporate the teachings of Abdelaziz for the benefit of performing parallel dot-product computations in the unrolled dimension(s) which may thus accelerate the computation of the neural network layer (Abdelaziz [0028])
Regarding claim 16, claim 16 is similar to claim 7 and it is rejected in the same manner and reasoning applying.
Regarding claim 25, claim 25 is similar to claim 7 and it is rejected in the same manner and reasoning applying.
8. Claims 9, 18, 27 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Woolley jr. et al. (US20160162402 hereinafter “Woolley”)
Regarding claim 9, Badin, Das teaches the one or more processors of claim 1, Badin teaches tensor (matrix A 300 with matrix B 302 [0043]. The Examiner notes each matrix is a tensor)
Badin, Das does not explicitly teach wherein the tensor represents one of: an input layer of a convolutional neural network or a feature map of an inner layer of the convolutional neural network.
Woolley teaches the wherein the tensor (In operation, the SM 310 executes the matrix multiplication operations on sub-matrices, referred to herein as tiles, of the image batch [0061]; One or more portions of the shared memory 382 are shared amongst the threads in a CTA [0054]; Advantageously, only a portion of the image matrix is stored in the shared memory 382 at any given time [0061]. The Examiner notes tensors are matrices)
represents one of: an input layer of a convolutional neural network or a feature map of an inner layer of the convolutional neural network (In the context of FIG. 4, the streaming multiprocessor (SM) 310 is configured to perform a multi-convolution operation between the image batch 410 and the filter stack 440 to produce the output batch 470. The multi-convolution operation corresponds to the predominant calculation involved in executing a particular convolution layer included in a CNN [0062]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
Regarding claim 18, claim 18 is similar to claim 9 and it is rejected in the same manner and reasoning applying.
Regarding claim 27, claim 27 is similar to claim 9 and it is rejected in the same manner and reasoning applying.
9. Claims 28 and 36 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) and further in view of Theverapperuma et al. (US20220024485 filed 07/24/2020)
Regarding claim 28, Badin teaches an operation using at least one tensor as an operand by at least (matrix A 300 with matrix B 302 [0043]; the data of the matrices 300 … may be formatted as floating point data [0048]. The Examiner notes each matrix is a tensor and floating point data as an operand):
identifying a first one or more portions of the at least one tensor (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F) from a memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
wherein … one or more first identified portions are dimensioned equally (first portion 306 a has equal dimensions as portion 306 b, 306 b has equal dimensions as portion 306 c and 306 c has equal dimension as portion 306 d);
identifying a second one or more portions of the at least one tensor (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F) from the memory (Raw data may stream from the raw data source device to the memory 16 and be stored by the memory until the raw data can be received and processed by a machine learning accelerator as discussed further herein with reference to FIGS. 3 [0037]),
wherein the first one or more identified portions and the second one or more identified portions correspond to tiles that are differently shaped (tile portion 306 a of matrix A 300 and tile portion 308 a of matrix B 302 are differently shaped); and
generating one or more outputs (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]) by using a combination of the first one or more identified portions and the second one or more identified portions (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]).
Badin does not explicitly teach wherein each of the one or more first identified portions are dimensioned equally, an autonomous vehicle comprising: one or more cameras; and one or more processors operatively coupled to the one or more cameras to capture at least one image, and the one or more processors to use one or more trained neural networks to process at least one tensor corresponding to the at least one image:
Das teaches wherein each of the one or more first identified portions are dimensioned equally (a 16×16 IFM tensor may be stored in a form of four 4×4 IFM tiles. Each 4×4 IFM tile may include 16 IFM pixels of 8 bits each [0049])
Since, Badin teaches discloses portions 306 a-306 d of matrix A 300 [0075] and matrix A represent the input data [0044] and Das teaches the IFM data stored in the memory 202 corresponding to the input … may be stored in a form of IFM tiles with equal dimensions [0049], then,
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin to incorporate the teachings of Das for the benefit of processing of neural networks, and more particularly, to reducing execution time and power dissipation in processing of layers in a neural network (Das [0002])
Badin and Das does not explicitly teach an autonomous vehicle comprising: one or more cameras; and one or more processors operatively coupled to the one or more cameras to capture at least one image, and the one or more processors to use one or more trained neural networks to process at least one tensor corresponding to the at least one image:
Theverapperuma teaches an autonomous vehicle (FIG. 1A is a simplified block diagram of an autonomous vehicle incorporating a controller system (referred to herein as an autonomous vehicle management system (AVMS)) according to certain embodiments [0014]) comprising:
one or more cameras (In certain embodiments, values of a set of features extracted from at least one camera image are processed to identify an edge represented in the at least one camera image, where the identified edge corresponds to an edge of an object in a physical environment or an edge of a drivable surface [0011]); and
one or more processors operatively coupled to the one or more cameras to capture at least one image (Each of the sensors in FIG. 4 is communicatively coupled to a respective pre-processing unit in the pre-processing subsystem 410. For example, camera 402 may be configured to provide image data to a pre-processing unit 412, camera 404 may be configured to provide image data to a pre-processing unit 414 [0091]), and
the one or more processors to use one or more trained neural networks to process at least one tensor corresponding to the at least one image (The feature extractor 422 can be implemented as a neural network that has been trained (e.g., through supervised learning and backpropagation) to generate a vector or multi-dimensional tensor for input to each of the modules 424, 426, and 428. The vector or multi-dimensional tensor is an abstract representation of a 2D image that combines information from the individual camera images [0094]) by:
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin and Das to incorporate the teachings of Theverapperuma for the benefit of implementing a neural network that has been trained to generate a vector or multi-dimensional tensor for input (Theverapperuma [0094])
Regarding claim 36, Badin, Das and Theverapperuma teaches the autonomous vehicle of claim 28, Theverapperuma teaches wherein the tensor (The feature extractor 422 can be implemented as a neural network that has been trained … to generate a vector or multi-dimensional tensor for input to each of the modules 424, 426, and 428. The vector or multi-dimensional tensor is an abstract representation of a 2D image that combines information from the individual camera images. The feature extractor 422 typically includes many layers (e.g., on the order of a hundred) that perform various mathematical operations, including convolution and pooling operations [0094]) represents one of: an input layer of a convolutional neural network or a feature map of an inner layer of the convolutional neural network (Multipurpose CNN 740 may include a separate sub-network for each of its components, where each subnetwork includes at least one convolutional layer. For instance, the feature extractor 722 may correspond to a first set of layers, … and so on [0117]).
The same motivation to combine independent claim 28 applies here.
10. Claims 29 and 35 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Theverapperuma et al. (US20220024485 filed 07/24/2020) and further in view of Nurvitadhi et al. (US20180189638)
Regarding claim 29, Badin, Das and Theverapperuma teaches the autonomous vehicle of claim 28, Badin teaches wherein the first one or more identified portions are identified (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F)
Badin, Das and Theverapperuma does not explicitly teach using a first technique, wherein the first technique comprises loading data from a plurality of cells corresponding to the first one or more identified portions into a cache, wherein the first one or more identified portions comprises one or more tiles of predetermined dimensions, and wherein the cache is accessible by one or more processing threads of the processor.
Nurvitadhi teaches using a first technique, wherein the first technique (In particular, FIG. 15a highlights paths (using dotted lines) for spMspV_csc [0157]; The Examiner notes that CSR is a compressed sparse row [0139]) comprises loading data from a plurality of cells corresponding to the first one or more identified portions into a cache (For example, the cache coherent interface 930 may implement a cache coherency protocol to ensure that data accessed/modified by the accelerator 900 and stored in the accelerator cache 907 [0132]),
wherein the first one or more identified portions comprises one or more tiles of predetermined dimensions (The operation of each accelerator tile is summarized in FIG. 19. At 1901, the y vector (vdata) is loaded to the PE RAM 1421. At 1902, the x vector and column pointers are loaded to the aux buffer 1801. At 1903 [0182]), and
wherein the cache is accessible by one or more processing threads of the processor (In one embodiment, coherency is maintained between one or more cache units 3706 and cores 3702-A-N [0277]; In some embodiments, one or more of the cores 3702A-N are capable of multithreading [0278]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin and Das to incorporate the teachings of Nurvitadhi for the benefit of a processor that achieves extreme efficiencies and efficiently converts the available memory bandwidth provided to it into performance for matrix operations (Nurvitadhi [0154])
Regarding claim 35, Badin, Das and Theverapperuma teaches the autonomous vehicle of claim 28, Badin teaches wherein combining results of at least first and second portions of the operation (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049])
comprises: generating a first result (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as the portions 306 a-306 d of matrix A 300 [0075]. The Examiner notes matrix 300 is a tensor, and the identified first portions are 306 a, 306 b, 306 c and 306 d in Fig. 3D-3F)
generating a second result (The portions of data of the matrices representing all or a portion of a row or column of data of the matrices, such as … portions 308 a-308 d of matrix B 302 [0075]. The Examiner notes matrix 302 is a tensor, and the identified second portions are 308 a, 308 b, 308 c and 308 d in Fig. 3D-3F)
and combining the first result and the second result (FIG. 3C illustrates an implementation of a partial matrix multiplication using the blocks 306 a, 308 a of matrices 300, 302, respectively [0049]) to generate an output tensor (Fig. 3A-3F illustrate a non-limiting example of matrix multiplication according to an embodiment. This example matrix multiplication involves the multiplication, or dot product, of matrix A 300 with matrix B 302 to produce a resultant matrix 304 [0043]).
Badin, Das and Theverapperuma does not explicitly teach using the first technique and the second technique
Nurvitadhi teaches generating a first result (The Examiner notes the Output Buf as the first result from Fig. 15a)
using the first technique (The Examiner notes that CSC is a compressed sparse column [0139] and spMspV_csc as first technique);
generating a first result (The Examiner notes the Output Buf as the first result from Fig. 15a)
using the second technique (FIG. 15b illustrates paths for a spMdV_csr operation [0157]. The Examiner notes CSR is a compressed sparse row [0139] and spMspV_csr as second technique);
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Badin and Das to incorporate the teachings of Nurvitadhi for the benefit of a processor that achieves extreme efficiencies and efficiently converts the available memory bandwidth provided to it into performance for matrix operations (Nurvitadhi [0154]
11. Claims 30-32 are rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Theverapperuma et al. (US20220024485 filed 07/24/2020) in view of Nurvitadhi et al. (US20180189638) and further in view of Woolley jr. et al. (US20160162402 hereinafter “Woolley”)
Regarding claim 30, Badin, Das, Theverapperuma and Nurvitadhi teaches the autonomous vehicle of claim 29, Badin, Das, Theverapperuma and Nurvitadhi does not explicitly teach wherein the operation comprises one or more convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter, and each portion of the operation comprises the convolution operations for each cell that are performed by a separate processing thread of the one or more processing threads.
Woolley teaches wherein the operation comprises one or more convolution operations between a plurality of cell patches in the plurality of cells and a convolutional filter, and each portion of the operation comprises the convolution operations for each cell (each of the thread groups configures the floating-point unit to perform matrix multiplication operations between the assigned image tile 542 and the corresponding filter tile 544 [0096]. The Examiner notes image tile 542 incudes plurality of cell patches), and
that are performed by a separate processing thread of the one or more processing threads (The convolution engine divides the virtual image matrix into separate image tiles and then assigns the processing of each image tile to a different thread group [0107]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Theverapperuma and Nurvitadhi to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
Regarding claim 31, Badin, Das, Theverapperuma, Nurvitadhi and Woolley teaches the autonomous vehicle of claim 30, Woolley teaches wherein the convolutional filter (FIG. 4 illustrates an image batch 410, a filter stack 440, and an output batch 470 associated with a multi-convolution operation [0062])
comprises a plurality of weight values for a convolution layer of a convolutional neural network model (Consequently, the optimized performance of the matrix-based multi-convolution operation is relatively consistent across changes in the values of individual parameters [0072]; The multi-convolution operation corresponds to the predominant calculation involved in executing a particular convolution layer included in a CNN [0062]).
The same motivation to combine dependent claim 30 applies here.
Regarding claim 32, Badin, Das, Theverapperuma and Nurvitadhi teaches the autonomous vehicle of claim 29, Badin, Das, Theverapperuma and Nurvitadhi does not explicitly teach load padding elements corresponding to the one of the first one or more portions into the cache.
Woolley teaches wherein the circuitry is further to: load padding elements corresponding to the one of the first one or more portions into the cache (For example, in some embodiments, the parameters 465 may include a padding height and a padding width. The padding height and the padding width append, respectively, rows of zeros and columns of zeros to output images 480 included in the output batch 470 for any technical reason, such as formatting for future operations [0066]; The L1 cache 384 supports, among other things, load and store operations performed by the execution units [0054]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Theverapperuma and Nurvitadhi to incorporate the teachings of Woolley so that the amount of parallel processing memory used can be dramatically reduced (Woolley [0011])
12. Claim 33 is rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) in view of Theverapperuma et al. (US20220024485 filed 07/24/2020) in view of Nurvitadhi et al. (US20180189638) and further in view of Nagy et al. (US20210241082 filed 04/09/2019)
Regarding claim 33, Badin, Das, Theverapperuma and Nurvitadhi teaches the autonomous vehicle of claim 29, Badin, Das, Theverapperuma and Nurvitadhi does not explicitly teach further comprising determining that the second one or more identified portions have smaller dimensions than the predetermined dimensions; and accessing the second one or more identified portions of the at least one tensor from the memory responsive to the determination.
Nagy teaches further comprising determining that the second one or more identified portions have smaller dimensions than the predetermined dimensions (The stride value may also be used to determine whether the result store 439 may store results of more than one output area, … the size of the resulting output area will be smaller than the size of corresponding input areas [0168]); and
accessing the second one or more identified portions of the at least one tensor from the memory responsive to the determination (For hardware resources, M may determine the amount of parallel processing elements 418 and R may determine the physical capacity of the result store 439 in each processing element 418 [0162]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Theverapperuma and Nurvitadhi to incorporate the teachings of Nagy for the benefit of accessing and processing of data, such as by applying convolution operations, which may be flexibly adapted to various scenarios, which efficiently exploits available resources, and handles bandwidth issues with regard to exchange of input and output data processed by the accelerator apparatus (Nagy [0007]).
14. Claim 34 is rejected under 35 U.S.C. 103 as being unpatentable over Badin et al. (US20170270073) in view of Das et al. (US20200293858 filed 03/12/2020) and further in view of Theverapperuma et al. (US20220024485 filed 07/24/2020) and further in view of Abdelaziz et al. (US20210326686 filed 06/12/2020)
Regarding claim 34, Badin, Das, Theverapperuma teaches the autonomous vehicle of claim 28, Badin, Das, Theverapperuma does not explicitly teach further comprising: applying a transformation to one or more blocks of the at least one tensor to unroll the one or more blocks into at least one two-dimensional (2D) tensor; loading the at least one 2D tensor into a cache; and performing, using data in the cache, a convolution operation between a convolutional filter and the at least one tensor, via general matrix multiply, to obtain a second output.
Abdelaziz teaches further comprising: applying a transformation to one or more blocks of the at least one tensor to unroll the one or more blocks (In one embodiment, the PE units are configured to unroll in one or more selected dimensions of the input feature map, convolution kernel, and/or output feature map, and perform parallel dot-product computations in the unrolled dimension(s) [0028])
into at least one two-dimensional (2D) tensor (In one embodiment, the 2D array of PE units is invoked for performing dot-product computations for computing a layer of the neural network, such as a convolution layer of the CNN, in parallel [0028]; In one embodiment, the size of register files per PE (RF/PE) is reduced by unrolling in one or two of the kernel dimensions [0032]);
loading the at least one 2D tensor into a cache (For example, input a0 is fed to the PE units in row 412 of the tile, input a1 is fed to the PE units in row 414 of the tile, and input a15 is fed to the PE units in row 416 of the tile [0060]; In the weight stationary architecture, given that the weight data may be preloaded into the register files of the various PE units [0032]. The Examiner notes register in the PE unit as cache); and
performing, using data in the cache, a convolution operation between a convolutional filter and the at least one tensor, via general matrix multiply, to obtain a second output (In performing the convolution computation during the first processing cycle, inputs a0-a15 are multiplied with weight B0 of kernel column 406 (in the various input channels), which is stored in the PE units of column 400 of the tile, for generating a partial sum of output pixel a0 [0060]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Badin, Das, Theverapperuma to incorporate the teachings of Abdelaziz for the benefit of performing parallel dot-product computations in the unrolled dimension(s) which may thus accelerate the computation of the neural network layer (Abdelaziz [0028])
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.G./Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148