DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 2/3/2026 has been entered.
Response to Arguments
(Submitted 2/3/2026)
The Applicant’s arguments with respect to claims 1, 10 and 16 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The examiner notes that claim 7-8, 11, 14-15, 17 and 20 are CANCELED.
- The applicant on Page 8 argues on “completeness and clarity of examiner’s action) stating that the examiner has not provided any suggestions in regard to claims allowability and/or amendments that may permit allowability in the first Office Action (Non Final Rejection dated 2/14/2024).
Examiner’s Response:
The examiner “disagrees” with the above argument on “completeness”. The examiner did not raise
the matters as specified in the form as the claim was not found as “allowable” at the time of first
action.
- The applicant on Page 8 argues that with regard to the amendments , the 103 rejections under prior art was not rendered obvious under prior arts Vijay, Milton and Das combined. On Page 9, the applicant elaborated their arguments in specific context to the limitations drawn from claims 7 and 8 (claims 14 and 15) to the independent claims.
Examiner’s Response:
The examiner submits and interprets that the current focus of invention “ for each layer, a hash mapping is applied after the sparsification using LSH and activate weight matrix with compressed representation. This evident from the claims 1, 10 and 16 as amended very extensively and bringing the limitations from dependent claims to independent may have provided better clarify on the invention to the examiner enabling new grounds of rejection. The examiner has used a new reference “Jiang” and “VISHNU” with new grounds of rejection to teach the amendments. The examiner submits that after careful review of the claims and based on the 103 combinations of “Vijay” in view of “Jiang”, in view of “VISHNU”. The claims do not support “allowability””
In CONCLUSION, the examiner rejects independent claims 1, 10 and 16 and all dependent claims 3-,6, 9, 12-13, 16 and 18-19 as NON- FINAL REJECTION as RCE under 103 .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3-4, 6, 9-10, and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over
Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1, in view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020,
in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1.
In regard to claim 1: (Currently Amended)
Vijay discloses:
- An apparatus comprising: one or more processors; a memory to store data for processing, including plurality of layers, each layer including a plurality of neurons, and
in [0070]:
Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data,
in [0006]:
In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers,
in [0004]:
For instance, the matrix multiplication is a combination of an activation vector, e.g., input for the particular layer, and a weight matrix, e.g., the weights for some of the nodes in the particular layer. The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes,
in [0008]:
Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset.
- the one or more processors to perform sparification of at least two of the plurality of the layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer, the sparsification of one or more layers including:
in [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
(BRI: This is a form of “sparsity” known as “activation sparsity” which is common in ML and offers many key benefits)
in [0041]:
In some implementations, during training and run-time, the classification system 100 divides the activation vector x that contains n elements into M portions that each contain n/M elements. The classification system 100, during training, creates a hash table 104 for each portion, {T.sub.m: m=1 . . . M}. For instance, the classification system 100 determines the hash code for each portion x.sub.m of the activation vector x, using the method described above or another appropriate method, and uses the resulting hash code as an index to the corresponding hash table T.sub.m,
in [0042] :
During run-time, the classification system 100 similarly determines the hash code for each portion x.sub.m of a corresponding activation vector x and uses the portions x.sub.m as indices to the corresponding hash tables T.sub.m to determine a set of all identifiers of the nodes for which to perform matrix multiplication. In some examples, each hash table T.sub.m has only one entry for each index and the classification system 100 determines the set of all identifiers of the nodes using the corresponding hash codes, e.g., to determine at most M nodes. In some examples, the hash table or hash tables 104 include only one entry for each of the nodes in the particular layer y,
(BRI: the M portions represent hashing of patterns)
In [0004]:
The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes,
in [0023]:
FIG. 1 is an example of a classification system 100 that uses a hash table to determine for which nodes in a particular layer y to perform matrix multiplication using an activation vector x.
in [0018]:
For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes”
(BRI: the contribution of determined nodes to perform the matrix operation provide patterns in the matrix)
in [0023]:
The matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer.
In [0018]:
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication,
In [0018]:
a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function.
In [0018]:
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes
(BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”)
in [0006]:
an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer
(BRI: Using a layer and its previous layer to reduce computation explicitly represents using at least two layers for sparsification. In this process, the output of the previous layer is the input for the current layer, and the weights connecting the two may be pruned to reduce the number of calculations. Using activations from a previous layer can reduce computation in specific neural network architectures and scenarios, though it depends on how the activations are used)
In [0007]:
The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero,
In [0023]:
the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y.
In [0027]:
For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, that have the largest dot products with the activation vector x. The classification system 100 computes the probabilities for only these K classes, and sets the probabilities of all other classes to zero.
(BRI: Setting node to zero and setting probability to zero can deactivate the node)
- applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer,
In [0023]:
The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y,
In [0006]:
methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers,
in [0006]:
an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer,
- where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN
In [0052]:
At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero.
In [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
- wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons,
In [0018] :
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function.
In [0018] :
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations.
(BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship)
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash
In [0043]:
In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors
- and selecting the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs,
In [0006]:
receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer.
In [0008]:
Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes.
In [0017]:
For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs.
However, Jiang discloses:
- detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms
In [Abstract, Page 376]:
SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices.
In [Abstract, Page 376]:
We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices
In [3.2, Page 380]: Row-Reordering by Clustering
We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM.
As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S.
PNG
media_image1.png
388
431
media_image1.png
Greyscale
In [Abstract, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality.
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
PNG
media_image3.png
276
250
media_image3.png
Greyscale
The similarity between row 0 and 4 is |
S
0
∩
S
4
|/|
S
0
∪
S
4
| = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster.
In [3.2, Page 380]:
Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure
In [2.2, Page 378]:
In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero,
PNG
media_image4.png
446
603
media_image4.png
Greyscale
- mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer,
In [3.2, Page 380]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siglen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz + siglen/bsize ×N +
d
m
a
x
× E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows,
d
m
a
x
is the number of nonzeros in the longest row, and E is the number of candidate pairs generated
in [3.2, Page 381]:
Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score. These candidate pairs are put in a priority queue (sim_queue) to enable a fast querying of the pair with the largest similarity
In [3.2, Page 381]:
PNG
media_image5.png
856
441
media_image5.png
Greyscale
(BRI: LSH that provides candidate pair of rows that are in similarity queues does represent received set of inputs and the previous set of inputs of the first layer by hashing the input items and grouping the items to the same bucket)
- wherein the one or more processors are further to perform compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a weight or activation matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix;
In [2.1, Page 377]: Compressed Sparse Row Representation
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2].
(BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix)
- wherein compression of the matrix of the first layer includes:
grouping connections of the first layer of the DNN into hash buckets;
in [1, Page 377]:
The goal of this work is to achieve consistently higher performance for SpMM and SDDMM.
in [1, Page 377]:
We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory.
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
in [5.4 , Page 386]:
The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM.
PNG
media_image6.png
277
607
media_image6.png
Greyscale
In [6, Page 386]:
Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM
(BRI: smaller rectangular blocks of sparse metric represents tiling)
- and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections;
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
In [3.2, Page 380]:
Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering
(BRI: the similarity measure is the combined value of each bucket)
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
- wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix;
In [2.1, Page 377]:
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value
PNG
media_image7.png
500
637
media_image7.png
Greyscale
(BRI: Fig 1 represents a sparsity pattern matrix)
In [3.2, Page 381 ]:
The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure.
- decomposing the matrix into a plurality of tiles;
In [3.1, Page 380]:
Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5.
In [1, Page 377]:
PNG
media_image8.png
243
376
media_image8.png
Greyscale
In [3.1, Page 380]:
If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution.
PNG
media_image9.png
628
546
media_image9.png
Greyscale
- and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size.
In [1, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles
in [2.3 , Page 378]: Adaptive Sparse Tiling
As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices.
PNG
media_image10.png
238
626
media_image10.png
Greyscale
in [2.3 , Page 379]: Adaptive Sparse Tiling
PNG
media_image11.png
646
576
media_image11.png
Greyscale
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
In regard to claim 3: (Currently Amended)
Vijay discloses:
wherein sparsification of the at least two of the plurality of layers of the DNN further includes:
- selecting the first subset of neurons based at least in part on the which neurons of the first layer have highest activation values.
In [0027] :
When the number of nodes in a particular layer y of the neural network 102 is large, the classification system 100 only needs output from the K nodes with the highest probabilities of activating based on the activation vector x. For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W,
In [0032]:
the classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes.
In regard to claim 4: (Previously Presented)
Vijay discloses:
- utilizing locality sensitive hashing includes applying one or more locality sensitive hash functions and mapping to one or more hash tables for the first layer.
In [0046]:
the neural network 102 may include multiple layers for which approximate computation of matrix products is performed.
In [0033]:
For instance, when the neural network 102 receives an image of a car as input, the neural network 102 may identify the top K nodes for the input image in the output layer y as nodes that represent a truck (y.sub.0) or a tree (y.sub.3).
In [0033]:
The classification system 100 uses the updated weight vectors to determine new hash codes for these nodes and places identifiers for these nodes in the hash table 104 at the location pointed to by the new hash codes.
In regard to claim 6: (Previously Presented)
Vijay discloses:
- the selected first subset of neurons includes a certain percentage of a total number of neurons of the first layer
In [0018]:
For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes,
In [0019]:
The classification system retrieves the top K of those weight vectors, e.g., from the hash table or another location in memory, with K being much smaller than the number of nodes in the particular neural network layer y.
In regard to claim 9: (Original)
Vijay does not explicitly disclose:
- wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles.
However, Jiang discloses:
- wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles.
In [1, Page 377]:
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
PNG
media_image12.png
486
583
media_image12.png
Greyscale
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row.
(BRI: the CSR does store the indices of the sparsity value)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
In regard to claim 10 : (Currently Amended)
Vijay discloses:
- One or more non-transitory computer-readable storage mediums having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising:
in [0065])
- receiving data for a deep neural network (DNN), the DNN including a plurality of layers, each layer including a plurality of neurons;
in [0006]:
receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer
- and processing the DNN, including performing sparsification of at least two of the plurality of layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer, the sparsification of the at least two of the plurality of layers including
in [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
(BRI: This is a form of “sparsity” known as “activation sparsity” which is common in ML and offers many key benefits)
in [0041]:
In some implementations, during training and run-time, the classification system 100 divides the activation vector x that contains n elements into M portions that each contain n/M elements. The classification system 100, during training, creates a hash table 104 for each portion, {T.sub.m: m=1 . . . M}. For instance, the classification system 100 determines the hash code for each portion x.sub.m of the activation vector x, using the method described above or another appropriate method, and uses the resulting hash code as an index to the corresponding hash table T.sub.m,
in [0042] :
During run-time, the classification system 100 similarly determines the hash code for each portion x.sub.m of a corresponding activation vector x and uses the portions x.sub.m as indices to the corresponding hash tables T.sub.m to determine a set of all identifiers of the nodes for which to perform matrix multiplication. In some examples, each hash table T.sub.m has only one entry for each index and the classification system 100 determines the set of all identifiers of the nodes using the corresponding hash codes, e.g., to determine at most M nodes. In some examples, the hash table or hash tables 104 include only one entry for each of the nodes in the particular layer y,
(BRI: the M portions represent hashing of patterns)
In [0004]:
The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes,
in [0023]:
FIG. 1 is an example of a classification system 100 that uses a hash table to determine for which nodes in a particular layer y to perform matrix multiplication using an activation vector x.
in [0018]:
For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes”
(BRI: the contribution of determined nodes to perform the matrix operation provide patterns in the matrix)
in [0023]:
The matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer.
In [0018]:
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication,
In [0018]:
a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function.
In [0018]:
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes
(BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”)
in [0006]:
an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer
(BRI: Using a layer and its previous layer to reduce computation explicitly represents using at least two layers for sparsification. In this process, the output of the previous layer is the input for the current layer, and the weights connecting the two may be pruned to reduce the number of calculations. Using activations from a previous layer can reduce computation in specific neural network architectures and scenarios, though it depends on how the activations are used.
In [0007]:
The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero,
In [0023]:
the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y.
In [0027]:
For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, that have the largest dot products with the activation vector x. The classification system 100 computes the probabilities for only these K classes, and sets the probabilities of all other classes to zero.
(BRI: Setting node to zero and setting probability to zero can deactivate the node)
- applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer,
In [0023]:
The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y,
In [0006]:
methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers,
in [0006]:
an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer,
- where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN
In [0052]:
At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero.
In [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
- wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons,
In [0018] :
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function.
In [0018] :
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations.
(BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship)
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash
In [0043]:
In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors
- and selecting the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs,
In [0006]:
receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer.
In [0008]:
Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes.
In [0017]:
For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs.
However, Jiang discloses:
- detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms
In [Abstract, Page 376]:
SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices.
In [Abstract, Page 376]:
We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices
In [3.2, Page 380]: Row-Reordering by Clustering
We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM.
As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S.
PNG
media_image1.png
388
431
media_image1.png
Greyscale
In [Abstract, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality.
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
PNG
media_image3.png
276
250
media_image3.png
Greyscale
The similarity between row 0 and 4 is |
S
0
∩
S
4
|/|
S
0
∪
S
4
| = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster.
In [3.2, Page 380]:
Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure
In [2.2, Page 378]:
In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero,
PNG
media_image4.png
446
603
media_image4.png
Greyscale
- mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer,
In [3.2, Page 380]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siдlen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz +siglen/bsize ×N +
d
m
a
x
× E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows,
d
m
a
x
is the number of nonzeros in the longest row, and E is the number of candidate pairs generated
- wherein the one or more processors are further to perform compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a weight or activation matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix;
In [2.1, Page 377]: Compressed Sparse Row Representation
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2].
(BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix)
- wherein compression of the matrix of the first layer includes:
grouping connections of the first layer of the DNN into hash buckets;
in [1, Page 377]:
The goal of this work is to achieve consistently higher performance for SpMM and SDDMM.
in [1, Page 377]:
We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory.
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
in [5.4 , Page 386]:
The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM.
PNG
media_image6.png
277
607
media_image6.png
Greyscale
In [6, Page 386]:
Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM
(BRI: smaller rectangular blocks of sparse metric represents tiling)
- and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections;
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
In [3.2, Page 380]:
Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering
(BRI: the similarity measure is the combined value of each bucket)
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
- wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix;
In [2.1, Page 377]:
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value
PNG
media_image7.png
500
637
media_image7.png
Greyscale
(BRI: Fig 1 represents a sparsity pattern matrix)
In [3.2, Page 381 ]:
The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure.
- decomposing the matrix into a plurality of tiles;
In [3.1, Page 380]:
Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5.
In [1, Page 377]:
PNG
media_image8.png
243
376
media_image8.png
Greyscale
In [3.1, Page 380]:
If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution.
PNG
media_image9.png
628
546
media_image9.png
Greyscale
- and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size.
In [1, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles
in [2.3 , Page 378]: Adaptive Sparse Tiling
As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices.
PNG
media_image10.png
238
626
media_image10.png
Greyscale
in [2.3 , Page 379]: Adaptive Sparse Tiling
PNG
media_image11.png
646
576
media_image11.png
Greyscale
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
In regard to claim 12: (Currently Amended)
Vijay discloses:
- wherein sparsification of the at least two of the plurality of layers of the DNN further includes:
In [0018]:
a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. At run-time, the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes
(BRI: a layer and a previous layer are two layers and the sparsificaiton by reducing the computation)
- selecting the first subset of neurons based at least in part on the which neurons of the first layer have highest activation values.
In [0027] :
When the number of nodes in a particular layer y of the neural network 102 is large, the classification system 100 only needs output from the K nodes with the highest probabilities of activating based on the activation vector x. For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W,
In [0032]:
the classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes.
Claims 5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over
Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1,
[Note: In similarity search- previously searched prior art]
In view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020,
further in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1.
In regard to claim 5: (Currently Amended)
Vijay discloses:
- sparsification of the at least two of the layers of the DNN
In [0025] :
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
(BRI: setting all but certain outputs to zero is a sparsification technique)
- further includes activating the selected first subset of the neurons of the first layer
In [0009] :
Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network.
In [0009]:
he coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system.
- and deactivating all other neurons of the first layer.
In [0016]:
The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
VISHNU teaches activation and deactivation.
One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]).
In regard to claim 13 (Currently Amended)
Vijay discloses:
- sparsification of the at least two of the layers of the DNN
In [0025] :
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
(BRI: setting all but certain outputs to zero is a sparsification technique)
- further includes activating the selected first subset of the neurons of the first layer
In [0009] :
Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network.
In [0009]:
he coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system.
- and deactivating all other neurons of the first layer.
In [0016]:
The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
VISHNU teaches activation and deactivation.
One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]).
Claims 16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over
Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1,
[Note: In similarity search- previously searched prior art]
In view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020,
in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1.
In regard to claim 16: (Currently Amended)
Vijay discloses:
- A computing system comprising: one or more processors; and a memory to store data for processing of a deep neural network (DNN) a plurality of layers, each layer including a plurality of neurons
in [0006] :
embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions,
in [0070]:
Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data”,
in [0017]:
This specification describes a method for neural networks, e.g., deep neural networks, that enables approximate computation of matrix products at various layers so that the number of output dimensions at a particular layer in the neural network can be increased by several orders of magnitude, while keeping the computation cost about the same and with little loss in accuracy,
in [0006] :
In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers.
- wherein the computing system is operable to perform neural network layer sparsification, including the computer system to:
In [0018]:
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication,
In [0018]:
a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. At run-time, the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes
In [0007]:
The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero,
In [0023]:
the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y.
in [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
(BRI: sparsity)
- utilize locality sensitive hashing to map inputs to a first layer of the DNN to a plurality of hash table buckets, including:
In [0018]:
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes
(BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”)
In [0044] :
the classification system 100 may use a WTA hash function that defines an ordinal embedding. For instance, as P.fwdarw.∞, the dot product between two WTA hashes tends to the rank correlation between the underlying vectors and WTA hashes are well suited as a basis for locality-sensitive hashing. This may result in a more robust proxy for dot product similarity and may be used to determine a count for each of the nodes that represents the ordinal similarity between the node and the activation vector x.
in [Abstract]:
selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer.
- applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer,
In [0023]:
The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y,
In [0006]:
methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers,
in [0006]:
an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer,
- where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN
In [0052]:
At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero.
In [0025]:
The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero.
- wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons,
In [0018] :
a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function.
In [0018] :
the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations.
(BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship)
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash
In [0043]:
In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors
- select the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs,
In [0006]:
receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer.
In [0008]:
selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset
In [0032]:
The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes.
In [0017]:
For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs.
However, Jiang discloses:
- and applying a locality sensitive hashing function to map each of the plurality of tiles to the hash buckets based on sparsity patterns found in the tiles:
In [3.2, Page 380]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure
In [3.2, Page 381]:
We will use LSH as a black-box function in our clustering algorithm.
In [3.2, Page 381]:
Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score.
PNG
media_image13.png
785
363
media_image13.png
Greyscale
- detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms
In [Abstract, Page 376]:
SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices.
In [Abstract, Page 376]:
We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing.
In [6, Page 386]:
Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices
In [3.2, Page 380]: Row-Reordering by Clustering
We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM.
As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S.
PNG
media_image1.png
388
431
media_image1.png
Greyscale
In [Abstract, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality.
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
PNG
media_image3.png
276
250
media_image3.png
Greyscale
The similarity between row 0 and 4 is |
S
0
∩
S
4
|/|
S
0
∪
S
4
| = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster.
In [3.2, Page 380]:
Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure
In [2.2, Page 378]:
In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero,
PNG
media_image4.png
446
603
media_image4.png
Greyscale
- mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer,
In [3.2, Page 380]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siдlen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz +siglen/bsize ×N +
d
m
a
x
× E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows,
d
m
a
x
is the number of nonzeros in the longest row, and E is the number of candidate pairs generated
in [1, Page 377]:
The goal of this work is to achieve consistently higher performance for SpMM and SDDMM.
in [1, Page 377]:
We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory.
In [3.2, Page 381]:
Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score.
PNG
media_image13.png
785
363
media_image13.png
Greyscale
(BRI: grouping similar rows. LSH that is used to provide candidate pair of rows that are in similarity queues does provide similarity between the received set of inputs and the one or more previous sets of inputs of the first layer as a result of hashing the input items and grouping similar items into same bucket. See code line 2 in Alg 3 for similarity queue)
- wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix;
In [2.1, Page 377]: Compressed Sparse Row Representation
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2].
(BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix)
In [2.1, Page 377]:
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value
PNG
media_image7.png
500
637
media_image7.png
Greyscale
(BRI: Fig 1 represents a sparsity pattern matrix)
In [3.2, Page 381 ]:
The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure.
- decomposing the matrix into a plurality of tiles;
In [3.1, Page 380]:
Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5.
In [1, Page 377]:
PNG
media_image8.png
243
376
media_image8.png
Greyscale
In [3.1, Page 380]:
If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution.
PNG
media_image9.png
628
546
media_image9.png
Greyscale
- and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size.
In [1, Page 377]:
the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles
in [2.3 , Page 378]: Adaptive Sparse Tiling
As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices.
PNG
media_image10.png
238
626
media_image10.png
Greyscale
in [2.3 , Page 379]: Adaptive Sparse Tiling
PNG
media_image11.png
646
576
media_image11.png
Greyscale
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
Vijay and, Jiang do not explicitly disclose:
- and activate the selected first subset of neurons of the first layer and deactivate all other neurons of the first layer
However, VISHNU discloses:
- and activate the selected first subset of neurons of the first layer and deactivate all other neurons of the first layer
In [0009] :
Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network.
In [0009]:
the coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system.
In [0016]:
The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
VISHNU teaches activation and deactivation.
One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]).
In regard to claim 18: (Original)
Vijay does not explicitly disclose:
- wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles.
However, Jiang discloses:
- wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles.
In [1, Page 377]:
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
PNG
media_image12.png
486
583
media_image12.png
Greyscale
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row.
(BRI: the CSR does store the indices of the sparsity value)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
In regard to claim 19: (Original)
Vijay does not explicitly disclose:
wherein compression of the matrix of the first layer includes:
- grouping connections of the first layer of the DNN into hash buckets;
- and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections.
However, Jiang discloses:
- wherein compression of the matrix of the first layer includes:
grouping connections of the first layer of the DNN into hash buckets;
In [2.1, Page 377]: Compressed Sparse Row Representation
The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices.
In [2.1, Page 377]:
As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2].
(BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix)
in [1, Page 377]:
The goal of this work is to achieve consistently higher performance for SpMM and SDDMM.
in [1, Page 377]:
We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory.
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
in [5.4 , Page 386]:
The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM.
PNG
media_image6.png
277
607
media_image6.png
Greyscale
In [6, Page 386]:
Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM
(BRI: smaller rectangular blocks of sparse metric represents tiling)
- and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections;
In [3.2, Page 380]:
We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows:
PNG
media_image2.png
60
555
media_image2.png
Greyscale
For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e.,
S
0
= {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e.,
S
0
= {0,3,4}).
In [3.2, Page 380]:
Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering
(BRI: the similarity measure is the combined value of each bucket)
In [3.2, Page 380 ]:
The idea of LSH is to hash the nodes to be
In [3.2, Page 381 ]:
clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang.
Vijay teaches hashing function.
Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling.
One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]).
Conclusion
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TIRUMALE K RAMESH/Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121