Prosecution Insights
Last updated: April 19, 2026
Application No. 17/129,590

NEURAL NETWORK DENSE LAYER SPARSIFICATION AND MATRIX COMPRESSION

Non-Final OA §103
Filed
Dec 21, 2020
Examiner
RAMESH, TIRUMALE K
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
Intel Corporation
OA Round
5 (Non-Final)
18%
Grant Probability
At Risk
5-6
OA Rounds
4y 5m
To Grant
20%
With Interview

Examiner Intelligence

Grants only 18% of cases
18%
Career Allow Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
40 currently pending
Career history
80
Total Applications
across all art units

Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
59.1%
+19.1% vs TC avg
§102
3.7%
-36.3% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 2/3/2026 has been entered. Response to Arguments (Submitted 2/3/2026) The Applicant’s arguments with respect to claims 1, 10 and 16 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The examiner notes that claim 7-8, 11, 14-15, 17 and 20 are CANCELED. - The applicant on Page 8 argues on “completeness and clarity of examiner’s action) stating that the examiner has not provided any suggestions in regard to claims allowability and/or amendments that may permit allowability in the first Office Action (Non Final Rejection dated 2/14/2024). Examiner’s Response: The examiner “disagrees” with the above argument on “completeness”. The examiner did not raise the matters as specified in the form as the claim was not found as “allowable” at the time of first action. - The applicant on Page 8 argues that with regard to the amendments , the 103 rejections under prior art was not rendered obvious under prior arts Vijay, Milton and Das combined. On Page 9, the applicant elaborated their arguments in specific context to the limitations drawn from claims 7 and 8 (claims 14 and 15) to the independent claims. Examiner’s Response: The examiner submits and interprets that the current focus of invention “ for each layer, a hash mapping is applied after the sparsification using LSH and activate weight matrix with compressed representation. This evident from the claims 1, 10 and 16 as amended very extensively and bringing the limitations from dependent claims to independent may have provided better clarify on the invention to the examiner enabling new grounds of rejection. The examiner has used a new reference “Jiang” and “VISHNU” with new grounds of rejection to teach the amendments. The examiner submits that after careful review of the claims and based on the 103 combinations of “Vijay” in view of “Jiang”, in view of “VISHNU”. The claims do not support “allowability”” In CONCLUSION, the examiner rejects independent claims 1, 10 and 16 and all dependent claims 3-,6, 9, 12-13, 16 and 18-19 as NON- FINAL REJECTION as RCE under 103 . Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 3-4, 6, 9-10, and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1, in view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020, in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1. In regard to claim 1: (Currently Amended) Vijay discloses: - An apparatus comprising: one or more processors; a memory to store data for processing, including plurality of layers, each layer including a plurality of neurons, and in [0070]: Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data, in [0006]: In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers, in [0004]: For instance, the matrix multiplication is a combination of an activation vector, e.g., input for the particular layer, and a weight matrix, e.g., the weights for some of the nodes in the particular layer. The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes, in [0008]: Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset. - the one or more processors to perform sparification of at least two of the plurality of the layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer, the sparsification of one or more layers including: in [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. (BRI: This is a form of “sparsity” known as “activation sparsity” which is common in ML and offers many key benefits) in [0041]: In some implementations, during training and run-time, the classification system 100 divides the activation vector x that contains n elements into M portions that each contain n/M elements. The classification system 100, during training, creates a hash table 104 for each portion, {T.sub.m: m=1 . . . M}. For instance, the classification system 100 determines the hash code for each portion x.sub.m of the activation vector x, using the method described above or another appropriate method, and uses the resulting hash code as an index to the corresponding hash table T.sub.m, in [0042] : During run-time, the classification system 100 similarly determines the hash code for each portion x.sub.m of a corresponding activation vector x and uses the portions x.sub.m as indices to the corresponding hash tables T.sub.m to determine a set of all identifiers of the nodes for which to perform matrix multiplication. In some examples, each hash table T.sub.m has only one entry for each index and the classification system 100 determines the set of all identifiers of the nodes using the corresponding hash codes, e.g., to determine at most M nodes. In some examples, the hash table or hash tables 104 include only one entry for each of the nodes in the particular layer y, (BRI: the M portions represent hashing of patterns) In [0004]: The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes, in [0023]: FIG. 1 is an example of a classification system 100 that uses a hash table to determine for which nodes in a particular layer y to perform matrix multiplication using an activation vector x. in [0018]: For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes” (BRI: the contribution of determined nodes to perform the matrix operation provide patterns in the matrix) in [0023]: The matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer. In [0018]: a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication, In [0018]: a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. In [0018]: the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes (BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”) in [0006]: an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer (BRI: Using a layer and its previous layer to reduce computation explicitly represents using at least two layers for sparsification. In this process, the output of the previous layer is the input for the current layer, and the weights connecting the two may be pruned to reduce the number of calculations. Using activations from a previous layer can reduce computation in specific neural network architectures and scenarios, though it depends on how the activations are used) In [0007]: The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero, In [0023]: the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y. In [0027]: For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, that have the largest dot products with the activation vector x. The classification system 100 computes the probabilities for only these K classes, and sets the probabilities of all other classes to zero. (BRI: Setting node to zero and setting probability to zero can deactivate the node) - applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer, In [0023]: The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y, In [0006]: methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers, in [0006]: an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, - where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN In [0052]: At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero. In [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. - wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons, In [0018] : a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. In [0018] : the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. (BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship) In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash In [0043]: In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors - and selecting the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs, In [0006]: receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer. In [0008]: Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes. In [0017]: For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs. However, Jiang discloses: - detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms In [Abstract, Page 376]: SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices. In [Abstract, Page 376]: We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices In [3.2, Page 380]: Row-Reordering by Clustering We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM. As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S. PNG media_image1.png 388 431 media_image1.png Greyscale In [Abstract, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). PNG media_image3.png 276 250 media_image3.png Greyscale The similarity between row 0 and 4 is | S 0 ∩ S 4 |/| S 0 ∪ S 4 | = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. In [3.2, Page 380]: Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure In [2.2, Page 378]: In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero, PNG media_image4.png 446 603 media_image4.png Greyscale - mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer, In [3.2, Page 380]: The idea of LSH is to hash the nodes to be In [3.2, Page 381]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siglen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz + siglen/bsize ×N + d m a x × E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows, d m a x is the number of nonzeros in the longest row, and E is the number of candidate pairs generated in [3.2, Page 381]: Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score. These candidate pairs are put in a priority queue (sim_queue) to enable a fast querying of the pair with the largest similarity In [3.2, Page 381]: PNG media_image5.png 856 441 media_image5.png Greyscale (BRI: LSH that provides candidate pair of rows that are in similarity queues does represent received set of inputs and the previous set of inputs of the first layer by hashing the input items and grouping the items to the same bucket) - wherein the one or more processors are further to perform compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a weight or activation matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix; In [2.1, Page 377]: Compressed Sparse Row Representation The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2]. (BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix) - wherein compression of the matrix of the first layer includes: grouping connections of the first layer of the DNN into hash buckets; in [1, Page 377]: The goal of this work is to achieve consistently higher performance for SpMM and SDDMM. in [1, Page 377]: We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory. In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering in [5.4 , Page 386]: The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM. PNG media_image6.png 277 607 media_image6.png Greyscale In [6, Page 386]: Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM (BRI: smaller rectangular blocks of sparse metric represents tiling) - and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections; In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). In [3.2, Page 380]: Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering (BRI: the similarity measure is the combined value of each bucket) In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering - wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix; In [2.1, Page 377]: The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value PNG media_image7.png 500 637 media_image7.png Greyscale (BRI: Fig 1 represents a sparsity pattern matrix) In [3.2, Page 381 ]: The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. - decomposing the matrix into a plurality of tiles; In [3.1, Page 380]: Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5. In [1, Page 377]: PNG media_image8.png 243 376 media_image8.png Greyscale In [3.1, Page 380]: If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution. PNG media_image9.png 628 546 media_image9.png Greyscale - and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size. In [1, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles in [2.3 , Page 378]: Adaptive Sparse Tiling As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices. PNG media_image10.png 238 626 media_image10.png Greyscale in [2.3 , Page 379]: Adaptive Sparse Tiling PNG media_image11.png 646 576 media_image11.png Greyscale It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). In regard to claim 3: (Currently Amended) Vijay discloses: wherein sparsification of the at least two of the plurality of layers of the DNN further includes: - selecting the first subset of neurons based at least in part on the which neurons of the first layer have highest activation values. In [0027] : When the number of nodes in a particular layer y of the neural network 102 is large, the classification system 100 only needs output from the K nodes with the highest probabilities of activating based on the activation vector x. For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, In [0032]: the classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes. In regard to claim 4: (Previously Presented) Vijay discloses: - utilizing locality sensitive hashing includes applying one or more locality sensitive hash functions and mapping to one or more hash tables for the first layer. In [0046]: the neural network 102 may include multiple layers for which approximate computation of matrix products is performed. In [0033]: For instance, when the neural network 102 receives an image of a car as input, the neural network 102 may identify the top K nodes for the input image in the output layer y as nodes that represent a truck (y.sub.0) or a tree (y.sub.3). In [0033]: The classification system 100 uses the updated weight vectors to determine new hash codes for these nodes and places identifiers for these nodes in the hash table 104 at the location pointed to by the new hash codes. In regard to claim 6: (Previously Presented) Vijay discloses: - the selected first subset of neurons includes a certain percentage of a total number of neurons of the first layer In [0018]: For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes, In [0019]: The classification system retrieves the top K of those weight vectors, e.g., from the hash table or another location in memory, with K being much smaller than the number of nodes in the particular neural network layer y. In regard to claim 9: (Original) Vijay does not explicitly disclose: - wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles. However, Jiang discloses: - wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles. In [1, Page 377]: The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. PNG media_image12.png 486 583 media_image12.png Greyscale In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. (BRI: the CSR does store the indices of the sparsity value) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). In regard to claim 10 : (Currently Amended) Vijay discloses: - One or more non-transitory computer-readable storage mediums having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: in [0065]) - receiving data for a deep neural network (DNN), the DNN including a plurality of layers, each layer including a plurality of neurons; in [0006]: receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer - and processing the DNN, including performing sparsification of at least two of the plurality of layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer, the sparsification of the at least two of the plurality of layers including in [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. (BRI: This is a form of “sparsity” known as “activation sparsity” which is common in ML and offers many key benefits) in [0041]: In some implementations, during training and run-time, the classification system 100 divides the activation vector x that contains n elements into M portions that each contain n/M elements. The classification system 100, during training, creates a hash table 104 for each portion, {T.sub.m: m=1 . . . M}. For instance, the classification system 100 determines the hash code for each portion x.sub.m of the activation vector x, using the method described above or another appropriate method, and uses the resulting hash code as an index to the corresponding hash table T.sub.m, in [0042] : During run-time, the classification system 100 similarly determines the hash code for each portion x.sub.m of a corresponding activation vector x and uses the portions x.sub.m as indices to the corresponding hash tables T.sub.m to determine a set of all identifiers of the nodes for which to perform matrix multiplication. In some examples, each hash table T.sub.m has only one entry for each index and the classification system 100 determines the set of all identifiers of the nodes using the corresponding hash codes, e.g., to determine at most M nodes. In some examples, the hash table or hash tables 104 include only one entry for each of the nodes in the particular layer y, (BRI: the M portions represent hashing of patterns) In [0004]: The neural network uses a fast locality-sensitive hashing technique to approximate a result of the matrix multiplication to allow the neural network to generate scores for a large number, e.g., millions, of output classes, in [0023]: FIG. 1 is an example of a classification system 100 that uses a hash table to determine for which nodes in a particular layer y to perform matrix multiplication using an activation vector x. in [0018]: For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes” (BRI: the contribution of determined nodes to perform the matrix operation provide patterns in the matrix) in [0023]: The matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer. In [0018]: a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication, In [0018]: a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. In [0018]: the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes (BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”) in [0006]: an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer (BRI: Using a layer and its previous layer to reduce computation explicitly represents using at least two layers for sparsification. In this process, the output of the previous layer is the input for the current layer, and the weights connecting the two may be pruned to reduce the number of calculations. Using activations from a previous layer can reduce computation in specific neural network architectures and scenarios, though it depends on how the activations are used. In [0007]: The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero, In [0023]: the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y. In [0027]: For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, that have the largest dot products with the activation vector x. The classification system 100 computes the probabilities for only these K classes, and sets the probabilities of all other classes to zero. (BRI: Setting node to zero and setting probability to zero can deactivate the node) - applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer, In [0023]: The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y, In [0006]: methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers, in [0006]: an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, - where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN In [0052]: At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero. In [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. - wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons, In [0018] : a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. In [0018] : the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. (BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship) In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash In [0043]: In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors - and selecting the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs, In [0006]: receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer. In [0008]: Selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes. In [0017]: For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs. However, Jiang discloses: - detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms In [Abstract, Page 376]: SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices. In [Abstract, Page 376]: We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices In [3.2, Page 380]: Row-Reordering by Clustering We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM. As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S. PNG media_image1.png 388 431 media_image1.png Greyscale In [Abstract, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). PNG media_image3.png 276 250 media_image3.png Greyscale The similarity between row 0 and 4 is | S 0 ∩ S 4 |/| S 0 ∪ S 4 | = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. In [3.2, Page 380]: Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure In [2.2, Page 378]: In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero, PNG media_image4.png 446 603 media_image4.png Greyscale - mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer, In [3.2, Page 380]: The idea of LSH is to hash the nodes to be In [3.2, Page 381]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siдlen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz +siglen/bsize ×N + d m a x × E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows, d m a x is the number of nonzeros in the longest row, and E is the number of candidate pairs generated - wherein the one or more processors are further to perform compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a weight or activation matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix; In [2.1, Page 377]: Compressed Sparse Row Representation The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2]. (BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix) - wherein compression of the matrix of the first layer includes: grouping connections of the first layer of the DNN into hash buckets; in [1, Page 377]: The goal of this work is to achieve consistently higher performance for SpMM and SDDMM. in [1, Page 377]: We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory. In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering in [5.4 , Page 386]: The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM. PNG media_image6.png 277 607 media_image6.png Greyscale In [6, Page 386]: Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM (BRI: smaller rectangular blocks of sparse metric represents tiling) - and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections; In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). In [3.2, Page 380]: Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering (BRI: the similarity measure is the combined value of each bucket) In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering - wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix; In [2.1, Page 377]: The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value PNG media_image7.png 500 637 media_image7.png Greyscale (BRI: Fig 1 represents a sparsity pattern matrix) In [3.2, Page 381 ]: The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. - decomposing the matrix into a plurality of tiles; In [3.1, Page 380]: Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5. In [1, Page 377]: PNG media_image8.png 243 376 media_image8.png Greyscale In [3.1, Page 380]: If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution. PNG media_image9.png 628 546 media_image9.png Greyscale - and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size. In [1, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles in [2.3 , Page 378]: Adaptive Sparse Tiling As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices. PNG media_image10.png 238 626 media_image10.png Greyscale in [2.3 , Page 379]: Adaptive Sparse Tiling PNG media_image11.png 646 576 media_image11.png Greyscale It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). In regard to claim 12: (Currently Amended) Vijay discloses: - wherein sparsification of the at least two of the plurality of layers of the DNN further includes: In [0018]: a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. At run-time, the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes (BRI: a layer and a previous layer are two layers and the sparsificaiton by reducing the computation) - selecting the first subset of neurons based at least in part on the which neurons of the first layer have highest activation values. In [0027] : When the number of nodes in a particular layer y of the neural network 102 is large, the classification system 100 only needs output from the K nodes with the highest probabilities of activating based on the activation vector x. For instance, when the particular layer y is an output layer, the classification system 100 only needs output for the top K classes of the output layer and can determine the top K vectors W.sup.(K), from a weight matrix W, In [0032]: the classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes. Claims 5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1, [Note: In similarity search- previously searched prior art] In view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020, further in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1. In regard to claim 5: (Currently Amended) Vijay discloses: - sparsification of the at least two of the layers of the DNN In [0025] : The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. (BRI: setting all but certain outputs to zero is a sparsification technique) - further includes activating the selected first subset of the neurons of the first layer In [0009] : Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network. In [0009]: he coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system. - and deactivating all other neurons of the first layer. In [0016]: The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. VISHNU teaches activation and deactivation. One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]). In regard to claim 13 (Currently Amended) Vijay discloses: - sparsification of the at least two of the layers of the DNN In [0025] : The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. (BRI: setting all but certain outputs to zero is a sparsification technique) - further includes activating the selected first subset of the neurons of the first layer In [0009] : Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network. In [0009]: he coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system. - and deactivating all other neurons of the first layer. In [0016]: The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. VISHNU teaches activation and deactivation. One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]). Claims 16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Sudheendra Vijayanarasimhan et.al(hereinafter Vijay) US 2016/0180200 A1, [Note: In similarity search- previously searched prior art] In view of Peng Jiang et.al(hereinafter Jiang) A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs, Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, Published Feb 19, 2020, in view of Abhinav VISHNU et.al(hereinafter VISHNU) US 2020/0097822 A1. In regard to claim 16: (Currently Amended) Vijay discloses: - A computing system comprising: one or more processors; and a memory to store data for processing of a deep neural network (DNN) a plurality of layers, each layer including a plurality of neurons in [0006] : embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions, in [0070]: Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data”, in [0017]: This specification describes a method for neural networks, e.g., deep neural networks, that enables approximate computation of matrix products at various layers so that the number of output dimensions at a particular layer in the neural network can be increased by several orders of magnitude, while keeping the computation cost about the same and with little loss in accuracy, in [0006] : In general, one innovative aspect of the subject matter described in this specification can be embodied in methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers. - wherein the computing system is operable to perform neural network layer sparsification, including the computer system to: In [0018]: a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication, In [0018]: a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. At run-time, the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes In [0007]: The method may include creating a modified activation vector by setting the values in the activation vector that correspond to the nodes that were not selected to zero, In [0023]: the matrix multiplication may be a product of the activation vector x from a previous layer in the neural network or an initial input for the neural network, e.g., when the particular layer is an input layer in the neural network, and the weights W. The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y. in [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. (BRI: sparsity) - utilize locality sensitive hashing to map inputs to a first layer of the DNN to a plurality of hash table buckets, including: In [0018]: the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. For instance, the classification system uses the hash codes as input to the hash table to determine the nodes and then determines the corresponding weight vectors W.sup.(K) for those nodes (BRI: the storage of weight matrix in a hash table using WTA is a “hash bucket”) In [0044] : the classification system 100 may use a WTA hash function that defines an ordinal embedding. For instance, as P.fwdarw.∞, the dot product between two WTA hashes tends to the rank correlation between the underlying vectors and WTA hashes are well suited as a basis for locality-sensitive hashing. This may result in a more robust proxy for dot product similarity and may be used to determine a count for each of the nodes that represents the ordinal similarity between the node and the activation vector x. in [Abstract]: selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer. - applying one or more locality sensitive hashing algorithms to the plurality of inputs to the first layer, In [0023]: The classification system 100 may use a hashing technique, e.g., a fast locality-sensitive hashing technique, to approximate the actual matrix multiplication to determine the output for the particular layer y, In [0006]: methods for processing an input through each of multiple layers of a neural network to generate an output, wherein each of the multiple layers of the neural network includes respective multiple nodes include the actions of for a particular layer of the multiple layers, in [0006]: an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, - where in the plurality of inputs to the first layer comes from the sparsification of a prior layer of the plurality of layers of the DNN In [0052]: At 210, the classification system processes the activation vector using the selected nodes to generate an output for the particular layer. For example, the classification system performs matrix multiplication using the activation vector x and the determined weight vectors and then applies a function to the result of the matrix multiplication to generate an output for each selected node and sets all other output values for the particular layer y to zero. In [0025]: The classification system 100 uses an input activation vector x to determine one or more hash codes. The classification system 100 uses the hash codes to determine a set of nodes y.sub.k in the particular layer y that are closest to the activation vector in the hash space and computes the matrix product for x and the set of nodes y.sub.k to determine the output for the particular layer y. The classification system 100 may set the output values for all other nodes in the particular layer y, other than the nodes in the set of nodes y.sub.k, to zero. - wherein the one or more previous sets of inputs relate to a first subset of the plurality of neurons, In [0018] : a neural network uses a winner takes all (WTA) hash method to reduce the computation time for the matrix multiplication. For instance, a classification system stores a weight matrix W of a particular neural network layer y in a hash table using the WTA function. In [0018] : the classification system computes hash codes using the activations from the previous layer x, e.g., the output values from the previous layer, and uses the hash codes to determine which nodes in the current layer y are most likely to be triggered based on the activations. (BRI: the activations from previous layer to determine which nodes in the current layers are most likely to be activated provide a relationship) In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash In [0043]: In some implementations, the classification system 100 may retrieve a corresponding count for each node from the hash table 104. For instance, each count may provide a lower bound for the dot product between the activation vector x and the weight vector for the node. The count may represent the ordinal similarity between the two vectors. The classification system 100 may select the K nodes with the greatest ordinal similarity between the two vectors - select the first subset of neurons as the subset of neurons for activation based at least in part on the relation of the first subset of neurons to the one or more previous sets of inputs, In [0006]: receiving, by a classification system, an activation vector as input for the particular layer, selecting one or more nodes in the particular layer using the activation vector and a hash table that maps numeric values to nodes in the particular layer, and processing the activation vector using the selected nodes to generate an output for the particular layer. In [0008]: selecting the nodes in the particular layer using the activation vector and the hash table may include selecting the one or more nodes in the particular layer by using the integers as input to the hash table. The integers may include a first subset and a second, mutually exclusive subset In [0032]: The classification system uses the updated weight vectors to compute updated hash codes for the top K nodes and moves the identifiers for the top K nodes, or a subset of these nodes, to the locations in the hash table 104 pointed to by the updated hash codes. In [0017]: For instance, a neural network may use matrix multiplication W*x during a classification process, where x is the input from a layer and W refers to the weights of the connections to the next layer's outputs. However, Jiang discloses: - and applying a locality sensitive hashing function to map each of the plurality of tiles to the hash buckets based on sparsity patterns found in the tiles: In [3.2, Page 380]: The idea of LSH is to hash the nodes to be In [3.2, Page 381]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure In [3.2, Page 381]: We will use LSH as a black-box function in our clustering algorithm. In [3.2, Page 381]: Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score. PNG media_image13.png 785 363 media_image13.png Greyscale - detecting similarity between a received set of inputs of the plurality of inputs and one or more previous sets of inputs based on results of the application of the one or more locality sensitive hashing algorithms In [Abstract, Page 376]: SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM(sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices. In [Abstract, Page 376]: We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. In [6, Page 386]: Several efforts have sought to improve the performance of sparse matrix multiplication by reordering the input data and defining new representations for sparse matrices In [3.2, Page 380]: Row-Reordering by Clustering We have shown that reordering the rows of the sparse matrix can improve data locality for SpMM and SDDMM. As shown in Fig2a, arow of the output matrix Y is the weighted sum of certain rows of the dense input matrix X, which are selected based on the non-zeros in the sparse matrix S. PNG media_image1.png 388 431 media_image1.png Greyscale In [Abstract, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). PNG media_image3.png 276 250 media_image3.png Greyscale The similarity between row 0 and 4 is | S 0 ∩ S 4 |/| S 0 ∪ S 4 | = 2/3. Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. In [3.2, Page 380]: Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering procedure In [2.2, Page 378]: In SDDMM, two dense matrices X and Y are multiplied and the result matrix is then scaled by an input sparse matrix S (Hadamard product). As shown in Fig 2b, an element in the output matrix (O[i][j]) is nonzero only if the corresponding element in the input sparse matrix (S[i][j]) is nonzero, PNG media_image4.png 446 603 media_image4.png Greyscale - mapping the received set of inputs to the first layer to one of a plurality of hash table buckets based on the detected similarity between the received set of inputs and the one or more previous sets of inputs of the first layer, In [3.2, Page 380]: The idea of LSH is to hash the nodes to be In [3.2, Page 381]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. The performance of LSH is controlled by two parameters: siдlen (the length of the signature) and bsize (the size of the band). The larger the siglen, the more accurate the hashing. The smaller the bsize, the more likely two nodes will be hashed into the same bucket. The total time complexity is siglen × nnz +siglen/bsize ×N + d m a x × E where nnz is the number of nonzeros in the sparse matrix, N is the number of rows, d m a x is the number of nonzeros in the longest row, and E is the number of candidate pairs generated in [1, Page 377]: The goal of this work is to achieve consistently higher performance for SpMM and SDDMM. in [1, Page 377]: We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory. In [3.2, Page 381]: Alg 3 shows our clustering-based row-reordering procedure. First, we use LSH to get a list of candidate pairs of rows that are likely to have large values of similarity score. PNG media_image13.png 785 363 media_image13.png Greyscale (BRI: grouping similar rows. LSH that is used to provide candidate pair of rows that are in similarity queues does provide similarity between the received set of inputs and the one or more previous sets of inputs of the first layer as a result of hashing the input items and grouping similar items into same bucket. See code line 2 in Alg 3 for similarity queue) - wherein compression of the weight or activation matrix of the first layer further includes: establishing a hash bucket for each of a plurality of sparsity patterns for the matrix; In [2.1, Page 377]: Compressed Sparse Row Representation The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2]. (BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix) In [2.1, Page 377]: The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr[i] is the index of the first element of row i in array colidx and value PNG media_image7.png 500 637 media_image7.png Greyscale (BRI: Fig 1 represents a sparsity pattern matrix) In [3.2, Page 381 ]: The pairs of nodes within the buckets are considered candidate pairs for clustering. Because the pairs of nodes within buckets are much fewer than the pairs of all nodes, LSH reduces the memory consumption and computation complexity of the clustering procedure. - decomposing the matrix into a plurality of tiles; In [3.1, Page 380]: Consider again the sparse matrix in Fig 1a. Row 0 has two identical columns (i.e. location of non-zeroes) with row 4, while row 1 has one identical column with row 5. In [1, Page 377]: PNG media_image8.png 243 376 media_image8.png Greyscale In [3.1, Page 380]: If we exchange row 1 and row 4, we can get a reordered matrix as shown in Fig 4a. Applying ASpT to the reordered matrix will give us a tiled matrix as shown in Fig 4b. We can see that the number of nonzeros in the dense tiles increases to 9. This means that more data movement from the global memory can be saved during the execution. PNG media_image9.png 628 546 media_image9.png Greyscale - and wherein a compression ratio is dynamically adjusted and optimized by changing a tile size. In [1, Page 377]: the effectiveness of the adaptive-tiling approach depends on the sparsity structure of the input data. If a sparse matrix already has good data reuse among consecutive rows, adaptive-tiling is an efficient implementation to exploit the data locality. However, if a sparse matrix has poor data reuse among consecutive rows, adaptive-tiling cannot improve the data locality. We find the poor data locality issue is quite common for real-world data. For 351 of the 1084 matrices from the SuiteSparse collection and the Network Repository [34], the adaptive-tiling method has less than 1% of the nonzeros in the dense tiles in [2.3 , Page 378]: Adaptive Sparse Tiling As shown in Alg 1 and Alg 2, both SpMM and SDDMM in volve traversing the nonzeros of the sparse matrix S and accessing the data of the input dense matrices. PNG media_image10.png 238 626 media_image10.png Greyscale in [2.3 , Page 379]: Adaptive Sparse Tiling PNG media_image11.png 646 576 media_image11.png Greyscale It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). Vijay and, Jiang do not explicitly disclose: - and activate the selected first subset of neurons of the first layer and deactivate all other neurons of the first layer However, VISHNU discloses: - and activate the selected first subset of neurons of the first layer and deactivate all other neurons of the first layer In [0009] : Accordingly, FIGS. 1-3 illustrate techniques for improving system performance leveraging heterogeneous system architectures. In some embodiments, a heterogeneous processing system, including at least one central processing unit (CPU) core and at least one graphics processing unit (GPU) core, is configured to compute an activation for each one of a plurality of neurons for a first network layer of a neural network. In [0009]: the coalesced neural network, as represented by the set of coalesced activation sub-matrices, improves compute efficiency of heterogeneous CPU-GPU architecture by coalescing the activations from each neural network layer into contiguous space which allows for more efficient execution of operations at the processing system. In [0016]: The processor 102 further includes a dropout module 140 for performing dropout operations. As described in further detail below, the dropout module 140 is configured to randomly drop a first subset of the plurality of neurons for a first neural network layer and keep a second subset of the plurality of neurons for the first neural network layer. The dropout module 140 instructs one of the CPU cores 106, 108, 110 to coalesce forwarded activation to generate a set of coalesced activation sub-matrices that are contiguous in a memory of the processing system, thereby leveraging the CPU cores 106, 108, 110 for re-organizing a dropped neural network to use GPU architecture more effectively when training a neural network. It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, Jiang and VISHNU. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. VISHNU teaches activation and deactivation. One of ordinary skill would have motivation to combine Vijay, Jiang and VISHNU that can improve neural network performance (VISHNU [0007]). In regard to claim 18: (Original) Vijay does not explicitly disclose: - wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles. However, Jiang discloses: - wherein compression of the weight or activation matrix of the first layer further includes: compressing the matrix including storing an identification for each of the sparsity patterns mapped by the plurality of tiles. In [1, Page 377]: The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. PNG media_image12.png 486 583 media_image12.png Greyscale In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. (BRI: the CSR does store the indices of the sparsity value) It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). In regard to claim 19: (Original) Vijay does not explicitly disclose: wherein compression of the matrix of the first layer includes: - grouping connections of the first layer of the DNN into hash buckets; - and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections. However, Jiang discloses: - wherein compression of the matrix of the first layer includes: grouping connections of the first layer of the DNN into hash buckets; In [2.1, Page 377]: Compressed Sparse Row Representation The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices. In [2.1, Page 377]: As shown in Fig 1, CSR format contains three arrays: rowptr, colidx and value. The value of rowptr [i] is the index of the first element of row i in array colidx and value. The array colidx stores the column indices of the nonzeros in the sparse matrix row by row. The corresponding values of these nonzeros are stored in the array value. For example, in Fig 1b, the second element of rowptr is 2, which indicates that the column index and the value of the first nonzero in the second row of the sparse matrix is in colidx[2] and value[2]. (BRI: Perhaps as known to the POSITA, the CSR is a representation of the weight matrix of a layer in a NN that stores and manipulate sparse matrix) in [1, Page 377]: The goal of this work is to achieve consistently higher performance for SpMM and SDDMM. in [1, Page 377]: We propose to use a row-reordering procedure to group similar rows together. More specifically, we first reorder the rows of the sparse matrix such that rows with non-zeroes at similar locations are put in the same row panel. This brings more non-zeros to the dense tiles and increases data reuse in the shared memory. In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering in [5.4 , Page 386]: The ratios of the preprocessing time to the actual computation time for the 416 matrices that need row-reordering are summarized in Tables 3 and 4 for SpMM and SDDMM, respectively. We can see that, when K = 512, more than 85% of the matrices have preprocessing time that is less than 10x of the actual computation time of SpMM, and more than 90% matrices have preprocessing time that is less than 10x of the actual computation time of SDDMM. PNG media_image6.png 277 607 media_image6.png Greyscale In [6, Page 386]: Compressed Sparse Blocks (CSB) [3] is another sparse matrix storage format that exploits register blocking. The sparse matrix is partitioned and stored as small rectangular blocks. SpMM implementation with CSB data representation has been demonstrated to achieve high performance when both SpMM and transposed SpMM (BRI: smaller rectangular blocks of sparse metric represents tiling) - and combining values of the grouped connections of each hash bucket to generate a group value for the grouped connections; In [3.2, Page 380]: We achieve this by using a hierarchical clustering procedure. Each row of the sparse matrix can be considered as a set of column indices. Two rows are similar if they have many non-zeros in identical columns. A natural definition of the similarity between two rows would be the Jaccard similarity between the two sets representing the two rows: PNG media_image2.png 60 555 media_image2.png Greyscale For example, row 0 of the sparse matrix in Fig 1a has nonzeros in column 0 and 4 (i.e., S 0 = {0,4}), while row 4 has nonzeros in column 0, 3 and 4 (i.e., S 0 = {0,3,4}). In [3.2, Page 380]: Intuitively, two rows with a large similarity should be put close to each other to allow more data reuse among the rows of the dense matrix. Initially, we consider each row as a cluster by itself. Then, in each iteration, we select the two clusters that have the largest similarity and group them into one. The similarity between the two clusters is the similarity between the representing rows of the two clusters. The representing row of a cluster can be defined in the following way: if a cluster has only one row, then it is the representing row; if two clusters are grouped together, then the representing row of the resulting cluster is the representing row of the larger cluster between the two; if the clusters are of the same size, we choose the row with the smaller index as the representing row of the resulting cluster. Whenever the size of a cluster reaches a threshold, we output the rows in it and remove the cluster. Because any two rows in the cluster have a large similarity, which means they have many columns in common, there will be good data reuse for the dense input matrix X in SpMM and SDDMM. The procedure ends when there is no cluster left. Although the clustering (BRI: the similarity measure is the combined value of each bucket) In [3.2, Page 380 ]: The idea of LSH is to hash the nodes to be In [3.2, Page 381 ]: clustered into different buckets such that nodes in the same buckets are likely to be very similar while nodes in different buckets are not similar. The pairs of nodes within the buckets are considered candidate pairs for clustering It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Vijay, and Jiang. Vijay teaches hashing function. Jiang teaches compression of weight matrix and grouping into hash buckets and applying a locality sensitive hashing function to map tiles to the hash buckets based on sparsity patterns and tiling. One of ordinary skill would have motivation to combine Vijay, and Jiang can provide reduced memory consumption and computation complexity of the cluster procedure with LSH buckets (Jiang [3.2, Page 381]). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /TIRUMALE K RAMESH/Examiner, Art Unit 2121 /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action

Prosecution Timeline

Dec 21, 2020
Application Filed
May 03, 2021
Response after Non-Final Action
Feb 08, 2024
Non-Final Rejection — §103
May 14, 2024
Response Filed
Jul 31, 2024
Final Rejection — §103
Oct 18, 2024
Response after Non-Final Action
Oct 31, 2024
Response after Non-Final Action
Nov 05, 2024
Request for Continued Examination
Nov 15, 2024
Response after Non-Final Action
Feb 07, 2025
Non-Final Rejection — §103
Jul 14, 2025
Response Filed
Sep 16, 2025
Final Rejection — §103
Dec 29, 2025
Request for Continued Examination
Jan 17, 2026
Response after Non-Final Action
Mar 19, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
2y 5m to grant Granted Jan 06, 2026
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
2y 5m to grant Granted May 06, 2025
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
2y 5m to grant Granted Feb 18, 2025
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
2y 5m to grant Granted Oct 29, 2024
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
2y 5m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 5m
Median Time to Grant
High
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month