Detailed Action
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is final and is in response to claims filed on 10/15/2025 via amendment. Claims 1-4, 6-13, and 15-22 are pending for examination. Claims 1, 6, 8, 10-11, 15-17, and 19-20 are currently amended. Claims 21-22 are newly presented. Claims 2-4, 7, 9, 12-13, and 18 are as originally filed.
Response to Arguments
Claim Interpretation under 35 U.S.C. 112
Applicant’s arguments, see Remarks 10, filed 10/15/2025, with respect to the claim interpretations under 35 U.S.C. 112 have been fully considered and are persuasive. The interpretation of claim 11 has been withdrawn.
Rejections Under 35 U.S.C. 101
Applicant’s arguments regarding the 35 U.S.C. 101 rejections have been fully considered. Regarding the rejection under 35 U.S.C. 101, Applicant argues “Claim 1 as currently amended, when considered in its entirety and as a whole, recites a specific technical solution that seeks to address technical problems associated with utilizing sparsity in neural networks”. See Remarks 13. Applicant further argues “the claims recite limitations that cannot be practically performed in the human mind. As one example, generating an output at the layer of the neural network for an input to the layer based on the resulting block of elements”. See Remarks 16.
Examiner respectfully disagrees with applicant arguments. Generating the output is clearly generally linking the use of the judicial exception to a particular field of use. see MPEP 2106.05(h).
Applicant further argues “involve the use of matrix multiplication in the context of providing a technical solutions that can improve the functioning of computing systems and neural networks implemented by computing systems” see Remarks 17.
Examiner respectfully disagrees with applicant arguments. These purported improvements are not recited in the claim language as written.
Applicant further argues “The pending claims utilize matrix multiplication in a specific manner such that any recited mathematical concepts are integrated into a practical application. In particular, all pending claims provide that the resulting block of elements that is generated has the target level of combined sparsity recited in previous limitations… an output at the layer of the neural network is generated for an input to the layer based on the resulting block of elements”. See Remarks 18.
Examiner respectfully disagrees with applicant arguments. The sparsity is still clearly mathematical concepts. The input and output of the layer is clearly generally linking the use of the judicial exception to a particular field of use. see MPEP 2106.05(h).
Applicant further argues “applying or using a judicial exception in a meaningful way, namely using sparse matrix multiplication for a layer of a neural network in manner that achieves a target level of combined sparsity in a resulting block, which is then used within that layer to generate an output… In view of Enfish, Applicant respectfully directs the Examiner's attention to the above remarks comparing the approach of FIG. 7 set forth in Applicant's specification and FIG. 9 set forth in Applicant's specification, which demonstrates that the claims recite a specific implementation of a technical solution to a technical problem”. See Remarks 19.
Examiner respectfully disagrees with applicant arguments. The recitation of the neural network is at a high level of generality and is clearly generally linking the use of the judicial exception to a particular field of use. see MPEP 2106.05(h).
Further, it is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements. See the discussion of Diamond v. Diehr, 450 U.S. 175, 187 and 191-92, 209 USPQ 1, 10 (1981)) in subsection II, below. In addition, the improvement can be provided by the additional element(s) in combination with the recited judicial exception... However, it is important to keep in mind that an improvement in the abstract idea itself (e.g. a recited fundamental economic concept) is not an improvement in technology...”. See MPEP 2106.05(a).
Rejections Under 35 U.S.C. 103
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-4, 6-13, and 15-22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract ideas without significantly more.
With regards to claim 1, at step 1, the claim is directed to a method, which is a statutory category of invention.
At Step 2A Prong 1, the examiner notes that the claim is directed to mental processes and/or mathematical concepts. The claim language has been reproduced below:
A method performed by a computing system, the method comprising: (mental process, evaluation)
receiving a first block of elements derived from a first matrix representing an activation matrix or a weight matrix for (mental process, evaluation) a layer of a neural network, the first block of elements having M elements in a first dimension, where M is an integer; (mental process, evaluation)
parsing the first block of elements into a first set of B sub-blocks, where B is an integer <= M, and where each of the first set of B sub-blocks include M/B elements in the first dimension; (mental process evaluation)
applying a first sparsity mask having S% sparsity over M elements to the first block of elements to obtain a sparsified first block, such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity, (mathematical calculation) wherein S is based on a target level of combined sparsity; (mental process, evaluation)
receiving a second block of elements derived from a second matrix representing an activation matrix or a weight matrix for (mental process, evaluation) the layer of the neural network, the second block of elements having M elements in a second dimension, different than the first dimension; (mental process, evaluation)
parsing the second block of elements into a second set of B sub-blocks, each of the second set of B sub-blocks including M/B elements in the second dimension; (mental process, evaluation)
applying a second sparsity mask having S'% sparsity over M elements to the second block of elements to obtain a sparsified second block, (mathematical calculation) such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity (mental process, evaluation) and (100-S')% of the second set of B sub-blocks have 0% sparsity, (mental process, evaluation) wherein S' is based on the target level of combined sparsity; (mental process, evaluation)
matrix multiplying the sparsified first block and the sparsified second block to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product having the target level of combined sparsity; (mathematical calculation)
generating an output at the layer of the neural network for an input to the layer based on the resulting block of elements. (mental process, evaluation; mathematical calculation)
Each of the non-bolded limitations are mental processes and/or mathematical calculations. The “A method performed by a computing system, the method comprising” limitation is an evaluation mental process that can be performed by choosing what the method comprises. The “a first block of elements derived from a first matrix representing an activation matrix or a weight matrix” limitation is an evaluation mental process that can be performed by choosing what the first block is derived from. The “the first block of elements having M elements in a first dimension,” limitation is an evaluation mental process that can be performed by choosing what the first block of data is. The “parsing the first block” limitation is an evaluation mental process that can be performed by choosing how to split the block of data. The “applying a first sparsity mask” limitation is a mathematical calculation that can be performed by applying the mask by hand using pen and paper. The “wherein S is based on a target level of combined sparsity” limitation is an evaluation mental process that can be performed by choosing what S is based on. The “a second block of elements derived from a second matrix representing an activation matrix or a weight matrix” limitation is an evaluation mental process that can be performed by choosing what the second block is derived from. The “the second block of elements having M elements” limitation is an evaluation mental process that can be performed by choosing what the second block of data is. The “parsing the second block” limitation is an evaluation mental process that can be performed by choosing how to split the block of data. The “applying a second sparsity mask” limitation is a mathematical calculation that can be performed by applying the mask by hand using pen and paper. The “such that S'% of the second set of B” limitation is an evaluation mental process that can be performed by choosing how the mask is applied. The “and (100-S')% of the second set of B” limitation is an evaluation mental process that can be performed by choosing how the mask is applied. The “wherein S' is based on the target level of combined sparsity” limitation is an evaluation mental process that can be performed by choosing what S' is based on. The “matrix multiplying the sparsified first block and the sparsified second block” limitation is a mathematical calculation that can be performed by matrix multiplying the blocks by hand using pen and paper The “generating an output at” limitation is an evaluation mental process and mathematical calculation that can be performed by choosing how the output is used and generating the output by hand using pen and paper.
At step 2A Prong 2, the additional elements are bolded above. the "layer of the neural network" is generally linking the use of the judicial exception to a particular field of use. see MPEP 2106.05(h). The “receiving” limitations, as claimed under BRI, are additional elements that are insignificant extra-solution activity. The ‘receiving’ in the context of the claim encompasses mere data gathering for the claimed applying the masks steps. The remaining additional elements amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f).
Under Step 2B, the claim recites “receiving a first block of elements”, “receiving a second block of elements”, and, per MPEP 2106.05(d) (Il), the courts have recognized the following computer functions as well understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity:
i. Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610, 118 USPQ2d 1744, 1745 (Fed. Cir. 2016) (using a telephone for image transmission); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network);
iv. Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092- 93.
With regards to claim 11, it recites similar language to claim 1 and is rejected for, at least, the same reasons therein. Herein claim 11 is directed towards the statutory category of a machine, thus also satisfying step 1. Moreover under step 2A prong 2 the additional elements are “one or more logic machines” and “one or more storage machines”. These are no more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claim 20, it recites similar language to claim 1 and is rejected for, at least, the same reasons therein. Herein claim 11 is directed towards the statutory category of a method, thus also satisfying step 1. The additional limitation is “dynamically recomputing the first sparsity mask and the second sparsity mask for each iteration of training of the neural network”, which is an evaluation mental process and mathematical calculation that can be performed by choosing how to update the masks and then updating them using pen and paper. Moreover under step 2A prong 2 the additional elements are “a hardware configuration of the computing system”. These are no more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 2 and 12, they are directed to mental processes and/or mathematical concepts. The “wherein S = S'” limitation is an evaluation mental process that can be performed by choosing the values of S and S’. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 3 and 13, they are directed to mental processes and/or mathematical concepts. The “wherein one or more of the first sparsity mask and the second sparsity mask” limitation is an evaluation mental process that can be performed by choosing how the sparsity masks are generated. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claim 4, it is are directed to mental processes and/or mathematical concepts. The “wherein one or more of the first sparsity mask and the second sparsity mask are generated” limitation is an evaluation mental process that can be performed by choosing how the sparsity masks are generated. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 6 and 15, they are directed to mental processes and/or mathematical concepts. The “wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs” limitation is an evaluation mental process that can be performed by choosing when the matrix multiplication occurs. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 7 and 16, they are directed to mental processes and/or mathematical concepts. The “wherein the first sparsity mask and second sparsity mask” limitation is an evaluation mental process and mathematical calculation that can be performed by choosing how to update the masks and then updating them using pen and paper. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 8 and 17, they are directed to mental processes and/or mathematical concepts. The “wherein the neural network is a trained neural network” limitation is an evaluation mental process that can be performed by choosing the type of neural network. The “wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs during an inference operation” limitation is an evaluation mental process that can be performed by choosing when the matrix multiplication occurs. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 9 and 18, they are directed to mental processes and/or mathematical concepts. The “wherein the first sparsity mask is maintained during each iteration of the inference operation” limitation is an evaluation mental process that can be performed by choosing to not update the first mask. The “wherein the second sparsity mask is dynamically recomputed” limitation is an evaluation mental process and mathematical calculation that can be performed by choosing how to update the second mask and then updating it using pen and paper. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 10 and 19, they are directed to mental processes and/or mathematical concepts. The “wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs within a self-attention layer” limitation is an evaluation mental process that can be performed by choosing when the matrix multiplication occurs. The “wherein the first block of elements and second block of elements are both” limitation is an evaluation mental process that can be performed by choosing what the first and second blocks are derived from. Under steps 2A prong 2 and 2B, the claim does not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
With regards to claims 21 and 22, they are directed to mental processes and/or mathematical concepts. The “wherein the target level of combined sparsity is based on a hardware configuration of the computing system” limitation is an evaluation mental process that can be performed by choosing what the target level of combined sparsity is based on. Under step 2A Prong 2, none of the additional elements regarding the generic computer components (i.e. a hardware configuration, the computing system, etc.) are more than high level generic computer components that amount to no more than components comprising mere instructions to apply the exception and do not integrate the judicial exception into a practical application. See MPEP 2106.05(f). Under step 2B, the claims do not recite any additional elements that integrate the abstract idea into a practical application, nor do they amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 4, 6-8, 11, 12, 15-17, and 20-22 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu et al. (US 20210027156 A1) hereinafter Zhu in view of Mathuriya et al. (US 11836102 B1) hereinafter Mathuriya further in view of Guo et al. (“Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity”) hereinafter Guo further in view of Kampeas et al. (WO 2023088553 A1) hereinafter Kampeas.
With regards to claim 1, Zhu teaches A method performed by a computing system, the method comprising: receiving a first block of elements derived from a first matrix representing an activation matrix or a weight matrix for a layer of a neural network, (Zhu [0040]: For example, generic sparsifying 300 may reduce weight matrix)
the first block of elements having M elements in a first dimension, where M is an integer (Zhu [0040]: For example, generic sparsifying 300 may reduce weight matrix 301 to a sparse weight matrix 305 to reduce a number of calculations required for executing the neural network. Although depicted as a 4×4 weight matrix, weight matrix 301 may be any size; Zhu [0045]: Weight matrix 401 is depicted as an M×N matrix)
parsing the first block of elements into a first set of B sub-blocks, where B is an integer <= M, (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x)
and where each of the first set of B sub-blocks include M/B elements in the first dimension; (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x; Zhu Fig. 4: shows that there are N/Bx blocks with Bx elements in the first dimension each)
applying a first sparsity mask [having S% sparsity over M elements] to the first block of elements to obtain a sparsified first block , (Zhu [0048]: Accordingly, as depicted in FIG. 5, block-wise sparsifying 500 may include selecting one or more elements, e.g., elements 503a, 503b, 503c, and 503d from block 501. Although depicted as selecting four elements, block-wise sparsifying 500 may use any predetermined number of elements)
receiving a second block of elements derived from a second matrix representing an activation matrix or a weight matrix for the layer of the neural network (Zhu [0069]: a full input matrix)
applying a second sparsity mask [having S'% sparsity over M elements] to the second block of elements to obtain a sparsified second block, (Zhu [0069]: extract elements from a full input matrix based on offset matrix 603 to obtain sparse input matrix)
matrix multiplying the sparsified first block and the sparsified second block to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product; [having the target level of combined sparsity] (Zhu [0070]: receive sparse weight matrix 601, offset matrix 603, and sparse input matrix 605 for executing the multiply-accumulate operations of the neural network)
and generating an output at the layer of the neural network for an input to the layer based on the resulting block of elements (Zhu [0029]: As further depicted in FIG. 1, neural nework 100 may include one or more hidden layers, e.g., hidden layer 130-1, . . . , hidden layer 130-n. Each hidden layer may comprise one or more nodes. For example, in FIG. 1, hidden layer 130-1 comprises node 130-1-1, node 130-1-2, node 130-1-3, . . . , node 130-1-b, and hidden layer 130-n comprises node 130-n-1, node 130-n-2, node 130-n-3, . . . , node 130-n-c. Similar to nodes of input layer 120, nodes of the hidden layers may apply activation functions to output from connected nodes of the previous layer and weight the output from the activation functions by particular weights associated with the nodes).
While Zhu teaches the matrices having dimensions, Zhu fails to teach having M elements in a second dimension, different than the first dimension, parsing the second block of elements into a second set of B sub-blocks, and each of the second set of B sub-blocks including M/B elements in the second dimension.
However, Mathuriya does teach having M elements in a second dimension, different than the first dimension (Mathuriya Col 18 Lines 59-60: input matrix X has M rows and N columns)
parsing the second block of elements into a second set of B sub-blocks (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C)
each of the second set of B sub-blocks including M/B elements in the second dimension (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu with the matrix dimensions and parsing the second block as taught by Mathuriya. One of ordinary skill in the art would be motivated to make this combination because it would allow for the mask to be applied in parallel, speeding up calculations.
Zhu in view of Mathuriya fails to teach [applying a first sparsity mask] having S% sparsity over M elements to obtain a sparsified first block], such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity, [applying a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block], such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity, and (100-S')% of the second set of B sub-blocks have 0% sparsity.
However, Guo does teach [applying a first sparsity mask] having S% sparsity over M elements [to obtain a sparsified first block] (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
[applying a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block] (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned)
and (100-S')% of the second set of B sub-blocks have 0% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya with the block pruning as taught by Guo. One of ordinary skill in the art would be motivated to make this combination because it maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy (Guo Page 2 Abstract).
Zhu in view of Mathuriya further in view of Guo fails to teach wherein S is based on a target level of combined sparsity, wherein S' is based on the target level of combined sparsity, and [to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity.
However, Kampeas teaches wherein S is based on a target level of combined sparsity; (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
wherein S' is based on the target level of combined sparsity; (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
[to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the combined sparsity as taught by Kampeas. One of ordinary skill in the art would be motivated to make this combination because it would increase the efficiency of the system, as this would facilitate compressed data formats, such as CSR formatting both in the forward as well as the backward path as taught by Kampeas (Kampeas Page 15 lines 13-14).
With regards to claim 2, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu fails to teach wherein S = S'.
However, Guo does teach wherein S = S' (Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… vector-wise (VW)
pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with S = S’ as taught by Guo. One of ordinary skill in the art would be motivated to make this combination because it maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy (Guo Page 2 Abstract).
With regards to claim 4, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu further teaches wherein one or more of the first sparsity mask and the second sparsity mask are generated based on absolute magnitudes for a respective set of M/B elements (Zhu [0041]: generic sparsifying 300 may include selecting one or more elements, e.g., elements 303a, 303b, 303c, and 303d from weight matrix 301. Although depicted as selecting four elements, generic sparsifying 300 may use any predetermined number of elements. Elements 303a, 303b, 303c, and 303d may be selected on account of having the four largest absolute values).
With regards to claim 6, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu further teaches matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs during training of the neural network (Zhu [0073]: Additionally with or alternatively to method 750 of FIG. 7B, iterative re-training of the neural network may be performed).
With regards to claim 7, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 6 above. Zhu further teaches wherein the first sparsity mask and second sparsity mask are dynamically recomputed for each iteration of training of the neural network (Zhu [0086]: At step 759, the at least one processor may determine if the re-trained neural network has converged. For example, the at least one processor may determine convergence has occurred when a desired sparsity level has been reached, when an accuracy of the neural network has dropped below a threshold (e.g., as performed by the pseudocode described above), or any other value associated with the neural network has reached or crossed a predetermined threshold. If converged, method 750 may end; if not, method 750 may iterate, as depicted in FIG. 7B).
With regards to claim 8, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu further teaches wherein the neural network is a trained neural network (Zhu [0073]: Additionally with or alternatively to method 750 of FIG. 7B, iterative re-training of the neural network may be performed using the example pseudocode below; (this shows that the neural network is a trained neural network)).
While Zhu teaches matrix multiplying after training the neural network, Zhu fails to teach and wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs during an inference operation of the trained neural network.
However, Mathuriya does teach and wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs during an inference operation of the trained neural network (Mathuriya Col. 19 Lines 47-52: In case of inference operation, stationary weights in arrays 401a are also stored in the bottom die 401. Top die 402 includes a plurality of processing elements (PEs) 402a. Each PE 402a may include one or more MMUs. Each MMU includes matrix multiplication logic (MML), logic, temporary buffer, etc).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with the inference operation as taught by Mathuriya. One of ordinary skill in the art would be motivated to make this combination for at least the same reasons listed in claim 1 above.
With regards to claim 11, Zhu teaches A computing system for implementing a deep neural network, comprising: one or more logic machines; (Zhu [0006]: a system for providing block-wise sparsity in a neural network may comprise at least one memory storing instructions and at least one processor configured to execute the instructions to perform operations)
and one or more storage machines, each storage machine holding instructions, that when executed by the one or more logic machines cause the computing system to: (Zhu [0006]: a system for providing block-wise sparsity in a neural network may comprise at least one memory storing instructions and at least one processor configured to execute the instructions to perform operations)
receive a first block of elements derived from a first matrix representing an activation matrix or a weight matrix for a laver of the deep neural network, (Zhu [0040]: For example, generic sparsifying 300 may reduce weight matrix)
the first block of elements having M elements in a first dimension, where M is an integer (Zhu [0040]: For example, generic sparsifying 300 may reduce weight matrix 301 to a sparse weight matrix 305 to reduce a number of calculations required for executing the neural network. Although depicted as a 4×4 weight matrix, weight matrix 301 may be any size; Zhu [0045]: Weight matrix 401 is depicted as an M×N matrix)
parse the first block of elements into a first set of B sub-blocks, where B is an integer <= M, (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x)
and where each of the first set of B sub-blocks include M/B elements in the first dimension; (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x; Zhu Fig. 4: shows that there are N/Bx blocks with Bx elements in the first dimension each)
apply a first sparsity mask [having S% sparsity over M elements] to the first block of elements to obtain a sparsified first block, (Zhu [0048]: Accordingly, as depicted in FIG. 5, block-wise sparsifying 500 may include selecting one or more elements, e.g., elements 503a, 503b, 503c, and 503d from block 501. Although depicted as selecting four elements, block-wise sparsifying 500 may use any predetermined number of elements)
receive a second block of elements derived from a second matrix representing an activation matrix or a weight matrix for the laver of the deep neural network, (Zhu [0069]: a full input matrix)
apply a second sparsity mask [having S'% sparsity over M elements] to the second block of elements to obtain a sparsified second block , (Zhu [0069]: extract elements from a full input matrix based on offset matrix 603 to obtain sparse input matrix)
perform sparse matrix multiplication of the sparsified first block and the sparsified second block to generate a resulting block of elements as a matrix- matrix multiplication (matmul) product [having the target level of combined sparsity;] (Zhu [0070]: receive sparse weight matrix 601, offset matrix 603, and sparse input matrix 605 for executing the multiply-accumulate operations of the neural network)
and generate an output at the layer of the deep neural network for an input to the layer based on the resulting block of elements (Zhu [0029]: As further depicted in FIG. 1, neural nework 100 may include one or more hidden layers, e.g., hidden layer 130-1, . . . , hidden layer 130-n. Each hidden layer may comprise one or more nodes. For example, in FIG. 1, hidden layer 130-1 comprises node 130-1-1, node 130-1-2, node 130-1-3, . . . , node 130-1-b, and hidden layer 130-n comprises node 130-n-1, node 130-n-2, node 130-n-3, . . . , node 130-n-c. Similar to nodes of input layer 120, nodes of the hidden layers may apply activation functions to output from connected nodes of the previous layer and weight the output from the activation functions by particular weights associated with the nodes).
While Zhu teaches the matrices having dimensions, Zhu fails to teach having M elements in a second dimension, different than the first dimension, parse the second block of elements into a second set of B sub-blocks, and each of the second set of B sub-blocks including M/B elements in the second dimension.
However, Mathuriya does teach having M elements in a second dimension, different than the first dimension (Mathuriya Col 18 Lines 59-60: input matrix X has M rows and N columns)
parse the second block of elements into a second set of B sub-blocks (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C)
each of the second set of B sub-blocks including M/B elements in the second dimension (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu with the matrix dimensions and parsing the second block as taught by Mathuriya. One of ordinary skill in the art would be motivated to make this combination because it would allow for the mask to be applied in parallel, speeding up calculations.
Zhu in view of Mathuriya fails to teach [apply a first sparsity mask] having S% sparsity over M elements [to obtain a sparsified first block], such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity, [apply a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block], such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity, and (100-S')% of the second set of B sub-blocks have 0% sparsity.
However, Guo does teach[apply a first sparsity mask] having S% sparsity over M elements [to obtain a sparsified first block] (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
[apply a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block] (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned)
and (100-S')% of the second set of B sub-blocks have 0% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya with the block pruning as taught by Guo. One of ordinary skill in the art would be motivated to make this combination because it maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy (Guo Page 2 Abstract).
Zhu in view of Mathuriya further in view of Guo fails to teach wherein S is based on a target level of combined sparsity, wherein S' is based on the target level of combined sparsity, and [to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity.
However, Kampeas teaches wherein S is based on a target level of combined sparsity; (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
wherein S' is based on the target level of combined sparsity; (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
[to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the combined sparsity as taught by Kampeas. One of ordinary skill in the art would be motivated to make this combination because it would increase the efficiency of the system, as this would facilitate compressed data formats, such as CSR formatting both in the forward as well as the backward path as taught by Kampeas (Kampeas Page 15 lines 13-14).
With regards to claim 12, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. Zhu fails to teach wherein S = S'.
However, Guo does teach wherein S = S' (Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… vector-wise (VW)
pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with S = S’ as taught by Guo. One of ordinary skill in the art would be motivated to make this combination because it maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy (Guo Page 2 Abstract).
With regards to claim 15, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. Zhu further teaches wherein the sparse matrix multiplication occurs during training of the deep neural network (Zhu [0073]: Additionally with or alternatively to method 750 of FIG. 7B, iterative re-training of the neural network may be performed).
With regards to claim 16, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 15 above. Zhu further teaches wherein the first sparsity mask and second sparsity mask are dynamically recomputed for each iteration of training of the deep neural network (Zhu [0086]: At step 759, the at least one processor may determine if the re-trained neural network has converged. For example, the at least one processor may determine convergence has occurred when a desired sparsity level has been reached, when an accuracy of the neural network has dropped below a threshold (e.g., as performed by the pseudocode described above), or any other value associated with the neural network has reached or crossed a predetermined threshold. If converged, method 750 may end; if not, method 750 may iterate, as depicted in FIG. 7B).
With regards to claim 17, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. hu further teaches wherein the deep neural network is a trained neural network (Zhu [0073]: Additionally with or alternatively to method 750 of FIG. 7B, iterative re-training of the neural network may be performed using the example pseudocode below; (this shows that the neural network is a trained neural network)).
While Zhu teaches matrix multiplying after training the neural network, Zhu fails to teach and wherein the sparse matrix multiplication occurs during an inference operation of the trained neural network.
However, Mathuriya does teach and wherein the sparse matrix multiplication occurs during an inference operation of the trained neural network (Mathuriya Col. 19 Lines 47-52: In case of inference operation, stationary weights in arrays 401a are also stored in the bottom die 401. Top die 402 includes a plurality of processing elements (PEs) 402a. Each PE 402a may include one or more MMUs. Each MMU includes matrix multiplication logic (MML), logic, temporary buffer, etc).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the inference operation as taught by Mathuriya. One of ordinary skill in the art would be motivated to make this combination for at least the same reasons listed in claim 11 above.
With regards to claim 20, Zhu teaches A method performed by a computing system for training a deep neural network, the method comprising, at the computing system: receiving a first block of elements derived from a weight matrix for a layer of the deep neural network, the first block of elements having M elements in a first dimension, where M is an integer (Zhu [0006]: In some embodiments, a system for providing block-wise sparsity in a neural network; Zhu [0040]: For example, generic sparsifying 300 may reduce weight matrix 301 to a sparse weight matrix 305 to reduce a number of calculations required for executing the neural network. Although depicted as a 4×4 weight matrix, weight matrix 301 may be any size; Zhu [0045]: Weight matrix 401 is depicted as an M×N matrix)
parsing the first block of elements into a first set of B sub-blocks, where B is an integer <= M, (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x)
and where each of the first set of B sub-blocks include M/B elements in the first dimension; (Zhu [0045]: FIG. 4 is a representation of a block-wise division 400 of a weight matrix 401 of a neural network, consistent with embodiments of the present disclosure. For example, division 400 may divide weight matrix 401 into blocks of size B.sub.y×B.sub.x; Zhu Fig. 4: shows that there are N/Bx blocks with Bx elements in the first dimension each)
applying a first sparsity mask [having S% sparsity over M elements] to the first block of elements to obtain a sparsified first block, (Zhu [0048]: Accordingly, as depicted in FIG. 5, block-wise sparsifying 500 may include selecting one or more elements, e.g., elements 503a, 503b, 503c, and 503d from block 501. Although depicted as selecting four elements, block-wise sparsifying 500 may use any predetermined number of elements)
[wherein S is based on a target level of combined sparsity] for a hardware configuration of the computing system (Zhu [0025]: The disclosed embodiments relate to computer-implemented systems and methods for providing block-wise sparse neural networks. Advantageously, the exemplary embodiments can provide improved speed and power efficiency by reducing both mathematical operations and memory transfers required to execute the neural network)
receiving a second block of elements derived from an activation matrix, (Zhu [0069]: a full input matrix)
applying a second sparsity mask [having S'% sparsity over M elements] to the second block of elements to obtain a sparsified second block, (Zhu [0069]: extract elements from a full input matrix based on offset matrix 603 to obtain sparse input matrix)
matrix multiplying the sparsified first block and the sparsified second block to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product [having the target level of combined sparsity] (Zhu [0070]: receive sparse weight matrix 601, offset matrix 603, and sparse input matrix 605 for executing the multiply-accumulate operations of the neural network)
and dynamically recomputing the first sparsity mask and the second sparsity mask for each iteration of training of the neural network (Zhu [0086]: At step 759, the at least one processor may determine if the re-trained neural network has converged. For example, the at least one processor may determine convergence has occurred when a desired sparsity level has been reached, when an accuracy of the neural network has dropped below a threshold (e.g., as performed by the pseudocode described above), or any other value associated with the neural network has reached or crossed a predetermined threshold. If converged, method 750 may end; if not, method 750 may iterate, as depicted in FIG. 7B.).
While Zhu teaches the matrices having dimensions, Zhu fails to teach having M elements in a second dimension, different than the first dimension, parsing the second block of elements into a second set of B sub-blocks, and each of the second set of B sub-blocks including M/B elements in the second dimension.
However, Mathuriya does teach having M elements in a second dimension, different than the first dimension (Mathuriya Col 18 Lines 59-60: input matrix X has M rows and N columns)
parsing the second block of elements into a second set of B sub-blocks (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C)
each of the second set of B sub-blocks including M/B elements in the second dimension (Mathuriya Col 15 Lines 61-63: the input matrix X and weight matrix W.sup.T are blocked or split into chunks of 4 (e.g., C=4). The size of each block is B, where B=N/C).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu with the matrix dimensions and parsing the second block as taught by Mathuriya. One of ordinary skill in the art would be motivated to make this combination because it would allow for the mask to be applied in parallel, speeding up calculations.
Zhu in view of Mathuriya fails to teach [applying a first sparsity mask] having S% sparsity over M elements [to obtain a sparsified first block], such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity, [applying a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block], such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity, and (100-S')% of the second set of B sub-blocks have 0% sparsity.
However, Guo does teach [applying a first sparsity mask] having S% sparsity over M elements [to obtain a sparsified first block] (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that each of the first set of B sub-blocks of the sparsified first block has S% sparsity (Guo Page 3 Section III A: The second sparsity pattern shown in the middle of Fig. 2, called vector-wise (VW) [66], [70], divides a column in the weight matrix to multiple vectors. Within each vector, it prunes a fixed portion of elements by the rank of their importance scores; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
[applying a second sparsity mask] having S'% sparsity over M elements [to obtain a sparsified second block] (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity: elementwise (EW) pattern prunes individual elements, vector-wise (VW) pattern prunes half elements in a 4-element vector, and blockwise (BW) pattern prunes an entire 2×2 block)
such that S'% of the second set of B sub-blocks of the sparsified second block have 100% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned)
and (100-S')% of the second set of B sub-blocks have 0% sparsity (Guo Page 4 Section III A: The third pattern, called block-wise (BW) [35], divides the weight matrix to small blocks, and treats a block as the basic pruning unit; Guo Fig. 2 Description: Comparison of three patterns with 50% sparsity… blockwise (BW) pattern prunes an entire 2×2 block; Guo Fig. 2: shows that the blockwise pruning leaves 50% of the blocks as pruned and 50% of the blocks as unpruned).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya with the block pruning as taught by Guo. One of ordinary skill in the art would be motivated to make this combination because it maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy (Guo Page 2 Abstract).
Zhu in view of Mathuriya further in view of Guo fails to teach wherein S is based on a target level of combined sparsity [for a hardware configuration of the computing system], wherein S' is based on the target level of combined sparsity, and [to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity.
However, Kampeas teaches wherein S is based on a target level of combined sparsity [for a hardware configuration of the computing system;] (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
wherein S' is based on the target level of combined sparsity; (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level)
[to generate a resulting block of elements as a matrix-matrix multiplication (matmul) product] having the target level of combined sparsity (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the combined sparsity as taught by Kampeas. One of ordinary skill in the art would be motivated to make this combination because it would increase the efficiency of the system, as this would facilitate compressed data formats, such as CSR formatting both in the forward as well as the backward path as taught by Kampeas (Kampeas Page 15 lines 13-14).
With regards to claim 21, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu further teaches [wherein the target level of combined sparsity] is based on a hardware configuration of the computing system (Zhu [0025]: The disclosed embodiments relate to computer-implemented systems and methods for providing block-wise sparse neural networks. Advantageously, the exemplary embodiments can provide improved speed and power efficiency by reducing both mathematical operations and memory transfers required to execute the neural network).
Zhu fails to teach wherein the target level of combined sparsity [is based on a hardware configuration of the computing system].
However, Kampeas teaches wherein the target level of combined sparsity [is based on a hardware configuration of the computing system] (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with the combined sparsity as taught by Kampeas. One of ordinary skill in the art would be motivated to make this combination because it would increase the efficiency of the system, as this would facilitate compressed data formats, such as CSR formatting both in the forward as well as the backward path as taught by Kampeas (Kampeas Page 15 lines 13-14).
With regards to claim 22, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. Zhu further teaches [wherein the target level of combined sparsity] is based on a hardware configuration of the computing system (Zhu [0025]: The disclosed embodiments relate to computer-implemented systems and methods for providing block-wise sparse neural networks. Advantageously, the exemplary embodiments can provide improved speed and power efficiency by reducing both mathematical operations and memory transfers required to execute the neural network).
Zhu fails to teach wherein the target level of combined sparsity [is based on a hardware configuration of the computing system].
However, Kampeas teaches wherein the target level of combined sparsity [is based on a hardware configuration of the computing system] (Kampeas Page 15 lines 8-16: a combination of both weights and activation sparsity may also be attained by optimizing the product of weight sparsity with the activation sparsity… the resulting sparsity, and in particular, its distance from a target sparsity level).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with the combined sparsity as taught by Kampeas. One of ordinary skill in the art would be motivated to make this combination because it would increase the efficiency of the system, as this would facilitate compressed data formats, such as CSR formatting both in the forward as well as the backward path as taught by Kampeas (Kampeas Page 15 lines 13-14).
Claims 3 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Kumar et al. (“Pruning filters with L1-norm and capped L1-norm for CNN compression”) hereinafter Kumar.
With regards to claim 3, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu fails to teach wherein one or more of the first sparsity mask and the second sparsity mask are generated based on a set of lowest one-norms for a respective set of M/B elements.
However, Kumar does teach wherein one or more of the first sparsity mask and the second sparsity mask are generated based on a set of lowest one-norms for a respective set of M/B elements (Kumar Page 1153 Section 1 Right Column: we used an L1-norm and Capped L1-norm based filter pruning to tackle the aforementioned issues).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the lowest 1 norm masks as taught by Kumar. One of ordinary skill in the art would be motivated to make this combination because right after the pruning process, the result in a slimmer network is far more compact concerning runtime memory, model size, and computational cost with comparison to the original wide architecture as taught by Kumar (Kumar Page 1153 Section 1 Right Column).
With regards to claim 13, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. Zhu fails to teach wherein one or more of the first sparsity mask and the second sparsity mask are generated based on a set of lowest one-norms for a respective set of M/B elements.
However, Kumar does teach wherein one or more of the first sparsity mask and the second sparsity mask are generated based on a set of lowest one-norms for a respective set of M/B elements (Kumar Page 1153 Section 1 Right Column: we used an L1-norm and Capped L1-norm based filter pruning to tackle the aforementioned issues).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo with the lowest 1 norm masks as taught by Kumar. One of ordinary skill in the art would be motivated to make this combination because right after the pruning process, the result in a slimmer network is far more compact concerning runtime memory, model size, and computational cost with comparison to the original wide architecture as taught by Kumar (Kumar Page 1153 Section 1 Right Column).
Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Schlicht et al. (US 20220222537 A1) hereinafter Schlicht.
With regards to claim 9, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 8 above. Zhu further teaches and wherein the second sparsity mask is dynamically recomputed for each forward phase of the inference operation (Zhu [0086]: At step 759, the at least one processor may determine if the re-trained neural network has converged. For example, the at least one processor may determine convergence has occurred when a desired sparsity level has been reached, when an accuracy of the neural network has dropped below a threshold (e.g., as performed by the pseudocode described above), or any other value associated with the neural network has reached or crossed a predetermined threshold. If converged, method 750 may end; if not, method 750 may iterate, as depicted in FIG. 7B).
Zhu fails to teach wherein the first sparsity mask is maintained during each iteration of the inference operation.
However, Schlicht teaches wherein the first sparsity mask is maintained during each iteration of the inference operation (Schlicht [0023]: the classic filter is initialized with fixed filter parameters when the deep neural network is initialized).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with maintaining the first sparsity mask as taught by Schlicht. One of ordinary skill in the art would be motivated to make this combination because overall, this may increase the robustness of the deep neural network with respect to interference as taught by Schlicht (Schlicht [0018]).
With regards to claim 18, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 17 above. Zhu further teaches and wherein the second sparsity mask is dynamically recomputed for each forward phase of the inference operation (Zhu [0086]: At step 759, the at least one processor may determine if the re-trained neural network has converged. For example, the at least one processor may determine convergence has occurred when a desired sparsity level has been reached, when an accuracy of the neural network has dropped below a threshold (e.g., as performed by the pseudocode described above), or any other value associated with the neural network has reached or crossed a predetermined threshold. If converged, method 750 may end; if not, method 750 may iterate, as depicted in FIG. 7B).
Zhu fails to teach wherein the first sparsity mask is maintained during each iteration of the inference operation.
However, Schlicht teaches wherein the first sparsity mask is maintained during each iteration of the inference operation (Schlicht [0023]: the classic filter is initialized with fixed filter parameters when the deep neural network is initialized).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with maintaining the first sparsity mask as taught by Schlicht. One of ordinary skill in the art would be motivated to make this combination because overall, this may increase the robustness of the deep neural network with respect to interference as taught by Schlicht (Schlicht [0018]).
Claims 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Li et al. (“Comparing BERT and XLNet from the Perspective of Computational Characteristics”) hereinafter Li further in view of Ahmad et al. (“Optimizing Hardware Accelerated General Matrix-Matrix Multiplication for CNNs on FPGAs”) hereinafter Ahmad.
With regards to claim 10, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 1 above. Zhu fails to teach wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs within a self-attention layer of a transformer language model.
However, Li does teach wherein matrix multiplying the sparsified first block and the sparsified second block to generate the resulting block of elements occurs within a self-attention layer of a transformer language model, (Li Page 1 Section 2: BERT is a popular attention-based Transformer language model and auto-encoding model... The encoder uses a self attention layer as a unit structure. As shown in Figure 1, it calculates the final attention output by using a query/key/value (feature), which is obtained by matrix multiplication).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with the self-attention layer as taught by Li. One of ordinary skill in the art would be motivated to make this combination because recently, Transformer [6], which is based solely on attention mechanism, shows superior performance in training time and accuracy compared to CNN and RNN on various NLP (Natural Language Processing) tasks as taught by Li (Li Page 1 Section 1).
Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Li fails to teach and wherein the first block of elements and second block of elements are both derived from activation matrices.
However, Ahmad does teach and wherein the first block of elements and second block of elements are both derived from activation matrices (Ahmad Page 2695 Section V A: GeMM operations between activation matrices).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Li with the activation matrices as taught by Ahmad. One of ordinary skill in the art would be motivated to make this combination because it improves the efficiency of the layers of the neural network as taught by Ahmad (Ahmad Page 2692 Abstract).
With regards to claim 19, Zhu in view of Mathuriya further in view of Guo further in view of Kampeas teaches all of the limitations of claim 11 above. Zhu fails to teach wherein the sparse matrix multiplication occurs within a self-attention layer of a transformer language model.
However, Li does teach wherein the sparse matrix multiplication occurs within a self-attention layer of a transformer language model, (Li Page 1 Section 2: BERT is a popular attention-based Transformer language model and auto-encoding model... The encoder uses a self attention layer as a unit structure. As shown in Figure 1, it calculates the final attention output by using a query/key/value (feature), which is obtained by matrix multiplication).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas with the self-attention layer as taught by Li. One of ordinary skill in the art would be motivated to make this combination because recently, Transformer [6], which is based solely on attention mechanism, shows superior performance in training time and accuracy compared to CNN and RNN on various NLP (Natural Language Processing) tasks as taught by Li (Li Page 1 Section 1).
Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Li fails to teach and wherein the first block of elements and second block of elements are both derived from activation matrices.
However, Ahmad does teach and wherein the first block of elements and second block of elements are both derived from activation matrices (Ahmad Page 2695 Section V A: GeMM operations between activation matrices).
Therefore, it would have been obvious before the effective filing date of the claimed invention for one of ordinary skill in the art to combine the teachings of Zhu in view of Mathuriya further in view of Guo further in view of Kampeas further in view of Li with the activation matrices as taught by Ahmad. One of ordinary skill in the art would be motivated to make this combination because it improves the efficiency of the layers of the neural network as taught by Ahmad (Ahmad Page 2692 Abstract).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jakob O Gudas whose telephone number is (571)272-0695. The examiner can normally be reached Monday-Thursday: 7:30AM-5:00PM Friday: 7:30AM-4:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Trujillo can be reached at (571) 272-3677. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.O.G./Examiner, Art Unit 2151
/James Trujillo/Supervisory Patent Examiner, Art Unit 2151