Last updated: April 19, 2026
Application No. 17/657,112
SPARSITY MASKING METHODS FOR NEURAL NETWORK TRAINING

Non-Final OA §103
Filed
Mar 29, 2022
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
3 (Non-Final)
Interview Optional

— +38.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 136 resolved cases, 2023–2026
Examiner Intelligence

BOSTWICK, SIDNEY VINCENT View full profile →
Grants 52% of resolved cases
Career Allow Rate
71 granted / 136 resolved
-2.8% vs TC avg
Strong +38% interview lift
Without
With
+38.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
68 currently pending
Career history
204
Total Applications
across all art units
Statute-Specific Performance

§101
24.4%
-15.6% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .  A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/24/2025 has been entered.

Remarks
This Office Action is responsive to Applicants' Amendment filed on November 24, 2025, in which claims 1, 5, 6, 9, 14, 15, 18, and 20 are currently amended. Claims 1-20 are currently pending.

Specification
Applicant's amendments made to the specification are acknowledged. 

Information Disclosure Statement
The information disclosure statement (IDS) submitted on November 8, 2019 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
The rejections to claims 5-8, 14-17, and 20 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.

Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 103 based on amendment have been considered, however, are not persuasive.
With respect to Applicant's arguments on pp. 13-14 of the Remarks submitted 11/24/2025 that Hubara does not disclose "generating, based on a transpose of the weight matrix, a second balanced sparsity mask, different from the first balanced sparsity mask, the second balanced sparsity mask constrained to be an N2 of M2 mask in the second dimension, wherein N2 is an integer and N2 < M2, the second balanced sparsity mask not constrained in the first dimension", Examiner respectfully disagrees.  Applicant cites p. 3 of Hubara "We propose a novel N:M transposable-dine-grained sparsity mask", to which Examiner notes that a transpose of the sparsity mask would reasonably be interpreted as a generated second mask.  However, this point is moot as Examiner has explicitly mapped the masked weight matrices (W' and W'^T) in Hubara to the claimed first and second balanced sparsity masks (pp. 6-7 of the Final Office Action mailed 9/23/2025).  Examiner notes that as shown in FIG. 2 these sparsity masks (W' and W'^T) are notationally and substantially distinct.  For at least these reasons Examiner believes the rejection in view of Hubara is reasonable and appropriate.  

With respect to Applicant's arguments on p. 15 of the Remarks submitted 9/23/2025 that "Stosic does not appear to teach combining masks", Examiner respectfully disagrees.  Stosic explicitly discloses ([p. 21 §H.2] "for each 4 × 4 block in the tensor, we construct all possible combinations of 2:4 2D patterns, compute their 1-norm, and choose the structure that has the largest norm" where each of the 2:4 2D patterns are masks).  With respect to Applicant's argument on p. 16 of the Remarks submitted 9/23/2025 that "the structure that is chosen by Stosic is not disclosed to be a transposable mask", while Examiner notes that one of ordinary skill in the art would recognize that any tensor can be transposed, Stosic explicitly discloses ([p. 12] "we can also apply 2:4 on weight transposes wT").  Examiner also notes that this combination is not in view of Stosic alone but rather in view of the combination of Stosic and Hubara where Hubara repeatedly discloses that the sparsity masks are transpose products.   

With respect to Applicant's arguments on p. 16 of the Remarks submitted 9/23/2025 that Stosic and Hubara are not analogous, Examiner respectfully disagrees.  Both Stosic and Hubara are explicitly directed towards N:M sparsity masks for neural networks and are therefore seen as highly analogous art in the same field of endeavor.  As explicitly stated on p. 8 of the Final Office Action mailed 9/23/2025 this combination amounts to using the mask hyperparameter search (combining each mask pattern to arrive at the optimal mask) as the mask selection process for the forward and backward mask process described in Hubara.  As Hubara explicitly utilizes N:M masks for neural network acceleration and Stosic explicitly generates N:M masks for accelerating forward and backward neural network processes, this combination would be trivial and obvious to one of ordinary skill in the art before the effective filing date of the claimed invention.

For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the prior art rejections.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


	Claims 1-6, 18, and 19 are rejected under U.S.C. §103 as being unpatentable over the combination of Hubara (“Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks”, 2021) and Stosic (“Search Spaces for Neural Model Training”, 2021).

    PNG
    media_image1.png
    372
    568
    media_image1.png
    Greyscale

FIG. 2(b) of Hubara


	 Regarding claim 1, Hubara teaches A method performed by a computing system for training a neural network, the method comprising:([p. 1 Abstract] "While N : M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes")
	obtaining a weight matrix having an integer dimension M1 in a first dimension and an integer dimension M2 in a second dimension, ([p. 5 §3] "Table 1 shows the MD for a matrix of size 8 × 8" M1 and M2 are both interpreted as 8.)
	performing one or more training iterations of the neural network, each training iteration including at least([p. 7 §4] "In Table 2 we show the running time overhead of ResNet50 training with IP, min cost flow and 2- approximation algorithms over regular training, the algorithms were implemented in a non-optimized way. All experiments were run in a single GPU and the mask was updated every 40 iterations")
	generating, based on the weight matrix, a first balanced sparsity mask that is constrained to be an N1 of M1 mask in the first dimension; wherein N1 is an integer and N1 < M1, the first balanced sparsity mask not constrained in the second dimension([p. 5 §3] "As can be seen in Fig. 3a, the accuracy correlates with the MD metric and the 4:8 transposable fine-grained mask reached a comparable accuracy to the 2:4 fine-grained mask" 4:8 mask W' interpreted as a first balanced N1:M1 sparsity mask in the first dimension where N1<M1.  See also FIG. 2(b) and Table 1.)
	 applying the first balanced sparsity mask to the weight matrix during a forward pass of input values to the neural network to obtain an output result; (See FIG. 2 where the forward pass applies W' to replace W in multiplication against input X)

    PNG
    media_image2.png
    152
    568
    media_image2.png
    Greyscale

Markup of FIG. 2(b) of Hubara


	generating, based on a transpose of the weight matrix, a second balanced sparsity mask, different from the first balanced sparsity mask, the second balanced sparsity mask constrained to be an N2 of M2 mask in the second dimension;  where N2 is an integer and N2<M2, the second balanced sparsity mask not constrained in the first dimension. ([p. 5 §3] "As can be seen in Fig. 3a, the accuracy correlates with the MD metric and the 4:8 transposable fine-grained mask reached a comparable accuracy to the 2:4 fine-grained mask" 4:8 mask W'^T interpreted as a second balanced N2:M2 mask in the second (transposed) dimension where N2<M2 based on transposed weight matrix W^T.  See also FIG. 2(b) and Table 1.)
	and applying the second balanced sparsity mask to the transpose of the weight matrix during a backwards pass of an error, the error based on the output result, ([p. 3] "The suggested 4:8 transposable structured pruning mask capable of accelerating with sparse tensors core the forward and backward passes" See FIG. 2(b) which shows the backward pass using W'T to accelerate backward propagation and explicitly states the advantage of W'T over previous art is acceleration of the backwards pass).

    PNG
    media_image3.png
    182
    546
    media_image3.png
    Greyscale

Markup of FIG. 2(b) of Hubara


to obtain an updated weight matrix ([p. 5] “The first multiplication is required for the forward propagation between the weights and actibation.  The other two multiplications are used for the backward and update phases.  The backward phase calculates the gradients of the loss function with respect to the input of the neural layer.  This is done by recursively passing the error from the last layer to the first (Eq. (1)).  Note that the backward phase uses the transposed weight matrix.”)
	However, Hubara does not explicitly teach determining, based on the error, whether to perform a next training iteration of the neural network using the updated weight matrix.

	Stosic, in the same field of endeavor, teaches determining, based on the error, whether to perform a next training iteration of the neural network using the updated weight matrix([p. 18 §E] "For most workloads, we adopt learning rates with linear warmups for the first 5 epochs, drop the learning rate by a factor of ten at epochs 30-60-80, and stop training after 90 epochs" See also FIG. 4).

	Hubara as well as Stosic are directed towards sparsity masking for neural networks.  Therefore, Hubara as well as Stosic are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Hubara with the teachings of Stosic by performing the mask hyperparameter search in Stosic to determine the optimal N:M mask to generate forward and transposed backwards pass masks in Hubara.  Stosic provides as additional motivation for combination ([p. 1 §1] "deep learning tasks improve with the size of search spaces being explored during training. We find that adding weights provides extra degrees of freedom that form new paths of optimization and facilitate the search for neural models").  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 2, the combination of Hubara and Stosic teaches The method of claim 1, wherein N1 and N2 are the same integer value.(Stosic [p. 21 §H.2] "for each 4 × 4 block in the tensor, we construct all possible combinations of 2:4 2D patterns, compute their 1-norm, and choose the structure that has the largest norm").
	
	 Regarding claim 3, the combination of Hubara and Stosic teaches The method of claim 1, wherein the second balanced sparsity mask is a transpose of the first balanced sparsity mask.(Stosic [p. 21 §H.2] "Typically, 2:4 is applied on weights w in the forward pass, y = wx. However, we can also apply 2:4 on weight transposes wT for the backward pass, ∂L/∂x = ∂L/∂y × wT" [p. 21 §2.4] "we construct all possible combinations of 2:4 2D patterns" See also FIG. 16).
	
	 Regarding claim 4, the combination of Hubara and Stosic teaches The method of claim 1, wherein the first balanced sparsity mask and the second balanced sparsity mask are generated based at least on a top-K function.(Stosic [p. 17 §C] "At each point in time, the top d-proportion of weights in each layer of a neural model participate in training, and the rest do not participate" [p. 20] "Blank cells represent weights that do not participate in training (are assumed zero)" [p. 17 §D] "We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration").
	
	 Regarding claim 5, the combination of Hubara and Stosic teaches The method of claim 1, wherein the training iterations are performed over a plurality of sequential phases and wherein a first sequential phase of training the neural network is performed using an initial level of sparsity. (Stosic [p. 17 §D] "We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration" [p. 18 §D] "We construct lottery tickets by training neural models to completion (e.g., k steps) and computing masks based on their trained weights [14,54]. We initialize sparse models with the original initialization (t = 0) or after some amount of training (t = E), and train them for k − t steps using the same hyperparameters").
	
	 Regarding claim 6, the combination of Hubara and Stosic teaches The method of claim 5, further comprising: determining a measure of training performance following a training iteration (Stosic [p. 4 §2] "training traverses through n-dimensional search spaces to minimize the loss function"  See also Algorithm 1 lines 5-8).
	
	 Regarding claim 18, Hubara teaches A computing system for training a deep neural network, the computing system comprising:([p. 1 Abstract] "While N : M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes")
	one or more logic machines; and([p. 7 §4.1] "All experiments were run in a single GPU")
	one or more storage machines, each storage machine holding instructions, that when executed by the one or more logic machines cause the computing system to:([p. 7 §4.1] "All experiments were run in a single GPU" [p. 3 §2] "pruning the weights reduces their memory footprint")
	obtain a weight matrix having integer dimensions M in a first dimension and M in a second dimension, ([p. 5 §3] "Table 1 shows the MD for a matrix of size 8 × 8" M1 and M2 are both interpreted as 8.)
	performing one or more training iterations of the neural network, each training iteration including at least([p. 7 §4] "In Table 2 we show the running time overhead of ResNet50 training with IP, min cost flow and 2- approximation algorithms over regular training, the algorithms were implemented in a non-optimized way. All experiments were run in a single GPU and the mask was updated every 40 iterations")
	generating, based on the weight matrix, a first balanced sparsity mask that is constrained to be an N1 of M1 mask in the first dimension, wherein n1 is an integer and n1<m1, the first balanced sparsity mask not constrained in the second dimension([p. 5 §3] "As can be seen in Fig. 3a, the accuracy correlates with the MD metric and the 4:8 transposable fine-grained mask reached a comparable accuracy to the 2:4 fine-grained mask" 4:8 mask W' interpreted as a first balanced N1:M1 sparsity mask in the first dimension where N1<M1.  See also FIG. 2(b) and Table 1.)
	applying the first balanced sparsity mask to the weight matrix during a forward pass of input values to the deep neural network to obtain an output result (See FIG. 2)
	generating, based on a transpose of the weight matrix, a second balanced sparsity mask, different from the first balanced sparsity mask, the second balanced sparsity mask constrained to be an N2 of M2 mask in the second dimension, wherein N2 is an integer and N2<M2, the second balanced sparsity mask not constrained in the first dimension([p. 5 §3] "As can be seen in Fig. 3a, the accuracy correlates with the MD metric and the 4:8 transposable fine-grained mask reached a comparable accuracy to the 2:4 fine-grained mask" 4:8 mask W'^T interpreted as a second balanced N2:M2 mask in the second (transposed) dimension where N2<M2 based on transposed weight matrix W^T.  See also FIG. 2(b) and Table 1.)
	and applying the second balanced sparsity mask to the transpose of the weight matrix during a backwards pass of an error, the error based on the output result, to obtain an updated weight matrix(See FIG. 2).
	However, Hubara does not explicitly teach determining, based on the error, whether to perform a next training iteration of the neural network using the updated weight matrix.

	Stosic, in the same field of endeavor, teaches determining, based on the error, whether to perform a next training iteration of the neural network using the updated weight matrix([p. 18 §E] "For most workloads, we adopt learning rates with linear warmups for the first 5 epochs, drop the learning rate by a factor of ten at epochs 30-60-80, and stop training after 90 epochs" See also FIG. 4).

	Hubara as well as Stosic are directed towards sparsity masking for neural networks.  Therefore, Hubara as well as Stosic are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Hubara with the teachings of Stosic by performing the mask hyperparameter search in Stosic to determine the optimal N:M mask to generate forward and transposed backwards pass masks in Hubara.  Stosic provides as additional motivation for combination ([p. 1 §1] "deep learning tasks improve with the size of search spaces being explored during training. We find that adding weights provides extra degrees of freedom that form new paths of optimization and facilitate the search for neural models").  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 19, the combination of Hubara and Stosic teaches The computing system of claim 18, wherein the second balanced sparsity mask is a transpose of the first balanced sparsity mask.(Stosic [p. 21 §H.2] "Typically, 2:4 is applied on weights w in the forward pass, y = wx. However, we can also apply 2:4 on weight transposes wT for the backward pass, ∂L/∂x = ∂L/∂y × wT" [p. 21 §2.4] "we construct all possible combinations of 2:4 2D patterns" See also FIG. 16).
	
	Claims 7, 8,  and 20 are rejected under U.S.C. §103 as being unpatentable over the combination of Hubara, Stosic, and in further view of Srinivas (US20220245457A1).

	 Regarding claim 7, the combination of Hubara and Stosic teaches The method of claim 6.
	However, the combination of Hubara and Stosic doesn't explicitly teach, further comprising: in response to the measure of training performance decreasing below a first threshold, progressing the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level.

	Srinivas, in the same field of endeavor, teaches The method of claim 6, further comprising: in response to the measure of training performance decreasing below a first threshold, progressing the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level.([¶0026] " increased sparsity of the weight tensors for trained neural networks can decrease the accuracy of the computations and inferences of the trained neural networks. Different methods for increasing sparsity of the weight tensors can produce sparse weight tensors that cannot achieve a level of accuracy for a given level of sparsity, needing lower levels of sparsity to achieve the level of accuracy." [¶0027] "Embodiments described herein provide methods for neural network pruning with cyclical sparsity for which a given level of accuracy of the neural network may be achieved with a higher level of sparsity" [¶0059] "the sparsity comparator 308 may compare the value of the increase sparsity counter 300 to an increase sparsity counter threshold and the value of the decrease sparsity counter 302 to a decrease sparsity counter threshold, and trigger the mask generator 202 to cease generating masks for the neural network in response to the value of the increase sparsity counter 300 exceeding the increase sparsity counter threshold and/or the value of the decrease sparsity counter 302 exceeding the decrease sparsity counter threshold" Decrease sparsity threshold interpreted as a measure of training performance.  Exceeding the decrease sparsity counter threshold interpreted as synonymous with a performance decreasing below a first threshold.).

	The combination of Hubara and Stosic as well as Srinivas are directed towards neural network pruning.  Therefore, the combination of Hubara and Stosic as well as Srinivas are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Hubara and Stosic with the teachings of Srinivas by .  Srinivas teaches a more fine-grained pruning method than The combination of Hubara and Stosic and provides as additional motivation for combination (the embodiments described herein may improve the implementation of the neural network by reducing the resource cost for such implementation with a dense weight tensor of the neural network or a sparse weight tensor generated by other means for increasing sparsity of a weight tensor.).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 8, the combination of Hubara, Stosic, and Srinivas teaches The method of claim 7, further comprising: in response to the measure of training performance decreasing below a second threshold, progressing the training into a third phase by adjusting the decreased sparsity level to a further decreased sparsity level.(Srinivas [¶0056] "the sparsity level decrementor 306 may decrease the level of sparsity to a predetermined level of sparsity. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level lower than the sparsity level of the sparsity level incrementor 304. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level higher than the sparsity level of a dense weight tensor (e.g., input dense weight tensor 210 in FIG. 2) for the trained neural network. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level as low as zero. In some embodiments, the sparsity level decrementor 306 may decrease the level of sparsity according to a sparsity profile designating values, such as numerically, of various levels of sparsity. In some embodiments, the sparsity level decrementor 306 may decrease the level of sparsity according to a sparsity profile designating values, such as algorithmically, of various levels of sparsity." [¶0057] " the sparsity level decrementor 306 may use various factors for control the level of sparsity used for generating a mask. For example, the sparsity level decrementor 306 may use a sparsity parameter to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a number of cycles of updating, or training, of a neural network by a neural network trainer to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use an accuracy value of the neural network to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a previously set level of sparsity set by the sparsity level incrementor 304 to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a previously set level of sparsity set by the sparsity level decrementor 306 to control the level of sparsity used for generating a mask. In some embodiments, the sparsity level decrementor 306 may use any number or combination of these examples to control the level of sparsity used for generating a mask.").
	
	 Regarding claim 20, the combination of Hubara and Stosic teaches The computing system of claim 18, wherein the storage machine further holds instructions that when executed by the one or more logic machines cause the computing system to: perform the training iterations over a plurality of sequential phases, and wherein a first phase training the deep neural network is performed using an initial level of sparsity;(Stosic [p. 17 §D] "We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration" [p. 18 §D] "We construct lottery tickets by training neural models to completion (e.g., k steps) and computing masks based on their trained weights [14,54]. We initialize sparse models with the original initialization (t = 0) or after some amount of training (t = E), and train them for k − t steps using the same hyperparameters")
	determine a measure of training performance following a training iteration; and(Stosic [p. 4 §2] "training traverses through n-dimensional search spaces to minimize the loss function"  See also Algorithm 1 lines 5-8).
	However, the combination of Hubara and Stosic doesn't explicitly teach in response to the measure of training performance decreasing below a first threshold, progress the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level..

	Srinivas, in the same field of endeavor, teaches in response to the measure of training performance decreasing below a first threshold, progress the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level.([¶0026] " increased sparsity of the weight tensors for trained neural networks can decrease the accuracy of the computations and inferences of the trained neural networks. Different methods for increasing sparsity of the weight tensors can produce sparse weight tensors that cannot achieve a level of accuracy for a given level of sparsity, needing lower levels of sparsity to achieve the level of accuracy." [¶0027] "Embodiments described herein provide methods for neural network pruning with cyclical sparsity for which a given level of accuracy of the neural network may be achieved with a higher level of sparsity" [¶0059] "the sparsity comparator 308 may compare the value of the increase sparsity counter 300 to an increase sparsity counter threshold and the value of the decrease sparsity counter 302 to a decrease sparsity counter threshold, and trigger the mask generator 202 to cease generating masks for the neural network in response to the value of the increase sparsity counter 300 exceeding the increase sparsity counter threshold and/or the value of the decrease sparsity counter 302 exceeding the decrease sparsity counter threshold" Decrease sparsity threshold interpreted as a measure of training performance.  Exceeding the decrease sparsity counter threshold interpreted as synonymous with a performance decreasing below a first threshold.).

	The combination of Hubara and Stosic as well as Srinivas are directed towards neural network pruning.  Therefore, the combination of Hubara and Stosic as well as Srinivas are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Hubara and Stosic with the teachings of Srinivas by .  Srinivas teaches a more fine-grained pruning method than The combination of Hubara and Stosic and provides as additional motivation for combination (the embodiments described herein may improve the implementation of the neural network by reducing the resource cost for such implementation with a dense weight tensor of the neural network or a sparse weight tensor generated by other means for increasing sparsity of a weight tensor.).  This motivation for combination also applies to the remaining claims which depend on this combination.

	Claims 9, 10, and 13-15 are rejected under U.S.C. §103 as being unpatentable over the combination of Stosic and Hubara.

	 Regarding claim 9, Stosic teaches A method performed by a computing system for training a neural network, the method comprising: ([p. 1 §1] "We then show this methodology achieves competitive results on dozens of deep learning workloads, even when satisfying constraints needed to accelerate training and inference using Sparse Tensor Cores [24] in NVIDIA GPUs")
	obtaining a weight matrix having an integer dimension J in a first dimension and an integer dimension K in a second dimension, ([p. 20 §H.1] "We construct block sparse structures by (1) partitioning a neural layer into a set of blocks, (2) aggregating elements in each block into a metric, and (3) removing blocks according to some criteria based on their metrics" [p. 21] "we apply 2:4 on a n × k weight tensor along k or n")
	performing one or more training iterations of the neural network, each training iteration including at least([p. 18 §E] "For most workloads, we adopt learning rates with linear warmups for the first 5 epochs, drop the learning rate by a factor of ten at epochs 30-60-80, and stop training after 90 epochs")
	reshaping the weight matrix into one or more square weight matrices having an integer dimension M in the first dimension and second dimensions([p. 21 §H.2] "for each 4 × 4 block in the tensor, we construct all possible combinations of 2:4 2D patterns, compute their 1-norm, and choose the structure that has the largest norm" Stosic explicitly reshapes the nxk weight tensor into 4x4 (mxm where m=4) blocks.)
	generating, based on the square weight matrix a first balanced sparsity mask that is constrained to be an N of M mask in the first dimension of the square weight matrix, where N is an integer and N<M([p. 20] "Figure 16: Block sparse, 2:4 1D, and 2:4 2D structures for a 4×4 neural layer. Blank cells represent weights that do not participate in training (are assumed zero)" [p. 21 §H.2] "for each 4 × 4 block in the tensor, we construct all possible combinations of 2:4 2D patterns, compute their 1-norm, and choose the structure that has the largest norm" 2:4 (N:M) 2<4 is synonymous with N<M.  4x4 neural layer interpreted as square weight matrix.)
	combining the first balanced sparsity mask and the second balanced sparsity mask to generate a third balanced sparsity mask;([p. 21 §H.2] "for each 4 × 4 block in the tensor, we construct all possible combinations of 2:4 2D patterns, compute their 1-norm, and choose the structure that has the largest norm")
	applying the third balanced sparsity mask to the square weight matrix during a fowrard pass of input valuers to the neural network to obtain an output result([p. 17 §D] "Our functions emulate sparsity using a binary tensor (or mask) that we multiply elementwise with the weights of each layer during forward and backward propagation. We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration" [p. 21 §H.2] "The 2:4 sparsity structure must always be imposed along the inner dimension of dot products. For linear layers, we apply 2:4 on a n× kweight tensor along k[…] (for forward […] pass, respectively)." Forward pass interpreted as synonymous with inference.)
	and applying a transpose of the third sparsity mask to the transpose of the square weight matrix during a backward pass of an error tensor based on the output result to obtain an updated square weight matrix([p. 21 §H.2] "we can also apply 2:4 on weight transposes wT for the backward pass, ∂ L/∂ x= ∂ L/∂ y× wT. […] For linear layers, we apply 2:4 on a n×k weight tensor along […] n (for […] backward pass, respectively" backpropagation interpreted as synonymous with backwards pass.)
	determining, based on the error tensor, whether to perform a next training iteration of the neural network using the updated square weight matrix([p. 18 §E] "For most workloads, we adopt learning rates with linear warmups for the first 5 epochs, drop the learning rate by a factor of ten at epochs 30-60-80, and stop training after 90 epochs" See also FIG. 4).
	However, Stosic does not explicitly teach generating, based on a transpose of the square weight matrix a second balanced sparsity mask that is constrained to be an N of M mask in the second dimension of the square weight matrix
	the third balanced sparsity mask being a transposable mask with at most N non-zero parameters in both the first dimension and the second dimension.

	Hubara, in the same field of endeavor, teaches generating, based on a transpose of the square weight matrix a second balanced sparsity mask that is constrained to be an N of M mask in the second dimension of the square weight matrix([p. 5 §3] "As can be seen in Fig. 3a, the accuracy correlates with the MD metric and the 4:8 transposable fine-grained mask reached a comparable accuracy to the 2:4 fine-grained mask" 4:8 mask W'^T interpreted as a second balanced N2:M2 mask in the second (transposed) dimension where N2<M2 based on transposed weight matrix W^T.  See also FIG. 2(b) and Table 1.  W interpreted as the square weight matrix.)
	the third balanced sparsity mask being a transposable mask with at most N non-zero parameters in both the first dimension and the second dimension([p. 5 §4] "The required mask contains only M − N non-zero elements, for every contiguous M elements, in both W and WT simultaneously" In an N:M 2:4 or 4:8 mask M-N=N).

	Hubara as well as Stosic are directed towards sparsity masking for neural networks.  Therefore, Hubara as well as Stosic are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Hubara with the teachings of Stosic by performing the mask hyperparameter search in Stosic to determine the optimal N:M mask to generate forward and transposed backwards pass masks in Hubara.  Stosic provides as additional motivation for combination ([p. 1 §1] "deep learning tasks improve with the size of search spaces being explored during training. We find that adding weights provides extra degrees of freedom that form new paths of optimization and facilitate the search for neural models").  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 10, the combination of Stosic and Hubara teaches The method of claim 9, wherein the first balanced sparsity mask and the second balanced sparsity mask are generated based at least on a top-K function.(Stosic [p. 17 §C] "At each point in time, the top d-proportion of weights in each layer of a neural model participate in training, and the rest do not participate" [p. 20] "Blank cells represent weights that do not participate in training (are assumed zero)" [p. 17 §D] "We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration").
	
	 Regarding claim 13, the combination of Stosic and Hubara teaches The method of claim 9, further comprising: applying N of M sparsity to one or more of an activation tensor and an error tensor during the backpropagation.(Stosic [p. 17 §C] "For example, the forward stage computes x`+1 = p`x` recursively for an input x and participating weights p, which is then fed into the loss function L. The backward stage derives two sets of gradients: activation gradients from layer ` are passed to downstream layer `−1").
	
	 Regarding claim 14, the combination of Stosic and Hubara teaches The method of claim 9, wherein the training iterations are performed over a plurality of sequential phases, and wherein a first sequential phase of training the deep neural network is performed using an initial level of sparsity.(Stosic [p. 17 §D] "We determine masks based on weight magnitudes, as described earlier, after the optimizer step and before the next training iteration" [p. 18 §D] "We construct lottery tickets by training neural models to completion (e.g., k steps) and computing masks based on their trained weights [14,54]. We initialize sparse models with the original initialization (t = 0) or after some amount of training (t = E), and train them for k − t steps using the same hyperparameters").
	
	 Regarding claim 15, the combination of Stosic and Hubara teaches The method of claim 14, further comprising: determining a measure of training performance following a training iteration.(Stosic [p. 4 §2] "training traverses through n-dimensional search spaces to minimize the loss function"  See also Algorithm 1 lines 5-8).
	
	Claims 11, 16,  and 17 are rejected under U.S.C. §103 as being unpatentable over the combination of Stosic and Hubara and Srinivas.

	 Regarding claim 11, the combination of Stosic and Hubara teaches The method of claim 9, further comprising: setting a desired output sparsity for the third balanced sparsity mask; and(Stosic [p. 21 §H.2] "The 2:4 sparsity structure must always be imposed").
	However, the combination of Stosic and Hubara doesn't explicitly teach generating the first and second balanced sparsity masks with a sparsity greater than the desired output sparsity.

	Srinivas, in the same field of endeavor, teaches generating the first and second balanced sparsity masks with a sparsity greater than the desired output sparsity.([¶0009] "increasing the level of sparsity of the weight tensor includes applying a mask to the weight tensor configured to convert a non-zero value of the weight tensor below a mask threshold to a zero value." [¶0026] "increased sparsity of the weight tensors for trained neural networks can decrease the accuracy of the computations and inferences of the trained neural networks. Different methods for increasing sparsity of the weight tensors can produce sparse weight tensors that cannot achieve a level of accuracy for a given level of sparsity, needing lower levels of sparsity to achieve the level of accuracy." [¶0027] "Embodiments described herein provide methods for neural network pruning with cyclical sparsity for which a given level of accuracy of the neural network may be achieved with a higher level of sparsity" [¶0059] "the sparsity comparator 308 may compare the value of the increase sparsity counter 300 to an increase sparsity counter threshold and the value of the decrease sparsity counter 302 to a decrease sparsity counter threshold, and trigger the mask generator 202 to cease generating masks for the neural network in response to the value of the increase sparsity counter 300 exceeding the increase sparsity counter threshold and/or the value of the decrease sparsity counter 302 exceeding the decrease sparsity counter threshold").

The combination of Hubara and Stosic as well as Srinivas are directed towards neural network pruning.  Therefore, the combination of Hubara and Stosic as well as Srinivas are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Hubara and Stosic with the teachings of Srinivas by .  Srinivas teaches a more fine-grained pruning method than The combination of Hubara and Stosic and provides as additional motivation for combination (the embodiments described herein may improve the implementation of the neural network by reducing the resource cost for such implementation with a dense weight tensor of the neural network or a sparse weight tensor generated by other means for increasing sparsity of a weight tensor.).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 16, the combination of Stosic and Hubara teaches The method of claim 15.
	However, the combination of Stosic and Hubara doesn't explicitly teach, further comprising: in response to the measure of training performance decreasing below a first threshold, progressing the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level.

	Srinivas, in the same field of endeavor, teaches in response to the measure of training performance decreasing below a first threshold, progressing the training into a second phase by adjusting the initial sparsity level to a decreased sparsity level. ([¶0026] " increased sparsity of the weight tensors for trained neural networks can decrease the accuracy of the computations and inferences of the trained neural networks. Different methods for increasing sparsity of the weight tensors can produce sparse weight tensors that cannot achieve a level of accuracy for a given level of sparsity, needing lower levels of sparsity to achieve the level of accuracy." [¶0027] "Embodiments described herein provide methods for neural network pruning with cyclical sparsity for which a given level of accuracy of the neural network may be achieved with a higher level of sparsity" [¶0059] "the sparsity comparator 308 may compare the value of the increase sparsity counter 300 to an increase sparsity counter threshold and the value of the decrease sparsity counter 302 to a decrease sparsity counter threshold, and trigger the mask generator 202 to cease generating masks for the neural network in response to the value of the increase sparsity counter 300 exceeding the increase sparsity counter threshold and/or the value of the decrease sparsity counter 302 exceeding the decrease sparsity counter threshold" Decrease sparsity threshold interpreted as a measure of training performance.  Exceeding the decrease sparsity counter threshold interpreted as synonymous with a performance decreasing below a first threshold.).

	The combination of Hubara and Stosic as well as Srinivas are directed towards neural network pruning.  Therefore, the combination of Hubara and Stosic as well as Srinivas are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Hubara and Stosic with the teachings of Srinivas by .  Srinivas teaches a more fine-grained pruning method than The combination of Hubara and Stosic and provides as additional motivation for combination (the embodiments described herein may improve the implementation of the neural network by reducing the resource cost for such implementation with a dense weight tensor of the neural network or a sparse weight tensor generated by other means for increasing sparsity of a weight tensor.).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 17, the combination of Stosic, Hubara, and Srinivas teaches The method of claim 16, further comprising: in response to the measure of training performance decreasing below a second threshold, progressing the training into a third phase by adjusting the decreased sparsity level to a further decreased sparsity level.(Srinivas [¶0056] "the sparsity level decrementor 306 may decrease the level of sparsity to a predetermined level of sparsity. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level lower than the sparsity level of the sparsity level incrementor 304. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level higher than the sparsity level of a dense weight tensor (e.g., input dense weight tensor 210 in FIG. 2) for the trained neural network. As another example, the sparsity level decrementor 306 may decrease the level of sparsity to a level as low as zero. In some embodiments, the sparsity level decrementor 306 may decrease the level of sparsity according to a sparsity profile designating values, such as numerically, of various levels of sparsity. In some embodiments, the sparsity level decrementor 306 may decrease the level of sparsity according to a sparsity profile designating values, such as algorithmically, of various levels of sparsity." [¶0057] " the sparsity level decrementor 306 may use various factors for control the level of sparsity used for generating a mask. For example, the sparsity level decrementor 306 may use a sparsity parameter to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a number of cycles of updating, or training, of a neural network by a neural network trainer to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use an accuracy value of the neural network to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a previously set level of sparsity set by the sparsity level incrementor 304 to control the level of sparsity used for generating a mask. As another example, the sparsity level decrementor 306 may use a previously set level of sparsity set by the sparsity level decrementor 306 to control the level of sparsity used for generating a mask. In some embodiments, the sparsity level decrementor 306 may use any number or combination of these examples to control the level of sparsity used for generating a mask.").
	

	Claim 12 is rejected under U.S.C. §103 as being unpatentable over the combination of Stosic and Hubara and Zhuo

	 Regarding claim 12, the combination of Stosic and Hubara teaches The method of claim 9.
	However, the combination of Stosic and Hubara doesn't explicitly teach wherein combining the first balanced sparsity mask and the second balanced sparsity mask to generate the third sparsity mask includes combining the first balanced sparsity mask and the second balanced sparsity mask using an elementwise Boolean AND operation.

	Zhuo, in the same field of endeavor, teaches combining the first balanced sparsity mask and the second balanced sparsity mask to generate the third sparsity mask includes combining the first balanced sparsity mask and the second balanced sparsity mask using an elementwise Boolean AND operation. ([Abstract] "a final pruning mask is obtained by performing a bitwise logic AND operation on pruning masks independently generated by two sparse methods, and then a weight matrix of the neural network after sparsity is obtained").

	The combination of Hubara and Stosic as well as Zhuo are directed towards pruning neural networks.  Therefore, the combination of Hubara and Stosic as well as Zhuo are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Hubara and Stosic with the teachings of Zhuo by combining masks using a logical AND operator in a mixed-grain sparsity system.  The combination of Hubara and Stosic is directed towards coarse grained pruning, while Zhuo teaches a means of implementing coarse-grained pruning with fine-grained pruning in a mixed-grained pruning system utilizing logical AND operators.  Zhuo provides as additional motivation for combination ([¶0007] "the present invention proposes a mixed-granularity-based joint sparse method for a neural network, which is the key to achieve efficient GPU reasoning in a convolutional neural network." [¶0005] "Different from the fine-grained sparse mode, the coarse-grained sparse mode is considered as a beneficial alternative to improve the hardware implementation efficiency.").

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Mishra (“Accelerating Sparse Deep Neural Networks”, 2021) is directed towards N:M sparsity masks for neural network training.
Chmiel (“Optimal Fine-Grained N:M Sparsity for Activations and Neural Gradients”, 2022) is directed towards N:M sparsity masks for neural network training with an emphasis on transposed masks for backward propagation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

Mar 29, 2022
Application Filed
May 10, 2025
Non-Final Rejection — §103
Jul 16, 2025
Examiner Interview Summary
Jul 16, 2025
Applicant Interview (Telephonic)
Aug 15, 2025
Response Filed
Sep 15, 2025
Final Rejection — §103
Nov 24, 2025
Response after Non-Final Action
Dec 22, 2025
Request for Continued Examination
Jan 15, 2026
Response after Non-Final Action
Jan 29, 2026
Non-Final Rejection — §103
Apr 01, 2026
Applicant Interview (Telephonic)
Apr 01, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/373,021
Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
2y 5m to grant Granted Feb 24, 2026
18/486,534
Patent 12547878
Highly Efficient Convolutional Neural Networks
2y 5m to grant Granted Feb 10, 2026
16/902,547
Patent 12536426
Smooth Continuous Piecewise Constructed Activation Functions
2y 5m to grant Granted Jan 27, 2026
18/607,777
Patent 12518143
FEEDFORWARD GENERATIVE NEURAL NETWORKS
2y 5m to grant Granted Jan 06, 2026
16/940,293
Patent 12505340
STASH BALANCING IN MODEL PARALLELISM
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
90%
With Interview (+38.2%)
4y 7m
Median Time to Grant
High
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.