Last updated: May 29, 2026
Application No. 17/666,400
SPARSE ATTENTION NEURAL NETWORKS

Non-Final OA §103
Filed
Feb 07, 2022
Priority
Feb 05, 2021 — provisional 63/146,551
Examiner
MAUNI, HUMAIRA ZAHIN
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
3 (Non-Final)
Interview Optional

— +58.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 42% grant rate with +58.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

MAUNI, HUMAIRA ZAHIN View full profile →
Grants 42% of resolved cases
Career Allowance Rate
8 granted / 19 resolved
-12.9% vs TC avg
Strong +59% interview lift
Without
With
+58.9%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
15 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
4.4%
-35.6% vs TC avg
§103
91.2%
+51.2% vs TC avg
§102
2.9%
-37.1% vs TC avg
§112
1.5%
-38.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/23/2025 has been entered.

 Response to Amendment
Claims 1-4 and 7-26 remain pending within the application.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 7-9, 13-19, and 23-24 are rejected under 35 U.S.C. 103 as being unpatentable over Kinley (“Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns”), hereafter Kinley, in view of Li et al. ("EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions"), hereafter Li.

Regarding claim 1, Kinley discloses:
-	A system for performing a machine learning task on a network input to generate a network output using an attention neural network with reduced latency, the system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to implement (Kinley, page 22, paragraph 1, lines 1-2 “We use standard transformer architectures in all experiments.” recites a standard transformer architecture comprising one or more computers and one or more storage devices storing instructions),
-	an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub-layer and a feed-forward sub-layer (Kinley, Figure 4.2, equation 3.1, page 14, last paragraph, line 1 “Here l is the layer index, FFN is a feed-forward neural network”, and page 6, paragraph 3, lines 1-2 “To validate this approach, we perform experiments on classification and translation” teaches an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub-layer and a feed-forward sub-layer, the feed-forward sub-layer indicated by FFN), 
-	the attention layer configured to: obtain an input sequence for the layer comprising a respective layer input at each of one or more positions (Kinley, Figure 4.2, equation 3.1, page 18, paragraph 2, line 3 “The first layer representations are set to the corresponding word embeddings”, and page 14, last paragraph, lines 1 - 2 “We begin by encoding tokens xi with a position-specific embedding function e8 to vectors h10. Each layer of the transformer then produces new vectors” teaches obtaining an input sequence xi for the layer input at each position),
-	generate an attended input sequence at least in part by applying one or more attention mechanisms to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions (Kinley, equation 3.1, Figure 4.2 and paragraph below the figure teaches generating an attended input sequence by applying attention mechanisms outlined in the model and controller stream),
-	the feed- forward layer configured  to: receive the attended input sequence (Kinley, page 18, Figure 4.2, paragraph above figure, lines 1-2 “Let the controller representation at timestep i in layer l be gi(l) and the model representation be hi(l)”, and the paragraph below the figure teaches the feed-forward layer FFN receiving the attended input sequence from the previous layer through g/h),
-	generate an output sequence for the layer from at least the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions (Kinley, Figure 4.2, equation 3.1, and paragraph below equation 3.1, line 1-2 “Here l is the layer index… At the final layer L we can make a prediction utilizing h(L),” teaches generating an output sequence h for each layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions), 
-	the generating the output sequence comprising, for each of the positions: obtaining an attended layer input at the position (Kinley, Figure 4.2 teaches obtaining an attended layer input at the index),
-	selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value, wherein the other elements of the attended layer input that are not in the proper subset are not constrained to have a zero value (Kinley, page 15,  paragraph 1, lines 4-11 “the Gumbel-Softmax generative process is defined by first sampling Uk ~ Uniform (0, 1), and then returning a K-dimensional sample vector s… samples given by the Gumbel-Softmax distribution conforms to the same distribution as one-hot categorical samples.” Teaches generating sample vectors via Gumbel-softmax to select a proper subset of elements in an intermediate output that are constrained to have a zero value, wherein the other elements of the attended layer input that are not in the proper subset are not constrained to have a zero value, with one hot categorical sample distribution being the zero and non-zero constraint imposed),
-	processing the attended layer input through a feed-forward neural network layer to generate the intermediate output while constraining the elements in the proper subset to have a zero value (Kinley, page 16, paragraph 5, lines 1-3 “Formally, we modify the above deterministic Eq 3.1 with a single sampling step: 
    PNG
    media_image1.png
    73
    486
    media_image1.png
    Greyscale
”, page 17, paragraph 2, line 1 “we use the Gumbel-Softmax estimator”, Figure 4.2, and paragraph below Figure 4.2 teaches processing the attended layer input through a feed-forward neural network layer to generate the intermediate output while constraining the elements in the proper subset to have a zero value, indicated by the sampling step provided under the model stream), 
- 	the processing comprising computing a product between the attended layer input and the weight matrix of the feed-forward neural network layer using only columns of the weight matrix of the feed-forward neural network layer that correspond to the elements of the intermediate output that are not constrained to be zero (Kinley, Figure 4.2, page 6, paragraph 2, lines 2-6 “we use the Gumbel-Softmax approximation … which is differentiable, to approximate discrete attention… Each intermediate representation is built up based on a subset of the attended lower-layer, guaranteeing that hidden states solely depend on attended elements.” and paragraph below Figure 4.2 teaches computing a product between the attended layer input and the weight matrix of the feed- forward neural network layer),
-	the computing a product comprising loading only the columns of the weight matrix of the feed-forward neural network layer that correspond to the elements of the intermediate output that are not constrained to be zero … such that the weight matrix of the feed-forward neural network is sparsely loaded … for computing the product with the attended layer input (Examiner’s Note: For prior art purposes, the examiner interprets BRI of loading columns of the weight matrix of the feed-forward neural network layer to include updating attention layer weights, as per specification page 11, line 26 to page 12, line 4) (Kinley, page 15, paragraph 1, lines 6-9 “gk = -log(-log(Uk)) is Gumbel noise and τ is a temperature parameter controlling the entropy of the distribution. As τ [Wingdings font/0xE0] 0, samples given by the Gumbel-Softmax distribution conforms to the same distribution as one-hot categorical samples.” And Figure 4.2, paragraph below the figure teaches loading only the columns of the weight matrix of the feed-forward neural network layer that correspond to the elements of the intermediate output that are not constrained to be zero, through updating attention layer weights via Gumbel-SoftMax sampling within the Model Stream, such that the weight matrix of the feed-forward neural network is sparsely loaded after the sampling to compute the product with the attended layer input),
-	applying a linear transformation to the intermediate output to generate a transformed output (Kinley, page 22, paragraph 2, lines 2-3 “We use a learned linear layer to project h0(L) to a 5-dimensional space” teaches applying a linear transformation to generate a transformed output),  
-	generating the layer output at the position from the transformed output (Kinley, page 22, paragraph 2, lines 2-4 “We use a learned linear layer to project h0(L) to a 5-dimensional space and pass it through softmax to get a distribution y over the classes.” Teaches generating the layer output at the position from the transformed output).  

While Kinley discloses loading only the columns of the weight matrix of the feed-forward neural network layer that correspond to the elements of the intermediate output that are not constrained to be zero … such that the weight matrix of the feed-forward neural network is sparsely loaded … for computing the product with the attended layer input, they do not explicitly disclose this loading and sparse loading to be done from memory.
Li discloses:
-      loading … elements of … intermediate output that are not constrained to be zero from memory such that … feed-forward neural network is sparsely loaded from memory (Li, page 1, section introduction, paragraph 2, lines 5-7 “Here, for general-purpose computing devices, such as CPUs and GPUs, implementation search means optimizing DNN implementations such as kernel fusion and memory access optimization” and page 2, section 3.1, last 8 lines “in this work, we use the Gumbel-Softmax function in [13] in order to sample only one operation out of M during feedforward propagation, since Gumbel-Softmax function can convert the discrete non-differentiable sampling to continuous differentiable sampling. This greatly reduces the memory requirement and speeds up the feedforward propagation.” Teaches Gumbel-Softmax function for DNNs to load elements of intermediate output that are not constrained to be zero from memory such that feed-forward neural network is sparsely loaded from memory for memory access optimization),

Kinley and Li are analogous art because they are from the same field of endeavor, deep neural network models and Gumbel Sampling.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley to include loading … elements of … intermediate output that are not constrained to be zero from memory such that … feed-forward neural network is sparsely loaded from memory, based on the teachings of Li. One of ordinary skill in the art would have been motivated to make this modification in order to improve speed, as suggested by Li (page 2, section 3.1, last 8 lines).

Regarding claim 7, Kinley, in view of Li, discloses the system of claim 1. Kinley further discloses:
-	wherein applying a linear transformation to the intermediate output to generate a transformed output comprises: computing a product between a weight matrix of the linear transformation and the intermediate output using only rows of the weight matrix of the linear transformation that correspond to the elements of the intermediate output that are not constrained to be zero (Kinley, page 15, penultimate paragraph, lines 1-3 “The Gumbel-Softmax distribution is smooth for τ [Wingdings font/0xE0] 0, and so the sample s is differentiable with respect to l. We can thus use backpropagation to compute gradients during training. This procedure is called Gumbel-Softmax Estimator.” And Table 6.4 caption “All models are trained with a linear projection over the final encoder layer…” teaches applying the linear transformation to include the discrete transformer architecture that computes a product between a weight matrix of the linear transformation and the intermediate output using only rows of the weight matrix of the linear transformation that correspond to the elements of the intermediate output that are not constrained to be zero).  

Regarding claim 8, Kinley, in view of Li, discloses the system of claim 7. Kinley further discloses:
-	wherein computing a product between a weight matrix of the linear transformation and the intermediate output using only rows of the weight matrix of the linear transformation that correspond to the elements of the intermediate output that are not constrained to be zero comprises: loading only the rows of the weight matrix of the linear transformation that correspond to the elements of the intermediate outputs that are not constrained to be zero from memory (Examiner’s Note: For prior art purposes, the examiner interprets loading data from memory to include updating attention layers)  (Kinley, Table 6.4 caption “All models are trained with a linear projection over the final encoder layer…”, and page 15, paragraph 1, lines 6-9 “gk = -log(-log(Uk)) is Gumbel noise and τ is a temperature parameter controlling the entropy of the distribution. As τ [Wingdings font/0xE0] 0, samples given by the Gumbel-Softmax distribution conforms to the same distribution as one-hot categorical samples.” And Figure 4.2, paragraph below the figure teaches loading only the rows of the weight matrix of the linear transformation that correspond to the elements of the intermediate outputs that are not constrained to be zero from memory within the Model Stream).

Regarding claim 9, Kinley, in view of Li, discloses the system of claim 1. Kinley further discloses:
-	wherein generating the layer output from the transformed output comprises: applying a residual connection, layer normalization, or both to the transformed outputs at the positions to generate the layer outputs in the output sequence (Kinley, page 13, last paragraph, last line “Each sub-layer has a residual connection and layer normalization” and page 14, paragraph 2, lines 1-3 “For the sake of notational simplicity…We elide layernorm, dropout, most residual connections, and multi-headedness, but they are all included in the final models.” teaches applying a residual connection, layer normalization, or both to the transformed outputs at the positions to generate the layer outputs in the output sequence).  

Regarding claim 13, Kinley, in view of Li, discloses the system of claim 1. Kinley further discloses:
-	wherein the attention neural network comprises an encoder that generates encoded activations that represent the network input (Kinley, Figure 3.1 teaches an encoder in the left half of the figure that generates encoded activations that represent the network input),
-	a decoder that includes a first subset of the plurality of attention layers and generates the network output from the encoded activations (Kinley, Figure 3.1  and page 14, paragraph 1, lines 1-2 “The decoder of the transformer uses information from the encoder representation to generate the final output (for instance, translation)” teaches a decoder that includes a first subset of the plurality of attention layers and generates the network output from the encoded activations).  

    PNG
    media_image2.png
    588
    618
    media_image2.png
    Greyscale

Regarding claim 14, Kinley, in view of Li, discloses the system of claim 13. Kinley further discloses:
- wherein the encoder includes a second subset of the plurality of attention layers (Kinley, Figure 3.1, left half of the figure as the encoder, and page 23, penultimate paragraph, line 1 “Since transformers have multiple attention layers” teaches the encoder to include a second subset of the plurality of attention layers).  

Regarding claim 15, Kinley, in view of Li, discloses the system of claim 13. Kinley further discloses:
-	wherein the decoder generates the network output by generating each element of the network output auto-regressively (Kinley, Figure 3.1 teaches a decoder on the right half of the figure that generates the network output by generating each element of the network output auto-regressively),
-	wherein, for each attention layer in the decoder, the input sequence includes a sequence derived from the encoded activations followed by a sequence derived from any elements of the network output that have already been generated (Kinley, Figure 3.1, right half of the figure, teaches an input sequence including a sequence derived from the encoded activations followed by a sequence derived from any elements of the network output that have already been generated).

Regarding claim 16, Kinley discloses the system of claim 15. Kinley further discloses:
-	wherein the one or more attention mechanisms applied by the attention sub-layer of each of the attention layers in the decoder are masked self-attention mechanisms (Kinley, Figure 3.1, and page 14, paragraph 1, lines 5-6 “The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to the future“ teaches masked self-attention mechanisms in the decoder).

Regarding claim 17, Kinley discloses the system of claim 13. Kinley further discloses:
-	wherein, for each attention layer in the decoder, the attention sub-layer is configured to: generate, from the input sequence, an initial attended input sequence at least in part by applying a first attention mechanism to at least a portion of the input sequence for the attention layer (Kinley, Figure 3.1 generating, from the input sequence, an initial attended input sequence at least in part by applying a first attention mechanism to at least a portion of the input sequence for the attention layer in the decoder on the right),
-	generate, from the initial attended input sequence, the attended input sequence at least in part by applying a second attention mechanism to at least a portion of the initial attended input sequence (Kinley, Figure 3.1 teaches applying a second attention mechanism to at least a portion of the initial attended input sequence to generate an attended input sequence in the decoder on the right).  

Regarding claim 18, Kinley discloses the system of claim 17. Kinley further discloses:
-	wherein generating the attended input sequence comprises applying layer normalization to the initial attended input sequence prior to applying the second attention mechanism (Kinley, Figure 3.1, decoder on the right side teaches “Add & Norm” layers as applying layer normalization to the initial attended input sequence prior to applying the second attention mechanism).  

Regarding claim 19, Kinley discloses the system of claim 1. Kinley further discloses:
-	wherein obtaining the input sequence for the layer comprising applying layer normalization to an initial input sequence for the layer (Kinley, page 13, last line “Each sub-layer has a residual connection and layer normalization” and Figure 3.1 teaches applying layer normalization to an initial input sequence for the layer).  

Claims 23 and 24 are substantially similar to claim 1, and thus are rejected on the same basis as claim 1.

Claims 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over Kinley (“Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns”), hereafter Kinley, in view of Li et al. ("EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions"), hereafter Li, in further view of Yang et al. ("Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling"), hereafter Yang.

Regarding claim 2, Kinley, in view of Li, discloses the system of claim 1.
Kinley teaches selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value and processing the attended layer input through a feed-forward neural network layer to generate the intermediate output while constraining the elements in the proper subset to have a zero value in claim 1, but does not explicitly disclose:
-	wherein the elements of the intermediate outputs are partitioned into a plurality of blocks and 
-	wherein selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value comprises: selecting a respective element from each block, 
-	for each block, constraining each element in the block other than the respective selected element from the block to have a zero value.  

Yang discloses:
-	wherein the elements of the intermediate outputs are partitioned into a plurality of blocks (Yang, Figure 2 and Figure 2 caption “In classification, the features alternately pass through Group Shuffle Attention (GSA) blocks and down-sampling blocks … our Gumbel Subset Sampling (GSS)” teaches partitioning elements of an intermediate output into a plurality of blocks),
-	wherein selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value comprises: selecting a respective element from each block (Yang, Figure 2, Figure 3 (b), and Figure 3(b) caption lines 2-3 “For Ni+1 rounds, one point is selected from the Ni input points competitively” teaches selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value to comprise selecting a point as a respective element from each block), 
-	for each block, constraining each element in the block other than the respective selected element from the block to have a zero value (Yang, Figure 2, Figure 3 (b), and Figure 3(b) caption lines 4-6 “every input point produces one selection score (Eq.13), with a max score to be selected. A Gumbel-Softmax (Eq. 3) with high temperature is used for soft selection in training phase” teaches constraining each element in the block other than the respective selected element from the block to have a zero value during soft training, using Gumbel-softmax).  

Kinley, Li, and Yang are analogous art because they are from the same field of endeavor, DNNs and Gumbel Sampling.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the elements of the intermediate outputs are partitioned into a plurality of blocks and wherein selecting, based on the attended layer input, a proper subset of elements in an intermediate output that are constrained to have a zero value comprises: selecting a respective element from each block; and for each block, constraining each element in the block other than the respective selected element from the block to have a zero value, based on the teachings of Yang. One of ordinary skill in the art would have been motivated to make this modification in order to obtain better performed models with lower computation cost, as suggested by Yang (Yang, page 2, left column, paragraph 4, last 2 lines “classification models are better-performed with lower computation cost”).


Regarding claim 3, Kinley, in view of Li, in further view of Yang, discloses the system of claim 2. Yang further discloses:
-	wherein selecting the proper subset comprises: projecting the attended layer input using a learned transformation to generate a projected layer input that has the same dimensionality as the intermediate output (Yang, Figure 2 and Figure 3 teaches projecting the attended layer input using learned transformations, downsampling and Group Shuffle Attention, i.e., GSA, to generate a projected layer input which has the same dimensionality as the intermediate output),
-	for each block of the projected layer input, selecting the element with the highest value of any element in the block and constraining each element in the corresponding block in the intermediate output other than the element corresponding to the selected element to have a zero value (Yang, Figure 2, Figure 3(b) and page 3, left column, equations 3 and 4, and paragraph below equation 4, lines 1-6 “we are able to draw differentiable samples (Eq. 3) from the distribution Cat… in training phase. In practice, τ starts at a high value (e.g., 1.0), and anneals to a small value (e.g., 0.1). Optimization on the Gumbel Softmax distribution could be interpreted as solving a certain entropy-regularized linear program” teaches for each block of the projected layer input, selecting the element with the highest value of any element in the block and constraining each element in the corresponding block in the intermediate output other than the element corresponding to the selected element to have a zero value).  

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein selecting the proper subset comprises: projecting the attended layer input using a learned transformation to generate a projected layer input that has the same dimensionality as the intermediate output, and for each block of the projected layer input, selecting the element with the highest value of any element in the block and constraining each element in the corresponding block in the intermediate output other than the element corresponding to the selected element to have a zero value, based on the teachings of Yang. One of ordinary skill in the art would have been motivated to make this modification in order to obtain better performed models with lower computation cost, as suggested by Yang (Yang, page 2, left column, paragraph 4, last 2 lines “classification models are better-performed with lower computation cost”).

Regarding claim 4, Kinley, in view of Li, in further view of Yang, discloses the system of claim 3. Yang further discloses:
-	wherein the learned transformation is a low-rank bottleneck dense layer (Yang, page 2, left column, paragraph 4, last 2 lines “the sampling operation is a bottleneck of the hierarchical structures”, Figure 2, and Figure 3 teaches the applied sampling operations as a low-rank bottleneck dense layer).  

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the learned transformation is a low-rank bottleneck dense layer, based on the teachings of Yang. One of ordinary skill in the art would have been motivated to make this modification in order to obtain better performed models with lower computation cost, as suggested by Yang (Yang, page 2, left column, paragraph 4, last 2 lines “classification models are better-performed with lower computation cost”).

Claims 10-12 are rejected under 35 U.S.C. 103 as being unpatentable over Kinley (“Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns”), hereafter Kinley, in view of Li et al. ("EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions"), hereafter Li, in further view of Zhang et al. ("Convolutional Multi-Head Self-Attention on Memory for Aspect Sentiment Classification"), hereafter Zhang.

Regarding claim 10, Kinley, in view of Li, discloses the system of claim 1. Kinley further discloses:
-	wherein a first attention mechanism of the one or more attention mechanisms is a multi-head attention mechanism having a plurality of attention heads that each apply query-key-value attention (Kinley, page 14, paragraph 2, lines 1-3 “For the sake of notational simplicity…We elide layernorm, dropout, most residual connections, and multi-headedness, but they are all included in the final models.” And Figure 3.1 teaches a multi-head attention mechanism having a plurality of attention heads that each apply query-key-value attention),
-	wherein the attention sub-layer is configured to: process the respective layer inputs in the input sequence using a multiplicative dense layer to generate, for each respective layer input, a respective split input comprising S modules of size M, where S and M are both integers greater than 1 (Kinley, Figure 3.1, and page 12, last paragraph, lines 2-3 “the multi-head mechanism linearly projects the queries, keys, and values h times with different learned linear projections into dk, dk. And dv dimensions respectively”),
-	process a tensor comprising the respective split inputs for the respective layer inputs in the sequence … to generate a respective set of queries for each attention head (Kinley, page 11, paragraph 1, line 5 “The query q…”  and paragraph 2, lines 1-2 “Self-attention is used to make a prediction for one part of a sequence x using other parts of the same sequence. In this case, all of q, k, v are generated using x. This type of attention can be used to encode a sequence” and Figure 3.1 teaches processing a tensor comprising the respective split inputs for the respective layer inputs in the sequence using a first two-dimensional convolutional layer to generate a respective set of queries for each attention head),
-	for each attention head, apply query-key-value attention over the respective sets of queries, keys, and values for the attention head to generate an attended output (Kinley, page 12, last paragraph, last 4 lines teaches computing “attention(…)” as applying query-key-value attention over the respective sets of queries, keys, and values for the attention head to generate an attended output), 
-	combine the attended outputs of the attention heads (Kinley, page 12, last paragraph, last 4 lines teaches calculating the concatenation using “MultiHeadAttention(…)” to combine the attended outputs of the attention heads).

Kinley teaches processing a tensor comprising the respective split inputs for the respective layer inputs in the sequence … to generate a respective set of queries for each attention head, but does not teach doing so using a first two-dimensional convolutional layer.
Zhang teaches:
process a tensor comprising the respective split inputs for the respective layer inputs in the sequence using a first two-dimensional convolutional layer to generate a respective set of queries for each attention head (Zhang, Fig. 1 and Fig. 2  teaches processing a tensor comprising the respective split inputs for the respective layer inputs in the sequence using a first two-dimensional convolutional layer to generate a respective set of queries for each attention head).

Kinley does not teach:
-	process the tensor using a second two-dimensional convolutional layer to generate a respective set of values for each attention head,
-	process the tensor using a third two-dimensional convolutional layer to generate a respective set of keys for each attention head.

Zhang teaches:
-	process the tensor using a second two-dimensional convolutional layer to generate a respective set of values for each attention head (Zhang, Fig. 1-3 teach processing the tensor using a second two-dimensional convolutional layer to generate a respective set of values for each attention head),
-	process the tensor using a third two-dimensional convolutional layer to generate a respective set of keys for each attention head (Zhang, Fig. 1-3 teach processing the tensor using a third two-dimensional convolutional layer to generate a respective set of keys for each attention head).

Kinley, Li, and Zhang are analogous art because they are from the same field of endeavor, deep neural networks.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include processing a tensor comprising the respective split inputs for the respective layer inputs in the sequence using a first two-dimensional convolutional layer to generate a respective set of queries for each attention head, processing the tensor using a second two-dimensional convolutional layer to generate a respective set of values for each attention head, and processing the tensor using a third two-dimensional convolutional layer to generate a respective set of keys for each attention head, based on the teachings of Zhang. One of ordinary skill in the art would have been motivated to make this modification in order to extract more complex and richer semantic information from sequences and aspects, as suggested by Zhang (Zhang, page 1039, left column, paragraph 1, lines 11-12 “to extract more complex and richer semantic information from sequences and aspects.”).

Regarding claim 11, Kinley, in view of Li, in further view of Zhang, discloses the system of claim 10. Kinley further discloses:
-	the multiplicative dense layer performs operations that can represent any arbitrary permutation on any given layer input (Kinley, page 14, note 8 “Self attention is permutation invariant. The positional encoding provides order information to the model. This is done by adding a position-specific vector such as a sinusoidal encoding to the word embeddings of each token xi”).  


Regarding claim 12, Kinley, in view of Li, in further view of Zhang, discloses the system of claim 10. Kinley further discloses:
-	wherein S is equal to the number of attention heads (Kinley, page 12, last 4 lines teaches h as the number of attention heads), 
-	M is equal to the dimensionality of the queries, keys, and values (Kinley, page 12, last 2 lines teaches the dimensionalities of the queries, keys, and values). 

Kinley does not teach:
-	the first, second, and third convolutional layers each have M filters with a kernel size of KxK that are convolved over a length dimension of the tensor that corresponds to the number of layer inputs in the input sequence.  

Zhang teaches:
M is equal to the dimensionality of the queries, keys, and values … the first, second, and third convolutional layers each have M filters with a kernel size of KxK that are convolved over a length dimension of the tensor that corresponds to the number of layer inputs in the input sequence (Zhang, Figs. 1-3, equation 3, and page 1039, paragraphs 3, last 8 lines 
    PNG
    media_image3.png
    164
    418
    media_image3.png
    Greyscale
 teaches convolutional layers to have d filters, which is equal to the d dimensionalities of the queries, keys, and values, and these filters are of a kernel size that are convolved over a length dimension of the tensor that corresponds to the number of layer inputs in the input sequence).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley to include M is equal to the dimensionality of the queries, keys, and values … the first, second, and third convolutional layers each have M filters with a kernel size of KxK that are convolved over a length dimension of the tensor that corresponds to the number of layer inputs in the input sequence, based on the teachings of Zhang. One of ordinary skill in the art would have been motivated to make this modification in order to extract more complex and richer semantic information from sequences and aspects, as suggested by Zhang (Zhang, page 1039, left column, paragraph 1, lines 11-12 “to extract more complex and richer semantic information from sequences and aspects.”).

Claims 20-21, and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over Kinley (“Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns”), hereafter Kinley, in view of Li et al. ("EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions"), hereafter Li, in further view of Kitaev et al. ("REFORMER: THE EFFICIENT TRANSFORMER"), as cited in the IDS dated 11/04/2022, hereafter Kitaev.

Regarding claim 20, Kinley, in view of Li, discloses the system of claim 1. Kinley does not disclose:
wherein the attention layers are implemented as reversible layers.  

Kitaev discloses:
 wherein the attention layers are implemented as reversible layers (Kitaev, page 6, paragraph Reversible Transformer, lines 1-6 “We apply the RevNet idea to the Transformer by combining the attention and feed-forward layers inside the revnet block…The reversible Transformer does not need to store activations in each layer” teaches attention layers implemented as reversible layers).  

Kinley, Li, and Kitaev are analogous art because they are from the same field of endeavor, deep neural network models.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the attention layers are implemented as reversible layers, based on the teachings of Kitaev. One of ordinary skill in the art would have been motivated to make this modification in order to reduce cost, as suggested by Kitaev (Kitaev, page 6, left column, paragraph 2, lines 1-2 “reduce this cost by … using reversible layers”).

Regarding claim 21, Kinley, in view of Li, discloses the system of claim 1. Kinley does not disclose:
wherein the attention layers are implemented as reversible layers,
-	wherein the residual layer includes a first reversible swap after the first attention mechanism, a second reversible swap after the second attention mechanism, and a third reversible swap after the feed-forward sub-layer.  

Kitaev discloses:
       - 	wherein the attention layers are implemented as reversible layers (Kitaev, page 6, paragraph Reversible Transformer, lines 1-6 “We apply the RevNet idea to the Transformer by combining the attention and feed-forward layers inside the revnet block…The reversible Transformer does not need to store activations in each layer” teaches attention layers implemented as reversible layers),  
-	wherein the residual layer includes a first reversible swap after the first attention mechanism, a second reversible swap after the second attention mechanism, and a third reversible swap after the feed-forward sub-layer (Kitaev, page 6, equations 8 and 10, and final paragraph, line 4 “we can swap layer parameters” teaches reversible swaps for each layer).  

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the attention layers are implemented as reversible layers, wherein the residual layer includes a first reversible swap after the first attention mechanism, a second reversible swap after the second attention mechanism, and a third reversible swap after the feed-forward sub-layer, based on the teachings of Kitaev. One of ordinary skill in the art would have been motivated to make this modification in order to reduce cost, as suggested by Kitaev (Kitaev, page 6, left column, paragraph 2, lines 1-2 “reduce this cost by … using reversible layers”).


Regarding claim 25, Kinley, in view of Li, discloses the system of claim 1. Kinley does not explicitly disclose:
wherein the attention neural network is implemented across multiple hardware accelerators.  

Kitaev discloses:
wherein the attention neural network is implemented across multiple hardware accelerators (Kitaev, page 7 last line – page 8, first line “Training for all experiments was parallelized across 8 devices”).  

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the attention neural network is implemented across multiple hardware accelerators, based on the teachings of Kitaev. One of ordinary skill in the art would have been motivated to make this modification in order to reduce cost, as suggested by Kitaev (Kitaev, page 6, left column, paragraph 2, lines 1-2 “reduce this cost by … using reversible layers”).

Regarding claim 26, Kinley, in view of Kitaev, discloses the system of claim 25. Kinley does not explicitly disclose:
wherein the multiple hardware accelerators comprise (i) Tensor Processing Units, (ii) graphics processing units, or (iii) both.  

Kitaev discloses:
wherein the multiple hardware accelerators comprise (i) Tensor Processing Units, (ii) graphics processing units, or (iii) both (Kitaev, page 7 last line – page 8, first line “Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores).”).  

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the multiple hardware accelerators comprise (i) Tensor Processing Units, (ii) graphics processing units, or (iii) both, based on the teachings of Kitaev. One of ordinary skill in the art would have been motivated to make this modification in order to reduce cost, as suggested by Kitaev (Kitaev, page 6, left column, paragraph 2, lines 1-2 “reduce this cost by … using reversible layers”).


Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Kinley (“Two-Stream Transformer Architecture With Discrete Attention for Better Interpretrability and Separation of Model Concerns”), hereafter Kinley, in view of Li et al. ("EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions"), hereafter Li, in further view of Leng et al. ("Using recurrent neural network structure with Enhanced Multi-Head Self-Attention for sentiment analysis"), hereafter Leng.

Regarding claim 22, Kinley, in view of Li, discloses the system of claim 1. Kinley does not disclose:
-	wherein the attention layer further comprises a recurrent block configured to process the attended input sequence to generate an updated attended sequence, 
-	wherein generating the layer output comprises generating the layer output from the updated attended sequence and the transformed outputs generated by the feed-forward sub-layer.

Leng discloses:
-      wherein the attention layer further comprises a recurrent block configured to process the attended input sequence to generate an updated attended sequence (Leng, Fig. 2 and page 12588, paragraph 1, lines 7-8 “The residual mechanism is used between the two Enhanced Multi-Head Self-Attention layers.” Teaches a residual mechanism as a recurrent block configured to process the attended input sequence to generate an updated attended sequence), 
-	wherein generating the layer output comprises generating the layer output from the updated attended sequence and the transformed outputs generated by the feed-forward sub-layer (Leng, Fig. 2 and page 12588, paragraph 1, lines 8-9 “The transformer mechanism is to add the residual after each Multi-Head attention structure.” And paragraph 2 line 1 “Finally, the result is obtained after the layer normalization operation.” Teaches generating the layer output comprises generating the layer output from the updated attended sequence and the transformed outputs generated by the feed-forward sub-layer).

Kinley, Li, and Leng are analogous art because they are from the same field of endeavor, deep neural network models.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Kinley, in view of Li, to include wherein the attention layer further comprises a recurrent block configured to process the attended input sequence to generate an updated attended sequence, wherein generating the layer output comprises generating the layer output from the updated attended sequence and the transformed outputs generated by the feed-forward sub-layer, based on the teachings of Leng. One of ordinary skill in the art would have been motivated to make this modification in order to better calculate text information, as suggested by Leng (Leng, page 12588, paragraph 1, lines 2-3 “The bidirectional Self-Attention structure of the Transformer encoder is better than the unidirectional Self-Attention when calculating text information.”).


Response to Arguments

Applicant's arguments filed 12/23/2025 have been fully considered with regards to the 35 U.S.C. 101  rejection, and they are persuasive. The rejections have been withdrawn.

Applicant's arguments filed 12/23/2025 have been fully considered with regards to the 35 U.S.C. 102/103 rejection. 
Applicant’s arguments with respect to claim(s) 1, 23, and 24 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUMAIRA ZAHIN MAUNI whose telephone number is (703)756-5654. The examiner can normally be reached Monday - Friday, 9 am - 5 pm (ET).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at (571) 270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.Z.M./Examiner, Art Unit 2141                                                                                                                                                                                                        
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141
Read full office action
Prosecution Timeline

Show 1 earlier event
May 06, 2025
Non-Final Rejection mailed — §103
Jun 18, 2025
Applicant Interview (Telephonic)
Jun 18, 2025
Examiner Interview Summary
Aug 05, 2025
Response Filed
Sep 23, 2025
Final Rejection mailed — §103
Dec 23, 2025
Request for Continued Examination
Jan 15, 2026
Response after Non-Final Action
Feb 17, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/475,557
Patent 12585969
GENERATING CONFIDENCE SCORES FOR MACHINE LEARNING MODEL PREDICTIONS
4y 6m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
42%
Grant Probability
99%
With Interview (+58.9%)
4y 1m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allowance rate.