DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements submitted on June 13, 2023 and April 23, 2024 have been considered by the Examiner.
Claim Objections
Claims 10-12 and 19 are objected to because of the following informalities. Appropriate correction is required.
In claim 10, the phrase “the normalized, combined global and self-attention vectors” appears to comprise a typographical error (e.g. a missing term after “and”).
Claims 11 and 12 depend from claim 10 and thereby include all of the limitations of claim 10. Accordingly, claims 11 and 12 are objected to for the same reasons as noted above for claim 10.
Further regarding claim 12, the phrase “the decoder being configured to decode receive the encoder representation” appears to comprise a typographical error.
In claim 19, the phrase “the normalized, combined global and self-attention vectors” appears to comprise a typographical error (e.g. a missing term after “and”).
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, there is no antecedent basis for “the global self-attention vector for each local input sequence” recited therein. The claim previously recites computing “a global self-attention vector for each of the tokenized embeddings in the local input sequence” but not a global self-attention vector for each local input sequence per se. Also in claim 1, there is no antecedent basis for “the local self-attention vector” recited therein. The claim previously recites computing “local self-attention for the local input sequence” but does not previously recite a local self-attention vector.
Claims 2-12 depend from claim 1 and thereby include all of the limitations of claim 1. Accordingly, claims 2-12 are also considered indefinite for the same reasons as noted above for claim 1.
Further regarding claim 7, there is no antecedent basis for “the local layer” recited therein, as it is unclear as to whether the recited “local layer” is intended to refer to the “local layer” recited in claim 6 or the “local layer” recited in claim 1, both of which claim 7 depends from.
Claims 8-12 depend from claim 7 and thus include all of the limitations of claim 7, and are therefore considered indefinite for the same reasons as noted above for claim 7.
Further regarding claim 8, there is no antecedent basis for “the global self-attention” recited therein. Claim 8 depends from claim 1, which recites a “global self-attention vector” and “global self-attention values” but not global self-attention per se.
Claims 9-12 depend from claim 8 and thus include all of the limitations of claim 8, and are therefore considered indefinite for the same reasons as noted above for claim 8.
Regarding claim 13, there is no antecedent basis for “the global self-attention vector for each local input sequence” recited therein. The claim previously recites computing “a global self-attention vector for each of the tokenized embeddings in the local input sequence” but not a global self-attention vector for each local input sequence per se. Also in claim 13, there is no antecedent basis for “the local self-attention vector” recited therein. The claim previously recites computing “local self-attention for the local input sequence” but does not previously recite a local self-attention vector.
Claims 14-19 depend from claim 13 and thereby include all of the limitations of claim 13. Accordingly, claims 14-19 are also considered indefinite for the same reasons as noted above for claim 13.
Further regarding claim 16, there is no antecedent basis for “the local layer” recited therein, as it is unclear as to whether the recited “local layer” is intended to refer to the “local layer” previously recited in claim 16 or to the “local layer” recited in claim 13, from which claim 16 depends.
Claims 17-19 depend from claim 16 and thus include all of the limitations of claim 16, and are therefore also considered indefinite for the same reasons as noted above for claim 16.
Further regarding claim 17, there is no antecedent basis for “the global self-attention” recited therein. Claim 17 depends from claim 13, which recites a “global self-attention vector” and “global self-attention values” but not global self-attention per se.
Claims 18 and 19 depend from claim 17 and thus include all of the limitations of claim 17, and are therefore also considered indefinite for the same reasons as noted above for claim 17.
Regarding claim 20, there is no antecedent basis for “the global self-attention vector for each local input sequence” recited therein. The claim previously recites computing “a global self-attention vector for each of the tokenized embeddings in the local input sequence” but not a global self-attention vector for each local input sequence per se. Also in claim 20, there is no antecedent basis for “the local self-attention vector” recited therein. The claim previously recites computing “local self-attention for the local input sequence” but does not previously recite a local self-attention vector.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-13 and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims do not fall within at least one of the four categories of patent eligible subject matter.
In particular, claims 1 and 20 are each directed to a “computing device” comprising a transformer including an encoder. Given its broadest reasonable interpretation, such a transformer can be considered software per se, and as such, the computing device would likewise be considered software per se (there is no recitation of any hardware in the computing device of claim 1 or 20, such as a processor or non-transitory computer readable storage). In such embodiments, the computing device of claims 1 and 20 would fail to fall within at least one of the four statutory categories of patent eligible subject matter.
Claims 2-12 depend from claim 1 and thereby include all of the limitations of claim 1. Claims 2-12 fail to fall within at least one of the four categories of patent eligible subject matter for the same reasons as described above for claim 1.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2 and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over the article entitled, “Attention is All You Need” by Vaswani et al. (“Vaswani”), and also over the article entitled, “Long Range Language Modeling via Gated State Spaces” by Mehta et al. (“Mehta”).
Regarding claim 1, Vaswani describes the transformer architecture (see e.g. the “Abstract,” which recites “[w]e propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”). Vaswani particularly teaches that the transformer includes an encoder having:
a first layer configured to (i) receive tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and (ii) compute a first self-attention vector for each of the tokenized embeddings in the input sequence (Vaswani discloses that the transformer architecture comprises an encoder that maps an input sequence to an output sequence, wherein the encoder is composed of a stack of identical layers that each comprise a multi-head self-attention sub-layer:
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
Vaswani further teaches that the first encoder layer receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer:
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension
d
m
o
d
e
l
. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by
d
m
o
d
e
l
.
(Section 3.4 “Embeddings and Softmax.” Emphasis added.).
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension
d
m
o
d
e
l
as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [8].
(Section 3.5 “Positional Encoding.” Emphasis added.).
PNG
media_image1.png
397
250
media_image1.png
Greyscale
The multi-head attention sub-layer of the first encoder layer thus receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and outputs a vector for each of the tokenized embeddings. The output of the first encoder layer, or alternatively, the output of the multi-head attention sub-layer of the first encoder layer, is considered a first self-attention vector.); and
a second layer configured to (i) receive the first self-attention vector from the first layer and compute a second self-attention vector for the input sequence, and (ii) add and normalize the first self-attention vector with the second self-attention vector to produce an encoder representation including a self-attention vector for the input sequence that includes both first self-attention values and second self-attention values (As noted above, Vaswani discloses that the transformer architecture comprises an encoder that is composed of a stack of identical layers, wherein each encoder layer comprises a multi-head self-attention sub-layer. Vaswani suggests that each encoder layer receives the output of a previous encoder layer and produces a subsequent self-attention vector based thereon:
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
(Section 3.2.3 “Applications of Attention in our Model.”).
A second layer of the encoder layers thus understandably receives the vector output by the first layer of the encoder, and its multi-head self-attention sub-layer produces a subsequent self-attention vector for the input sequence. Like demonstrated in Figure 1, which is reproduced above, Vaswani discloses that the self-attention sub-layer of each encoder adds and normalizes the self-attention vector produced thereby with the input to the encoder layer:
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
The second encoder layer thus understandably receives the self-attention vector output by the first encoder layer, and the multi-head self-attention sub-layer of the second encoder layer computes a subsequent self-attention vector for the input sequence and adds and normalizes the subsequent self-attention vector produced thereby with the input to the encoder layer, i.e. with the first self-attention vector produced by the first encoder layer.).
Vaswani further teaches that the transformer can be configured to output a prediction (e.g. a translation) for the input sequence according to a prediction task (e.g. English-to-German translation) based on the encoder representation of the input sequence:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
(Abstract. Emphasis added.).
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
(Section 5.1 “Training Data and Batching.” Emphasis added.).
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
(Section 6.1 “Machine Translation.” Emphasis added.).
Moreover, Vaswani teaches that the transformer can be implemented on a computing machine (see e.g. section 5.2 “Hardware and Schedule”). Such a computing machine implementing a transformer like taught by Vaswani is considered a computing device similar to that of claim 1. However, Vaswani does not explicitly disclose that the first encoder layer is a global layer that computes a global self-attention vector for each tokenized embedding in each of a plurality of local input sequences of a global input sequence, as is required by claim 1. Vaswani also does not explicitly disclose that the second encoder layer is a local layer that receives the global self-attention vector for each local input sequence from the global layer, and computes a local self-attention vector for the local input sequence and adds and normalizes the global and local self-attention vectors to produce the encoder representation, as is further required by claim 1.
Mehta generally describes a Gated State Space (GSS) layer for modeling long range dependencies, and which is “fairly competitive with…Transformer-based baselines….”:
State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
(Abstract. Emphasis added).
In particular, Mehta describes a hybrid model that interleaves transformer layers with GSS layers, wherein the GSS layers model both short and longer range interactions, and the transformer layers allow for a richer modelling of short range interactions:
Going one step further, we also perform an apples-to-apples comparison with well-tuned and performant baselines reported in Block Recurrent Transformers [Hutchins et al., 2022], on several long range language modeling benchmarks over modalities such as English books, raw source code from Github and LaTeX source of ArXiv mathematics articles. As detailed in Table 2, while our GSS model currently lags behind on some tasks when compared in the fixed-parameter setting, it is fairly competitive in the fixed-compute setting where we measure compute as the exact amount of TPUv4 hours spent on a training run and serves as a fairly accurate proxy to the realistic cost of training that model. Furthermore, we also experimented with a hybrid model in which we sparingly interleave Transformer layers (having local attention) in a GSS stack to allow for a richer modeling of short range interactions. To our delight, this further improves performance at (roughly) no extra training cost, both in terms of parameters and compute.
(Section 1 “Introduction.” Emphasis added).
Conceptually, GSS looks fairly different from the current workhorse of machine learning; the Transformer architecture. Given this, it is not immediately clear if one is decidedly better than the other or if they each provide some orthogonal benefits. If its the latter, one might wonder if there are synergies between these architectures which can be exploited to create a hybrid model which is stronger than either one of them individually. To that end, we also consider a conceptually simple hybrid between GSS and Transformer where we sparingly interleave traditional Transformer blocks with GSS layers. Despite its glaring simplicity, as shown in Table 2, we observed that it shows consistent and significant improvements on all tasks.
Chunking long inputs In all our experiments we used sequence lengths large enough to be prohibitive for traditional Transformer layers. To get around this restriction at the Transformer layers used in our hybrid model, we chunk their inputs into non-overlapping chunks of length 512 and run the Transformer layer on each of them independently. While the GSS layers are apt at modeling both short and longer range interactions, the interleaved Transformer layers can potentially allow for a richer modeling of short range interactions.
(Section 3.3 “GSS-Transformer-Hybrid.” Emphasis added.).
As indicated above, the inputs to each transformer layer are segmented into non-overlapping chunks, and the transformer layer is applied to each chunk independently. Mehta further teaches that the first layer can particularly comprise a GSS layer and the second layer (and every 4th layer thereafter) a transformer layer:
GSS models GSS consists of 16 layers and an embedding dimension of 1024. We also consider a larger variant with 32 layers as denoted by GSS-L. For GSS-Hybrid model, we used vanilla Transformer blocks at every 4th layer starting with the 2nd layer. Since GSS layers are inherently position aware, using them for the 1st layer eschews any need of explicit position embeddings typically used with otherwise position invariant Transformer blocks. Thus, barring position aware nature of GSS layers, we don’t use any kind of explicit position embedding or bias in our models. For the Transformer blocks used in hybrid models, we use multi-head self-attention with 8 heads, each with size 128.
(Section 4.2 “Comparison with other baselines.” Emphasis added).
Mehta discloses that state space models like the GSS layer are beneficial because they reduce the complexity on input sequence length:
Modeling long range dependencies on sequential data is a crucial step towards closing the gap with human-level performance on many tasks. Attention based models like Transformer [Vaswani et al., 2017] have proven to be a strong choice of backbone architecture for a considerable number of tasks across modalities and scale [Devlin et al., 2019, Brown et al., 2020, Dosovitskiy et al., 2021]. Vanilla Multi-Head-Attention famously incurs
Ω
L
2
penalty in modeling a sequence of length L. This is prohibitive at best for tasks where the model is required to capture long range dependencies from various parts of the input. Over the years, a variety of improvements have been proposed to alleviate this quadratic complexity (cf. [Tay et al., 2020]).
On a somewhat orthogonal direction, attention-free models based on state spaces, such as S4 [Gu et al., 2022a] and DSS [Gupta et al., 2022], have shown remarkable improvements on Long Range Arena (LRA) [Tay et al., 2021], a benchmark designed with long range modeling as its focus and consists of diverse tasks with 1k-16k sequence length across modalities. These models require careful initialization, originally borrowing ideas from the theory of HiPPO matrices [Voelker et al., 2019, Gu et al., 2020], to achieve good results on LRA.
In this work, we explore and extend the use of state space models by focusing solely on the task of autoregressive sequence modeling [Brown et al., 2020, Rae et al., 2021, Chowdhery et al., 2022, Zhang et al., 2022, Hoffmann et al., 2022, Srivastava et al., 2022]. Several key properties endowed by the state space model family makes it particularly attractive, to at least fully explore it, in the context of language modeling. First, it reduces the
Ω
L
2
complexity on input sequence length to
O
(
L
log
L
)
. This complexity results from the use of Fast Fourier Transform (FFT) [Cooley and Tukey, 1965] for performing convolutions. We will describe this in detail in later sections. Second, the state space model is fully parallelizable in the length dimension. This is an arguably subtle but an important property at training time. Note that transformers are also fully parallelizable, a worthy advantage over traditional RNNs for modeling sequences, which otherwise incurs only an
O
(
L
)
penalty. While this parallelism is useful at training time, it may also be a curse at inference time where decoding every token requires attending to the whole past. The ideal model is parallelizable at training time but incurs a small constant cost (per decoded token) at inference time. This brings us to the final point. Due to the inherent convolution-recurrence equivalence of the state space model, it can be made to accumulate state and unroll like an RNN at inference time without any approximations.
(Section 1. “Introduction.”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani and Mehta before the effective filing date of the claimed invention, to modify the transformer taught by Vaswani so as to include state space model layers (e.g. GSS layers) interleaved with the transformer encoder layers like taught by Mehta, e.g. in which a first layer is a state space layer that models short and long range interactions, and a second layer is a transformer encoder layer that acts on chunks of the input sequence and provides a richer modelling of short-range interactions. It would have been advantageous to one of ordinary skill to utilize such a combination because such state space layers reduce the complexity on input sequence length, as is taught by Mehta (see e.g. Section 1 “Introduction”). In such a configuration comprising a state space layer followed by a transformer encoder layer: (i) the state space layer can be considered a global layer that receives tokenized embeddings for an input sequence that comprises a plurality of chunks, i.e. the state space layer receives, for each of a plurality of local input sequences in a global input sequence (i.e. for each of a plurality of chunks of an input sequence), tokenized embeddings for each of a plurality of tokens in the local input sequence (i.e. in the chunk) from an embedding layer; (ii) the output computed by the state space layer and provided to the subsequent transformer encoder layer can be considered a global self-attention vector for each of the tokenized embeddings in each local input sequence; (iii) the transformer encoder layer can be considered a local layer that receives the output (i.e. the global self-attention vector) for each local input sequence (i.e. chunk) from the state space layer; and (iv) the output from the multi-head self-attention sub-layer in the transformer encoder layer can be considered a local self-attention vector for the local input sequence (i.e. for the chunk). Like noted above, Vaswani discloses that the multi-head self-attention sub-layer of the encoder layer adds and normalizes the input to the sub-layer with the output of the multi-head self-attention sub-layer (see e.g. Section 3.1 “Encoder and Decoder Stacks”). It thus follows that, in the above-noted configuration having a state space layer followed by a transformer encoder layer, the transformer encoder layer (i.e. the local layer) would further add and normalize the input (i.e. the global self-attention vector) with the output (i.e. the local self-attention vector) of the multi-head self-attention sub-layer and thereby produce an encoder representation including a self-attention vector for each local input sequence that includes both global self-attention values and local self-attention values. Like noted above, the transformer outputs a prediction for the input sequence (i.e. the global input sequence) according to the prediction task based on the encoder representation (i.e. based on the encoder representation of the local input sequences of the global input sequence). Accordingly, Vaswani and Mehta are considered to teach, to one of ordinary skill in the art, a computing device like that of claim 1.
As per claim 2, it would have been obvious, as is described above, to modify the transformer taught by Vaswani so as to include state space model layers (i.e. global layers) interleaved with the transformer encoder layers (i.e. local layers) like taught by Mehta, and particularly in which a first layer is a state space model layer followed by a transformer encoder layer. Vaswani suggests that the first layer receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer (see e.g. Section 3.4 “Embeddings and Softmax,” and Figure 1). It thus follows that, in the configuration taught by Vaswani and Mehta, which comprises a state space model layer followed by a transformer encoder layer, the first layer (i.e. the state space model layer/ global layer) would receive tokenized embeddings for each of the plurality of tokens in the local input sequence (i.e. in each chunk of the input sequence) from the embedding layer. Accordingly, the above-described combination of Vaswani and Mehta is further considered to teach a computing device like that of claim 2.
Regarding claim 13, Vaswani describes the transformer architecture (see e.g. the “Abstract,” which recites “[w]e propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”), and particularly teaches:
receiving, at a first layer of a transformer, tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and computing, at the first layer, a first self-attention vector for each of the tokenized embeddings in the input sequence (Vaswani discloses that the transformer architecture comprises an encoder that maps an input sequence to an output sequence, wherein the encoder is composed of a stack of identical layers that each comprise a multi-head self-attention sub-layer:
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
Vaswani further teaches that the first encoder layer receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer:
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension
d
m
o
d
e
l
. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by
d
m
o
d
e
l
.
(Section 3.4 “Embeddings and Softmax.” Emphasis added.).
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension
d
m
o
d
e
l
as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [8].
(Section 3.5 “Positional Encoding.” Emphasis added.).
PNG
media_image1.png
397
250
media_image1.png
Greyscale
The multi-head attention sub-layer of the first encoder layer thus receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and outputs a vector for each of the tokenized embeddings. The output computed by the first encoder layer, or alternatively, the output computed by the multi-head attention sub-layer of the first encoder layer, is considered a first self-attention vector.);
receiving, at a second layer, the first self-attention vector from the first layer, and computing, at the second layer, a second self-attention vector for the input sequence (As noted above, Vaswani discloses that the transformer architecture comprises an encoder that is composed of a stack of identical layers, wherein each encoder layer comprises a multi-head self-attention sub-layer. Vaswani suggests that each encoder layer receives the output of a previous encoder layer and produces a subsequent self-attention vector based thereon:
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
(Section 3.2.3 “Applications of Attention in our Model.”).
A second layer of the encoder layers thus understandably receives the vector output by the first layer of the encoder, and its multi-head self-attention sub-layer produces a subsequent self-attention vector for the input sequence.);
adding and normalize the first self-attention vector with the second self-attention vector to produce an encoder representation including a self-attention vector for the input sequence that includes both first self-attention values and second self-attention values (Like demonstrated in Figure 1, which is reproduced above, Vaswani discloses that the self-attention sub-layer of each encoder adds and normalizes the self-attention vector produced thereby with the input to the encoder layer:
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
The second encoder layer thus understandably receives the self-attention vector output by the first encoder layer, and the multi-head self-attention sub-layer of the second encoder layer computes a subsequent self-attention vector for the input sequence and adds and normalizes the subsequent self-attention vector produced thereby with the input to the encoder layer, i.e. with the first self-attention vector produced by the first encoder layer.); and
outputting a prediction for the input sequence according to a prediction task based on the encoder representation based on the encoder representation of the input sequence (Vaswani teaches that the transformer can be configured to output a prediction, e.g. a translation, for the input sequence according to a prediction task, such an English-to-German translation task, based on the encoder representation of the input sequence:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
(Abstract. Emphasis added.).
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
(Section 5.1 “Training Data and Batching.” Emphasis added.).
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
(Section 6.1 “Machine Translation.” Emphasis added.).
Vaswani thus teaches a computerized method similar to that of claim 13. However, Vaswani does not explicitly disclose that the first layer is a global layer that computes a global self-attention vector for each tokenized embedding in each of a plurality of local input sequences of a global input sequence, as is required by claim 13. Vaswani also does not explicitly disclose that the second layer is a local layer that receives the global self-attention vector for each local input sequence from the global layer, and computes a local self-attention vector for the local input sequence, and whereby the global and local self-attention vectors are added and normalized to produce the encoder representation, as is further required by claim 13.
Like noted above, Mehta generally describes a Gated State Space (GSS) layer for modeling long range dependencies, and which is “fairly competitive with…Transformer-based baselines….”:
State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
(Abstract. Emphasis added).
In particular, Mehta describes a hybrid model that interleaves transformer layers with GSS layers, wherein the GSS layers model both short and longer range interactions, and the transformer layers allow for a richer modelling of short range interactions:
Going one step further, we also perform an apples-to-apples comparison with well-tuned and performant baselines reported in Block Recurrent Transformers [Hutchins et al., 2022], on several long range language modeling benchmarks over modalities such as English books, raw source code from Github and LaTeX source of ArXiv mathematics articles. As detailed in Table 2, while our GSS model currently lags behind on some tasks when compared in the fixed-parameter setting, it is fairly competitive in the fixed-compute setting where we measure compute as the exact amount of TPUv4 hours spent on a training run and serves as a fairly accurate proxy to the realistic cost of training that model. Furthermore, we also experimented with a hybrid model in which we sparingly interleave Transformer layers (having local attention) in a GSS stack to allow for a richer modeling of short range interactions. To our delight, this further improves performance at (roughly) no extra training cost, both in terms of parameters and compute.
(Section 1 “Introduction.” Emphasis added).
Conceptually, GSS looks fairly different from the current workhorse of machine learning; the Transformer architecture. Given this, it is not immediately clear if one is decidedly better than the other or if they each provide some orthogonal benefits. If its the latter, one might wonder if there are synergies between these architectures which can be exploited to create a hybrid model which is stronger than either one of them individually. To that end, we also consider a conceptually simple hybrid between GSS and Transformer where we sparingly interleave traditional Transformer blocks with GSS layers. Despite its glaring simplicity, as shown in Table 2, we observed that it shows consistent and significant improvements on all tasks.
Chunking long inputs In all our experiments we used sequence lengths large enough to be prohibitive for traditional Transformer layers. To get around this restriction at the Transformer layers used in our hybrid model, we chunk their inputs into non-overlapping chunks of length 512 and run the Transformer layer on each of them independently. While the GSS layers are apt at modeling both short and longer range interactions, the interleaved Transformer layers can potentially allow for a richer modeling of short range interactions.
(Section 3.3 “GSS-Transformer-Hybrid.” Emphasis added.).
As indicated above, the inputs to each transformer layer are segmented into non-overlapping chunks, and the transformer layer is applied to each chunk independently. Mehta further teaches that the first layer can particularly comprise a GSS layer and the second layer (and every 4th layer thereafter) a transformer layer:
GSS models GSS consists of 16 layers and an embedding dimension of 1024. We also consider a larger variant with 32 layers as denoted by GSS-L. For GSS-Hybrid model, we used vanilla Transformer blocks at every 4th layer starting with the 2nd layer. Since GSS layers are inherently position aware, using them for the 1st layer eschews any need of explicit position embeddings typically used with otherwise position invariant Transformer blocks. Thus, barring position aware nature of GSS layers, we don’t use any kind of explicit position embedding or bias in our models. For the Transformer blocks used in hybrid models, we use multi-head self-attention with 8 heads, each with size 128.
(Section 4.2 “Comparison with other baselines.” Emphasis added).
Mehta discloses that state space models like the GSS layer are beneficial because they reduce the complexity on input sequence length:
Modeling long range dependencies on sequential data is a crucial step towards closing the gap with human-level performance on many tasks. Attention based models like Transformer [Vaswani et al., 2017] have proven to be a strong choice of backbone architecture for a considerable number of tasks across modalities and scale [Devlin et al., 2019, Brown et al., 2020, Dosovitskiy et al., 2021]. Vanilla Multi-Head-Attention famously incurs
Ω
L
2
penalty in modeling a sequence of length L. This is prohibitive at best for tasks where the model is required to capture long range dependencies from various parts of the input. Over the years, a variety of improvements have been proposed to alleviate this quadratic complexity (cf. [Tay et al., 2020]).
On a somewhat orthogonal direction, attention-free models based on state spaces, such as S4 [Gu et al., 2022a] and DSS [Gupta et al., 2022], have shown remarkable improvements on Long Range Arena (LRA) [Tay et al., 2021], a benchmark designed with long range modeling as its focus and consists of diverse tasks with 1k-16k sequence length across modalities. These models require careful initialization, originally borrowing ideas from the theory of HiPPO matrices [Voelker et al., 2019, Gu et al., 2020], to achieve good results on LRA.
In this work, we explore and extend the use of state space models by focusing solely on the task of autoregressive sequence modeling [Brown et al., 2020, Rae et al., 2021, Chowdhery et al., 2022, Zhang et al., 2022, Hoffmann et al., 2022, Srivastava et al., 2022]. Several key properties endowed by the state space model family makes it particularly attractive, to at least fully explore it, in the context of language modeling. First, it reduces the
Ω
L
2
complexity on input sequence length to
O
(
L
log
L
)
. This complexity results from the use of Fast Fourier Transform (FFT) [Cooley and Tukey, 1965] for performing convolutions. We will describe this in detail in later sections. Second, the state space model is fully parallelizable in the length dimension. This is an arguably subtle but an important property at training time. Note that transformers are also fully parallelizable, a worthy advantage over traditional RNNs for modeling sequences, which otherwise incurs only an
O
(
L
)
penalty. While this parallelism is useful at training time, it may also be a curse at inference time where decoding every token requires attending to the whole past. The ideal model is parallelizable at training time but incurs a small constant cost (per decoded token) at inference time. This brings us to the final point. Due to the inherent convolution-recurrence equivalence of the state space model, it can be made to accumulate state and unroll like an RNN at inference time without any approximations.
(Section 1. “Introduction.”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani and Mehta before the effective filing date of the claimed invention, to modify the transformer taught by Vaswani so as to include state space model layers (e.g. GSS layers) interleaved with the transformer encoder layers like taught by Mehta, e.g. in which a first layer is a state space layer that models short and long range interactions, and a second layer is a transformer encoder layer that acts on chunks of the input sequence and provides a richer modelling of short-range interactions. It would have been advantageous to one of ordinary skill to utilize such a combination because such state space layers reduce the complexity on input sequence length, as is taught by Mehta (see e.g. Section 1 “Introduction”). In such a configuration comprising a state space layer followed by a transformer encoder layer: (i) the state space layer can be considered a global layer that receives tokenized embeddings for an input sequence that comprises a plurality of chunks, i.e. the state space layer receives, for each of a plurality of local input sequences in a global input sequence (i.e. for each of a plurality of chunks of an input sequence), tokenized embeddings for each of a plurality of tokens in the local input sequence (i.e. in the chunk) from an embedding layer; (ii) the output computed by the state space layer and provided to the subsequent transformer encoder layer can be considered a global self-attention vector for each of the tokenized embeddings in each local input sequence; (iii) the following transformer encoder layer can be considered a local layer that receives the output (i.e. the global self-attention vector) for each local input sequence (i.e. chunk) from the state space layer; and (iv) the output from the multi-head self-attention sub-layer in the transformer encoder layer can be considered a local self-attention vector for the local input sequence (i.e. for the chunk). Like noted above, Vaswani discloses that the multi-head self-attention sub-layer of the encoder layer adds and normalizes the input to the sub-layer with the output of the multi-head self-attention sub-layer (see e.g. Section 3.1 “Encoder and Decoder Stacks”). It thus follows that, in the above-noted configuration having a state space layer followed by a transformer encoder layer, the transformer encoder layer (i.e. the local layer) would further add and normalize the input (i.e. the global self-attention vector) with the output (i.e. the local self-attention vector) of the multi-head self-attention sub-layer and thereby produce an encoder representation including a self-attention vector for each local input sequence that includes both global self-attention values and local self-attention values. Like noted above, the transformer outputs a prediction for the input sequence (i.e. the global input sequence) according to the prediction task based on the encoder representation (i.e. based on the encoder representation of the local input sequences of the global input sequence). Accordingly, Vaswani and Mehta are considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 13.
As per claim 14, it would have been obvious, as is described above, to modify the transformer taught by Vaswani so as to include state space model layers (i.e. global layers) interleaved with the transformer encoder layers (i.e. local layers) like taught by Mehta, and particularly in which a first layer is a state space model layer followed by a transformer encoder layer. Vaswani suggests that the first layer receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer (see e.g. Section 3.4 “Embeddings and Softmax,” and Figure 1). It thus follows that, in the configuration taught by Vaswani and Mehta, which comprises a state space model layer followed by a transformer encoder layer, the first layer (i.e. the state space model layer/global layer) would receive tokenized embeddings for each of the plurality of tokens in the local input sequence (i.e. in each chunk of the input sequence) from the embedding layer. Accordingly, the above-described combination of Vaswani and Mehta is further considered to teach a computerized method like that of claim 14.
Claims 3-5, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Vaswani and Mehta, which is described above, and also over the article entitled “Efficiently Modeling Long Sequences with Structured State Spaces” by Gu et al. (“Gu”).
Regarding claim 3, Vaswani and Mehta teach a computing device like that of claim 2, as is described above, and which comprises a global layer that includes a state space model layer. Particularly, as is described above, it would have been obvious to modify the transformer taught by Vaswani so as to include state space model layers (i.e. global layers) interleaved with the transformer encoder layers (i.e. local layers) like taught by Mehta. The particular state space model layer described by Mehta (i.e. GSS) is suggested as an alternative to other types of state space model layers, such as S4 and DSS:
Despite these attractive properties, we found that current state space models (such as S4, DSS) run slower than we expected at training time on TPUs, our accelerator of choice. We take this opportunity to modify the architecture to reduce dimensionality of specific operations which we found to be bottlenecks. Our proposed changes borrow from a well-supported empirical observation around the effectiveness of gating units [Shazeer, 2020]. Specifically, Hua et al. [2022] observed that replacing the typical Feed-Forward layer in the Transformer with gating units allows for a reduced dimensionality when mixing tokens along the length dimension using self-attention. We extend the use of gating units to state space model family and observe that, even in our context, the use of gating units allows for a reduction in dimensionality when performing FFT operations, which we observed to be the main bottleneck behind slow training. Furthermore, somewhat contrary to observations made by S4 and DSS authors, we found the performance of the model on language modeling tasks to be much less sensitive to initialization. We found that only the scale and structural aspects of initialization of state space variables were important and not the exact values. We were able to successfully train the model while initializing the state space variables randomly. This departs significantly, at least in understanding, from the reliance of the design on the theory of HiPPO matrices, which led the S4 model to employ several numerical linear algebra tricks to able to make it work. Combining both of these contributions, we propose a layer named Gated State Space (GSS) (Figure 1), which we empirically verified to be 2-3× faster than DSS while keeping the perplexity on several language modeling benchmarks (Table 1).
(Section 1 “Introduction.” Emphasis added.)
Mehta thus teaches, albeit in a nonpreferred embodiment, using an S4 model or a DSS model as the state space model layer. Vaswani and Mehta, however, do not explicitly teach that the state space model layer includes a “discrete time structured state space sequence model parameterized by normal plus low rank matrices” as is recited in claim 3.
Gu nevertheless suggests that the S4 model is a discrete time structured state space sequence model (i.e. to define a sequence-to-sequence map) that is parameterized by normal plus low rank matrices (see e.g. section 2.3 “Discrete-time SSM: the Recurrent Representation”, section 3.2 “The S4 Parameterization: Normal Plus Low-Rank” and section 3.4 “Architectures Details of the Deep S4 Layer”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Gu before the effective filing date of the claimed invention, to modify the state space model layer (i.e. global layer) taught by Vaswani and Mehta so as to comprise an S4 model like particularly taught by Gu, which is a discrete time structured state space sequence model parameterized by normal plus low rank matrices. It would have been advantageous to one of ordinary skill to utilize such an S4 model because it requires less computation and memory usage while still advancing the state-of-the-art for handling data that contains long range dependencies, as is taught by Gu (see e.g. section 1 “Introduction.”). Accordingly, Vaswani, Mehta and Gu are considered to teach, to one of ordinary skill in the art, a computing device like that of claim 3.
As per claim 4, it would have been obvious, as is described above, to modify the state space model layer (i.e. global layer) taught by Vaswani and Mehta so as to comprise an S4 model like particularly taught by Gu. Accordingly, the above-described combination of Vaswani, Mehta and Gu is further considered to teach a computing device like that of claim 4.
Regarding claim 5, Vaswani and Mehta teach a computing device like that of claim 2, as is described above, and which comprises a global layer that includes a state space model layer for computing a global self-attention vector from a global input sequence. As also described above (see the rejection for claim 3), Mehta teaches a nonpreferred embodiment in which the state space model layer particularly includes an S4 model. Vaswani and Mehta, however, do not explicitly disclose that computation of the global self-attention using the global layer including the state space model layer is accomplished with linear computational complexity and linear memory complexity relative to the global input sequence, as is required by claim 5.
Gu nevertheless teaches that the S4 model has a linear computational complexity and a linear memory complexity relative to the length of the input sequence (see e.g. section 1 “Introduction”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Gu before the effective filing date of the claimed invention, to modify the state space model layer (i.e. in the global layer) taught by Vaswani and Mehta so as to comprise an S4 model like particularly taught by Gu, which has a linear computational complexity and a linear memory complexity relative to the length of the input sequence. It thus follows that computation of the global self-attention using such an S4 state space model layer in the global layer would be accomplished with linear computational complexity and linear memory complexity relative to the global input sequence. Like noted above, it would have been advantageous to one of ordinary skill to utilize such an S4 model because it requires less computation and memory usage while still advancing the state-of-the-art for handling data that contains long range dependencies, as is taught by Gu (see e.g. section 1 “Introduction.”). Accordingly, Vaswani, Mehta and Gu are considered to teach, to one of ordinary skill in the art, a computing device like that of claim 5.
Regarding claim 15, Vaswani and Mehta teach a computerized method like that of claim 14, as is described above, and which entails utilizing a global layer that includes a state space model layer. Particularly, as is described above, it would have been obvious to modify the transformer taught by Vaswani so as to include state space model layers (i.e. global layers) interleaved with the transformer encoder layers (i.e. local layers) like taught by Mehta. The particular state space model layer described by Mehta (i.e. GSS) is suggested as an alternative to other types of state space model layers, such as S4 and DSS:
Despite these attractive properties, we found that current state space models (such as S4, DSS) run slower than we expected at training time on TPUs, our accelerator of choice. We take this opportunity to modify the architecture to reduce dimensionality of specific operations which we found to be bottlenecks. Our proposed changes borrow from a well-supported empirical observation around the effectiveness of gating units [Shazeer, 2020]. Specifically, Hua et al. [2022] observed that replacing the typical Feed-Forward layer in the Transformer with gating units allows for a reduced dimensionality when mixing tokens along the length dimension using self-attention. We extend the use of gating units to state space model family and observe that, even in our context, the use of gating units allows for a reduction in dimensionality when performing FFT operations, which we observed to be the main bottleneck behind slow training. Furthermore, somewhat contrary to observations made by S4 and DSS authors, we found the performance of the model on language modeling tasks to be much less sensitive to initialization. We found that only the scale and structural aspects of initialization of state space variables were important and not the exact values. We were able to successfully train the model while initializing the state space variables randomly. This departs significantly, at least in understanding, from the reliance of the design on the theory of HiPPO matrices, which led the S4 model to employ several numerical linear algebra tricks to able to make it work. Combining both of these contributions, we propose a layer named Gated State Space (GSS) (Figure 1), which we empirically verified to be 2-3× faster than DSS while keeping the perplexity on several language modeling benchmarks (Table 1).
(Section 1 “Introduction.” Emphasis added.)
Mehta thus teaches, albeit in a nonpreferred embodiment, using an S4 model or a DSS model as the state space model layer. Vaswani and Mehta, however, do not explicitly teach that the state space model layer includes a “discrete time structured state space sequence model parameterized by normal plus low rank matrices” as is recited in claim 15.
Gu nevertheless suggests that the S4 model is a discrete time structured state space sequence model (i.e. to define a sequence-to-sequence map) that is parameterized by normal plus low rank matrices (see e.g. section 2.3 “Discrete-time SSM: the Recurrent Representation”, section 3.2 “The S4 Parameterization: Normal Plus Low-Rank” and section 3.4 “Architectures Details of the Deep S4 Layer”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Gu before the effective filing date of the claimed invention, to modify the state space model layer (i.e. global layer) taught by Vaswani and Mehta so as to comprise an S4 model like particularly taught by Gu, which is a discrete time structured state space sequence model parameterized by normal plus low rank matrices. It would have been advantageous to one of ordinary skill to utilize such an S4 model because it requires less computation and memory usage while still advancing the state-of-the-art for handling data that contains long range dependencies, as is taught by Gu (see e.g. section 1 “Introduction.”). Accordingly, Vaswani, Mehta and Gu are considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 15.
Regarding claim 20, Vaswani describes the transformer architecture (see e.g. the “Abstract,” which recites “[w]e propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”). Like described above, Vaswani particularly teaches that the transformer includes an encoder having:
a first layer configured to (i) receive tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and (ii) compute a first self-attention vector for each of the tokenized embeddings in the input sequence (Vaswani discloses that the transformer architecture comprises an encoder that maps an input sequence to an output sequence, wherein the encoder is composed of a stack of identical layers that each comprise a multi-head self-attention sub-layer:
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
Vaswani further teaches that the first encoder layer receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer:
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension
d
m
o
d
e
l
. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by
d
m
o
d
e
l
.
(Section 3.4 “Embeddings and Softmax.” Emphasis added.).
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension
d
m
o
d
e
l
as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [8].
(Section 3.5 “Positional Encoding.” Emphasis added.).
PNG
media_image1.png
397
250
media_image1.png
Greyscale
The multi-head attention sub-layer of the first encoder layer thus receives tokenized embeddings for each of a plurality of tokens in an input sequence from an embedding layer, and outputs a vector for each of the tokenized embeddings. The output of the first encoder layer, or alternatively, the output of the multi-head attention sub-layer of the first encoder layer, is considered a first self-attention vector.); and
a second layer configured to (i) receive the first self-attention vector from the first layer and compute a second self-attention vector for the input sequence, and (ii) add and normalize the first self-attention vector with the second self-attention vector to produce an encoder representation including a self-attention vector for the input sequence that includes both first self-attention values and second self-attention values (As noted above, Vaswani discloses that the transformer architecture comprises an encoder that is composed of a stack of identical layers, wherein each encoder layer comprises a multi-head self-attention sub-layer. Vaswani suggests that each encoder layer receives the output of a previous encoder layer and produces a subsequent self-attention vector based thereon:
The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
(Section 3.2.3 “Applications of Attention in our Model.”).
A second layer of the encoder layers thus understandably receives the vector output by the first layer of the encoder, and its multi-head self-attention sub-layer produces a subsequent self-attention vector for the input sequence. Like demonstrated in Figure 1, which is reproduced above, Vaswani discloses that the self-attention sub-layer of each encoder adds and normalizes the self-attention vector produced thereby with the input to the encoder layer:
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
L
a
y
e
r
N
o
r
m
(
x
+
S
u
b
l
a
y
e
r
x
)
, where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension
d
m
o
d
e
l
=
512
.
(Section 3.1 “Encoder and Decoder Stacks.” Emphasis added.).
The second encoder layer thus understandably receives the self-attention vector output by the first encoder layer, and the multi-head self-attention sub-layer of the second encoder layer computes a subsequent self-attention vector for the input sequence and adds and normalizes the subsequent self-attention vector produced thereby with the input to the encoder layer, i.e. with the first self-attention vector produced by the first encoder layer.).
Vaswani further teaches that the transformer can be configured to output a prediction (e.g. a translation) for the input sequence according to a prediction task (e.g. English-to-German translation) based on the encoder representation of the input sequence:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
(Abstract. Emphasis added.).
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (
x
1
,
…
,
x
n
) to a sequence of continuous representations
z
=
(
z
1
,
…
,
z
n
)
. Given z, the decoder then generates an output sequence
(
y
1
,
…
,
y
m
) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
(Section 3 “Model Architecture.” Emphasis added.).
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
(Section 5.1 “Training Data and Batching.” Emphasis added.).
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
(Section 6.1 “Machine Translation.” Emphasis added.).
Moreover, Vaswani teaches that the transformer can be implemented on a computing machine (see e.g. section 5.2 “Hardware and Schedule”). Such a computing machine implementing a transformer like taught by Vaswani is considered a computing device similar to that of claim 20. However, Vaswani does not explicitly disclose that the first encoder layer is a global layer that computes a global self-attention vector for each tokenized embedding in each of a plurality of local input sequences of a global input sequence, as is required by claim 20. Vaswani also does not explicitly disclose that the second encoder layer is a local layer that receives the global self-attention vector for each local input sequence from the global layer, and computes a local self-attention vector for the local input sequence and adds and normalizes the global and local self-attention vectors to produce the encoder representation, as is further required by claim 20. Moreover, Vaswani does not teach that: (i) the global layer includes a state space model layer configured to receive the tokenized embeddings for each of the plurality of tokens in the local input sequence; (ii) the state space model layer includes a discrete time structured state space sequence model parameterized by normal plus low rank matrices; and (iii) computation of the global self-attention using the global layer including the state space model layer is accomplished with linear computational complexity and linear memory complexity relative to the global input sequence, as is further required by claim 20.
Like described above, Mehta generally describes a Gated State Space (GSS) layer for modeling long range dependencies, and which is “fairly competitive with…Transformer-based baselines….” (Abstract). In particular, like further described above, Mehta describes a hybrid model that interleaves transformer layers with GSS layers, wherein the GSS layers model both short and longer range interactions, and the transformer layers allow for a richer modelling of short range interactions:
Going one step further, we also perform an apples-to-apples comparison with well-tuned and performant baselines reported in Block Recurrent Transformers [Hutchins et al., 2022], on several long range language modeling benchmarks over modalities such as English books, raw source code from Github and LaTeX source of ArXiv mathematics articles. As detailed in Table 2, while our GSS model currently lags behind on some tasks when compared in the fixed-parameter setting, it is fairly competitive in the fixed-compute setting where we measure compute as the exact amount of TPUv4 hours spent on a training run and serves as a fairly accurate proxy to the realistic cost of training that model. Furthermore, we also experimented with a hybrid model in which we sparingly interleave Transformer layers (having local attention) in a GSS stack to allow for a richer modeling of short range interactions. To our delight, this further improves performance at (roughly) no extra training cost, both in terms of parameters and compute.
(Section 1 “Introduction.” Emphasis added).
Conceptually, GSS looks fairly different from the current workhorse of machine learning; the Transformer architecture. Given this, it is not immediately clear if one is decidedly better than the other or if they each provide some orthogonal benefits. If its the latter, one might wonder if there are synergies between these architectures which can be exploited to create a hybrid model which is stronger than either one of them individually. To that end, we also consider a conceptually simple hybrid between GSS and Transformer where we sparingly interleave traditional Transformer blocks with GSS layers. Despite its glaring simplicity, as shown in Table 2, we observed that it shows consistent and significant improvements on all tasks.
Chunking long inputs In all our experiments we used sequence lengths large enough to be prohibitive for traditional Transformer layers. To get around this restriction at the Transformer layers used in our hybrid model, we chunk their inputs into non-overlapping chunks of length 512 and run the Transformer layer on each of them independently. While the GSS layers are apt at modeling both short and longer range interactions, the interleaved Transformer layers can potentially allow for a richer modeling of short range interactions.
(Section 3.3 “GSS-Transformer-Hybrid.” Emphasis added.).
As indicated above, the inputs to each transformer layer are segmented into non-overlapping chunks, and the transformer layer is applied to each chunk independently. Mehta further teaches that the first layer can particularly comprise a GSS layer and the second layer (and every 4th layer thereafter) a transformer layer:
GSS models GSS consists of 16 layers and an embedding dimension of 1024. We also consider a larger variant with 32 layers as denoted by GSS-L. For GSS-Hybrid model, we used vanilla Transformer blocks at every 4th layer starting with the 2nd layer. Since GSS layers are inherently position aware, using them for the 1st layer eschews any need of explicit position embeddings typically used with otherwise position invariant Transformer blocks. Thus, barring position aware nature of GSS layers, we don’t use any kind of explicit position embedding or bias in our models. For the Transformer blocks used in hybrid models, we use multi-head self-attention with 8 heads, each with size 128.
(Section 4.2 “Comparison with other baselines.” Emphasis added).
Mehta discloses that state space models like the GSS layer are beneficial because they reduce the complexity on input sequence length:
Modeling long range dependencies on sequential data is a crucial step towards closing the gap with human-level performance on many tasks. Attention based models like Transformer [Vaswani et al., 2017] have proven to be a strong choice of backbone architecture for a considerable number of tasks across modalities and scale [Devlin et al., 2019, Brown et al., 2020, Dosovitskiy et al., 2021]. Vanilla Multi-Head-Attention famously incurs
Ω
L
2
penalty in modeling a sequence of length L. This is prohibitive at best for tasks where the model is required to capture long range dependencies from various parts of the input. Over the years, a variety of improvements have been proposed to alleviate this quadratic complexity (cf. [Tay et al., 2020]).
On a somewhat orthogonal direction, attention-free models based on state spaces, such as S4 [Gu et al., 2022a] and DSS [Gupta et al., 2022], have shown remarkable improvements on Long Range Arena (LRA) [Tay et al., 2021], a benchmark designed with long range modeling as its focus and consists of diverse tasks with 1k-16k sequence length across modalities. These models require careful initialization, originally borrowing ideas from the theory of HiPPO matrices [Voelker et al., 2019, Gu et al., 2020], to achieve good results on LRA.
In this work, we explore and extend the use of state space models by focusing solely on the task of autoregressive sequence modeling [Brown et al., 2020, Rae et al., 2021, Chowdhery et al., 2022, Zhang et al., 2022, Hoffmann et al., 2022, Srivastava et al., 2022]. Several key properties endowed by the state space model family makes it particularly attractive, to at least fully explore it, in the context of language modeling. First, it reduces the
Ω
L
2
complexity on input sequence length to
O
(
L
log
L
)
. This complexity results from the use of Fast Fourier Transform (FFT) [Cooley and Tukey, 1965] for performing convolutions. We will describe this in detail in later sections. Second, the state space model is fully parallelizable in the length dimension. This is an arguably subtle but an important property at training time. Note that transformers are also fully parallelizable, a worthy advantage over traditional RNNs for modeling sequences, which otherwise incurs only an
O
(
L
)
penalty. While this parallelism is useful at training time, it may also be a curse at inference time where decoding every token requires attending to the whole past. The ideal model is parallelizable at training time but incurs a small constant cost (per decoded token) at inference time. This brings us to the final point. Due to the inherent convolution-recurrence equivalence of the state space model, it can be made to accumulate state and unroll like an RNN at inference time without any approximations.
(Section 1. “Introduction.”).
Accordingly, like described above, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani and Mehta before the effective filing date of the claimed invention, to modify the transformer taught by Vaswani so as to include state space model layers (e.g. GSS layers) interleaved with the transformer encoder layers like taught by Mehta, e.g. in which a first layer is a state space layer that models short and long range interactions, and a second layer is a transformer encoder layer that acts on chunks of the input sequence and provides a richer modelling of short-range interactions. It would have been advantageous to one of ordinary skill to utilize such a combination because such state space layers reduce the complexity on input sequence length, as is taught by Mehta (see e.g. Section 1 “Introduction”). In such a configuration comprising a state space layer followed by a transformer encoder layer: (i) the state space layer can be considered a global layer that receives tokenized embeddings for an input sequence that comprises a plurality of chunks, i.e. the state space layer receives, for each of a plurality of local input sequences in a global input sequence (i.e. for each of a plurality of chunks of an input sequence), tokenized embeddings for each of a plurality of tokens in the local input sequence (i.e. in the chunk) from an embedding layer; (ii) the output computed by the state space layer and provided to the subsequent transformer encoder layer can be considered a global self-attention vector for each of the tokenized embeddings in each local input sequence; (iii) the transformer encoder layer can be considered a local layer that receives the output (i.e. the global self-attention vector) for each local input sequence (i.e. chunk) from the state space layer; and (iv) the output from the multi-head self-attention sub-layer in the transformer encoder layer can be considered a local self-attention vector for the local input sequence (i.e. for the chunk). Like noted above, Vaswani discloses that the multi-head self-attention sub-layer of the encoder layer adds and normalizes the input to the sub-layer with the output of the multi-head self-attention sub-layer (see e.g. Section 3.1 “Encoder and Decoder Stacks”). It thus follows that, in the above-noted configuration having a state space layer followed by a transformer encoder layer, the transformer encoder layer (i.e. the local layer) would further add and normalize the input (i.e. the global self-attention vector) with the output (i.e. the local self-attention vector) of the multi-head self-attention sub-layer and thereby produce an encoder representation including a self-attention vector for each local input sequence that includes both global self-attention values and local self-attention values. Like noted above, the transformer outputs a prediction for the input sequence (i.e. the global input sequence) according to the prediction task based on the encoder representation (i.e. based on the encoder representation of the local input sequences of the global input sequence). Accordingly, the combination of Vaswani and Mehta is considered to teach a computing device similar to that of claim 20, which includes a state space model layer as a global layer that is configured to receive tokenized embeddings for each of a plurality of tokens in a al input sequence. The particular state space model layer described by Mehta (i.e. GSS) is suggested as an alternative to other types of state space model layers, such as S4 and DSS:
Despite these attractive properties, we found that current state space models (such as S4, DSS) run slower than we expected at training time on TPUs, our accelerator of choice. We take this opportunity to modify the architecture to reduce dimensionality of specific operations which we found to be bottlenecks. Our proposed changes borrow from a well-supported empirical observation around the effectiveness of gating units [Shazeer, 2020]. Specifically, Hua et al. [2022] observed that replacing the typical Feed-Forward layer in the Transformer with gating units allows for a reduced dimensionality when mixing tokens along the length dimension using self-attention. We extend the use of gating units to state space model family and observe that, even in our context, the use of gating units allows for a reduction in dimensionality when performing FFT operations, which we observed to be the main bottleneck behind slow training. Furthermore, somewhat contrary to observations made by S4 and DSS authors, we found the performance of the model on language modeling tasks to be much less sensitive to initialization. We found that only the scale and structural aspects of initialization of state space variables were important and not the exact values. We were able to successfully train the model while initializing the state space variables randomly. This departs significantly, at least in understanding, from the reliance of the design on the theory of HiPPO matrices, which led the S4 model to employ several numerical linear algebra tricks to able to make it work. Combining both of these contributions, we propose a layer named Gated State Space (GSS) (Figure 1), which we empirically verified to be 2-3× faster than DSS while keeping the perplexity on several language modeling benchmarks (Table 1).
(Section 1 “Introduction.” Emphasis added.)
Mehta thus teaches, albeit in a nonpreferred embodiment, using an S4 model or a DSS model as the state space model layer. Vaswani and Mehta, however, do not explicitly teach that the state space model layer includes a discrete time structured state space sequence model parameterized by normal plus low rank matrices, wherein computation of the global self-attention using the global layer including the state space model layer is accomplished with linear computational complexity and linear memory complexity relative to the global input sequence, as is further required by claim 20.
Gu nevertheless suggests that the S4 model is a discrete time structured state space sequence model (i.e. to define a sequence-to-sequence map) that is parameterized by normal plus low rank matrices (see e.g. section 2.3 “Discrete-time SSM: the Recurrent Representation”, section 3.2 “The S4 Parameterization: Normal Plus Low-Rank” and section 3.4 “Architectures Details of the Deep S4 Layer”). Gu further teaches that the S4 model has a linear computational complexity and a linear memory complexity relative to the length of the input sequence (see e.g. section 1 “Introduction”).
Accordingly, it would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Gu before the effective filing date of the claimed invention, to modify the state space model layer (i.e. global layer) taught by Vaswani and Mehta so as to comprise an S4 model like particularly taught by Gu, which is a discrete time structured state space sequence model parameterized by normal plus low rank matrices, and which has a linear computational complexity and a linear memory complexity relative to the length of the input sequence. It thus follows that computation of the global self-attention using such an S4 state space model layer in the global layer would be accomplished with linear computational complexity and linear memory complexity relative to the global input sequence. It would have been advantageous to one of ordinary skill to utilize such an S4 model because it requires less computation and memory usage while still advancing the state-of-the-art for handling data that contains long range dependencies, as is taught by Gu (see e.g. section 1 “Introduction.”). Accordingly, Vaswani, Mehta and Gu are considered to teach, to one of ordinary skill in the art, a computing device like that of claim 20.
Claims 6-12 and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Vaswani and Mehta, which is described above, and also over U.S. Patent Application Publication No. 2024/0249503 to Jiang et al. (“Jiang”).
Regarding claim 6, Vaswani and Mehta teach a computing device like that of claim 2, as is described above, and which comprises a global layer that includes a state space model layer. Vaswani and Mehta, however, do not disclose that the global layer further includes a local layer positioned in a parallel data path to the state space model layer, as is required by claim 6.
Jiang nevertheless generally teaches implementing a local layer, which computes local self-attention for a local input sequence, in parallel with a global layer that computes global self-attention for an input sequence (see e.g. paragraphs 0008, 0037 and 0080-0083, and FIG. 13).
It would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to modify the global layer (i.e. state space model layer) taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang. It would have been advantageous to one of ordinary skill to utilize such a combination, because it can improve the image classification capabilities of the model, as is taught by Jiang (see e.g. paragraphs 0035-0037). Accordingly, Vaswani, Mehta and Jiang are considered to teach, to one of ordinary skill in the art, a computing device like that of claim 6.
As per claim 7, Vaswani suggests that the first layer of the transformer can be configured to receive tokenized embeddings for each of a plurality of tokens in an input sequence, from an embedding layer, and compute self-attention for the input sequence (see e.g. Section 3.4 “Embeddings and Softmax,” and Figure 1). As described above (see e.g. the rejection for claim 1), it would have been obvious to modify the transformer taught by Vaswani so as to include a state space model layer (i.e. a global layer) like taught by Mehta in the first layer to compute a global self-attention for the input sequence. As further described above, it would have been obvious to modify this global layer so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang. It thus follows that the local layer, being in the first layer of the transformer, would receive the tokenized embeddings for each of the plurality of tokens in the local input sequence, from the embedding layer, and compute local self-attention for the local input sequence. Accordingly, the above-described combination of Vaswani, Mehta and Jiang is further considered to teach a computing device like that of claim 7.
Regarding claim 8, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang. Jiang further suggests utilizing a combine layer to fuse the global self-attention with the local-self attention computed by the global and local layers (see e.g. paragraphs 0008 and 0087-0088). Although Jiang discloses that “a plurality of methods in which the feature fusion is performed at the network layer…may be included” (paragraph 0087), Jiang does not explicitly teach that the combine layer concatenates the global self-attention and the local self-attention, as is required by claim 8. Nevertheless, the multi-head self-attention layer taught by Vaswani, which similarly computes multiple attentions in parallel, comprises a combine layer configured to concatenate the multiple computed attentions (see e.g. section 3.2.2 “Multi-Head Attention” and Figure 2). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include a combine layer like taught by Jiang to fuse the global self-attention with the local-self computed within the global layer. It would have been advantageous to one of ordinary skill to utilize such a combination, because it can improve classification quality, as is suggested by Jiang (see e.g. paragraph 0011). It would have further been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to particularly configure the combine layer to concatenate the computed attentions (i.e. the global self-attention and the local self-attention computed within the global layer) like taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize concatenation, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.2.2 “Multi-Head Attention” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computing device like that of claim 8.
As per claim 9, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang, and wherein a combine layer is configured to concatenate the global self-attention and the local-self attention computed by the respective layers. Jiang does not explicitly teach that the global layer further includes an add and normalize layer configured to add and normalize the concatenated global self-attention and local self-attention computed within the global layer with the tokenized embeddings output from the embedding layer, as is required by claim 9. Nevertheless, the first transformer encoder layer taught by Vaswani comprises an add and normalize layer configured to add and normalize the concatenated self-attentions computed by the multi-head attention layer with the tokenized embeddings output from an embedding layer (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.4 “Embeddings and Softmax,” and Figure 1). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include an add and normalize layer configured to add and normalize the concatenated self-attentions (i.e. the concatenated global self-attention and local self-attention computed within the global layer) with the tokenized embeddings output from the embedding layer, as is taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize such a combination, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computing device like that of claim 9.
As per claim 10, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang, and wherein an add and normalize layer is configured to add and normalize the self-attentions output from the global and local layers with tokenized embeddings output from an embedding layer. Jiang does not explicitly teach that the global layer further includes a feed forward network that is configured to receive the normalized, combined global and local self-attention vectors computed in the global layer and output a predicted global layer output at inference time, as is required by claim 10. Nevertheless, each transformer encoder layer taught by Vaswani comprises a feed forward network that is configured to receive normalized, combined attention vectors computed by a multi-head self-attention sublayer in the transformer encoder layer and output a predicted layer output at inference time (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.3 “Position-wise Feed-Forward Networks,” and Figure 1). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include a feed forward network that is configured to receive the normalized, combined attention vectors (i.e. the normalized, combined global and local self-attention vectors computed in the global layer) and output a predicted layer output at inference time, as is taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize such a combination, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.3 “Position-wise Feed-Forward Networks,” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computing device like that of claim 10.
As per claim 11, Vaswani and Mehta do not explicitly teach that the transformer includes a classification layer configured to receive the encoder representation and generate the prediction, the prediction including one or a plurality of predicted classifications, as is required by claim 10. The transformer described by Jiang nevertheless comprises a classification layer configured to receive an encoder representation and generate a prediction that includes one or a plurality of predicted classifications (see e.g. paragraphs 0034-0037). It would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the transformer taught by Vaswani, Mehta and Jiang so as to include a classification layer configured to receive the encoder representation and generate the prediction, the prediction including one or a plurality of predicted classifications, as is taught by Jiang. It would have been advantageous to one of ordinary skill to utilize such a combination, because it would enable content in an image to be automatically recognized, as is taught by Jiang (see e.g. paragraphs 0034-0037). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computing device like that of claim 11.
As per claim 12, Vaswani teaches that the transformer can be a sequence-to-sequence transformer that includes a decoder configured to receive the encoder representation and generate as the prediction an output sequence of tokens based upon the input sequence (see e.g. the Abstract, section 3 “Model Architecture” on page 2, and section 3.1 “Encoder and Decoder Stacks”). As described above (see the rejection for claim 6), it would have been obvious to modify the layers taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to a global layer like taught by Jiang. Vaswani further demonstrates that the decoder has the same attention sub-layers as the encoder, in addition to a masked multi-head attention sub-layer (see e.g. section 3.1 “Encoder and Decoder Stacks” and Figure 1). Accordingly, under the same rationale, it would likewise have been obvious to modify the decoder layers to also include both local and global layers. Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computing device like that of claim 12.
Regarding claim 16, Vaswani and Mehta teach a computerized method like that of claim 14, as is described above, and which comprises a state space model layer (i.e. a global layer) as a first layer to compute a global self-attention for an input sequence. Vaswani and Mehta, however, do not disclose that the global layer further includes a local layer positioned in a parallel data path to the state space model layer, wherein the local layer receives the tokenized embeddings for each of the plurality of tokens in the local input sequence, from the embedding layer, and computes local self-attention for the local input sequence, as is required by claim 16.
Jiang nevertheless generally teaches implementing a local layer, which computes local self-attention for a local input sequence, in parallel with a global layer that computes global self-attention for an input sequence (see e.g. paragraphs 0008, 0037 and 0080-0083, and FIG. 13).
It would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to modify the global layer (i.e. state space model layer) taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang. It would have been advantageous to one of ordinary skill to utilize such a combination, because it can improve the image classification capabilities of the model, as is taught by Jiang (see e.g. paragraphs 0035-0037). Vaswani suggests that the first layer of the transformer can be configured to receive tokenized embeddings for each of a plurality of tokens in an input sequence, from an embedding layer, and compute self-attention for the input sequence (see e.g. Section 3.4 “Embeddings and Softmax,” and Figure 1). It thus follows that each of the global layer and the local layer, being in the first layer, would likewise receive the tokenized embeddings for each of the plurality of tokens in the local input sequence, from the embedding layer, and compute global and local self-attentions, respectively, for the local input sequence. Vaswani, Mehta and Jiang are thus considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 16.
Regarding claim 17, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang. Jiang further suggests utilizing a combine layer to fuse the global self-attention with the local-self attention computed by the global and local layers (see e.g. paragraphs 0008 and 0087-0088). Although Jiang discloses that “a plurality of methods in which the feature fusion is performed at the network layer…may be included” (paragraph 0087), Jiang does not explicitly teach that the combine layer concatenates the global self-attention and the local self-attention, as is required by claim 17. Nevertheless, the multi-head self-attention layer taught by Vaswani, which similarly computes multiple attentions in parallel, comprises a combine layer configured to concatenate the multiple computed attentions (see e.g. section 3.2.2 “Multi-Head Attention” and Figure 2). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include a combine layer like taught by Jiang to fuse the global self-attention with the local-self computed within the global layer. It would have been advantageous to one of ordinary skill to utilize such a combination, because it can improve classification quality, as is suggested by Jiang (see e.g. paragraph 0011). It would have further been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to particularly configure the combine layer to concatenate the computed attentions (i.e. the global self-attention and the local self-attention computed within the global layer) like taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize concatenation, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.2.2 “Multi-Head Attention” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 17.
As per claim 18, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang, and wherein a combine layer is configured to concatenate the global self-attention and the local-self attention computed by the respective layers. Jiang does not explicitly teach that the global layer further includes an add and normalize layer configured to add and normalize the concatenated global self-attention and local self-attention computed within the global layer with the tokenized embeddings output from the embedding layer, as is required by claim 18. Nevertheless, the first transformer encoder layer taught by Vaswani comprises an add and normalize layer configured to add and normalize the concatenated self-attentions computed by the multi-head attention layer with the tokenized embeddings output from an embedding layer (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.4 “Embeddings and Softmax,” and Figure 1). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include an add and normalize layer configured to add and normalize the concatenated self-attentions (i.e. the concatenated global self-attention and local self-attention computed within the global layer) with the tokenized embeddings output from the embedding layer, as is taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize such a combination, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 18.
As per claim 19, it would have been obvious, as is described above, to modify the global layer taught by Vaswani and Mehta so as to comprise a local layer positioned in a parallel data path to the global layer, as is taught by Jiang, and wherein an add and normalize layer is configured to add and normalize the self-attentions output from the global and local layers with tokenized embeddings output from an embedding layer. Jiang does not explicitly teach that the global layer further includes a feed forward network that is configured to receive the normalized, combined global and local self-attention vectors computed in the global layer and output a predicted global layer output at inference time, as is required by claim 19. Nevertheless, each transformer encoder layer taught by Vaswani comprises a feed forward network that is configured to receive normalized, combined attention vectors computed by a multi-head self-attention sublayer in the transformer encoder layer and output a predicted layer output at inference time (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.3 “Position-wise Feed-Forward Networks,” and Figure 1). It thus would have been obvious to one of ordinary skill in the art, having the teachings of Vaswani, Mehta and Jiang before the effective filing date of the claimed invention, to further modify the global layer taught by Vaswani, Mehta and Jiang so as to include a feed forward network that is configured to receive the normalized, combined attention vectors (i.e. the normalized, combined global and local self-attention vectors computed in the global layer) and output a predicted layer output at inference time, as is taught by Vaswani. It would have been advantageous to one of ordinary skill to utilize such a combination, because it would provide a useful representation of the combined attentions for further processing, as is evident from Vaswani (see e.g. section 3.1 “Encoder and Decoder Stacks,” section 3.2.2. “Multi-Head Attention,” section 3.3 “Position-wise Feed-Forward Networks,” and Figure 1). Accordingly, Vaswani, Mehta and Jiang are further considered to teach, to one of ordinary skill in the art, a computerized method like that of claim 19.
Conclusion
The prior art made of record on form PTO-892 and not relied upon is considered pertinent to applicant’s disclosure. The applicant is required under 37 C.F.R. §1.111(C) to consider these references fully when responding to this action. In particular, the U.S. Patent Application Publication to Lee et al. cited therein describes a slice attention layer that comprises both a local attention layer and a global attention layer. The article by Han et al. cited therein (“Transformer in Transformer”) describes a transformer architecture that comprises inner and outer transformer blocks for computing multiple levels of self-attention. The article by Tay et al. cited therein (“Efficient Transformers: A Survey”) provides a survey of transformer models designed to improve the computational and memory efficiency of the original transformer model.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BLAINE T BASOM whose telephone number is (571)272-4044. The examiner can normally be reached Monday-Friday, 9:00 am - 5:30 pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached at (571)270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BTB/
3/7/2026
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141