Office Action Analysis: 18102985 — GENERATING SEQUENCES OF DATA ELEMENTS USING CROSS-ATTENTION OPERATIONS

Examiner Intelligence

STANDKE, ADAM C View full profile →
Grants 50% of resolved cases
Career Allow Rate
61 granted / 123 resolved
-5.4% vs TC avg
Strong +25% interview lift
Without
With
+24.8%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
39 currently pending
Career history
162
Total Applications
across all art units
Statute-Specific Performance

§101
18.9%
-21.1% vs TC avg
§103
55.3%
+15.3% vs TC avg
§102
8.7%
-31.3% vs TC avg
§112
14.7%
-25.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 123 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/31/2024 and 10/03/2024 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
Claim 3 is objected to because of the following informalities: Should state “the position corresponding to the latent embedding” rather than “the position correspond to the latent embedding.”  Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 19 and 20 recites the limitation "the current position."  There is insufficient antecedent basis for this limitation in the claim.
Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 19 and 20 recites the limitation "the data element."  There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-16 and 19-20  are rejected under 35 U.S.C. 103 as being unpatentable over Ma, Xuezhe, et al. "Luna: Linear Unified Nested Attention." arXiv preprint arXiv:2106.01540 (2021)(“Ma”) in view of Katharopoulos, Angelos, et al. "Transformers are rnns: Fast autoregressive transformers with linear attention." International conference on machine learning. PMLR, 2020(“Katharopoulos”).
Regarding claim 1, Ma teaches a method performed by one or more computers, the method comprising: 
generating a sequence of data elements that comprises a respective data element at each position in a sequence of positions, comprising, for each position after a first position in the sequence of positions(Ma, pgs., 5-6, “They collect five tasks in this benchmark which are ListOps...byte-level text classification...byte-level document retrieval...image classification on sequences of pixels...and Pathfinder...[t]hese tasks consist of input sequences ranging from 1K to 8K tokens and span across a variety of data types and modalities[generating a sequence of data elements that comprises a respective data element at each position in a sequence of positions, comprising, for each position after a first position in the sequence of positions].”): 
obtaining a current sequence of data element embeddings that comprises a respective data element embedding of each data element at a position that precedes the current position(Ma, pgs., 2-3, see also fig. 2, “[T]he query sequence                         
                            X
                            ∈
                            
                                    R
                                
                                    n
                                    ×
                                    d
                                
                     with length n and the context sequence                         
                            C
                            ∈
                            
                                    R
                                
                                    m
                                    ×
                                    d
                                
                     with length m... where X and C are the two input sequences[obtaining a current sequence of data element embeddings that comprises a respective data element embedding of each data element at a position that precedes] and                         
                            X
                            '
                        
                     is the output of the Transformer layer[the current position].”);
obtaining a sequence of latent embeddings(Ma, pgs., 2-3, see also fig. 2, “[B]esides the original query and context input sequences, Luna introduces an extra input that is a sequence with fixed (constant) length. With this extra input as the query sequence... named pack attention, to pack the context sequence into a fixed-length sequence. Formally, let                         
                            P
                            ∈
                            
                                    R
                                
                                    l
                                    ×
                                    d
                                
                     denote the extra input sequence with fixed length l[obtaining a sequence of latent embeddings].”);
 processing: (i) the current sequence of data element embeddings, and (ii) the sequence of latent embeddings, using a neural network to generate the data element at the current position, wherein the neural network comprises a sequence of neural network blocks comprising: (i) a cross-attention block, (ii) one or more self-attention blocks, and (iii) an output block(Ma, pgs., 2-4, see also fig., 2, “Note that the [attention] formulation in (1) is applicable to both cross-attention where C and X are the representations from Transformer encoder and decoder[i) a cross-attention block], respectively, and self-attention where X and C are the same sequence (X = C)[(ii) one or more self-attention blocks,]... [t]he pack attention first packs C to                         
                            
                                    Y
                                
                                    p
                                
                     with P as the query sequence:                         
                            
                                    Y
                                
                                    p
                                
                            =
                            A
                            t
                            t
                            n
                            (
                            P
                            ,
                             
                            C
                            )
                        
                     where Attn                        
                            (
                            ⋅
                            ,
                            ⋅
                            )
                        
                     is the regular attention function in (1)... [t]o unpack the sequence back to the length of the original query sequence X, Luna leverages its second attention, named unpack attention:                         
                            
                                    Y
                                
                                    X
                                
                            =
                            A
                            t
                            t
                            n
                            (
                            X
                            ,
                             
                                    Y
                                
                                    p
                                
                            )
                        
                     where X                        
                            ∈
                            
                                    R
                                
                                    n
                                    ×
                                    d
                                
                    [processing: (i) the current sequence of data element embeddings, and (ii) the sequence of latent embeddings]... [w]e incorporate the position-wise feed-forward network[using a neural network to generate the data element at the current position, wherein the neural network comprises a sequence of neural network blocks comprising:] and layer normalization into Luna layers. Concretely, layer normalization is applied to both                         
                            
                                    Y
                                
                                    X
                                
                     and                         
                            
                                    Y
                                
                                    P
                                
                    , while FFN only to                         
                            
                                    Y
                                
                                    X
                                
                     [as detailed in equation 6] where                         
                            X
                            '
                        
                     and                         
                            P
                            '
                        
                     are the two outputs of the Luna layer[and (iii) an output block].”), 
wherein the cross-attention block performs operations comprising: updating each latent embedding in the sequence of latent embeddings using attention over the current sequence of data element embeddings(Ma, pgs., 2-4, see also fig., 2, “Formally, let                         
                            P
                            ∈
                            
                                    R
                                
                                    l
                                    ×
                                    d
                                
                     denote the extra input sequence with fixed length l[latent embedding in the sequence of latent embeddings]. The pack attention first packs C to                         
                            
                                    Y
                                
                                    p
                                
                     with P as the query sequence:                         
                            
                                    Y
                                
                                    p
                                
                            =
                            A
                            t
                            t
                            n
                            (
                            P
                            ,
                             
                            C
                            )
                        
                     where Attn                        
                            (
                            ⋅
                            ,
                            ⋅
                            )
                        
                     is the regular attention function in (1)[ updating each using attention over],                        
                            C
                            ∈
                            
                                    R
                                
                                    m
                                    ×
                                    d
                                
                     is the context sequence[the current sequence of data element embeddings]...”); 
wherein the output block performs operations comprising: after the sequence of latent embeddings are updated using the cross- attention block and the one or more self-attention blocks, processing one or more latent embeddings from the sequence of latent embeddings to generate the data element at the current position(Ma, pg., 4, see also fig., 2, “The Luna attention is used as a drop-in-replacement for the regular attention. We incorporate the position-wise feed-forward network and layer normalization into Luna layers. Concretely, layer normalization is applied to both                         
                            
                                    Y
                                
                                    X
                                
                     and                         
                            
                                    Y
                                
                                    p
                                
                    , while FFN only to                         
                            
                                    Y
                                
                                    X
                                
                    :                         
                            
                                    X
                                
                                    A
                                
                            ,
                             
                                    P
                                
                                    A
                                
                            =
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                            Y
                                        
                                            X
                                        
                                    +
                                    X
                                
                            ,
                             
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                            Y
                                        
                                            p
                                        
                                    +
                                    P
                                
                                    X
                                
                                    '
                                
                            ,
                             
                                    P
                                
                                    '
                                
                            =
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                    F
                                    F
                                    N
                                    
                                                    X
                                                
                                                    A
                                                
                                    +
                                    
                                            X
                                        
                                            A
                                        
                            ,
                             
                                    P
                                
                                    A
                                
                     [after the sequence of latent embeddings are updated using the cross- attention block and the one or more self-attention blocks, processing one or more latent embeddings from the sequence of latent embeddings]where                         
                            
                                    X
                                
                                    '
                                
                            ,
                             
                                    P
                                
                                    '
                                
                     are the two outputs of the Luna layer[to generate the data element at the current position].”). 
While Ma does teach latent embeddings, Ma does not teach: wherein each self-attention block performs operations comprising: updating each latent embedding in the sequence of latent embeddings using attention over the sequence of latent embeddings. 
However, Katharopoulos teaches: 
wherein each self-attention block performs operations comprising: updating each latent embedding in the sequence of latent embeddings using attention over the sequence of latent embeddings(Katharopoulos, pgs., 3-4, “Equation 2 implements a specific form of self-attention
called softmax attention where the similarity score is the exponential of the dot product between a query and a key. Given that subscripting a matrix with i returns the i-th row as a vector, we can write a generalized attention equation for any similarity function as follows,                         
                            
                                    V
                                
                                    i
                                
                                    '
                                
                            =
                            
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                                N
                                            
                                            s
                                            i
                                            m
                                            
                                                            Q
                                                        
                                                            i
                                                        
                                                    ,
                                                     
                                                            K
                                                        
                                                            j
                                                        
                                                    V
                                                
                                                    j
                                                
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                                N
                                            
                                            s
                                            i
                                            m
                                            
                                                            Q
                                                        
                                                            i
                                                        
                                                    ,
                                                     
                                                            K
                                                        
                                                            j
                                                        
                    [updating each latent embedding in the sequence of latent embeddings using attention over the sequence of latent embeddings]”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Ma with the teachings of Katharopoulos the motivation to do so would be to improve self-attention as used in transformers by reducing the computational complexity from polynomial time to linear time, and hence speeding up the prediction time of transformers(Katharopoulos, pg., 1, “The bottleneck is mainly caused by the global receptive field of self-attention, which processes contexts of N inputs with a quadratic memory and time complexity                         
                            O
                            (
                            
                                    N
                                
                                    2
                                
                            )
                        
                    ... [l]ately, researchers shifted their attention to approaches that increase the context length without sacrificing efficiency... [i]n this paper, we introduce the linear transformer model that significantly reduces the memory footprint and scales linearly with respect to the context length. We achieve this by using a kernel-based formulation of self-attention and the associative property of matrix products to calculate the self-attention weights.”)
Regarding claim 2, Ma in view of Katharopoulos teaches the method of claim 1, wherein updating each latent embedding in the sequence of latent embeddings using attention over the current sequence of data element embeddings comprises: updating each latent embedding in the sequence of latent embeddings using masked attention over the current sequence of data element embeddings(Ma, pg., 4, “To design causal attention in Luna, we need to assume that the input P contains no information of X, i.e. P will not leak any future information of X to the history[using masked attention]... we first define a causal function                         
                            f
                        
                    [as defined in equation (7)]...[f]rom the definition of                         
                            f
                        
                     in (7), we see that                         
                            
                                    F
                                
                                    t
                                
                     can only access the information of the past and present row of X, Y, and Z...[f]inally, the output Y is computed by                         
                            Y
                            =
                            f
                            (
                            
                                    A
                                
                                    u
                                    n
                                    p
                                    a
                                    c
                                    k
                                
                            ,
                             
                                    A
                                
                                    p
                                    a
                                    c
                                    k
                                
                                    T
                                
                            ,
                             
                            X
                            )
                        
                     [updating each latent embedding in the sequence of latent embeddings using masked attention over the current sequence of data element embeddings]”).  
Regarding claim 3, Ma in view of Katharopoulos teaches the method of claim 2, wherein each latent embedding corresponds to a respective position in the sequence of positions(Ma, pgs., 2-4, see also fig., 2, “Formally, let                         
                            P
                            ∈
                            
                                    R
                                
                                    l
                                    ×
                                    d
                                
                     denote the extra input sequence with fixed length l[latent embedding]. The pack attention first packs C to                         
                            
                                    Y
                                
                                    p
                                
                     with P as the query sequence:                         
                            
                                    Y
                                
                                    p
                                
                            =
                            A
                            t
                            t
                            n
                            (
                            P
                            ,
                             
                            C
                            )
                        
                     where Attn                        
                            (
                            ⋅
                            ,
                            ⋅
                            )
                        
                     is the regular attention function in (1)...and                         
                            
                                    Y
                                
                                    p
                                
                            ∈
                            
                                    R
                                
                                    l
                                    ×
                                    d
                                
                     [latent embedding]is the output of the pack attention[wherein each corresponds to a respective position in the sequence of positions]....”), and wherein updating each latent embedding in the sequence of latent embeddings using masked attention over the current sequence of data element embeddings comprises, for each latent embedding: updating the latent embedding using attention over only: (i) the data element embedding at the position corresponding to the latent embedding, and (ii) any data element embeddings at positions preceding the position correspond to the latent embedding(Ma, pg., 4, “To design causal attention in Luna, we need to assume that the input P[latent embedding] contains no information of X[data element embedding], i.e. P will not leak any future information of X to the history...we first define a causal function                         
                            f
                        
                    :                         
                            F
                            ≜
                            f
                            (
                            X
                            ,
                             
                            Y
                            ,
                             
                            Z
                            )
                        
                    , where                         
                            
                                    F
                                
                                    t
                                
                    =                        
                             
                                    1
                                
                                    t
                                
                                    X
                                
                                    t
                                
                                    ∑
                                    
                                        j
                                        =
                                        1
                                    
                                        t
                                    
                                            Y
                                        
                                            j
                                        
                                            t
                                        
                                            Z
                                        
                                            j
                                        
                    ...[f]rom the definition of                         
                            f
                        
                     in (7), we see that                         
                            
                                    F
                                
                                    t
                                
                     can only access the information of the past and present row of X, Y, and Z...we first compute the attention matrix of the pack attention:                         
                            
                                    A
                                
                                    p
                                    a
                                    c
                                    k
                                
                            =
                            ω
                            (
                            
                                    P
                                    
                                            X
                                        
                                            T
                                        
                                        d
                                    
                            )
                        
                    [ using attention over only: (i) the data element embedding at the position corresponding to the latent embedding]...we [also] compute the attention matrix of the unpack attention:                         
                            
                                    A
                                
                                    u
                                    n
                                    p
                                    a
                                    c
                                    k
                                
                            =
                            ω
                            (
                            f
                            
                                    X
                                    ,
                                     
                                    X
                                    ,
                                     
                                            A
                                        
                                            p
                                            a
                                            c
                                            k
                                        
                                            T
                                        
                            )
                        
                    [ and (ii) any data element embeddings at positions preceding the position correspond to the latent embedding]...the output Y is computed by                         
                            Y
                            =
                            f
                            (
                            
                                    A
                                
                                    u
                                    n
                                    p
                                    a
                                    c
                                    k
                                
                            ,
                             
                                    A
                                
                                    p
                                    a
                                    c
                                    k
                                
                                    T
                                
                            ,
                             
                            X
                            )
                        
                    [ for each latent embedding: updating the latent embedding].”).  
Regarding claim 4, Ma in view of Katharopoulos teaches the method of claim 1, wherein updating each latent embedding in the sequence of latent embeddings using attention over the sequence of latent embeddings comprises: updating each latent embedding in the sequence of latent embeddings using masked attention over the sequence of latent embeddings(Katharopoulos, pgs., 3-4, “The transformer architecture can be used to efficiently train autoregressive models by masking the attention computation such that the i-th position can only be influenced by a position j if and only if                         
                            j
                            ≤
                            i
                        
                     namely a position cannot be influenced by the subsequent positions... we linearize the masked attention as described below...                        
                            
                                    V
                                
                                    i
                                
                                    '
                                
                            =
                            
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            S
                                        
                                            i
                                        
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            Z
                                        
                                            i
                                        
                    [ updating each latent embedding in the sequence of latent embeddings using masked attention over the sequence of latent embeddings]”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of  Ma with the above teachings of Katharopoulos for the same rationale stated at Claim 1.
Regarding claim 5, Ma in view of Katharopoulos teaches the method of claim 4, wherein updating each latent embedding in the sequence of latent embeddings using masked attention over the sequence of latent embeddings comprises, for each latent embedding in the sequence of latent embeddings: updating the latent embedding using attention over only: (i) the latent embedding, and (ii) any latent embeddings that precede the latent embedding in the sequence of latent embeddings(Katharopoulos, pgs., 3-4, “The transformer architecture can be used to efficiently train autoregressive models by masking the attention computation such that the i-th position can only be influenced by a position j if and only if                         
                            j
                            ≤
                            i
                        
                     namely a position cannot be influenced by the subsequent positions... we linearize the masked attention as described below...                        
                            
                                    V
                                
                                    i
                                
                                    '
                                
                            =
                            
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            S
                                        
                                            i
                                        
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            Z
                                        
                                            i
                                        
                    [for each latent embedding in the sequence of latent embeddings: updating the latent embedding using attention over only: (i) the latent embedding, and (ii) any latent embeddings that precede the latent embedding in the sequence of latent embeddings].”).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of  Ma with the above teachings of Katharopoulos for the same rationale stated at Claim 1.
Regarding claim 6, Ma in view of Katharopoulos teaches the method of claim 1, wherein for one or more positions in the sequence of positions, obtaining the sequence of latent embeddings comprises: identifying a subsequence of the current sequence of data element embeddings; and determining the sequence of latent embeddings based on the subsequence of the current sequence of data element embeddings (Ma, pgs., 2-4, see also fig., 2, “Formally, let                         
                            P
                            ∈
                            
                                    R
                                
                                    l
                                    ×
                                    d
                                
                     denote the extra input sequence with fixed length l. The pack attention first packs C to                         
                            
                                    Y
                                
                                    p
                                
                     with P as the query sequence[and determining the sequence of latent embeddings based on the subsequence of the current sequence of data element embeddings]... [where]                         
                            C
                            ∈
                            
                                    R
                                
                                    m
                                    ×
                                    d
                                
                     is the context sequence[identifying a subsequence of the current sequence of data element embeddings]....”).
Regarding claim 7, Ma in view of Katharopoulos teaches the method of claim 6, wherein the subsequence of the current sequence of data element embeddings comprises a predefined number of last data element embeddings in the sequence of data element embeddings(Ma, pgs., 2-4, see also fig., 2, As fig. 2 details below: 

    PNG
    media_image1.png
    281
    289
    media_image1.png
    Greyscale

Q(i.e., the query) which is drawn from the sequence of data element embeddings i.e.,  X  is a predefined number of last data element embeddings drawn from the sequence of data element embeddings i.e., X ).  
Regarding claim 8, Ma in view of Katharopoulos teaches the method of claim 6, wherein determining the sequence of latent embeddings based on the subsequence of the current sequence of data element embeddings comprises: setting the sequence of latent embeddings equal to the subsequence of the current sequence of data element embeddings(Ma, pg., 2, see also fig. 2, “The traditional attention mechanism is a function... the query sequence                         
                            X
                            ∈
                            
                                    R
                                
                                    n
                                    ×
                                    d
                                
                     with length n[subsequence of the current sequence of data element embeddings] and the context sequence                         
                            C
                            ∈
                            
                                    R
                                
                                    m
                                    ×
                                    d
                                
                     with length m, and output one sequence                         
                            Y
                            ∈
                            
                                    R
                                
                                    n
                                    ×
                                    d
                                
                    [latent embeddings] with the same length n as the query X[setting the sequence of latent embeddings equal to the subsequence of the current sequence of data element embeddings].”).  
Regarding claim 9, Ma in view of Katharopoulos teaches the method of claim 1, wherein generating the data element at the current position comprises: processing the one or more latent embeddings from the sequence of latent embeddings to generate a score distribution over a set of possible data elements; and selecting the data element at the current position using the score distribution over the set of possible data elements(Ma, pg., 4, see also fig., 2, “The Luna attention is used as a drop-in-replacement for the regular attention. We incorporate the position-wise feed-forward network and layer normalization into Luna layers. Concretely, layer normalization is applied to both                         
                            
                                    Y
                                
                                    X
                                
                     and                         
                            
                                    Y
                                
                                    p
                                
                    , while FFN only to                         
                            
                                    Y
                                
                                    X
                                
                    :                         
                            
                                    X
                                
                                    A
                                
                            ,
                             
                                    P
                                
                                    A
                                
                            =
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                            Y
                                        
                                            X
                                        
                                    +
                                    X
                                
                            ,
                             
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                            Y
                                        
                                            p
                                        
                                    +
                                    P
                                
                                    X
                                
                                    '
                                
                            ,
                             
                                    P
                                
                                    '
                                
                            =
                            L
                            a
                            y
                            e
                            r
                            N
                            o
                            r
                            m
                            
                                    F
                                    F
                                    N
                                    
                                                    X
                                                
                                                    A
                                                
                                    +
                                    
                                            X
                                        
                                            A
                                        
                            ,
                             
                                    P
                                
                                    A
                                
                    where                         
                            
                                    X
                                
                                    '
                                
                            ,
                             
                                    P
                                
                                    '
                                
                    are the outputs of the Luna layer[processing the one or more latent embeddings from the sequence of latent embeddings to generate a score distribution over a set of possible data elements].” & Ma, pg., 8, “For QA tasks, we concatenate each candidate answer with the corresponding question and passage. We then encode every candidate and pass the [CLS] output at the last layer through a fully-connected layer, which is used to predict the correct answer[and selecting the data element at the current position using the score distribution over the set of possible data elements].”).  
Regarding claim 10, Ma in view of Katharopoulos teaches the method of claim 9, wherein selecting the data element at the current position using the score distribution over the set of possible data elements comprises: sampling the data element at the current position from the set of possible data elements in accordance with the score distribution over the set of possible data elements(Ma, pg., 8, “For QA tasks, we concatenate each candidate answer with the corresponding question and passage. We then encode every candidate and pass the [CLS] output at the last layer through a fully-connected layer[sampling the data element at the current position from the set of possible data elements], which is used to predict the correct answer).  
Regarding claim 11, Ma in view of Katharopoulos teaches the method of claim 1, wherein for each of one or more positions in the sequence of positions: a number of latent embeddings in the sequence of latent embeddings is less than a number of data element embeddings in the current sequence of data element embeddings(Ma, pgs., 2-4, see also fig., 2, As fig. 2 details below: 

    PNG
    media_image1.png
    281
    289
    media_image1.png
    Greyscale

P and P’ i.e., the latent embeddings are vectors of three elements while X i.e. the data element embeddings is a vector of 7 elements[a number of latent embeddings in the sequence of latent embeddings is less than a number of data element embeddings in the current sequence of data element embeddings]).  
Regarding claim 12, Ma in view of Katharopoulos teaches the method of claim 11, wherein the number of latent embeddings is at least an order of magnitude less than the number of data element embeddings(Ma, pg., 7, “Another interesting observation is that Luna with a small projected length (l = 16)[ wherein the number of latent embeddings] obtains similar performance to RFA with k = 256 feature maps[is at least an order of magnitude less than the number of data element embeddings].”).  
Regarding claim 13, Ma in view of Katharopoulos teaches the method of claim 1, wherein for each position after the first position in the sequence of positions, a number of latent embeddings in the sequence of latent embeddings is predefined and independent of a number of data element embeddings in the current sequence of data element embeddings(Katharopoulos, pgs., 3-4, “The transformer architecture can be used to efficiently train autoregressive models by masking the attention computation such that the i-th position can only be influenced by a position j if and only if                         
                            j
                            ≤
                            i
                        
                    , namely a position cannot be influenced by the subsequent positions... we linearize the masked attention as described below...                        
                            
                                    S
                                
                                    i
                                
                            =
                            
                                    ∑
                                    
                                        j
                                        =
                                        1
                                    
                                        i
                                    
                                    ϕ
                                    
                                                    K
                                                
                                                    j
                                                
                                            V
                                        
                                            j
                                        
                                            T
                                        
                            ,
                             
                                    Z
                                
                                    i
                                
                            =
                            
                                    ∑
                                    
                                        j
                                        =
                                        1
                                    
                                        i
                                    
                                    ϕ
                                    (
                                    
                                            K
                                        
                                            j
                                        
                                    )
                                
                                    V
                                
                                    i
                                
                                    '
                                
                            =
                            
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            S
                                        
                                            i
                                        
                                    ϕ
                                    
                                                            Q
                                                        
                                                            i
                                                        
                                            T
                                        
                                            Z
                                        
                                            i
                                        
                     [wherein for each position after the first position in the sequence of positions, a number of latent embeddings in the sequence of latent embeddings is predefined and independent of a number of data element embeddings in the current sequence of data element embeddings].”).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of  Ma with the above teachings of Katharopoulos for the same rationale stated at Claim 1.
Regarding claim 14, Ma in view of Katharopoulos teaches the method of claim 1, wherein generating the sequence of data elements comprises autoregressively generating the sequence of data elements(Ma, pg., 4, “the ability to support causal autoregressive decoding, i.e. attending solely to the past and current tokens, is required when designing efficient self-attention mechanisms... [t]o design causal attention in Luna, we need to assume that the input P contains no information of X, i.e. P will not leak any future information of X to the history... [f]or the encoder-decoder mode in sequence-to-sequence modeling (e.g. for machine translation), we can use packed output from the Luna encoder as P[autoregressively generating the sequence of data elements].”).  
Regarding claim 15, Ma in view of Katharopoulos teaches the method of claim 1, wherein the sequence of data elements defines an image(Ma, pg., 5, “We evaluate the effectiveness and efficiency of Luna on the Long Range Arena (LRA) benchmark...[which includes] image classification on sequences of pixels... input sequences ranging from 1K to 8K tokens and span across a variety of data types and modalities[wherein the sequence of data elements defines an image].”).  
Regarding claim 16, Ma in view of Katharopoulos teaches the method of claim 1, wherein the sequence of data elements defines an audio waveform(Katharopoulos, pgs., 7-8, “[W]e evaluate the performance of linear transformers in end-to-end automatic speech recognition... [w]e use the 80 hour WSJ dataset with 40-dimensional mel-scale filterbanks
without temporal differences as features. The dataset contains sequences with 800 frames on average and a maximum sequence length of 2,400 frames[wherein the sequence of data elements defines an audio waveform].”).  
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of  Ma with the above teachings of Katharopoulos for the same rationale stated at Claim 1.
Regarding claim 19, Ma teaches a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers(Ma, pg., 7, “For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU.”) and for all other claim limitations they are rejected on the same basis as independent claim 1 since they are analogous claims. 	
Regarding claim 20, Ma teaches one or more non-transitory computer storage media(Ma, pg., 7, “For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU.”) and for all other claim limitations they are rejected on the same basis as independent claim 1 since they are analogous claims.
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Ma, Xuezhe, et al. "Luna: Linear Unified Nested Attention." arXiv preprint arXiv:2106.01540 (2021)(“Ma”) in view of Katharopoulos, Angelos, et al. "Transformers are rnns: Fast autoregressive transformers with linear attention." International conference on machine learning. PMLR, 2020(“Katharopoulos”) and in view of Suzuki, Masahiro. "Score transformer: Generating musical score from note-level representation." Proceedings of the 3rd ACM International Conference on Multimedia in Asia. 2021(“Suzuki”).
Regarding claim 17, Ma in view of Katharopoulos teaches the method of claim 1, but does not teach: wherein the sequence of data elements defines a sequence of musical notes.
However, Suzuki teaches: 
wherein the sequence of data elements defines a sequence of musical notes(Suzuki, pg., 3, see also fig. 1, “A single token corresponds to a score symbol (e.g., barline, clef, key signature, and time signature) or a note attribute (e.g., note value, stem direction, and tie). The only exception is a voice symbol, which consists of a pair of tokens...[w]e concatenate time-ordered token sequences of staves (i.e., right- and left-hand parts of piano scores) to form a single sequence[wherein the sequence of data elements defines a sequence of musical notes]”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Ma in view of Katharopoulos with the teachings of Suzuki the motivation to do so would be to use transformers to solve the unresolved problem of generating music for musical scores(Suzuki, pgs. 2-3, “Deep learning techniques have yielded impressive results in music generation and music transcription. However, their application to the generation of comprehensive musical scores or even effective representations thereof remains unexplored... [i]n this study, we address the generation of comprehensive musical scores using the Transformer model...which complements music generation and music transcription research.”). 
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Ma, Xuezhe, et al. "Luna: Linear Unified Nested Attention." arXiv preprint arXiv:2106.01540 (2021)(“Ma”) in view of Katharopoulos, Angelos, et al. "Transformers are rnns: Fast autoregressive transformers with linear attention." International conference on machine learning. PMLR, 2020(“Katharopoulos”) and in view of Nambiar, Ananthan, et al. "Transforming the language of life: transformer neural networks for protein prediction tasks." Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics. 2020(“Nambiar”).
Regarding claim 18, Ma in view of Katharopoulos teaches the method of claim 1, but does not teach: wherein the sequence of data elements defines a structure of a protein.
However, Nambiar teaches: 
wherein the sequence of data elements defines a structure of a protein(Nambiar, pg., 4, “A tokenized amino acid sequence ⟨                        
                            
                                    u
                                
                                    1
                                
                            ,
                             
                                    u
                                
                                    2
                                
                            ,
                            …
                            ,
                             
                                    u
                                
                                    n
                                
                    ⟩ is either truncated or padded to a fixed-length sequence of 512 tokens[wherein the sequence of data elements defines a structure of a protein].”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Ma in view of Katharopoulos with the teachings of Nambiar the motivation to do so would be to discover new proteins that can actually be tested through the use of generated pre-training Transformers to supplement traditional techniques of protein sequencing and protein interaction prediction(Nambiar, pgs. 2-3, “The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family classification and protein interaction prediction.”). 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20200342316 A1(details a self-attention decoder network along with three different types of self-attention such as memory-compressed attention and local attention)
US 12306906 B2(details a modified attention scheme in which approximate attention is used along with sampling attention based on the generated scores)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM C STANDKE whose telephone number is (571)270-1806. The examiner can normally be reached Gen. M-F 9-9PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ADAM C STANDKE/
Examiner
Art Unit 2129
Read full office action
Prosecution Timeline

Jan 30, 2023
Application Filed
Jan 07, 2026
Non-Final Rejection — §103, §112
Mar 31, 2026
Interview Requested
Apr 09, 2026
Applicant Interview (Telephonic)
Apr 10, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/414,718
Patent 12596958
APPARATUS AND METHODS FOR MULTIPLE STAGE PROCESS MODELING
2y 5m to grant Granted Apr 07, 2026
17/165,444
Patent 12555026
INTERPRETABLE HIERARCHICAL CLUSTERING
2y 5m to grant Granted Feb 17, 2026
17/072,709
Patent 12547875
AUTOMATED SETUP AND COMMUNICATION COORDINATION FOR TRAINING AND UTILIZING MASSIVELY PARALLEL NEURAL NETWORKS
2y 5m to grant Granted Feb 10, 2026
17/007,438
Patent 12541704
MACHINE-LEARNING PREDICTION OR SUGGESTION BASED ON OBJECT IDENTIFICATION
2y 5m to grant Granted Feb 03, 2026
17/119,592
Patent 12541691
MIXUP DATA AUGMENTATION FOR KNOWLEDGE DISTILLATION FRAMEWORK
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
74%
With Interview (+24.8%)
4y 3m
Median Time to Grant
Low
PTA Risk
Based on 123 resolved cases by this examiner. Grant probability derived from career allow rate.
GENERATING SEQUENCES OF DATA ELEMENTS USING CROSS-ATTENTION OPERATIONS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

GENERATING SEQUENCES OF DATA ELEMENTS USING CROSS-ATTENTION OPERATIONS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email