Office Action Analysis: 18193234 — MULTIDIMENSIONAL SPACE DECOMPOSITION FOR TRANSFORMER NEURAL NETWORKS

Examiner Intelligence

SUSSMAN MOSS, JACOB ZACHARY View full profile →
Grants only 14% of cases
Career Allow Rate
1 granted / 7 resolved
-40.7% vs TC avg
Minimal -20% lift
Without
With
+-20.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
26 currently pending
Career history
33
Total Applications
across all art units
Statute-Specific Performance

§101
37.3%
-2.7% vs TC avg
§103
35.2%
-4.8% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 7 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the application filed on March 30th, 2023. Claims 1-35 are pending in the case. Claims 1, 11, 20 and 28 are independent claims.
The information disclosure statements (IDS) submitted on 30 March 2023, 3 April 2023 and 19 June 2024 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.

Specification
The disclosure is objected to because of the following informalities: 
“user equipments” in ¶3, ¶26 should be “user equipment”.  
“system 400 maybe distributed” in ¶68 should be “system 400 may be distributed”.  
Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-35 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 11, 20 and 28 recite the limitations “a projection of tokens in a first two-dimensional subspace” and “a projection of tokens in a second two-dimensional subspace”. It is unclear whether the second projection of tokens uses the same set of tokens as the first projection, or a new second set of tokens.
Claims 2-10, 12-19, 21-27 and 29-35 are rejected for being dependent on a rejected base claim without curing any of the deficiencies. 

Non-Statutory Subject Matter
Claims 28-35 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to non-statutory subject matter. During examination, the claims must be interpreted as broadly as their terms reasonably allow.  In re American Academy of Science Tech Center, 367 F.3d 1359, 1369, 70 U.S.P.Q.2d 1827, 1834 (Fed. Cir. 2004).  Independent claim 28 recites a “computer readable medium,” which is not comprehensively defined by the specification.  The broadest reasonable interpretation of a claim drawn to a computer readable medium covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer readable media. Transitory propagating signals are non-statutory subject matter.  In re Nuijten, 500 F.3d 1346, 1356-57, 84 U.S.P.Q.2d 1495, 1502 (Fed. Cir. 2007) (transitory embodiments are not directed to statutory subject matter).  See also Subject Matter Eligibility of Computer Readable Media, 1351 Off. Gaz. Pat. Office 212 (Feb. 23, 2010).

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-35 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1:
Step 1: Claim 1 is directed to A processor-implemented method, therefore it falls under the statuary category of a process.
Step 2A Prong 1: The claim recites, in part:
“decomposing a multidimensional input into a plurality of two-dimensional subspaces, wherein the plurality of two-dimensional subspaces share a common dimension” this encompasses the mental decomposition of an observed input into two-dimensional subspace. further, this limitation is a mathematical concept. 
“generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept. 
“generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept.
“generating an output…based on the first attention matrix and the second attention matrix” this encompasses the mental creation of an output based on observed attention matrices. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “via an attention block of a transformer neural network” (lines 6-7 of claim 1), “via the attention block of the transformer neural network” (line 10 of claim 1) the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “of the transformer neural network” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible. Further, this limitation is a mathematical concept. 

Regarding claim 2, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“the multidimensional input comprises an input having a plurality of spatial dimensions and a time dimension” this limitation is a mathematical concept. 
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 3, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“the plurality of spatial dimensions comprises a width spatial dimension and a height spatial dimension and wherein the common dimension comprises the time dimension, such that computational complexity involved in generating the output of the transformer neural network is reduced relative to decomposing the multidimensional input into a spatial component and a time component” this limitation is a mathematical concept. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows:
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 4, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the multidimensional input comprises a video input” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 5, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“the first two-dimensional subspace comprises a subspace based on a first spatial dimension of the plurality of spatial dimensions and the time dimension” this limitation is a mathematical concept. 
“the second two-dimensional subspace comprises a subspace based on a second spatial dimension of the plurality of spatial dimensions and the time dimension” this limitation is a mathematical concept.
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 6, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“computing a first feature based on the first attention matrix and values projected from the tokens in the first two-dimensional subspace” this encompasses the mental computation of a feature based on an observed attention matrix. Further, this limitation is a mathematical concept. 
“computing a second feature based on the second attention matrix and values projected from the tokens in the second two-dimensional subspace” this encompasses the mental computation of a feature based on an observed attention matrix. Further, this limitation is a mathematical concept.
“combining the first feature and the second feature into a combined feature representing the multidimensional input” this limitation is a mathematical concept. 
“generating the output…based on the combined feature” this encompasses the mental creation of an output based on an observed combined feature.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “of the transformer neural network” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 7, the rejection of claim 6 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“the combined feature comprises a sum of the first feature and the second feature” this limitation is a mathematical concept.
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 8, the rejection of claim 6 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “processing the combined feature through a feed-forward component of the transformer neural network” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 9, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“projecting the first two-dimensional subspace into query data, key data, and value data” this limitation is a mathematical concept. 
“generating the first attention matrix based on the query data, the key data, and a number of components from the common dimension in the first two-dimensional subspace” this encompasses the mental creation of an attention matric based on observed data. 
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 10, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“projecting the second two-dimensional subspace into query data, key data, and value data” this limitation is a mathematical concept. 
“generating the second attention matrix based on the query data, the key data, and a number of components from the common dimension in the second two-dimensional subspace” this encompasses the mental creation of an attention matric based on observed data. 
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more.

Regarding claim 11:
Step 1: Claim 11 is directed to A system, therefore it falls under the statuary category of a machine.
Step 2A Prong 1: The claim recites, in part:
“decompose a multidimensional input into a plurality of two-dimensional subspaces, wherein the plurality of two-dimensional subspaces share a common dimension” this encompasses the mental decomposition of an observed input into two-dimensional subspace. further, this limitation is a mathematical concept. 
“generate a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept. 
“generate a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept.
“generate an output…based on the first attention matrix and the second attention matrix” this encompasses the mental creation of an output based on observed attention matrices. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “a processor configured to execute the executable instructions in order to cause the system to”, “via an attention block of a transformer neural network” (lines 10 of claim 11), “via the attention block of the transformer neural network” (line 13 of claim 11) the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “a memory having executable instructions stored thereon”, “of the transformer neural network” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h).  
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible. Further, this limitation is a mathematical concept.

Regarding claims 12-19:
The rejection of claim 11 is further incorporated, the rejection of claims 2-6 and 8-10 are applicable to claims 12-19, respectively

Regarding claim 20:
Step 1: Claim 20 is directed to A system, therefore it falls under the statuary category of a machine.
Step 2A Prong 1: The claim recites, in part:
“decomposing a multidimensional input into a plurality of two-dimensional subspaces, wherein the plurality of two-dimensional subspaces share a common dimension” this encompasses the mental decomposition of an observed input into two-dimensional subspace. further, this limitation is a mathematical concept. 
“generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept. 
“generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept.
“generating an output…based on the first attention matrix and the second attention matrix” this encompasses the mental creation of an output based on observed attention matrices. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “means for…” (lines 2, 5, 8 and 11 of claim 20), “via an attention block of a transformer neural network” (lines 7 of claim 20), “via the attention block of the transformer neural network” (line 10 of claim 20) the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “of the transformer neural network” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible. Further, this limitation is a mathematical concept.

Regarding claims 21-27:
The rejection of claim 20 is further incorporated, the rejection of claims 2-3, 5-6 and 8-10 are applicable to claims 21-27, respectively.

Regarding claim 28:
Step 1: Claim 28 is directed to non-statutory subject matter.
Step 2A Prong 1: The claim recites, in part:
“decomposing a multidimensional input into a plurality of two-dimensional subspaces, wherein the plurality of two-dimensional subspaces share a common dimension” this encompasses the mental decomposition of an observed input into two-dimensional subspace. further, this limitation is a mathematical concept. 
“generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept. 
“generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces” this encompasses the mental creation of an attention matrix based on a projection of tokens. Further, this limitation is a mathematical concept.
“generating an output…based on the first attention matrix and the second attention matrix” this encompasses the mental creation of an output based on observed attention matrices. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “via an attention block of a transformer neural network” (lines 7-8 of claim 28), “via the attention block of the transformer neural network” (line 11 of claim 28) the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “of the transformer neural network” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible. Further, this limitation is a mathematical concept.

Regarding claims 29-35:
The rejection of claim 28 is further incorporated, the rejection of claims 2-3, 5-6 and 8-10 are applicable to claims 29-35, respectively.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6-14, 16-22, 24-30 and 32-35 are rejected under 35 U.S.C. 103 as being unpatentable over Selva et al. (“Video Transformers: A Survey”, Selva et al., 13 February 2023), as cited in the IDS, hereinafter Selva in view of Duke et al. (“SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation”, Duke et al., 29 March 2021) hereinafter Duke.

Regarding claim 1:
Selva teaches A processor-implemented method, comprising:
decomposing a multidimensional input into a plurality of two-dimensional subspaces (Selva, page 4, col 2, section 3.3, ¶1 “This is done via positional embeddings (PE), which can be either fixed or learned and then absolute or relative: fixed absolute [8], [30], [58], learned absolute [11], [61], [81], fixed relative [42], [83], or learned relative [10], [84], [85]” and Selva, page 4, col 2, section 3.3, ¶2 “For this reason, 2D absolute PE [68], [88] accounting for joint space wh and time t dimensions, and 3D absolute PE [58], [60], [61], [67] for width w, height h, and t have also been proposed, disregarding [5] who found 1D learned absolute PE to suffice – at least for images.” Here, the 2D positional embeddings can be considered a decomposition of a multidimensional inputs into a plurality of two-dimensional subspaces in light of the specification, ¶20 “For example, the input data 105 may correspond to a multidimensional input, a tokenized version of the multidimensional input (which may optionally include positional embedding(s) and/or learnable token(s)), or the like”)
generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces via an attention block of a transformer neural network (Selva, page 2, col 1, ¶3 “Self-attention (SA). It is the core operation of the Transformer. Given an arbitrary sequence of token embeddings                        
                             
                            X
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            m
                                        
                     X ∈ R  (e.g., X0 ), it augments (contextualizes) each of the embeddings                         
                            
                                    x
                                
                                    i
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                     with information from the rest of embeddings. For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm. Then, self-attention can be computed as follows:

    PNG
    media_image1.png
    119
    837
    media_image1.png
    Greyscale

The dot-product                         
                            Q
                            
                                    K
                                
                                    ⊤
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            N
                                        
                                            x
                                        
                     is a measure of similarity.” Here the attention mapping made of queries, keys and values can be considered an attention matrix in light of the specification, ¶22 “In some aspects, an attention matrix A 130 (also referred to as an “attention map” or simply “attention” in some aspects) is then generated based on the queries and keys. For example, the self-attention block may, at operation 128, compute the dot product of the query matrix and the transposed key matrix (e.g., Q∙KT). In some aspects, the self-attention block can apply one or more operations (e.g., a row-wise softmax operation) to the dot product to yield that attention matrix. That is, the attention matrix A 130 may be defined as A=σ(Q∙KT), where σ is the softmax function (or some other regularizing function usable in a transformer neural network).”);
generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces via the attention block of the transformer neural network (Selva, page 6, col 5, section 4.1.1, ¶1 “In order to approximate the full receptive field (i.e., the whole input sequence), restriction relies on stacking multiple smaller SA (similar to local filters in CNNs). We categorize restricted approaches by how subsets of tokens are selected for each SA. It can be by attending local token neighborhoods, specific axis (i.e., height, width or time) or sparsely sampled subsets of tokens (see Fig. 3a).” Here, each SA (self-attention map) can be considered an attention matrix and the differing sizes of the SA can be considered a second two-dimensional subspace); and
generating an output of the transformer neural network based on the first attention matrix and the second attention matrix (Selva, page 6, fig. 3a 
    PNG
    media_image2.png
    265
    209
    media_image2.png
    Greyscale
 The output T of the attention matrices can be considered the an output of the transformer based on the first attention matrix and the second attention matrix).
Selva does not teach “wherein the plurality of two-dimensional subspaces share a common dimension “
However, Duke teaches wherein the plurality of two-dimensional subspaces share a common dimension (Duke, page 3, col 2, ¶2 “The SST architecture (Fig. 2) takes a length T sequence of H × W RGB video frames S ∈ R 3×T ×H×W as input.” here, the T length sequence can be considered the common time dimension);
Selva and Duke are analogous art because both references concern methods for video transformer networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva’s transformer network to incorporate the common dimension taught by Duke. The motivation for doing so would have been to reduce feature mapping  FLOPs as stated in Duke, page 2, col 1, ¶4 “SST reduces feature matching FLOPs by an order of magnitude, and achieves an overall score of 81.8 on the official YouTube-VOS 2019 validation set, comparing favourably with prior work.”.

Regarding claim 2:
Selva in view of Duke teaches The method of Claim 1, wherein the multidimensional input comprises an input having a plurality of spatial dimensions and a time dimension (Selva, page 1, abstract “Specifically, we delve into how videos are handled at the input level first.” A video can be considered to have a plurality of spatial dimensions (height and width of a frame) and a time dimension (the sequence of frames)).

Regarding claim 3:
Selva in view of Duke teaches The method of Claim 2, wherein the plurality of spatial dimensions comprises a width spatial dimension and a height spatial dimension and wherein the common dimension comprises the time dimension (Duke, page 3, col 2, ¶2 “The SST architecture (Fig. 2) takes a length T sequence of H × W RGB video frames S ∈ R 3×T ×H×W as input.” here, the T length sequence can be considered the common time dimension), such that computational complexity involved in generating the output of the transformer neural network is reduced relative to decomposing the multidimensional input into a spatial component and a time component (Selva, col 1, section 3.4, ¶1 “Regarding tokenization, it has an impact on two main factors:…(2) it will impact the input sequence length, and consequently the computational complexity of the model”).

Regarding claim 4:
Selva in view of Duke teaches The method of Claim 2, wherein the multidimensional input comprises a video input (Selva, page 1, abstract “Specifically, we delve into how videos are handled at the input level first.”).

Regarding claim 6:
Selva in view of Duke teaches The method of Claim 1, wherein generating the output of the transformer neural network comprises:
computing a first feature based on the first attention matrix and values projected from the tokens in the first two-dimensional subspace; computing a second feature based on the second attention matrix and values projected from the tokens in the second two-dimensional subspace (Duke, page 5, col 2, ¶4 “For a feature cell at position (x, y, t) to interact with another feature cell at an arbitrary position (i, j, k), interactions must propagate along a “route” composed of pairs of similar feature cells.” Here, feature cell (x, y, t) can be considered a first feature and another feature cell (i, j, k) can be considered a second feature);
combining the first feature and the second feature into a combined feature representing the multidimensional input (Duke, page 5, col 2, ¶4 “We can show that composing three applications of grid self-attention on spatiotemporal feature tensor T produces an output tensor where each spatiotemporal feature cell with coordinates (x, y, t) is composed of a weighted sum

    PNG
    media_image3.png
    101
    645
    media_image3.png
    Greyscale

over other feature cells in T with coordinates (i, j, k).”); and
generating the output of the transformer neural network based on the combined feature (Duke, page 3, col 2, ¶4 “The SST encoder passes two outputs to the SST decoder, the first of which is spatiotemporal features                         
                            
                                            T
                                        
                                        ~
                                    
                                    L
                                
                            ∈
                            
                                    R
                                
                                    c
                                    ×
                                    τ
                                    ×
                                    
                                            H
                                        
                                            '
                                        
                                    ×
                                    
                                            W
                                        
                                            '
                                        
                            .
                        
                    ”).
Selva and Duke are analogous art because both references concern methods for video transformer networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva’s transformer network to incorporate the feature computation and combination taught by Duke. The motivation for doing so would have been to reduce feature mapping  FLOPs as stated in Duke, page 2, col 1, ¶4 “SST reduces feature matching FLOPs by an order of magnitude, and achieves an overall score of 81.8 on the official YouTube-VOS 2019 validation set, comparing favourably with prior work.”.

Regarding claim 7:
Selva in view of Duke teaches The method of Claim 6, wherein the combined feature comprises a sum of the first feature and the second feature (Duke, page 5, col 2, ¶4 “We can show that composing three applications of grid self-attention on spatiotemporal feature tensor T produces an output tensor where each spatiotemporal feature cell with coordinates (x, y, t) is composed of a weighted sum

    PNG
    media_image3.png
    101
    645
    media_image3.png
    Greyscale

over other feature cells in T with coordinates (i, j, k).”).
	It would have been obvious to combine the teachings of Selva and Duke for the reasons set forth in connection with claim 6 above. 

Regarding claim 8:
Selva in view of Duke teaches The method of Claim 6, wherein generating the output of the transformer neural network based on the combined feature comprises processing the combined feature through a feed-forward component of the transformer neural network (Duke, page 2, col 2,  section 2, ¶5 “We use sparse attention operators to propagate features temporally in a single feedforward operation”).
It would have been obvious to combine the teachings of Selva and Duke for the reasons set forth in connection with claim 6 above.

Regarding claim 9:
Selva in view of Duke teaches The method of Claim 1, wherein generating the first attention matrix based on the projection of the tokens in the first two-dimensional subspace comprises:
projecting the first two-dimensional subspace into query data, key data, and value data (Selva, page 2, col 1, ¶3 “For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm.); and
generating the first attention matrix based on the query data, the key data, and a number of components from the common dimension in the first two-dimensional subspace (Selva, page 2, col 1, ¶3 “Then, self-attention can be computed as follows:

    PNG
    media_image4.png
    69
    483
    media_image4.png
    Greyscale

The dot-product 
    PNG
    media_image5.png
    28
    140
    media_image5.png
    Greyscale
 is a measure of similarity.”).

Regarding claim 10:
Selva in view of Duke teaches The method of Claim 1, wherein generating the second attention matrix based on the projection of the tokens in the second two-dimensional subspace comprises:
projecting the second two-dimensional subspace into query data, key data, and value data (Selva, page 2, col 1, ¶3 “For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm.); and
generating the second attention matrix based on the query data, the key data, and a number of components from the common dimension in the second two-dimensional subspace (Selva, page 2, col 1, ¶3 “Then, self-attention can be computed as follows:

    PNG
    media_image4.png
    69
    483
    media_image4.png
    Greyscale

The dot-product 
    PNG
    media_image5.png
    28
    140
    media_image5.png
    Greyscale
 is a measure of similarity.”).

Regarding claim 11:
Selva teaches A system, comprising:
a memory having executable instructions stored thereon; and a processor configured to execute the executable instructions in order to cause the system to (Selva, page 7, col 2, section 4.2, ¶1 “In both, portions (i.e., frames/clips) of the videos are processed sequentially in a sliding window fashion to keep manageable compute and GPU memory but still ensure relevant information from past windows is within reach.” Here, the GPU inherently discloses a memory having executable instructions and processor necessary for its function) :
decompose a multidimensional input into a plurality of two-dimensional subspaces (Selva, page 4, col 2, section 3.3, ¶1 “This is done via positional embeddings (PE), which can be either fixed or learned and then absolute or relative: fixed absolute [8], [30], [58], learned absolute [11], [61], [81], fixed relative [42], [83], or learned relative [10], [84], [85]” and Selva, page 4, col 2, section 3.3, ¶2 “For this reason, 2D absolute PE [68], [88] accounting for joint space wh and time t dimensions, and 3D absolute PE [58], [60], [61], [67] for width w, height h, and t have also been proposed, disregarding [5] who found 1D learned absolute PE to suffice – at least for images.” Here, the 2D positional embeddings can be considered a decomposition of a multidimensional inputs into a plurality of two-dimensional subspaces in light of the specification, ¶20 “For example, the input data 105 may correspond to a multidimensional input, a tokenized version of the multidimensional input (which may optionally include positional embedding(s) and/or learnable token(s)), or the like”)
generate a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces via an attention block of a transformer neural network (Selva, page 2, col 1, ¶3 “Self-attention (SA). It is the core operation of the Transformer. Given an arbitrary sequence of token embeddings                        
                             
                            X
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            m
                                        
                     X ∈ R  (e.g., X0 ), it augments (contextualizes) each of the embeddings                         
                            
                                    x
                                
                                    i
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                     with information from the rest of embeddings. For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm. Then, self-attention can be computed as follows:

    PNG
    media_image1.png
    119
    837
    media_image1.png
    Greyscale

The dot-product                         
                            Q
                            
                                    K
                                
                                    ⊤
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            N
                                        
                                            x
                                        
                     is a measure of similarity.” Here the attention mapping made of queries, keys and values can be considered an attention matrix in light of the specification, ¶22 “In some aspects, an attention matrix A 130 (also referred to as an “attention map” or simply “attention” in some aspects) is then generated based on the queries and keys. For example, the self-attention block may, at operation 128, compute the dot product of the query matrix and the transposed key matrix (e.g., Q∙KT). In some aspects, the self-attention block can apply one or more operations (e.g., a row-wise softmax operation) to the dot product to yield that attention matrix. That is, the attention matrix A 130 may be defined as A=σ(Q∙KT), where σ is the softmax function (or some other regularizing function usable in a transformer neural network).”);
generate a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces via the attention block of the transformer neural network (Selva, page 6, col 5, section 4.1.1, ¶1 “In order to approximate the full receptive field (i.e., the whole input sequence), restriction relies on stacking multiple smaller SA (similar to local filters in CNNs). We categorize restricted approaches by how subsets of tokens are selected for each SA. It can be by attending local token neighborhoods, specific axis (i.e., height, width or time) or sparsely sampled subsets of tokens (see Fig. 3a).” Here, each SA (self-attention map) can be considered an attention matrix and the differing sizes of the SA can be considered a second two-dimensional subspace); and
generate an output of the transformer neural network based on the first attention matrix and the second attention matrix  (Selva, page 6, fig. 3a 
    PNG
    media_image2.png
    265
    209
    media_image2.png
    Greyscale
 The output T of the attention matrices can be considered the an output of the transformer based on the first attention matrix and the second attention matrix).
Selva does not teach “wherein the plurality of two-dimensional subspaces share a common dimension “
However, Duke teaches wherein the plurality of two-dimensional subspaces share a common dimension (Duke, page 3, col 2, ¶2 “The SST architecture (Fig. 2) takes a length T sequence of H × W RGB video frames S ∈ R 3×T ×H×W as input.” here, the T length sequence can be considered the common time dimension);
Selva and Duke are analogous art because both references concern methods for video transformer networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva’s transformer network to incorporate the common dimension taught by Duke. The motivation for doing so would have been to reduce feature mapping  FLOPs as stated in Duke, page 2, col 1, ¶4 “SST reduces feature matching FLOPs by an order of magnitude, and achieves an overall score of 81.8 on the official YouTube-VOS 2019 validation set, comparing favourably with prior work.”.

Regarding claim 20:
Selva teaches A system, comprising:
means for decomposing a multidimensional input into a plurality of two-dimensional subspaces (Selva, page 4, col 2, section 3.3, ¶1 “This is done via positional embeddings (PE), which can be either fixed or learned and then absolute or relative: fixed absolute [8], [30], [58], learned absolute [11], [61], [81], fixed relative [42], [83], or learned relative [10], [84], [85]” and Selva, page 4, col 2, section 3.3, ¶2 “For this reason, 2D absolute PE [68], [88] accounting for joint space wh and time t dimensions, and 3D absolute PE [58], [60], [61], [67] for width w, height h, and t have also been proposed, disregarding [5] who found 1D learned absolute PE to suffice – at least for images.” Here, the 2D positional embeddings can be considered a decomposition of a multidimensional inputs into a plurality of two-dimensional subspaces in light of the specification, ¶20 “For example, the input data 105 may correspond to a multidimensional input, a tokenized version of the multidimensional input (which may optionally include positional embedding(s) and/or learnable token(s)), or the like”);
means for generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces via an attention block of a transformer neural network  (Selva, page 2, col 1, ¶3 “Self-attention (SA). It is the core operation of the Transformer. Given an arbitrary sequence of token embeddings                        
                             
                            X
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            m
                                        
                     X ∈ R  (e.g., X0 ), it augments (contextualizes) each of the embeddings                         
                            
                                    x
                                
                                    i
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                     with information from the rest of embeddings. For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm. Then, self-attention can be computed as follows:

    PNG
    media_image1.png
    119
    837
    media_image1.png
    Greyscale

The dot-product                         
                            Q
                            
                                    K
                                
                                    ⊤
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            N
                                        
                                            x
                                        
                     is a measure of similarity.” Here the attention mapping made of queries, keys and values can be considered an attention matrix in light of the specification, ¶22 “In some aspects, an attention matrix A 130 (also referred to as an “attention map” or simply “attention” in some aspects) is then generated based on the queries and keys. For example, the self-attention block may, at operation 128, compute the dot product of the query matrix and the transposed key matrix (e.g., Q∙KT). In some aspects, the self-attention block can apply one or more operations (e.g., a row-wise softmax operation) to the dot product to yield that attention matrix. That is, the attention matrix A 130 may be defined as A=σ(Q∙KT), where σ is the softmax function (or some other regularizing function usable in a transformer neural network).”);
means for generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces via the attention block of the transformer neural network (Selva, page 6, col 5, section 4.1.1, ¶1 “In order to approximate the full receptive field (i.e., the whole input sequence), restriction relies on stacking multiple smaller SA (similar to local filters in CNNs). We categorize restricted approaches by how subsets of tokens are selected for each SA. It can be by attending local token neighborhoods, specific axis (i.e., height, width or time) or sparsely sampled subsets of tokens (see Fig. 3a).” Here, each SA (self-attention map) can be considered an attention matrix and the differing sizes of the SA can be considered a second two-dimensional subspace); and
means for generating an output of the transformer neural network based on the first attention matrix and the second attention matrix (Selva, page 6, fig. 3a 
    PNG
    media_image2.png
    265
    209
    media_image2.png
    Greyscale
 The output T of the attention matrices can be considered the an output of the transformer based on the first attention matrix and the second attention matrix).
Selva does not teach “wherein the plurality of two-dimensional subspaces share a common dimension “
However, Duke teaches wherein the plurality of two-dimensional subspaces share a common dimension (Duke, page 3, col 2, ¶2 “The SST architecture (Fig. 2) takes a length T sequence of H × W RGB video frames S ∈ R 3×T ×H×W as input.” here, the T length sequence can be considered the common time dimension);
Selva and Duke are analogous art because both references concern methods for video transformer networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva’s transformer network to incorporate the common dimension taught by Duke. The motivation for doing so would have been to reduce feature mapping  FLOPs as stated in Duke, page 2, col 1, ¶4 “SST reduces feature matching FLOPs by an order of magnitude, and achieves an overall score of 81.8 on the official YouTube-VOS 2019 validation set, comparing favourably with prior work.”.

Regarding claim 28:
Selva teaches A computer-readable medium having executable instructions stored thereon which, when executed by a processor, perform an operation comprising:
decomposing a multidimensional input into a plurality of two-dimensional subspaces (Selva, page 4, col 2, section 3.3, ¶1 “This is done via positional embeddings (PE), which can be either fixed or learned and then absolute or relative: fixed absolute [8], [30], [58], learned absolute [11], [61], [81], fixed relative [42], [83], or learned relative [10], [84], [85]” and Selva, page 4, col 2, section 3.3, ¶2 “For this reason, 2D absolute PE [68], [88] accounting for joint space wh and time t dimensions, and 3D absolute PE [58], [60], [61], [67] for width w, height h, and t have also been proposed, disregarding [5] who found 1D learned absolute PE to suffice – at least for images.” Here, the 2D positional embeddings can be considered a decomposition of a multidimensional inputs into a plurality of two-dimensional subspaces in light of the specification, ¶20 “For example, the input data 105 may correspond to a multidimensional input, a tokenized version of the multidimensional input (which may optionally include positional embedding(s) and/or learnable token(s)), or the like”);
generating a first attention matrix based on a projection of tokens in a first two-dimensional subspace of the plurality of two-dimensional subspaces via an attention block of a transformer neural network  (Selva, page 2, col 1, ¶3 “Self-attention (SA). It is the core operation of the Transformer. Given an arbitrary sequence of token embeddings                        
                             
                            X
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            m
                                        
                     X ∈ R  (e.g., X0 ), it augments (contextualizes) each of the embeddings                         
                            
                                    x
                                
                                    i
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                     with information from the rest of embeddings. For that, the embeddings in X are linearly mapped to the embedding spaces of queries                         
                            Q
                            =
                            X
                            
                                    W
                                
                                    Q
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , keys                         
                            K
                            =
                            X
                            
                                    W
                                
                                    K
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , and values                         
                            V
                            =
                            X
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    , where                         
                            
                                            W
                                        
                                            Q
                                        
                                            ,
                                            W
                                        
                                            K
                                        
                                    ∈
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            k
                                        
                    ,                         
                            
                                    W
                                
                                    V
                                
                            ∈
                            
                                    R
                                
                                            d
                                        
                                            m
                                        
                                    ×
                                    
                                            d
                                        
                                            v
                                        
                    , and typically dk, dv <= dm. Then, self-attention can be computed as follows:

    PNG
    media_image1.png
    119
    837
    media_image1.png
    Greyscale

The dot-product                         
                            Q
                            
                                    K
                                
                                    ⊤
                                
                            ∈
                            
                                    R
                                
                                            N
                                        
                                            x
                                        
                                    ×
                                    
                                            N
                                        
                                            x
                                        
                     is a measure of similarity.” Here the attention mapping made of queries, keys and values can be considered an attention matrix in light of the specification, ¶22 “In some aspects, an attention matrix A 130 (also referred to as an “attention map” or simply “attention” in some aspects) is then generated based on the queries and keys. For example, the self-attention block may, at operation 128, compute the dot product of the query matrix and the transposed key matrix (e.g., Q∙KT). In some aspects, the self-attention block can apply one or more operations (e.g., a row-wise softmax operation) to the dot product to yield that attention matrix. That is, the attention matrix A 130 may be defined as A=σ(Q∙KT), where σ is the softmax function (or some other regularizing function usable in a transformer neural network).”);
generating a second attention matrix based on a projection of tokens in a second two-dimensional subspace of the plurality of two-dimensional subspaces via the attention block of the transformer neural network (Selva, page 6, col 5, section 4.1.1, ¶1 “In order to approximate the full receptive field (i.e., the whole input sequence), restriction relies on stacking multiple smaller SA (similar to local filters in CNNs). We categorize restricted approaches by how subsets of tokens are selected for each SA. It can be by attending local token neighborhoods, specific axis (i.e., height, width or time) or sparsely sampled subsets of tokens (see Fig. 3a).” Here, each SA (self-attention map) can be considered an attention matrix and the differing sizes of the SA can be considered a second two-dimensional subspace); and
generating an output of the transformer neural network based on the first attention matrix and the second attention matrix (Selva, page 6, fig. 3a 
    PNG
    media_image2.png
    265
    209
    media_image2.png
    Greyscale
 The output T of the attention matrices can be considered the an output of the transformer based on the first attention matrix and the second attention matrix).
Selva does not teach “wherein the plurality of two-dimensional subspaces share a common dimension “
However, Duke teaches wherein the plurality of two-dimensional subspaces share a common dimension (Duke, page 3, col 2, ¶2 “The SST architecture (Fig. 2) takes a length T sequence of H × W RGB video frames S ∈ R 3×T ×H×W as input.” here, the T length sequence can be considered the common time dimension);
Selva and Duke are analogous art because both references concern methods for video transformer networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva’s transformer network to incorporate the common dimension taught by Duke. The motivation for doing so would have been to reduce feature mapping  FLOPs as stated in Duke, page 2, col 1, ¶4 “SST reduces feature matching FLOPs by an order of magnitude, and achieves an overall score of 81.8 on the official YouTube-VOS 2019 validation set, comparing favourably with prior work.”.

Claims 5, 15, 23 and 31 are rejected under 35 U.S.C. 103 as being unpatentable over Selva in view of Duke in further view of Ho et al. (“Axial Attention In Multidimensional Transformers”, Ho et al., 20 December 2019) hereinafter Ho.

Regarding claim 5:
Selva in view of Duke teaches The method of Claim 2, wherein:
Selva in view of Duke does not teach “the first two-dimensional subspace comprises a subspace based on a first spatial dimension of the plurality of spatial dimensions and the time dimension, and 
the second two-dimensional subspace comprises a subspace based on a second spatial dimension of the plurality of spatial dimensions and the time dimension” 
However, Ho teaches  the first two-dimensional subspace comprises a subspace based on a first spatial dimension of the plurality of spatial dimensions and the time dimension (Ho, page 4, section 3.1, ¶3 “When the data is an image, we call Attention1 column attention, as it mixes information within columns while keeping separate columns independent. We call Attention2 row attention for analogous reasons.” Further Ho, page 4, section 3.1, ¶3 “Of course, a single layer of axial attention along some axis k does not have the full receptive field since it covers a single axis, but we will see in section 3.2 that stacking two axial attention layers allows the model to obtain a global receptive field.” Here, Attention1 column attention can be considered a first spatial dimension and the stacking of the axis can be considered the time dimension), and 
the second two-dimensional subspace comprises a subspace based on a second spatial dimension of the plurality of spatial dimensions and the time dimension (Ho, page 4, section 3.1, ¶3 “When the data is an image, we call Attention1 column attention, as it mixes information within columns while keeping separate columns independent. We call Attention2 row attention for analogous reasons.” Further Ho, page 4, section 3.1, ¶3 “Of course, a single layer of axial attention along some axis k does not have the full receptive field since it covers a single axis, but we will see in section 3.2 that stacking two axial attention layers allows the model to obtain a global receptive field.” Here, Attention2 row attention can be considered a second spatial dimension and the stacking of the axis can be considered the time dimension).
Selva in view of Duke and Ho are analogous art because both references concern methods for multidimensional transformers. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Selva/Duke’s transformer network to incorporate the spatiotemporal subspaces taught by Ho. The motivation for doing so would have been to have a simple fast and partly parallel sampling procedure as stated in Ho, page 5, section 3.2, ¶1 “Decomposing the model in this manner also leads to a simple fast and partly parallel sampling procedure (section 3.2.1).”.

Regarding claims 15, 23 and 31:
	Claims 15, 23 and 31 are rejected under the same rationale as claim 5.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB Z SUSSMAN MOSS whose telephone number is (571) 272-1579. The examiner can normally be reached Monday - Friday, 9 a.m. - 5 p.m. ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.S.M./Examiner, Art Unit 2122  

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Mar 30, 2023
Application Filed
Dec 29, 2025
Non-Final Rejection — §101, §103, §112 (current)
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
14%
Grant Probability
-6%
With Interview (-20.0%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 7 resolved cases by this examiner. Grant probability derived from career allow rate.
MULTIDIMENSIONAL SPACE DECOMPOSITION FOR TRANSFORMER NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

MULTIDIMENSIONAL SPACE DECOMPOSITION FOR TRANSFORMER NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email