Last updated: May 29, 2026

Application No. 18/577,051

Visual Transformers with Sparse Application of Video Kernels

Non-Final OA §102§103§112

Filed

Jan 05, 2024

Priority

Nov 22, 2022 — provisional 63/427,238 +1 more

Examiner

FUJITA, KATRINA R

Art Unit

2672

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +24.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 70% grant rate with +24.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 678 resolved cases, 2023–2026

Examiner Intelligence

FUJITA, KATRINA R View full profile →

Grants 70% — above average

Career Allowance Rate

476 granted / 678 resolved

+8.2% vs TC avg

Strong +24% interview lift

Without

With

+24.0%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

20 currently pending

Career history

700

Total Applications

across all art units

Statute-Specific Performance

§101

1.0%

-39.0% vs TC avg

§103

85.2%

+45.2% vs TC avg

§102

3.1%

-36.9% vs TC avg

§112

3.0%

-37.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 678 resolved cases

Office Action

§102 §103 §112

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 19 and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 19 recites “the one or more processors” in line 9.  There is insufficient antecedent basis for this limitation in the claim.
Claim 20 inherits the same deficiency as it depends on claim 19.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 4, 11-14 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bain et al. (“Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval”).
Regarding claim 1, Bain et al. discloses a computer system for performing video processing tasks with improved computational efficiency, the computer system comprising: 
one or more processors (processor of implied computer); and 
one or more non-transitory computer-readable media (memory of implied computer) that collectively store: 
a machine-learned model (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9) comprising: 
a video kernel configured to be applied to a plurality of data samples from a set of video data to respectively generate a plurality of video tokens (“The patches x ∈ RM×N×3×P×P are fed through a 2D convolutional layer and the output is flattened, forming a sequence of embeddings z ∈ RMN×D for input to the transformer, where D depends of the number of kernels in the convolutional layer” at section 3.1, Transformer input, line 1), wherein each data sample comprises at least a portion of multiple image frames included in the set of video data (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and 
a visual transformer configured to process the plurality of video tokens to generate a model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1); and 
instructions that, when executed by the one or more processors, cause the computer system to perform operations, the operations comprising: 
processing the set of video data with the machine-learned model to generate the model output (“The video sequence is fed into a stack of space-time transformer blocks. We make a minor modification to the Divided Space-Time attention introduced by [6], by replacing the residual connection between the block input and the temporal attention output with a residual connection between the block input and the spatial attention output, see the appendix for details. Each block sequentially performs temporal selfattention and then spatial self-attention on the output of previous block. The video clip embedding is obtained from the [CLS] token of the final block.” at section 3.1, Space-time self-attention blocks, line 1); 
wherein processing the set of video data with the machine-learned model comprises sparsely applying the video kernel to the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame).
Regarding claim 4, Bain et al. discloses a system wherein sparsely applying the video kernel to the set of video data comprises directly applying the video kernel to pixel values included in the set of video data (“(i) we propose a new end-to-end model for video retrieval that does not rely on ‘expert’ features, but instead, inspired by [6] employs a transformer architecture with a modified divided spacetime attention applied directly to pixels” at section 1, last paragraph, line 1).
Regarding claim 7, Bain et al. discloses a system wherein sparsely applying the video kernel to the set of video data comprises applying the video kernel starting at a predefined offset point that differs from an origin point of the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples frames offset from the beginning frame).
Regarding claim 11, Bain et al. discloses a system wherein the machine-learned model comprises a pre-trained vision encoder (“We jointly pretrain our model on image and video data” at section 4.1, line 1) that has been fine-tuned using a set of video training data (“large motivation for using preextracted expert models for video retrieval is to save computational cost. Finetuning our 4-frame model for 50 epochs on MSR-VTT takes 10 hours on 2 Quadro RTX 6000k GPUs (with 24GB RAM each), which is similar to other works using pre-extracted expert features” at section 4.3, last paragraph).
Regarding claim 12, Bain et al. discloses a system wherein the model output comprises a video classification output (the output constitutes a characterization of the input video; see also the Table 4 description that discusses uses with classification uses).
Regarding claim 13, Bain et al. discloses a computer-implemented method, the method comprising: 
obtaining, by a computing system comprising one or more computing devices, a set of video data and a video label (“DiDeMo [3] contains 10K Flickr videos annotated with 40K sentences” at section 4.2, line 10); 
processing, by the computing system, the set of video data with a machine-learned model to generate the model output (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9), wherein processing the set of video data with the machine-learned model comprises: 
sparsely applying, by the computing system, a video kernel of the machine-learned model to the set of video data to generate a plurality of video tokens (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame), the video kernel having a temporal dimension size of greater than one (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and 
processing, by the computing system, the plurality of video tokens with a visual transformer of the machine-learned model to generate the model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1); 
evaluating, by the computing system, a loss function that generates a loss value based on the model output and the video label (“We minimise the sum of two losses, video-to-text and text-to-video” at section 3.2, line 4); and 
modifying, by the computing system, one or more values of one or more parameters of the machine-learned model based on the loss function (parameters of the learner are adjusted during minimization of the training loss).
Regarding claim 14, Bain et al. discloses a method wherein modifying, by the computing system, the one or more values of the one or more parameters of the machine-learned model based on the loss function comprises updating parameter values of the video kernel based on the loss function (parameters of the learner are adjusted during minimization of the training loss).
Regarding claim 19, Bain et al. discloses a one or more non-transitory computer-readable media (memory of implied computer) that collectively store:  
a machine-learned model (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9) comprising: 
a video kernel configured to be applied to a plurality of data samples from a set of video data to respectively generate a plurality of video tokens (“The patches x ∈ RM×N×3×P×P are fed through a 2D convolutional layer and the output is flattened, forming a sequence of embeddings z ∈ RMN×D for input to the transformer, where D depends of the number of kernels in the convolutional layer” at section 3.1, Transformer input, line 1), wherein each data sample comprises at least a portion of multiple image frames included in the set of video data (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and 
a visual transformer configured to process the plurality of video tokens to generate a model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1); and 
instructions that, when executed by the one or more processors, cause the computer system to perform operations, the operations comprising: 
processing the set of video data with the machine-learned model to generate the model output (“The video sequence is fed into a stack of space-time transformer blocks. We make a minor modification to the Divided Space-Time attention introduced by [6], by replacing the residual connection between the block input and the temporal attention output with a residual connection between the block input and the spatial attention output, see the appendix for details. Each block sequentially performs temporal selfattention and then spatial self-attention on the output of previous block. The video clip embedding is obtained from the [CLS] token of the final block.” at section 3.1, Space-time self-attention blocks, line 1); 
wherein processing the set of video data with the machine-learned model comprises sparsely applying the video kernel to the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Wang et al. (“Transformers Meet Visual Learning Understanding: A Comprehensive Review”).
Bain et al. discloses a system as described in claim 1 above.
Bain et al. does not explicitly disclose generating a plurality of fixed sine positional embeddings respectively for the plurality of video tokens, wherein the fixed sine positional embedding for each token indicates a center of the video kernel relative to the set of video data.
Wang et al. teaches a system in the same field of endeavor of wherein processing the set of video data with the machine-learned model further comprises generating a plurality of fixed sine positional embeddings respectively for the plurality of video tokens, wherein the fixed sine positional embedding for each token indicates a center of the video kernel relative to the set of video data (“In Transformer, sine and cosine functions are mainly used for position encoding. The specific coding method is formulated as Eq. 5. PE(pos; 2i) = sin(pos=100002i=dmodel ); PE(pos; 2i + 1) = cos(pos=100002i=dmodel ); (5) where pos represents the position, and i means the dimension. Each dimension of the position code corresponds to a sine curve.” At section IIIC).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize the position encoding as taught by Wang et al. for the embedding of Bain et al. as a way to represent the patch locations.

Claim(s) 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Liu et al. (“TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval”).
Bain et al. discloses a system as described in claim 1 above.
Bain et al. does not explicitly disclose that each of the plurality of data samples comprises data for only a subset of a number of channels in a channel dimension of the set of video data, and wherein at least one of the plurality of tokens is generated by concatenation along a channel dimension for two temporally-displaced data samples.
Liu et al. teaches a system in the same field of endeavor of transformer based classification, wherein each of the plurality of data samples comprises data for only a subset of a number of channels in a channel dimension of the set of video data, and wherein at least one of the plurality of tokens is generated by concatenation along a channel dimension for two temporally-displaced data samples (“In this work, we propose the token selection transformer by inserting a token selection module, which aims to select informative tokens per frame, especially those tokens containing salient semantics of objects, for video feature aggregation. As shown in Fig.4, top-K informative tokens are selected via the trainable token selection module every frame. The input of the token selection module is a sequence of tokens of each frame I = {pcls, p0, p1, . . . , pn−1} ∈ R(N+1)×C. We first apply an MLP over I for channel dimension reduction and output I′ = {p′cls, p′0, p′1, . . . , p′n−1} ∈ R(N+1)×C2 . We then use p′cls as a global frame feature and concatenate it with each local token p′i, ˆpi = [p′cls, p′i] , 0 ≤ i < N. We finally feed all the concatenated token features to another MLP followed by a Softmax layer to predict the importance scores” at section 3.2, paragraph 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate a token selection as taught by Liu et al. in the system of Bain et al. to avoid token redundancy and preserving the most relevant information (see Liu et al. at section 3.2).

Claim(s) 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Bao et al. (“BEIT: BERT Pre-Training of Image Transformers”).
Regarding claim 15, Bain et al. discloses a system as described in claim 1 above. 
Bain et al. does not explicitly disclose importing the video kernel to a larger pre-trained image transformer.
Bao et al. teaches a method in the same field of endeavor of transformer based image classification, comprising importing the video kernel to a larger pre-trained image transformer (“Overview of BEIT pre-training. Before pre-training, we learn an “image tokenizer” via autoencoding-style reconstruction, where an image is tokenized into discrete visual tokens according to the learned vocabulary. During pre-training, each image has two views, i.e., image patches, and visual tokens. We randomly mask some proportion of image patches (gray patches in the figure) and replace them with a special mask embedding [M]. Then the patches are fed to a backbone vision Transformer” at Figure 1 description).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate the kernels of Bain et al. into a pre-trained transformer as taught by Bao et al. to speed up the overall training of the system.
Regarding claim 16, the Bain et al. and Bao et al. combination discloses a method wherein modifying, by the computing system, the one or more values of the one or more parameters of the machine-learned model based on the loss function comprises finetuning one or more layers of a pre-trained image transformer while holding one or more other layers of the pre-trained image transformer fixed (“After pre-training BEIT, we append a task layer upon the Transformer, and fine-tune the parameters on downstream tasks, like BERT” Bao et al. at section 2.6, line 1).

Allowable Subject Matter
Claims 2, 3, 5, 6, 8, 17 and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 20 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  the prior art does not disclose applying the video kernel with a spatial stride greater than the spatial dimension size of the video kernel to achieve spatial sparseness as required by claim 2; applying the video kernel with a temporal stride greater than the temporal dimension size of the video kernel to achieve temporal sparseness as required by claim 3; one or more image kernels configured to be applied to an individual image frame of the set of video data to generate a plurality of image tokens from the individual image frame as required by claim 5; a second kernel configured to be applied to a second set of data samples from the set of video data wherein at least one of the second set of data samples is overlapping with at least one of the plurality of data samples to which the video kernel is applied as required by claim 8; one or more image kernels configured to be applied to an individual image frame of the set of video data to generate a plurality of image tokens from the individual image frame, and wherein the machine-learned model comprises a single visual transformer configured to jointly process both the plurality of video tokens and the plurality of image tokens to generate the model output as required by claim 17; applying the video kernel with a spatial stride greater than the spatial dimension size of the video kernel to achieve spatial sparseness; or applying the video kernel with a temporal stride greater than the temporal dimension size of the video kernel to achieve temporal sparseness as required by claim 20.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATRINA R FUJITA whose telephone number is (571)270-1574. The examiner can normally be reached Monday - Friday 9:30-5:30 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at 5712723638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KATRINA R FUJITA/Primary Examiner, Art Unit 2672

Read full office action

Prosecution Timeline

Jan 05, 2024

Application Filed

Jan 30, 2026

Non-Final Rejection mailed — §102, §103, §112

May 20, 2026

Examiner Interview Summary

May 20, 2026

Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

18/292,188

Patent 12639833

Automatic Alignment of Two Pullbacks

2y 4m to grant Granted May 26, 2026

17/585,580

Patent 12633115

IMAGE RECOGNITION METHOD AND UNMANNED AERIAL VEHICLE SYSTEM

4y 3m to grant Granted May 19, 2026

17/779,546

Patent 12632976

METHOD FOR DIGITALLY STAINING CELLS

3y 12m to grant Granted May 19, 2026

18/542,592

Patent 12629230

HOLE FILLING FOR 3D DENTAL MODELS

2y 5m to grant Granted May 19, 2026

18/155,699

Patent 12608807

METHOD OF GENERATING A METRIC TO QUANTITATIVELY REPRESENT AN EFFECT OF A TREATMENT

3y 3m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

70%

Grant Probability

94%

With Interview (+24.0%)

3y 2m (~9m remaining)

Median Time to Grant

Low

PTA Risk

Based on 678 resolved cases by this examiner. Grant probability derived from career allowance rate.