Last updated: May 29, 2026

Application No. 18/694,604

VIDEO-TEXT MODELING WITH ZERO-SHOT TRANSFER FROM CONTRASTIVE CAPTIONERS

Non-Final OA §103

Filed

Mar 22, 2024

Priority

Dec 08, 2022 — provisional 63/431,224 +1 more

Examiner

PATEL, JAYESH A

Art Unit

2677

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +5.3% interview lift. Interview lift (+5.3%) is below the 15.0% threshold. A written response is recommended.

Based on 894 resolved cases, 2023–2026

Examiner Intelligence

PATEL, JAYESH A View full profile →

Grants 83% — above average

Career Allowance Rate

746 granted / 894 resolved

+21.4% vs TC avg

Moderate +5% lift

Without

With

+5.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

15 currently pending

Career history

921

Total Applications

across all art units

Statute-Specific Performance

§101

2.4%

-37.6% vs TC avg

§103

77.1%

+37.1% vs TC avg

§102

7.5%

-32.5% vs TC avg

§112

9.3%

-30.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 894 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5-9, 11 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over NPL1 (CoCa: Contrastive Captioners are Image-Text Foundation Models, Jiahui Yu et al., arXiv, 14 Jun 2022, Pages 1-19) hereafter NPL1 (Single reference 103 as a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”.) 

1. Regarding claim 1, NPL1 discloses a computer-implemented method for performing a video understanding task with improved computational efficiency (see introduction and Fig 1), the method comprising: 
accessing, by a computing system comprising one or more computing devices, a pre- trained image-text processing model (fig 1., Section 3.2, 3.3 and 4.1 “a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”), wherein the pre-trained image-text processing model comprises one or more pre-trained attentional pooling layers having a number of parameters (figs 1-2, sections 3.2 and 3.3 where λCon and λCap are loss weighting hyper-parameters), and wherein the pre-trained image-text processing model has been pre-trained on a joint contrastive and generative image captioning loss function (See abstract, Figs 1, 2 and section 1, sections 4.1 and 4.2.3 shows and discloses We propose a simple model family named Contrastive Captioners (CoCa) with a modified encoder-decoder architecture trained with both contrastive loss and captioning (generative image captioning) loss); obtaining, by the computing system, an input video that comprises a plurality of image frames (Fig  3 shows an input video that comprises a plurality of image frames); processing, by the computing system, the input video with the pre-trained image-text processing model having the one or more pre-trained attentional pooling layers (figs 1-2 shows pre-trained model with one or more pre-trained attentional pooling layers) having the same number of parameters to generate, as an output of the pre-trained image-text processing model, a prediction for the video understanding task (figs 1-3, section 1, section 3.2, 3.3 and 4.2.3 output (i.e prediction) for video recognition task “i.e a learned CoCa model for video action recognition tasks”); and providing, by the computing system, the prediction for the video understanding task as an output (figs 1-3, section 1, section 3.2, 3.3 and 4.2.3 shows and discloses the prediction for the video understanding task as an output “i.e a learned CoCa model for video action recognition tasks”). Before the effective filing date of the invention was made, a computing system comprising one or more computing devices would be obvious and within one of ordinary skill in the art in view of the disclosure in NPL1 “a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”. The suggestion/motivation would be an efficient and minimal overhead method/system (see page 1 abstract).

2. Regarding claim 2, NPL1 discloses the computer-implemented method of claim 1, wherein: the pre-trained image-text processing model comprises a pre-trained unimodal image encoder configured to process an input image to generate one or more frame embeddings (abstract, figs 1, 2 shows and discloses the pre-trained image-text processing model comprises a pre-trained unimodal image encoder configured to process an input image to generate one or more frame embeddings); the one or more pre-trained attentional pooling layers are configured to process the one or more frame embeddings to generate one or more contrastive embeddings and one or more generative embeddings (Abstract, fig 2 and page 12 shows and discloses the one or more pre-trained attentional pooling layers are configured to process the one or more frame embeddings to generate one or more contrastive embeddings and one or more generative embeddings); and processing, by the computing system, the input video with the pre-trained image-text processing model comprises: separately processing each of the plurality of image frames with the pre- trained unimodal image encoder to generate a plurality of frame embeddings respectively for the plurality of image frames (figs 1-3 shows and discloses the input video with the pre-trained image-text processing model comprises: separately processing each of the plurality of image frames with the pre- trained unimodal image encoder to generate a plurality of frame embeddings respectively for the plurality of image frames); combining the plurality of frame embeddings to form a set of combined frame embeddings; and processing the set of combined frame embeddings with the one or more attentional layers to generate one or more generative embeddings and one or more contrastive embeddings (figs 1- 3 and pages 2-4, 6 shows and discloses combining the plurality of frame embeddings to form a set of combined frame embeddings; and processing the set of combined frame embeddings with the one or more attentional layers to generate one or more generative embeddings and one or more contrastive embeddings).  

3. Regarding claim 5, NPL1 discloses the computer-implemented method of claim 1, wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 2, algorithm 1, page 3 section 3.1 shows and discloses wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function).  

4. Regarding claim 6, NPL1 discloses the computer-implemented method of claim 5, wherein the video understanding task comprises a zero-shot video understanding task (figs 1-3, pages 1-4, and 6, abstract, section 1, 3, 3.3 shows and discloses wherein the video understanding task comprises a zero-shot video understanding task).  

5. Regarding claim 7, NPL1 discloses the computer-implemented method of claim 5, wherein the pre-trained image- text processing model has been trained only on training data comprising only still images (fig 1, 3 and section 3.3 shows the pretraining using still images meeting the claim limitations).  

6. Regarding claim 8, NPL1 discloses the computer-implemented method of claim 1, wherein an entirety of parameters of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 2, algorithm 1, page 3 section 3.1 shows and discloses wherein an entirety of parameters of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function).  

7. Regarding claim 9, NPL1 discloses the computer-implemented method of claim 1, wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been further finetuned after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 1 shows fine-tuning after the pretraining and Figs 1-2, section 2, 3.3, discloses wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been further finetuned after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function).  

8. Regarding claim 11, NPL1 discloses the computer-implemented method of claim 1, wherein an entirety of parameters of the pre-trained image-text processing model have been further finetuned after said pre- training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 1, 3 shows fine-tuning after the pretraining and Figs 1-2, section 2, 3.3 wherein an entirety of parameters of the pre-trained image-text processing model have been further finetuned after said pre- training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function).  
9. Regarding claim 15, NPL1 discloses the computer-implemented method of claim 1, wherein the pre-trained image- text processing model comprises a decoder configured to process embeddings generated by the one or more attentional layers to generate a text output (figs 1-2 shows and discloses wherein the pre-trained image- text processing model comprises a decoder configured to process embeddings generated by the one or more attentional layers to generate a text output).  

10. Regarding claim 16, NPL1 discloses the computer-implemented method of claim 1, wherein the pre-trained image- text processing model further comprises a unimodal text decoder and wherein processing the input video comprises processing a set of input text associated with the input video using the unimodal text decoder (figs 1-3 shows and discloses wherein the pre-trained image- text processing model further comprises a unimodal text decoder and wherein processing the input video comprises processing a set of input text associated with the input video using the unimodal text decoder).  

11. Regarding claim 17, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video classification task (figs 1-3, table 5 (Zero-shot Video-text retrieval (i.e video classification task)).

12. Regarding claim 18, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video question answering task (section 4.2.3 discloses VQA (visual or video) question answering meeting the claim limitations).  

13. Regarding claim 19, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video captioning task (figs1, 3 discloses and shows image/video captioning task meeting the claim limitations).  

14. Claim 20 is a corresponding One or more non-transitory computer-readable media claim of claim 1. See the corresponding explanation of claim 1. Examiner notes that one or more non-transitory computer-readable media as claimed in claim 20 would be obvious and within one of ordinary skill in the art from the disclosure of NPL1 at “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency” and “Algorithm 1” in fig 2.

Examiner's Note: Examiner has cited figures, and paragraphs in the references as applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested for the applicant, in preparing the responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. Examiner has also cited references in PTO892 but not relied on, which are relevant and pertinent to the applicant’s disclosure, and may also be reading (anticipatory/obvious) on the claims and claimed limitations. Applicant is advised to consider the references in preparing the response/amendments in-order to expedite the prosecution.
Allowable Subject Matter
Claims 3-4, 10, 12-14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAYESH PATEL whose telephone number is (571)270-1227. The examiner can normally be reached IFW Mon-FRI.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at 571-270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JAYESH PATEL
Primary Examiner
Art Unit 2677



/JAYESH A PATEL/Primary Examiner, Art Unit 2677

Read full office action

Prosecution Timeline

Mar 22, 2024

Application Filed

Jan 27, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/509,026

Patent 12639938

ASYNCHRONOUS MULTIMODAL FEATURE FUSION

2y 6m to grant Granted May 26, 2026

17/918,627

Patent 12632978

CONFIGURATION OF ROBOT OPERATIONAL ENVIRONMENT INCLUDING LAYOUT OF SENSORS

3y 7m to grant Granted May 19, 2026

18/194,205

Patent 12626396

ELECTRONIC DEVICE AND CONTROLLING METHOD OF ELECTRONIC DEVICE

3y 1m to grant Granted May 12, 2026

18/396,971

Patent 12620260

INFORMATION PROCESSING METHOD, COMPUTER DEVICE, AND STORAGE MEDIUM

2y 4m to grant Granted May 05, 2026

18/702,055

Patent 12620208

FACIAL BEAUTY PREDICTION METHOD, APPARATUS, DEVICE AND STORAGE MEDIUM

2y 0m to grant Granted May 05, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

89%

With Interview (+5.3%)

2y 11m (~9m remaining)

Median Time to Grant

Low

PTA Risk

Based on 894 resolved cases by this examiner. Grant probability derived from career allowance rate.