Prosecution Insights
Last updated: April 19, 2026
Application No. 18/694,604

VIDEO-TEXT MODELING WITH ZERO-SHOT TRANSFER FROM CONTRASTIVE CAPTIONERS

Non-Final OA §103
Filed
Mar 22, 2024
Examiner
PATEL, JAYESH A
Art Unit
2677
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
83%
Grant Probability
Favorable
1-2
OA Rounds
3y 0m
To Grant
88%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
739 granted / 887 resolved
+21.3% vs TC avg
Moderate +5% lift
Without
With
+5.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
33 currently pending
Career history
920
Total Applications
across all art units

Statute-Specific Performance

§101
11.1%
-28.9% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
14.5%
-25.5% vs TC avg
§112
25.0%
-15.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 887 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-2, 5-9, 11 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over NPL1 (CoCa: Contrastive Captioners are Image-Text Foundation Models, Jiahui Yu et al., arXiv, 14 Jun 2022, Pages 1-19) hereafter NPL1 (Single reference 103 as a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”.) 1. Regarding claim 1, NPL1 discloses a computer-implemented method for performing a video understanding task with improved computational efficiency (see introduction and Fig 1), the method comprising: accessing, by a computing system comprising one or more computing devices, a pre- trained image-text processing model (fig 1., Section 3.2, 3.3 and 4.1 “a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”), wherein the pre-trained image-text processing model comprises one or more pre-trained attentional pooling layers having a number of parameters (figs 1-2, sections 3.2 and 3.3 where λCon and λCap are loss weighting hyper-parameters), and wherein the pre-trained image-text processing model has been pre-trained on a joint contrastive and generative image captioning loss function (See abstract, Figs 1, 2 and section 1, sections 4.1 and 4.2.3 shows and discloses We propose a simple model family named Contrastive Captioners (CoCa) with a modified encoder-decoder architecture trained with both contrastive loss and captioning (generative image captioning) loss); obtaining, by the computing system, an input video that comprises a plurality of image frames (Fig 3 shows an input video that comprises a plurality of image frames); processing, by the computing system, the input video with the pre-trained image-text processing model having the one or more pre-trained attentional pooling layers (figs 1-2 shows pre-trained model with one or more pre-trained attentional pooling layers) having the same number of parameters to generate, as an output of the pre-trained image-text processing model, a prediction for the video understanding task (figs 1-3, section 1, section 3.2, 3.3 and 4.2.3 output (i.e prediction) for video recognition task “i.e a learned CoCa model for video action recognition tasks”); and providing, by the computing system, the prediction for the video understanding task as an output (figs 1-3, section 1, section 3.2, 3.3 and 4.2.3 shows and discloses the prediction for the video understanding task as an output “i.e a learned CoCa model for video action recognition tasks”). Before the effective filing date of the invention was made, a computing system comprising one or more computing devices would be obvious and within one of ordinary skill in the art in view of the disclosure in NPL1 “a computing system comprising one or more computing devices would be obvious in view of “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency”. The suggestion/motivation would be an efficient and minimal overhead method/system (see page 1 abstract). 2. Regarding claim 2, NPL1 discloses the computer-implemented method of claim 1, wherein: the pre-trained image-text processing model comprises a pre-trained unimodal image encoder configured to process an input image to generate one or more frame embeddings (abstract, figs 1, 2 shows and discloses the pre-trained image-text processing model comprises a pre-trained unimodal image encoder configured to process an input image to generate one or more frame embeddings); the one or more pre-trained attentional pooling layers are configured to process the one or more frame embeddings to generate one or more contrastive embeddings and one or more generative embeddings (Abstract, fig 2 and page 12 shows and discloses the one or more pre-trained attentional pooling layers are configured to process the one or more frame embeddings to generate one or more contrastive embeddings and one or more generative embeddings); and processing, by the computing system, the input video with the pre-trained image-text processing model comprises: separately processing each of the plurality of image frames with the pre- trained unimodal image encoder to generate a plurality of frame embeddings respectively for the plurality of image frames (figs 1-3 shows and discloses the input video with the pre-trained image-text processing model comprises: separately processing each of the plurality of image frames with the pre- trained unimodal image encoder to generate a plurality of frame embeddings respectively for the plurality of image frames); combining the plurality of frame embeddings to form a set of combined frame embeddings; and processing the set of combined frame embeddings with the one or more attentional layers to generate one or more generative embeddings and one or more contrastive embeddings (figs 1- 3 and pages 2-4, 6 shows and discloses combining the plurality of frame embeddings to form a set of combined frame embeddings; and processing the set of combined frame embeddings with the one or more attentional layers to generate one or more generative embeddings and one or more contrastive embeddings). 3. Regarding claim 5, NPL1 discloses the computer-implemented method of claim 1, wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 2, algorithm 1, page 3 section 3.1 shows and discloses wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function). 4. Regarding claim 6, NPL1 discloses the computer-implemented method of claim 5, wherein the video understanding task comprises a zero-shot video understanding task (figs 1-3, pages 1-4, and 6, abstract, section 1, 3, 3.3 shows and discloses wherein the video understanding task comprises a zero-shot video understanding task). 5. Regarding claim 7, NPL1 discloses the computer-implemented method of claim 5, wherein the pre-trained image- text processing model has been trained only on training data comprising only still images (fig 1, 3 and section 3.3 shows the pretraining using still images meeting the claim limitations). 6. Regarding claim 8, NPL1 discloses the computer-implemented method of claim 1, wherein an entirety of parameters of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 2, algorithm 1, page 3 section 3.1 shows and discloses wherein an entirety of parameters of the pre-trained image-text processing model have been held fixed after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function). 7. Regarding claim 9, NPL1 discloses the computer-implemented method of claim 1, wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been further finetuned after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 1 shows fine-tuning after the pretraining and Figs 1-2, section 2, 3.3, discloses wherein the parameters of the one or more pre-trained attentional pooling layers of the pre-trained image-text processing model have been further finetuned after said pre-training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function). 8. Regarding claim 11, NPL1 discloses the computer-implemented method of claim 1, wherein an entirety of parameters of the pre-trained image-text processing model have been further finetuned after said pre- training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function (fig 1, 3 shows fine-tuning after the pretraining and Figs 1-2, section 2, 3.3 wherein an entirety of parameters of the pre-trained image-text processing model have been further finetuned after said pre- training of the pre-trained image-text processing model on the joint contrastive and generative image captioning loss function). 9. Regarding claim 15, NPL1 discloses the computer-implemented method of claim 1, wherein the pre-trained image- text processing model comprises a decoder configured to process embeddings generated by the one or more attentional layers to generate a text output (figs 1-2 shows and discloses wherein the pre-trained image- text processing model comprises a decoder configured to process embeddings generated by the one or more attentional layers to generate a text output). 10. Regarding claim 16, NPL1 discloses the computer-implemented method of claim 1, wherein the pre-trained image- text processing model further comprises a unimodal text decoder and wherein processing the input video comprises processing a set of input text associated with the input video using the unimodal text decoder (figs 1-3 shows and discloses wherein the pre-trained image- text processing model further comprises a unimodal text decoder and wherein processing the input video comprises processing a set of input text associated with the input video using the unimodal text decoder). 11. Regarding claim 17, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video classification task (figs 1-3, table 5 (Zero-shot Video-text retrieval (i.e video classification task)). 12. Regarding claim 18, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video question answering task (section 4.2.3 discloses VQA (visual or video) question answering meeting the claim limitations). 13. Regarding claim 19, NPL1 discloses the computer-implemented method of claim 1, wherein the video understanding task comprises a video captioning task (figs1, 3 discloses and shows image/video captioning task meeting the claim limitations). 14. Claim 20 is a corresponding One or more non-transitory computer-readable media claim of claim 1. See the corresponding explanation of claim 1. Examiner notes that one or more non-transitory computer-readable media as claimed in claim 20 would be obvious and within one of ordinary skill in the art from the disclosure of NPL1 at “abstract discloses “By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead” and “compute two training losses considered efficiently” on pages 5 section “Pretraining efficiency” and “Algorithm 1” in fig 2. Examiner's Note: Examiner has cited figures, and paragraphs in the references as applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested for the applicant, in preparing the responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. Examiner has also cited references in PTO892 but not relied on, which are relevant and pertinent to the applicant’s disclosure, and may also be reading (anticipatory/obvious) on the claims and claimed limitations. Applicant is advised to consider the references in preparing the response/amendments in-order to expedite the prosecution. Allowable Subject Matter Claims 3-4, 10, 12-14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAYESH PATEL whose telephone number is (571)270-1227. The examiner can normally be reached IFW Mon-FRI. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at 571-270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. JAYESH PATEL Primary Examiner Art Unit 2677 /JAYESH A PATEL/Primary Examiner, Art Unit 2677
Read full office action

Prosecution Timeline

Mar 22, 2024
Application Filed
Jan 24, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12597170
METHOD AND APPARATUS FOR IMMERSIVE VIDEO ENCODING AND DECODING, AND METHOD FOR TRANSMITTING A BITSTREAM GENERATED BY THE IMMERSIVE VIDEO ENCODING METHOD
2y 5m to grant Granted Apr 07, 2026
Patent 12579770
DETECTION SYSTEM, DETECTION METHOD, AND NON-TRANSITORY STORAGE MEDIUM
2y 5m to grant Granted Mar 17, 2026
Patent 12561949
CONDITIONAL PROCEDURAL MODEL GENERATION
2y 5m to grant Granted Feb 24, 2026
Patent 12555346
Automatic Working System, Automatic Walking Device and Control Method Therefor, and Computer-Readable Storage Medium
2y 5m to grant Granted Feb 17, 2026
Patent 12536636
METHOD AND SYSTEM FOR EVALUATING QUALITY OF A DOCUMENT
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
88%
With Interview (+5.2%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 887 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month