Last updated: April 19, 2026

Application No. 18/980,001

METHOD AND SYSTEM USING DIVERSE CAPTIONS FOR IMPROVING LONG VIDEO RETRIEVAL

Non-Final OA §103

Filed

Dec 13, 2024

Examiner

PEREZ-ARROYO, RAQUEL

Art Unit

2169

Tech Center

2100 — Computer Architecture & Software

Assignee

Sri International

OA Round

3 (Non-Final)

This examiner grants 58% of cases after interview

— +32.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 296 resolved cases, 2023–2026

Examiner Intelligence

PEREZ-ARROYO, RAQUEL View full profile →

Grants 58% of resolved cases

Career Allow Rate

171 granted / 296 resolved

+2.8% vs TC avg

Strong +32% interview lift

Without

With

+32.3%

Interview Lift

resolved cases with interview

Typical timeline

3y 5m

Avg Prosecution

28 currently pending

Career history

324

Total Applications

across all art units

Statute-Specific Performance

§101

21.9%

-18.1% vs TC avg

§103

47.6%

+7.6% vs TC avg

§102

8.7%

-31.3% vs TC avg

§112

15.0%

-25.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 296 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on February 5, 2026 has been entered.
 
Response to Amendment
	This Office Action has been issued in response to Applicant’s Communication of amended application S/N 18/980,001 filed on February 5, 2026. Claims 1 to 3, 5 to 13, and 15 to 22 are currently pending with the application.

Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter (See 37 CFR 1.75(d)(1) and MPEP § 608.01(o)). Correction of the following is required: claims 21 and 22 recite the limitation “ground truth length”, however, the specification lacks antecedent basis for the claim terminology, and more specifically, for the term “ground truth length”.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are: “a synthetic caption generation unit configured to”, “a video language model finetuning unit configured to”, and “an enhanced video language model configured to”, recited in claims 11 to 13 and 15 to 19.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.  (See Specification Paras [0074], [0081] – “The illustrative computing device 510 includes at least one processor 512 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 514, and an input/output (I/O) subsystem 516. The computing device 510 may be embodied as any type of computing device such as a personal computer (e.g., a desktop, laptop, tablet, smart phone, wearable or body-mounted device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices”, “Embodiments in accordance with the disclosure may be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors”)
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 3, 5 to 13, and 15 to 22 are rejected under 35 U.S.C. 103 as being unpatentable over TORABI et al. (U.S. Publication No. 2017/0357720) hereinafter Torabi, in view of Hu et al. (U.S. Publication No. 2025/0190719) hereinafter Hu, and further in view of KIM (U.S. Publication No. 2023/0154159).
	As to claim 1:
	Torabi discloses:
	A method for improved long video retrieval by training video language models (VLM) using diverse captions [Paragraph 0041 teaches providing improved natural language image/video annotation and search, enabling the effective training of proposed model from heterogeneous linguistic descriptions, including full sentences, and noun/verb phrases], the method comprising: 
generating a plurality of captions of varying dimensions [Paragraph 0046 teaches using a recurrent neural network to produce a representation of video data, including a representation of linguistic descriptions; Paragraph 0047 teaches a given video can have multiple descriptions of varying length, where example natural language tags of a video can include: “Bear climbing up the ladder to get honey which he intends to eat”, “bear climbing”, “intending to eat honey”, and “bear on ladder”, therefore, varying dimensions];
associating the plurality of captions of varying dimensions to one or more videos in one or more video data sets to generate one or more enhanced video data sets [Paragraph 0009 teaches for each of the plurality of training samples, encoding each of the plurality of phrases for the training sample as a matrix, wherein each word within the one or more phrases is encoded as a vector, determining a weighted ranking between the plurality of phrases, encoding the instance of video content for the training sample as a sequence of frames, extracting frame features from the sequence of frames, performing an object classification analysis on the extracted frame features, and generating a matrix representing the instance of video content, based on the extracted frame features and the object classification analysis; Paragraph 0049 teaches the textual descriptions comprise natural language description tags of variable length; Paragraph 0052 teaches textual descriptions are of variable length, where the LVSM component encodes each word within the textual descriptions, to generate a matrix representing each textual description; Paragraph 0054 teaches generating video content matrix representations for the videos]; 
generating an enhanced VLM by finetuning a pretrained video language model using the generated one or more enhanced video data sets [Paragraph 0008 teaches the data model that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module; Paragraph 0009 teaches training the data model based on a plurality of training samples, where each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the instance of video content; Paragraph 0049 teaches training the data model using a plurality of training samples including natural language words describing the instance of video content]; and 
retrieving one or more videos with a query using the enhanced VLM [Paragraph 0008 teaches determining one or more instances of video content from the video library that correspond to the textual query, by analyzing the textual query using the data model; Paragraph 0009 teaches the textual query using the trained data model to identify one or more instances of video content from the plurality of instances of video content, and returning at least an indication of the one or more instances of video content]; 
wherein the varying dimensions include summarization level, wherein each caption of the plurality of captions differs by at least one of a duration level, summarization level, and simplification level [Paragraph 0047 teaches a given video can have multiple descriptions of varying length, where example natural language tags of a video can include: “Bear climbing up the ladder to get honey which he intends to eat”, “bear climbing”, “intending to eat honey”, and “bear on ladder”, therefore, captions differing at least by a duration level].
Torabi does not appear to expressly disclose generating captions using one or more Large Language Models (LLM); having a R@K rank; wherein the varying dimensions include varying duration level, summarization level, and simplification level.
Hu discloses:
generating captions using one or more Large Language Models (LLM) [Paragraph 0004 teaches using a large language model (LLM) or another generative model capable of performing a summarization task; Paragraph 0047 teaches using a large language model (LLM) to summarize the content based on degree of summarization]; 
wherein the varying dimensions include varying duration level, summarization level, and simplification level [Paragraph 0045 teaches summarization criteria, degree of summarization, including temporal duration, textual length, etc.; Paragraph 0046 teaches degree of summarization can be determined based on level of expertise; Paragraph 0047 teaches using a large language model (LLM) to summarize the content based on degree of summarization; Paragraph 0118 teaches summarization criteria includes temporal duration, textual length, level of expertise, etc.].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Torabi, by generating captions using one or more Large Language Models (LLM), wherein the varying dimensions include varying duration level, summarization level, and simplification level, as taught by Hu [Paragraph 0004, 0045-0047, 0118], because both applications are directed to efficient generation of content textual descriptions; generating the captions using LLM reduces computing resources utilized to manually review a video, while using varying criteria enables the personalization of the generated text improving the user’s experience (See Hu Para [0004]).
	Neither Torabi nor Hu appear to expressly disclose having a R@K rank.
	Kim discloses:
	having a R@K rank [Paragraph 0134 teaches validation recall-at-1 (R@1) performance is evaluated at every training epoch, and the model at the epoch with the best validation performance is selected as the final model; Table 1 teaches retrieval results on the synthetic data, including R@1, R@5, and R@10 rank; Paragraph 0138 teaches R@K is the % of a true item found in model’s top-K retrieved items; Paragraph 0150 teaches retrieval performance metrics are recall-at-k (R@k) with k = 1,5,10].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the teachings of the cited references and modify the invention as taught by Torabi, by having a R@K rank, as taught by Kim [Paragraph 0134, 0138, 0150, Table 1], because the applications are directed to improvements in content retrieval, including video content; employing a R@K rank improves the performance of the learning of the models (See Kim Para [0153]).

	As to claim 2:
Torabi and Hu disclose:
each of the one or more video data sets include a plurality of long videos greater than 60 seconds or contain multiple events [Torabi - Paragraph 0049 teaches textual descriptions comprise natural language descriptions of variable length, i.e., “Bear climbing up the ladder to get honey which he intends to eat”, therefore, the videos contain multiple events; Hu – Paragraph 0004 teaches selecting a plurality of sources to generate the summaries, hence, multiple events].

As to claim 3:
Torabi discloses:
each of the plurality of captions associated with each video is a different description of the video [Paragraph 0047 teaches example natural language tags of a video can include: “Bear climbing up the ladder to get honey which he intends to eat”, “bear climbing”, “intending to eat honey”, and “bear on ladder”, therefore, different descriptions; Paragraph 0049 teaches the textual descriptions comprise natural language description tags of variable length].
As to claim 5:
Torabi as modified by Hu discloses:
one or more of the plurality of captions are natural language descriptions generated by the one or more LLMs [Hu - Paragraph 0004 teaches generating a summarization of content using a large language model].

As to claim 6:
Torabi as modified by Kim discloses:
the enhanced VLM is further finetuned using one or more contrastive loss functions [Kim - Paragraph 0022 teaches using contrastive loss functions].

As to claim 7:
Torabi as modified by Kim discloses:
the contrastive loss functions are standard bi-directional contrastive loss functions that push the relationship between caption embeddings generated for the same video closer together [Kim - Paragraph 0021 teaches the contrastive learning approach aims to learn the cross-modal similarity measure by the intuitive criteria that pull together relevant pairs and push away irrelevant ones].

As to claim 8:
Torabi discloses:
the same video is retrieved from the enhanced VLM using a plurality of different search queries have varying dimensions which are variations on how to query the enhanced VLM [Paragraph 0009 teaches processing the textual query using the trained data model to identify one or more instances of video content from the plurality of instances of video content; Paragraph 0045 teaches retrieve a ranked list of the images/videos ranked by the distance to the query in the embedding space; Paragraph 0074 teaches training the data model such that a video depicting this scene would appear closest to the phrase “Bear climbing up the ladder to get honey which he intends to eat” in the semantic embedding space, followed by the tags “intending to eat honey”, “bear on ladder”, and “bear climbing”, therefore, different queries can be used to retrieve a same video].

As to claim 9:
Torabi as modified by Kim discloses:
wherein K = 1 [Kim - Paragraph 0134 teaches validation recall-at-1 (R@1) performance is evaluated; Paragraph 0150 teaches retrieval performance metrics are recall-at-k (R@k) with k = 1].

As to claim 10:
Torabi as modified by Kim discloses:
wherein K <= 5 [Kim - Paragraph 0134 teaches validation recall-at-1 (R@1) performance is evaluated; Paragraph 0150 teaches retrieval performance metrics are recall-at-k (R@k) with k = 1,5,10].

As to claim 21:
Torabi discloses:
the duration level includes ground truth length [Paragraph 0044 teaches incorporating visual information and heterogeneous linguistic information, including complete AD sentences (hence, ground truth length), noun phrases (NPs) and verb phrases (VPs); Paragraph 0047 teaches a given video can have multiple descriptions of varying length, including the most specific (longest) description (ground truth length), and less specific (shorter) descriptions; Paragraph 0070 teaches the original captioned sentence could be ranked higher compared to phrases (including NPs and VPs) which are part of the complete caption, therefore, including ground truth length, i.e., original or complete captions].
	Same rationale applies to claims 11 to 13, 15 to 20, and 22, since they recite similar limitations, and are therefore, similarly rejected.

Response to Arguments
	The following is in response to arguments filed on February 5, 2026. Arguments have been carefully and respectfully considered, but are moot in view of new grounds of rejections, as necessitated by the amendments.




Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAQUEL PEREZ-ARROYO whose telephone number is (571)272-8969. The examiner can normally be reached Monday - Friday, 8:00am - 5:30pm, Alt Friday, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sherief Badawi can be reached at 571-272-9782. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RAQUEL PEREZ-ARROYO/Primary Examiner, Art Unit 2169

Read full office action

Prosecution Timeline

Dec 13, 2024

Application Filed

Aug 09, 2025

Non-Final Rejection — §103

Nov 12, 2025

Response Filed

Dec 13, 2025

Final Rejection — §103

Feb 05, 2026

Request for Continued Examination

Feb 15, 2026

Response after Non-Final Action

Feb 21, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/498,124

Patent 12566786

NATURAL LANGUAGE PROCESSING WORKFLOW FOR RESPONDING TO CLIENT QUERIES

2y 5m to grant Granted Mar 03, 2026

18/478,228

Patent 12566726

ENABLING EXCLUSION OF ASSETS IN IMAGE BACKUPS

2y 5m to grant Granted Mar 03, 2026

18/113,521

Patent 12555109

DETERMINISTIC CONCURRENCY CONTROL FOR PRIVATE BLOCKCHAINS

2y 5m to grant Granted Feb 17, 2026

18/436,521

Patent 12547602

LOG ENTRY REPRESENTATION OF DATABASE CATALOG

2y 5m to grant Granted Feb 10, 2026

17/601,884

Patent 12517948

INFORMATION PROCESSING METHOD AND DEVICE FOR SORTING MUSIC IN A PLAYLIST

2y 5m to grant Granted Jan 06, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

58%

Grant Probability

90%

With Interview (+32.3%)

3y 5m

Median Time to Grant

High

PTA Risk

Based on 296 resolved cases by this examiner. Grant probability derived from career allow rate.