Last updated: May 29, 2026

Application No. 18/614,205

VERSATILE ACTION MODELS (VAMOS) FOR VIDEO UNDERSTANDING

Non-Final OA §103

Filed

Mar 22, 2024

Priority

Nov 20, 2023 — provisional 63/601,124

Examiner

YANG, JIANXUN

Art Unit

2662

Tech Center

2600 — Communications

Assignee

Brown University

OA Round

1 (Non-Final)

Interview Optional

— +19.0% interview lift. Examiner has a relatively high allowance rate (74%); +19.0% interview lift. A written response may suffice.

Based on 645 resolved cases, 2023–2026

Examiner Intelligence

YANG, JIANXUN View full profile →

Grants 74% — above average

Career Allowance Rate

479 granted / 645 resolved

+12.3% vs TC avg

Strong +19% interview lift

Without

With

+19.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 7m

Avg Prosecution

27 currently pending

Career history

686

Total Applications

across all art units

Statute-Specific Performance

§101

0.5%

-39.5% vs TC avg

§103

91.9%

+51.9% vs TC avg

§102

3.0%

-37.0% vs TC avg

§112

3.6%

-36.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 645 resolved cases

Office Action

§103

DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims 1-20 are pending.


Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claim(s) 1-6, 10-15 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al (Socratic Models, 2022) in view of Chen et al (VideoLLM, May 2023).

Regarding claim 1, Zeng teaches a method for forming versatile action models for video understanding, comprising:
(Zeng, “Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning”, [abstract]; a method ("framework") for video understanding that is versatile (“answering free-form questions about egocentric video”, “engaging in multimodal assistive dialogue with people (e.g., for cooking recipes)”, “robot perception and planning", etc.)
gathering data from a video,
(Zeng, “activity = f_LM (f_VLM (f_LM (f_ALM (f_LM (f_VLM (video)))))”, sec. 3, p3; “visual-language models (VLMs) ... large language models (LMs)”, [abstract], “ALM (speech-to-text)”, sec. 4.3, p6)
wherein the date comprises textual video representations and other task specific language inputs; and
(Zeng, in the above expression, “(i) the VLM detects visual entities, (ii) the LM suggests sounds that may be heard, (iii) the ALM chooses the most likely sound, (iv) the LM suggests possible activities, (v) the VLM ranks the most likely activity, (vi) the LM generates a summary of the Socratic interaction”, sec. 3, p3; “input event log”, sec. 5.1, p7; “(ii) the LM suggests sounds that may be heard” or “input event log” => “other task specific language inputs”)
using a pre-trained large language model (LLM) next 
(Zeng, “(iii) Forecasting of future events can be formulated as language-based world-state completion. Our system prompts the LM to complete the rest of an input event log. Timestamps of the predictions can be preemptively specified depending on the application needs. The completion results (example below on the right) are generative, and are more broad than binary event classification”, sec. 5.1, p7; using a pre-trained LLM ("LM" like GPT-3, see Sec 4) for "forecasting of future events" (action anticipation) based on the textual video data ("input event log"). The "completion" of the log by the LM is inherently next prediction)
	Zeng does not expressly disclose but Chen teaches:
	... next token prediction for action anticipation ...;
(Chen, “We use two linear layers to predict the category of each token s_v^I and its next token s_v^(i+1). Thanks to the causal structure of decoder-only LLM, we do not need to calculate the context of the entire sequence ... We only make s_v^0 cross-attend to the historical context to calculate new states c^0”, sec 4.3, “Online Reasoning”, p6; “Future Prediction. Given a sequence of seen tokens m = ... as the working memory, model need predict the next N_f tokens or events ...”, sec. 4.3, “Future Prediction”, p6; predicting the "next token" using the "causal structure of decoder-only LLM" for "Future Prediction" (action anticipation))
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to incorporate the teachings of Chen into the system or method of Zeng because both use pre-trained, decoder-only LLMs to predict future video events. While Zeng teaches high-level "forecasting" via text completion, Chen explicitly optimizes this as "next token prediction" using the LLM’s causal structure. A skilled artisan would incorporate Chen’s token-based method into Zeng to standardize the action anticipation step, ensuring the LLM predicts future actions efficiently as the next tokens in the sequence. The combination of Zeng and Chen also teaches other enhanced capabilities.

Regarding claim 2, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 1,
wherein using a pre-trained LLM next token prediction for action anticipation based on the data from the video comprises unifying video dynamic modeling tasks,
wherein the video dynamic modeling tasks comprises comprehending historical content and future prediction.
(Chen, see comments on claim 15)

Regarding claim 3, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 1, wherein gathering data from a video comprises converting visual inputs from the video into discrete action labels and free-form descriptions.
(Zeng, Chen, see comments on claim 15)

Regarding claim 4, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 3, wherein converting visual inputs into discrete action label comprises condensing the video into sequences of discrete action labels through application of action recognition models that operate in a predefined action space.
(Zeng, Chen, see comments on claim 15)

Regarding claim 5, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 3, wherein converting visual inputs from the video into free-form descriptions comprises:
processing sampling frames to produce frame level captions; and
concatenating the frame level captions to form a comprehensive video-level captions.
(Zeng, see comments on claim 15)

Regarding claim 6, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 1, comprising generating a corresponding text token by inputting the textual video representations and other task specific language inputs into a frozen word embedding layer.
(Zeng, see comments on claim 1; “A key component in SMs is multi-model multimodal prompting, in which information from a non-language domain is substituted into a language prompt, which can be used by an LM for reasoning. One way to multimodal prompt is to variable-substitute language-described entities from other modalities into a prompt”, “activity = f_LM (f_VLM (f_LM (f_ALM (f_LM (f_VLM (video)))))”, sec. 3, p3; regarding “a frozen word embedding layer”, it is well defined in the art that a frozen word embedding layer means using pre-trained word vectors that are kept static (not updated) during the training of a neural network; the standard LLMs used in Zeng/Chen often have frozen embeddings to reduce computational cost)

Regarding claim 10, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 5, comprising incorporating a learnable linear projection layer to align visual features with a language space.
(Chen, “We use one linear projector ϕ to learn translation from the visual to language semantics to attain translated semantics s_v = ... where d is the hidden dimension of the used LLM”, sec. 4.2, p5; using a learnable linear projector to map visual features to the language dimension)

Regarding claim 11, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 1, comprising compressing input sequences above a desired length and extracting elements from the generated textual video representations to determine downstream video understanding tasks.
(Zeng, “extract a set of “key moments” throughout the video (e.g., via importance sampling, or video/audio search based on the input query, ...) ... recursively summarize ... them into a language-based record of events, which we term a language-based world-state history”, sec. 5.1, p7; summarizing/compressing video data into a concise history for downstream tasks like Q&A)

Regarding claim 12, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 1, comprising:
providing a token selector, wherein the token selector takes in a sequence of textual video tokens; and
selecting a single token from the sequence of textual video tokens for downstream video understanding.
(Chen, “We use two linear layers to predict the category of each token s_v^I and its next token s_v^(i+1). Thanks to the causal structure of decoder-only LLM, we do not need to calculate the context of the entire sequence ... We only make s_v^0 cross-attend to the historical context to calculate new states c^0”, sec 4.3, “Online Reasoning”, p6; “implement action anticipation tasks to predict the action category that occurs after a certain time gap”, sec. 2.1, p3; a particular video anticipation token may be selected for t = t0+gap, which is a mechanism that takes a sequence of tokens and selects a "single token" (the next token/category) for the understanding task)

Regarding claim 13, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 12, comprising:
selecting a condensed token sequence;
dividing the condensed token sequence into a plurality of uniform segments, each uniform segment containing a unique textual video token forming the sequence of textual video tokens; and
feeding the sequence of textual video tokens into the token selector.
(Chen, “adopt a temporal-wise unitization method to process unit-wise visual (or audio and other modality) information for utilizing LLMs to understand video streams comprehensively ... all frames are divided into N_v = ... space-time visual unit (=> “uniform segments) ... outputs a sequence of space-time visual units x_v = ...”, sec. 4.1, p4-5; “pool each visual unit of x_v to the temporal token (=> “unique token per segment”)”, sec. 4.2, p5; dividing the sequence into uniform segments (units) and converting them into tokens fed into the model)

Regarding claim 14, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 13, comprising providing manual intervention to generate intervened tokens to fix incorrect downstream video understanding.
(Zeng, “Our example application here helps the user search for a recipe, then guides them through it step by step. The system allows the user to navigate recipe steps with casual dialogue, provides ingredient replacements or advice (using LM priors), and searches for visual references (in the form of images or videos) on user request. This is a case study in (i) prompting a dialogue LM ... to produce key phrase tokens”, sec. 5.2, p8; casual dialogue between a user and the system involves token generation)

Regarding claim 15, the combination of Zeng and Chen teaches a method for forming versatile action models for video understanding, the method implemented using a computer system including a processor communicatively coupled to a memory device, the method comprising:
(Zeng, see comments on claim 1)
gathering data from a video by converting visual inputs from the video into discrete action labels and free-form descriptions by condensing the video into sequences of discrete action labels through application of action recognition models that operate in a predefined action space, and
(Zeng, Chen, see comments on claim 1; Zeng, “VLM is used to zero-shot detect different place categories ... object categories ... image type ({photo, cartoon, sketch, painting}) and the number of people {are no people, is one person, are two people, are three people, are several people, are many people}”, sec. 4.1, p4; {...} => discrete labels; converting inputs to discrete labels; “SMs can be prompted to perform various perceptual tasks on egocentric video: (i) summarizing content, (ii) answering free-form reasoning questions, (iii) and forecasting”, sec. 5.1, p7; Chen, “Timestamp-level tasks aim to predict closed-set properties at each time step or filter suitable time steps based on textual conditions... implement online action detection or action segmentation tasks to predict the category of each time step in a video stream ... implement action anticipation tasks to predict the action category that occurs after a certain time gap”, sec. 2.1, p3; converting video inputs into sequences of closed-set properties and action categories at each time step)
processing sampling frames to produce frame level captions, which are concatenated to form a comprehensive video-level caption; and
(Zeng, “we first extract a set of “key moments” throughout the video (e.g., via importance sampling, or video/audio search based on the input query, discussed in Appendix). We then caption the key frames indexed by these moments ... and recursively summarize ... them into a language-based record of events, which we term a language-based world-state history”, sec. 5.1, p7; processing sampled frames to create frame captions and concatenating/summarizing them into a video-level history)
using a pre-trained large language model (LLM) next token prediction for action anticipation based on the data from the video by unifying video dynamic modeling tasks,
(Zeng, Chen, see comments on claim 1; Chen, “VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence ... VideoLLM yields an effective unified framework for different kinds of video understanding tasks”, [abstract]; a unified framework using multi-modal predictions)
wherein the video dynamic modeling tasks comprises comprehending historical content and future prediction.
(Chen, “Online Reasoning... We only make s_v^0 cross-attend to the historical context to calculate new states c^0”, sec 4.3, p6; predicting future states based on the historic events)

Regarding claim 18, the combination of Zeng and Chen teaches its/their respective base claim(s).
The combination further teaches the method of Claim 15, comprising:
selecting a condensed token sequence;
dividing the condensed token sequence into a plurality of uniform segments, each uniform segment containing a unique textual video token forming the sequence of textual video tokens; and
feeding the sequence of textual video tokens into a token selector; and
(Chen, see comments on claim 13)
selecting a single video token from the sequence of textual video tokens for downstream video understanding.
(Chen, see comments on claim 12)


Allowable Subject Matter
Claim(s) 19-20 is/are allowed.

The following is an examiner’s statement of reasons for allowance claim 19:

Claim 19 is directed to a method for forming versatile action models for video understanding that involves gathering data from a video by converting visual inputs into both discrete action labels (via action recognition models in a predefined space) and free-form descriptions (by captioning and concatenating sampled frames). The claim specifically requires generating a corresponding text token by inputting these labels and descriptions into a frozen word embedding layer, and uniquely specifies that this same frozen word embedding layer samples a predetermined number of frames from the video and generates visual features. These visual features are then inputted into a projection layer to produce vision tokens, which are concatenated and fed into a pre-trained LLM for next token prediction to unify historical comprehension and future prediction tasks.
 	Prior art Zeng (Socratic Models, 2022) is directed to "Socratic Models," a framework that composes pre-trained models (like VLMs and LLMs) via language prompts to perform video understanding tasks without fine-tuning. Zeng teaches converting video into a "language-based world-state history" (free-form descriptions) by captioning key frames and using an LLM to "forecast future events" (action anticipation) based on this text log. It uses standard pre-trained models where the word embedding layer processes text, while a separate visual model (VLM) handles video frame sampling and feature generation.
Prior art Chen (VideoLLM, 2023) is directed to "VideoLLM," a unified framework for video sequence understanding that leverages the causal reasoning of decoder-only LLMs. Chen teaches converting video into a sequence of tokens using a "Modality Encoder" (visual encoder) and a "Semantic Translator" (projection layer) to map visual features to the language space, alongside a text tokenizer for language inputs. It explicitly uses the LLM for "next token prediction" to perform tasks like "Online Action Detection" (discrete labels) and "Future Prediction" (action anticipation).
However, either Zeng alone or in combination with Chen fails to teach all limitations of claim 19.

Claim 20 depends on claim 19 and therefore is allowable.

Claim(s) 7-8 and 16 is/are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening Claim(s).
The following is a statement of reasons for the indication of allowable subject matter:

Claim(s) 7-8 and 16 recite(s) limitation(s) related to a frozen word embedding layer which samples video frames and generates visual features for projection into tokens, and which generates both text tokens from descriptions and visual features from sampled frames. There are no explicit teachings to the above limitation(s) found in the prior art cited in this office action and from the prior art search.

Claim(s) 9 and 17 depend on claims 8 and 16, respectively.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JIANXUN YANG whose telephone number is (571)272-9874. The examiner can normally be reached on MON-FRI: 8AM-5PM Pacific Time.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amandeep Saini can be reached on (571)272-3382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JIANXUN YANG/
Primary Examiner, Art Unit 2662				1/18/2026

Read full office action

Prosecution Timeline

Mar 22, 2024

Application Filed

Jan 22, 2026

Non-Final Rejection mailed — §103

Apr 22, 2026

Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,015

Patent 12639797

IMAGE INPAINTING METHOD AND DEVICE

3y 1m to grant Granted May 26, 2026

18/210,608

Patent 12639800

DEFECT DETECTION FOR DENTAL APPLIANCES

2y 11m to grant Granted May 26, 2026

18/284,494

Patent 12633104

IMAGE FEATURE TRANSMISSION METHOD, DEVICE AND SYSTEM

2y 7m to grant Granted May 19, 2026

18/318,679

Patent 12633082

METHOD FOR GENERATING CONTOUR DATA, COMPUTER DEVICE AND COMPUTER-READABLE STORAGE MEDIUM

3y 0m to grant Granted May 19, 2026

18/253,120

Patent 12626368

IMAGE ANALYSIS FOR AERIAL IMAGES

2y 12m to grant Granted May 12, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

74%

Grant Probability

93%

With Interview (+19.0%)

2y 7m (~5m remaining)

Median Time to Grant

Low

PTA Risk

Based on 645 resolved cases by this examiner. Grant probability derived from career allowance rate.