Last updated: May 29, 2026
Application No. 18/918,408
LANGUAGE INSTRUCTED TEMPORAL LOCALIZATION IN VIDEO USING IMAGE TOKENS, VIDEO TOKENS, AND/OR SOFT CROSS ENTROPY LOSS

Final Rejection §103
Filed
Oct 17, 2024
Priority
May 10, 2024 — provisional 63/645,333
Examiner
TRAN, LOI H
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Nvidia Corporation
OA Round
2 (Final)
Interview Optional

— +23.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 65% grant rate with +23.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 614 resolved cases, 2023–2026
Examiner Intelligence

TRAN, LOI H View full profile →
Grants 65% of resolved cases
Career Allowance Rate
397 granted / 614 resolved
+6.7% vs TC avg
Strong +23% interview lift
Without
With
+23.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
17 currently pending
Career history
637
Total Applications
across all art units
Statute-Specific Performance

§101
1.0%
-39.0% vs TC avg
§103
93.4%
+53.4% vs TC avg
§102
3.4%
-36.6% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 614 resolved cases
Office Action

§103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Response to Arguments
Applicant's arguments with respect to the rejections of claims 1-22 have been considered but are moot in view of new ground(s) of rejection.  

Response to Amendment
Claim Rejections - 35 USC § 103
3.	The text of those sections of Title 35, U.S. Code not included in this section can be found in a prior Office action.
4.	Claims 1-9 and 15-22 are rejected under AIA  35 U.S.C. 103 as being unpatentable over Sridhar et al. (US Publication 2025/0078818) in view of Yang et al. (US Publication20240380949, and further in view of Zhou et al. (English Translation of Chinese Publication CN117152669 02-2024).  
Regarding claim 1, Sridhar discloses a computer-implemented method for language instructed temporal localization in videos, comprising:
receiving multimodal input comprising natural language input and video input comprising a plurality of frames (Sridhar, fig. 8A, para’s 0075-0076, receiving text prompt 802 and video frames 804 which represent multimodal input); 
pre-processing the video input by:
sampling the plurality of frames to obtain a second subset of frames (Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816); 
generating a plurality of image tokens for the second subset of frames (Sridhar, fig. 8A, para’s 0075-0076, generate image embeddings 816 from down-sampled video); 
pre-processing the natural language input to generate a plurality of language tokens (Sridhar, fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder (not shown) that generates a text embedding 814);
providing, to a pre-trained multimodal large language model (LLM), the plurality of image tokens, and the plurality of language tokens (Sridhar, fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder that generates a text embedding 814. The encoder for example could be a contrastive language-image pretraining (CLIP) encoder. The video frames 804 can be down sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816. The image embedding 816 may also be generated from the CLIP encoder; a weighted average and normalization process includes receiving the text embedding 814, the image embedding 816 and the audio embedding 818 and generating a weighted average value that is provided to a cosine similarity component 822. The intermediate beams 612 are provided to the greedy rollout engine 704 to generate greedily rollout intermediate beams. The greedily rollout intermediate beams (or complete sentences) can be encoded by an encoder 824 to generate a second text embedding 826 (e.g., a CLIP embedding from a CLIP encoder such as encoder 824). The second text embedding 826 is provided to the cosine similarity component 822 to generate a faithfulness score 707 based on the weighted average and the second text embedding 826. The cosine similarity component 822 may also represent a more generic similarity component that uses other approaches besides a cosine similarity. The alignments from the input embeddings (e.g., one or more of the text embeddings 814, the image embedding 816, and/or the audio embedding 818) are used as a check on the text embedding associated with a caption); and
processing, by the pre-trained multimodal LLM, the plurality of image tokens, and the plurality of language tokens to generate output responsive to the natural language input, wherein the output indicates a natural language caption corresponding to content in the video input (Sridhar, para. 0071, FIG. 6 is a block diagram of a decoder 600 (or decoder system) that uses a faithfulness guidance engine 608 to improve the output caption 618. The faithfulness guidance engine 608 is an example of the faithfulness guidance engine of the decoder 310 of FIG. 3B. The decoder 600 introduces an embedding space faithfulness score during the decoding process. For instance, if a condition is met, at the generation step, the decoder 600 can compute a faithfulness score for intermediate beams 612 along with a model probability score to re-rank the intermediate beams and to generate reranked intermediate beams 614. Unimodal or multimodal encoded representations 602 are input into the decoder 600. As shown, the decoder 600 includes transformer blocks 604, which can identify sets of possible tokens. A sampler 606 (e.g., a beam search or a greedy search algorithm to rank tokens based on probability of use) can be used to generate the intermediate beams 612 that are provided to the faithfulness guidance engine 608. The faithfulness guidance engine 608, given the intermediate beams 612 and the encoded representations 602, can generate reranked intermediate beams 614 provided to the sampler 606 to produce a finalized beam 616 that is ultimately used to generate the output caption 618. The faithfulness guidance engine 608 is introduced into the sampler 606 decoding process. At one or more token generation steps (e.g., at every token generation step in some cases), the decoder 600 considers the reranking of the intermediate beams 614 from faithfulness guidance engine 608 along with the prediction score (intermediate beams 612) from the sampler 606).
Sridhar discloses processing, by the pre-trained multimodal LLM, the plurality of image tokens and the plurality of language tokens to generate output responsive to the natural language input, wherein the output indicates a natural language caption corresponding to content in the video input, but does not explicitly disclose: 
obtaining the second subset of frames by sampling the plurality of frames using a fixed frame count, wherein the second subset of frames is independent of the video length of the video input.
sampling the plurality of frames using a first downsampling ratio to obtain a first subset of frames, wherein the first subset of frames is based on a video length of the video input;
generating a plurality of video tokens using a plurality of tokens associated with the first subset of frames;
processing the plurality of image tokens and the plurality of video tokens to generate the output.
Yang discloses:
sampling the plurality of frames using a first downsampling ratio to obtain a first subset of frames; generating a plurality of video tokens using a plurality of tokens associated with the first subset of frames; processing the plurality of image tokens and the plurality of video tokens to generate the output (Yang, para. 0029, para’s 0046-0055, the video MHA transformer 110ba receives the local video tokens 304b and global video tokens 304c as queries, the local video tokens 304b as keys, and the global audio tokens 306c as values, and subsequently outputs fused local video tokens 308c and fused global video tokens 308e. Accordingly, with the global cross fusion module 110b, cross-modal attention flow is funneled into the global video tokens 304c and the global audio tokens 306c, and the local video tokens 304b and local audio tokens 306b are used for intra-modal attention. By using the global video tokens 304c as context, as opposed to using local video tokens 304b as context, high-level concepts can be used in the generation of video captions; it is noted that high-level concept detection would require down-sampling having low frame rate; local video token for refinement of feature representation in a frame and neighboring frames would require down-sampling having high frame rate; para’s 0081-0085, the cross-modal encoder may comprise a merged fusion module configured to concatenate the local video tokens and local audio tokens, input the concatenated local video tokens and the concatenated local audio tokens into a first transformer and a second transformer, respectively, and output merged local video tokens and merged local audio tokens; since Yang discloses sampling the video, and generate local video tokens and global video tokens according to the sampled video frames, either the local video tokens and the global video tokens can be considered as image tokens or video tokens);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate or combine a plurality of local video tokens and a plurality of global video tokens as disclosed by Yang into or with Sridhar’s invention for enhancing multimodal video captioning generation by using both video tokens and image tokens.
Sridhar-Yang does not explicitly disclose but Zhou discloses: 
obtaining the second subset of frames by sampling the plurality of frames using a fixed frame count, wherein the second subset of frames is independent of the video length of the video input; wherein the first subset of frames is based on a video length of the video input (Zhou, para’s 0088-0089, performing sparse sampling and uniform sampling. For a video with a number of frames, performing sparce sampling forms a shorter sampled video, wherein, the starting and ending time corresponding to the original video is scaled according to the sampling ratio; for example, a 1/10 sampling ratio means only one-tenth of the original data points are kept; in sparce sampling, the number of frames in the sampled video is based on the length of the original video; uniform sampling, i.e., fixed frame count sampling, is a video sampling technique that extracts a predetermined, constant number of frames from a video, regardless of its original length or frame rate).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Zhou’s sampling methods into Sridhar-Yang’s invention for enhancing multimodal video captioning generation by generating tokens using reduced versions of the original video.
Regarding claim 2, Sridhar-Yang discloses the computer-implemented method of claim 1, further comprising: prior to generating the output responsive to the natural language input, training the multimodal LLM using training data comprising a plurality of training videos and a plurality of training images (Yang, para’s 0017-0018, training the model using training video).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 3, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 2, wherein training the multimodal LLM using the training data comprises: determining a sample length for a training video from the plurality of training videos based on dividing a plurality of training frames of the training video with the first downsampling ratio; sampling the plurality of training frames of the training video using the determined sample length to obtain a first training subset of the plurality of training frames; generating one or more training video tokens for the training video based on the first training subset of the plurality of training frames; and training the multimodal LLM using the one or more training video tokens (Sridhar, para’s 0066-0077 and 0107-0114, training machine learning models, Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; generate image embeddings 816 from down-sampled video; fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder (not shown) that generates a text embedding 814; the text prompt 802 can be encoded by an encoder that generates a text embedding 814. The encoder for example could be a contrastive language-image pretraining (CLIP) encoder. The video frames 804 can be down sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816. The image embedding 816 may also be generated from the CLIP encoder; a weighted average and normalization process includes receiving the text embedding 814, the image embedding 816 and the audio embedding 818 and generating a weighted average value that is provide to a cosine similarity component 822. The intermediate beams 612 are provided to the greedy rollout engine 704 to generate greedily rollout intermediate beams. The greedily rollout intermediate beams (or complete sentences) can be encoded by an encoder 824 to generate a second text embedding 826 (e.g., a CLIP embedding from a CLIP encoder such as encoder 824). The second text embedding 826 is provided to the cosine similarity component 822 to generate a faithfulness score 707 based on the weighted average and the second text embedding 826. The cosine similarity component 822 may also represent a more generic similarity component that uses other approaches besides a cosine similarity. The alignments from the input embeddings (e.g., one or more of the text embedding 814, the image embedding 816, and/or the audio embedding 818) are used as a check on the text embedding associated with a caption; Yang, para’s 0017-0018, training the model using training video).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 4, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 3, wherein generating the one or more training video tokens comprises: generating a plurality of training tokens for each frame from the first training subset; and generating the one or more training video tokens for the first training subset by: for each frame from the first training subset, averaging the plurality of training tokens associated with the frame to generate a single training video token for the frame (Sridhar, para’s 0066-0077 and 0107-0114, training machine learning models, Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; generate image embeddings 816 from down-sampled video; fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder (not shown) that generates a text embedding 814; the text prompt 802 can be encoded by an encoder that generates a text embedding 814. The encoder for example could be a contrastive language-image pretraining (CLIP) encoder. The video frames 804 can be down sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816. The image embedding 816 may also be generated from the CLIP encoder; a weighted average and normalization process includes receiving the text embedding 814, the image embedding 816 and the audio embedding 818 and generating a weighted average value that is provide to a cosine similarity component 822. The intermediate beams 612 are provided to the greedy rollout engine 704 to generate greedily rollout intermediate beams. The greedily rollout intermediate beams (or complete sentences) can be encoded by an encoder 824 to generate a second text embedding 826 (e.g., a CLIP embedding from a CLIP encoder such as encoder 824). The second text embedding 826 is provided to the cosine similarity component 822 to generate a faithfulness score 707 based on the weighted average and the second text embedding 826. The cosine similarity component 822 may also represent a more generic similarity component that uses other approaches besides a cosine similarity. The alignments from the input embeddings (e.g., one or more of the text embedding 814, the image embedding 816, and/or the audio embedding 818) are used as a check on the text embedding associated with a caption; Yang, para’s 0017-0018, training the model using training video).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 5, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 3, wherein determining the sample length for the training video comprises: dividing the plurality of training frames of the training video with the first downsampling ratio to obtain an intermediate sample length; and determining the sample length based on comparing the intermediate sample length with a maximum threshold of sample lengths (Sridhar, para’s 0066-0077 and 0107-0114, training machine learning models, Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; generate image embeddings 816 from down-sampled video; fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder (not shown) that generates a text embedding 814; the text prompt 802 can be encoded by an encoder that generates a text embedding 814. The encoder for example could be a contrastive language-image pretraining (CLIP) encoder. The video frames 804 can be down sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816. The image embedding 816 may also be generated from the CLIP encoder; a weighted average and normalization process includes receiving the text embedding 814, the image embedding 816 and the audio embedding 818 and generating a weighted average value that is provide to a cosine similarity component 822. The intermediate beams 612 are provided to the greedy rollout engine 704 to generate greedily rollout intermediate beams. The greedily rollout intermediate beams (or complete sentences) can be encoded by an encoder 824 to generate a second text embedding 826 (e.g., a CLIP embedding from a CLIP encoder such as encoder 824). The second text embedding 826 is provided to the cosine similarity component 822 to generate a faithfulness score 707 based on the weighted average and the second text embedding 826. The cosine similarity component 822 may also represent a more generic similarity component that uses other approaches besides a cosine similarity. The alignments from the input embeddings (e.g., one or more of the text embedding 814, the image embedding 816, and/or the audio embedding 818) are used as a check on the text embedding associated with a caption; Yang, para’s 0017-0018, training the model using training video).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 6, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 3, wherein training the multimodal LLM using the training data further comprises:
sampling the plurality of training frames of the training video to obtain a second training subset of the plurality of training frames; and 
generating a set of training image tokens for each of the second training subset of the plurality of training frames, wherein training the multimodal LLM comprises training the multimodal LLM using the one or more training video tokens and the sets of training image tokens (Sridhar, para’s 0066-0077 and 0107-0114, training machine learning models, Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; generate image embeddings 816 from down-sampled video; fig. 8A, para’s 0075-0076, the text prompt 802 can be encoded by an encoder (not shown) that generates a text embedding 814; the text prompt 802 can be encoded by an encoder that generates a text embedding 814. The encoder for example could be a contrastive language-image pretraining (CLIP) encoder. The video frames 804 can be down sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816. The image embedding 816 may also be generated from the CLIP encoder; a weighted average and normalization process includes receiving the text embedding 814, the image embedding 816 and the audio embedding 818 and generating a weighted average value that is provide to a cosine similarity component 822. The intermediate beams 612 are provided to the greedy rollout engine 704 to generate greedily rollout intermediate beams. The greedily rollout intermediate beams (or complete sentences) can be encoded by an encoder 824 to generate a second text embedding 826 (e.g., a CLIP embedding from a CLIP encoder such as encoder 824). The second text embedding 826 is provided to the cosine similarity component 822 to generate a faithfulness score 707 based on the weighted average and the second text embedding 826. The cosine similarity component 822 may also represent a more generic similarity component that uses other approaches besides a cosine similarity. The alignments from the input embeddings (e.g., one or more of the text embedding 814, the image embedding 816, and/or the audio embedding 818) are used as a check on the text embedding associated with a caption; Yang, para’s 0017-0018, training the model using training video).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 7, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 6, wherein sampling the plurality of training frames of the training video to obtain the second training subset of the plurality of training frames is based on using a second downsampling ratio that is greater in magnitude than the first downsampling ratio (Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; Yang, para. 0046, an input video is received from a user, and video frames are sampled from the input video with different sampling ratio; Zhou, para’s 0088-0089, performing sparse sampling and uniform sampling. For a video with a number of frames, performing sparce sampling forms a shorter sampled video, wherein, the starting and ending time corresponding to the original video is scaled according to the sampling ratio; for example, a 1/10 sampling ratio means only one-tenth of the original data points are kept; in sparce sampling, the number of frames in the sampled video is based on the length of the original video; uniform sampling, i.e., fixed frame count sampling, is a video sampling technique that extracts a predetermined, constant number of frames from a video, regardless of its original length or frame rate; it is obvious that the second downsampling ratio can be greater in magnitude than the first downsampling ratio).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 8, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 6, wherein sampling the plurality of training frames of the training video to obtain the second training subset of the plurality of training frames is based on using the fixed frame count (Sridhar, fig. 8A, para’s 0075-0076, the video frames 804 can be down-sampled by a down sampler 808 that provides an output to a feature fusion network 810 to generate image embeddings 816; Yang, para. 0046, an input video is received from a user, and video frames are sampled from the input video with different sampling rate; Zhou, para’s 0088-0089, performing sparse sampling and uniform sampling. For a video with a number of frames, performing sparce sampling forms a shorter sampled video, wherein, the starting and ending time corresponding to the original video is scaled according to the sampling ratio; for example, a 1/10 sampling ratio means only one-tenth of the original data points are kept; in sparce sampling, the number of frames in the sampled video is based on the length of the original video; uniform sampling, i.e., fixed frame count sampling, is a video sampling technique that extracts a predetermined, constant number of frames from a video, regardless of its original length or frame rate).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 9, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 6, wherein training the multimodal LLM comprises: concatenating the one or more training video tokens and the sets of training image tokens to generate a concatenated token;inputting the concatenated token into the multimodal LLM to generate a training output;and training the multimodal LLM using the training output (Sridhar, para. 0169, fusing encoded representations of the plurality of frames of the video data to generate a fused representation of the video data, wherein the encoded representations of the input data include the fused representation of the video data; Yang, para. 0048, cross-modal encoding is performed on the tokens of the extracted embeddings. Step 510 may include a step 510a of performing merged fusion of the video tokens and audio tokens, and a step 510b of performing global cross fusion of the video tokens and audio tokens. Step 510a includes a step 510aa of concatenating the local video tokens and the local audio tokens together, and a step 510ab of inputting the concatenated video tokens and concatenated audio tokens into transformer blocks which merge the concatenated video tokens and concatenated audio tokens).
The motivation to combine the references and obviousness arguments are the same as claim 1.
Regarding claim 15, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 1, wherein at least one of the steps of receiving, pre-processing, providing, and processing are performed on a server or in a data center to generate the output, and the output is streamed to a user device (Yang, fig. 2, processing steps are performed at computing device 200, output is sent to client computing device 214).
The motivation to combine the references and obviousness arguments would have been to efficiently manage resources of thin client device. 
Regarding claim 16, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 1, wherein at least one of the steps of receiving, pre-processing, providing, and processing are performed within a cloud computing environment (Sridhar, para’s 0010 and 0044; Yang, para. 0069, cloud computing device).
The motivation to combine the references and obviousness arguments would have been to efficiently manage resources of thin client device.
Regarding claim 17, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 1, wherein at least one of the steps of receiving, pre-processing, providing, and processing are performed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle (Sridhar, para. 0100, computing device of the vehicle, a robotic device).
Regarding claim 18, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 1, wherein at least one of the steps of receiving, pre-processing, providing, and processing is performed on a virtual machine comprising a portion of a graphics processing unit (Sridhar, para. 0117, virtual devices).
Claims 19-22 are rejected the same reasons set forth in claim 1-2. Sridhar-Yang-Zhou further discloses processors and memory modules (see Sridhar, fig. 12), and computer readable medium (see Sridhar, para. 0038).

5.	Claims 10 and 11 are is rejected under AIA  35 U.S.C. 103 as being unpatentable over Sridhar-Yang-Zhou, as applied to claim 9 and 2 above, in view of Bryan et al. (US Publication 2023/0129350).  
Regarding claim 10, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 9.
Sridhar-Yang-Zhou does not explicitly disclose but Bryan discloses wherein the concatenated token comprises the sets of training image tokens, the one or more training video tokens, and a plurality of identifiers, wherein the plurality of identifiers indicate a start and an end of each of the sets of training image tokens and the one or more training video tokens (Bryan, para. 0032, in one or more embodiments, for a nearest neighbor search without product quantization, the audio embeddings comparator 114 uses two main data structures per time resolution: 1) a large, flattened matrix of all audio embeddings for all catalog audio sequences for a given resolution concatenated together (e.g., one column of the matrix correspond to one embedding); and 2) a hash map structure where the keys are the column indices of the audio embedding in the flattened embedding matrix and the values stored are a catalog audio sequence identifier, a start time, and an end time (e.g., identifier, start time within catalog audio sequence, end time within catalog audio sequence) within the catalog audio sequence associated with the audio embedding. Then, given audio embedding 112 generated from input 100 (averaged across time for a specified length), the audio embeddings comparator 114 computes the similarity between the audio embedding 112 and each catalog audio embedding using a metric or score function (e.g., Euclidean distance, cosine distance, etc.). For example, the audio embeddings comparator 114 computes the squared Euclidean distance (proportional to cosine distance with L2 normalized embeddings) between the audio embedding 112 and the catalog audio embeddings, sorts the distances from smallest to largest, and returns ranked audio sequences 118 listing the most similar results (e.g., the comparisons with the smallest distances).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Bryan’s features into Sridhar-Yang-Zhou’s invention for enhancing multimodal video captioning generation. 
Regarding claim 11, Sridhar-Yang-Zhou discloses the computer-implemented method of claim 2, comprising wherein training the multimodal LLM using the training data comprises: generating a plurality of training tokens for a training image from the plurality of training images (see (Yang, para’s 0017-0018, training the model using training video).
Sridhar-Yang-Zhou does not explicitly disclose but Bryan discloses generating a single training video token for the training image by averaging the plurality of training tokens; generating a plurality of training image tokens, wherein the plurality of training image tokens are the plurality of training tokens; concatenating the single training video token and the plurality of training image tokens into a concatenated token; and training the multimodal LLM using the concatenated token (Bryan, para. 0040] After generating the short length audio embeddings, the audio model 111 combines neighboring audio embeddings to generate audio embeddings corresponding to larger time resolutions. Continuing the example of FIG. 3, the audio model 111 generates six-second long audio embeddings 304A-304E from three-second long audio embeddings 302A-302F. Combining the neighboring audio embeddings can include averaging the audio embeddings or concatenating the audio embeddings. In one or more embodiments, combining the neighboring audio embeddings is performed by a model trained to take multiple audio embeddings as an input and output a single audio embedding. For example, audio embeddings 302A and 302B are averaged to generate six-second long audio embedding 304A, audio embeddings 302B and 302C are averaged to generate six-second long audio embedding 304B, audio embeddings 302C and 302D are averaged to generate six-second long audio embedding 304C, audio embeddings 302D and 302E are averaged to generate six-second long audio embedding 304D, and audio embeddings 302E and 302F are averaged to generate six-second long audio embedding 304E. The six-second long audio embeddings 304A-304E can be similarly averaged together to generate 12-second long audio embeddings, and so on until a song-level audio embedding can be generated).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Bryan’s features into Sridhar-Yang-Zhou’s invention for enhancing multimodal video captioning generation.

Allowable Subject Matter
6. 	Claim 12 is rejected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all the limitations of the base claim and any intervening claims. 
The remaining claims 13-14 depending on claim 12 inherit the objection thereof.

Conclusion
7.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
 
8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOI H TRAN whose telephone number is (571)270-5645. The examiner can normally be reached 8:00AM-5:00PM PST FIRST FRIDAY OF BIWEEK OFF.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LOI H TRAN/           Primary Examiner, Art Unit 2484
Read full office action
Prosecution Timeline

Oct 17, 2024
Application Filed
Nov 03, 2025
Examiner Interview (Telephonic)
Nov 15, 2025
Examiner Interview Summary
Jan 14, 2026
Non-Final Rejection mailed — §103
Feb 24, 2026
Response Filed
Mar 19, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/326,720
Patent 12626729
SYSTEM AND METHOD FOR VIDEO/AUDIO COMPREHENSION AND AUTOMATED CLIPPING
2y 11m to grant Granted May 12, 2026
17/940,472
Patent 12621546
Rotating Camera and Microphone Configurations
3y 8m to grant Granted May 05, 2026
18/574,678
Patent 12620416
VIDEO EDITING METHOD, APPARATUS, DEVICE AND MEDIUM
2y 4m to grant Granted May 05, 2026
18/883,720
Patent 12621544
DYNAMICALLY ALTERING PREPROCESSING OF STREAMING VIDEO DATA BY A LARGE LANGUAGE MODEL DEPENDENT UPON REVIEW BY THE LARGE LANGUAGE MODEL
1y 7m to grant Granted May 05, 2026
18/463,427
Patent 12598366
CONTENT DATA PROCESSING METHOD AND CONTENT DATA PROCESSING APPARATUS
2y 7m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
65%
Grant Probability
88%
With Interview (+23.3%)
2y 9m (~1y 2m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 614 resolved cases by this examiner. Grant probability derived from career allowance rate.