Last updated: April 19, 2026
Application No. 18/737,444
SYSTEM AND METHOD TO REVIEW ONLINE VIOLENCE AND EDUCATION

Final Rejection §103
Filed
Jun 07, 2024
Examiner
TRAN, LOI H
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Sri International
OA Round
2 (Final)
Interview Optional

— +23.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 611 resolved cases, 2023–2026
Examiner Intelligence

TRAN, LOI H View full profile →
Grants 64% of resolved cases
Career Allow Rate
394 granted / 611 resolved
+6.5% vs TC avg
Strong +24% interview lift
Without
With
+23.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
25 currently pending
Career history
636
Total Applications
across all art units
Statute-Specific Performance

§101
6.3%
-33.7% vs TC avg
§103
54.9%
+14.9% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
12.5%
-27.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 611 resolved cases
Office Action

§103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
Response to Arguments
Applicant’s arguments with respect to the rejections of claims 1-18 and 20 under AIA  35 U.S.C. 103 have been fully considered but they are not persuasive. Therefore, the rejections of claims 1-18 and 20 are hereby maintained. 
New claim 21 is rejected as described below.
Regarding claims 1, Applicant argues that Li in view of Jain does not disclose or suggest “wherein generating the multi-modal feature includes applying cross-attention to the text tokens and the frame tokens”. 
Examiner respectfully disagrees. Li discloses generating, by the computing system and based on the text elements and the visual elements, a plurality of text tokens representative of text in the video and a plurality of frame tokens representative of one or more frames of the video (Li, col. Fig. 3, col 4 lines14-29, the output of visual encoder 220 is a sequence of visual embeddings: {v.sub.cls, v.sub.1, v.sub.K}, with v.sub.i∈custom character.sup.d and v.sub.cls the embedding 315 of the video [CLS] token);
generating, by the computing system using a machine learning model, a set of features that includes a text feature, a frame feature, and a multi-modal feature, wherein the multi-modal feature is representative of multi-modal elements of the video, and wherein generating the set of features is based on the plurality of text tokens and the plurality of frame tokens (Li, col 3 line 63 though col. 5 line 7, FIG. 3 is a simplified block diagram illustrating a video-text contrastive learning framework. The video encoder 220 and the text encoder 222 in the pretraining network 225 may be trained by a video-text contrastive (VTC) loss; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T);
Li discloses generating the multi-modal feature based on a plurality of  the text tokens and the frame tokens as described above, but does not explicitly disclose generating the multi-modal feature includes applying cross-attention to the multi-modal elements
Jain discloses generating the multi-modal feature includes applying cross-attention to the multi-modal elements (see Jain, col. 3, lines 46-60, in the encoder(s), the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded (e.g., by a text encoder model, such as a text transformer; col. 16, lines 25-42, As illustrated in FIG. 5, the frame-level encoder model 510 and the segment-level encoder model 530 can be a multimodal encoder configured to produce a plurality of representations (e.g., 520, 540) based at least in part on associated text 504 (e.g., a user query). For instance, in addition to encoding the video data 502 and/or representations thereof, the encoder(s) 510, 530 can be cross-modal encoders that additionally fuse the video data 502 and/or representations thereof with associated text data 504, such as, for example, captioning data for the video and/or query data descriptive of a user query representing a user's search for videos and/or, more particularly, content depicted within the videos. For instance, in the encoder(s) 510, 530, the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded by a text encoder model 516, 536, such as a text transformer, prior to being fused with the video data 502).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Jain’s features into Li’s invention for enhancing user’s search or playback experience by providing a more effective cross attention feature model that link different representations of multi-modal elements.

Response to Amendment
Claim Rejections - 35 USC § 103
3.	The text of those sections of Title 35, U.S. Code not included in this section can be found in a prior Office action.
4.	Claims 1-7, 11-17, and 20 are rejected under AIA  35 U.S.C. 103 as being unpatentable over Li et al. (US Patent 12,198,432) in view of Jain et al. (US Patent 11,533,495). 
Regarding claim 1, Li discloses a method comprising: 
obtaining, by a computing system, a video that includes text elements and visual elements (Li, col. 2 line 39 through col. 3 line 7, obtaining a video stream by a system; sparsely sampling video frames from the video; sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively); 
generating, by the computing system and based on the text elements and the visual elements, a plurality of text tokens representative of text in the video and a plurality of frame tokens representative of one or more frames of the video (Li, col. Fig. 3, col 4 lines14-29, the output of visual encoder 220 is a sequence of visual embeddings: {v.sub.cls, v.sub.1, v.sub.K}, with v.sub.i∈custom character.sup.d and v.sub.cls the embedding 315 of the video [CLS] token); 
generating, by the computing system using a machine learning model, a set of features that includes a text feature, a frame feature, and a multi-modal feature, wherein the multi-modal feature is representative of multi-modal elements of the video, and wherein generating the set of features is based on the plurality of text tokens and the plurality of frame tokens (Li, col 3 line 63 though col. 5 line 7, FIG. 3 is a simplified block diagram illustrating a video-text contrastive learning framework. The video encoder 220 and the text encoder 222 in the pretraining network 225 may be trained by a video-text contrastive (VTC) loss; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T);
associating, by the computing system, the set of features with one or more labels to generate a multi-label classification of the video (Li, col 5 line 7 though col. 6 line 35, generating multimodal labels); and 
outputting, by the computing system, an indication of the multi-label classification of the video (Li, col. 9, line 3-19, a video-language model may generate an entity prediction in response to an input of the video frame. The multi-modal video-text encoder may encode the video feature representation and the text feature representation into a set of multimodal embeddings. A classifier may be used to generate the entity prediction from the set of multimodal embeddings. At step 812, a first loss may be computed based on a cross-entropy between the entity prediction and the entity pseudo label; see also claims 1 and 4, computing and outputting label for the video).
Li discloses generating the multi-modal feature based on a plurality of the text tokens and the frame tokens as described above, but does not explicitly disclose:
the plurality of text tokens being representative of audio spoken in the video;
generating the multi-modal feature includes applying cross-attention to the multi-modal elements
Jain discloses:
the plurality of text tokens being representative of audio spoken in the video (Jain, col. 14, lines 5-20, the machine-learned model(s) can be configured to perform a task that generates an embedding for input data (e.g. input audio or visual data). In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may comprise a text output which is mapped to the spoken utterance. In some cases, the task comprises encrypting or decrypting input data).  generating the multi-modal feature includes applying cross-attention to the multi-modal elements
generating the multi-modal feature includes applying cross-attention to the multi-modal elements (see Jain, col. 3, lines 46-60, in the encoder(s), the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded (e.g., by a text encoder model, such as a text transformer; col. 16, lines 25-42, As illustrated in FIG. 5, the frame-level encoder model 510 and the segment-level encoder model 530 can be a multimodal encoder configured to produce a plurality of representations (e.g., 520, 540) based at least in part on associated text 504 (e.g., a user query). For instance, in addition to encoding the video data 502 and/or representations thereof, the encoder(s) 510, 530 can be cross-modal encoders that additionally fuse the video data 502 and/or representations thereof with associated text data 504, such as, for example, captioning data for the video and/or query data descriptive of a user query representing a user's search for videos and/or, more particularly, content depicted within the videos. For instance, in the encoder(s) 510, 530, the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded by a text encoder model 516, 536, such as a text transformer, prior to being fused with the video data 502).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Jain’s features into Li’s invention for enhancing user’s search or playback experience by providing a more effective cross attention feature model that links different representations of multi-modal elements.

Regarding claim 2, Li-Jain discloses the method of claim 1, wherein generating the set of features includes: generating, using a transformer model and the text tokens and the frame tokens, a multi-modal classification token for the video, and wherein generating the set of features is based on the multi-modal classification token (Li, col. 2 lines 39-52, provide a sparse video-text pretraining based on sparsely sampled video frames and texts. Specifically, video frames are sparsely sampled from a video, such as a live stream. Sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively. A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss. In this way, instance-level alignment is learned by applying the video-text contrastive loss on the unimodal features, which encourages paired video-text instances to have similar representations; col. 4, lines 23-29, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1,… t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens; Jain, col. 3, lines 46-60, in the encoder(s), the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded (e.g., by a text encoder model, such as a text transformer).
The motivation to combine the references and obviousness arguments are the same as in claim 1.
Regarding claim 3, Li-Jain discloses the method of claim 2, wherein the transformer model includes one or more attention layers, and wherein generating the corresponding multi-modal classification token for the video comprises applying the one or more attention layers to the text tokens and the frame tokens (Jain, col. 10 lines 29-31, some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models having one or more attention layers); Li, col. 4, lines 1-29, the video encoder 220 may be a 12-layer TimeSformer to extract video features, with the height and width of input frames being 224. For example, the video input 302 may include N.sub.v frames that are sparsely sampled from each input video. The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens. Learnable positional embeddings are also added to the patch tokens from the linear projection layer 305. Then the TimeSformer applies self-attention along the temporal and spatial dimensions separately in order, leading to per-frame features {tilde over (v)}∈custom character.sup.N.sup.v.sup.×K×d, with d the feature dimension; the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1,…t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens).
The motivation to combine the references and obviousness arguments are the same as in claim 1.
Regarding claim 4, Li-Jain discloses the method of claim 2, wherein generating the set of features includes: generating the text feature using a first neural network, the frame feature using a second neural network, and the multi-modal feature using a third neural network and based on the multi-modal classification token (Li, col 3 line 63 though col. 5 line 7, FIG. 3 is a simplified block diagram illustrating a video-text contrastive learning framework. The video encoder 220 and the text encoder 222 in the pretraining network 225 may be trained by a video-text contrastive (VTC) loss; the video encoder 220 may be a 12-layer TimeSformer to extract video features, with the height and width of input frames being 224. For example, the video input 302 may include N.sub.v frames that are sparsely sampled from each input video. The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens. Learnable positional embeddings are also added to the patch tokens from the linear projection layer 305. Then the TimeSformer applies self-attention along the temporal and spatial dimensions separately in order, leading to per-frame features {tilde over (v)}∈custom character.sup.N.sup.v.sup.×K×d, with d the feature dimension. The text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1,…,t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens; col. 3 lines 58-62, the output of the multimodal encoder 230 may then be compared with the soft entity labels 216 to generate a training loss objective for the pre-training model 225. In this way, the prompter 205 serves to generate soft entity labels to supervise the pretraining of the video-language model; col. 7, lines 45-57, FIG. 7 is a simplified logic flow diagram illustrating a method of training and using a video-text entity prompt network to generate soft labels, according to some embodiments described herein. One or more of the processes of method 700 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes. In some embodiments, method 700 corresponds to the operation of the video-and-language alignment module 630 (FIG. 6) to perform video-and-language alignment contrastive pretraining.).
Regarding claim 5, Li-Jain discloses the method of claim 2, wherein the transformer model comprises an encoder that applies multi-head cross attention to the text tokens and the frame tokens to generate the multi-modal feature (Li, col 4 line 23-41, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1, …, t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T; col. 7 line 45 through col. 8 line 36, FIG. 7 is a simplified logic flow diagram illustrating a method of training and using a video-text entity prompt network to generate soft labels, according to some embodiments described herein. One or more of the processes of method 700 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes. In some embodiments, method 700 corresponds to the operation of the video-and-language alignment module 630 (FIG. 6) to perform video-and-language alignment contrastive pretraining; a first contrastive loss indicative of video-to-text classification may be computed based on the computed similarity scores; a second contrastive loss indicative of text-to-video classification may be computed based on the computed similarity scores; and  a video-text contrastive loss may be computed by taking a weighted sum of the first and the second contrastive losses; the video encoder and the text encoder may be updated based at least in part on the video-text contrastive loss; Jain, col. 3, lines 46-60, in the encoder(s), the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded (e.g., by a text encoder model, such as a text transformer; col. 16, lines 25-42, as illustrated in FIG. 5, the frame-level encoder model 510 and the segment-level encoder model 530 can be a multimodal encoder configured to produce a plurality of representations (e.g., 520, 540) based at least in part on associated text 504 (e.g., a user query). For instance, in addition to encoding the video data 502 and/or representations thereof, the encoder(s) 510, 530 can be cross-modal encoders that additionally fuse the video data 502 and/or representations thereof with associated text data 504, such as, for example, captioning data for the video and/or query data descriptive of a user query representing a user's search for videos and/or, more particularly, content depicted within the videos. For instance, in the encoder(s) 510, 530, the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded by a text encoder model 516, 536, such as a text transformer, prior to being fused with the video data 502; Jain, col. 10 lines 29-31, some example machine-learned models can include multi-headed structure).
The motivation to combine the references and obviousness arguments are the same as in claim 1.
Regarding claim 6, Li-Jain discloses the method of claim 1, wherein associating the set of features with one or more labels comprises: applying a contrastive loss function to the set of features; and determining, using the contrastive loss function, a distance between each class prototype of one or more class prototypes and a corresponding feature of the set of features, wherein each class prototype of the one or more class prototypes are representative of a corresponding classification (Jain, col. 7, lines 8-30, the hierarchical video encoder may output a plurality of segment representations associated with a plurality of video segments of the highest scoring videos, each of which has an associated compatibility score with the user query. The highest score of these compatibility scores can be used as representative of the entire video. In some implementations, the one or more highest likelihood videos can be selected based at least in part on a negative log-likelihood of the one or more highest likelihood videos containing the moment described by the user query. For instance, the videos can be selected to minimize the negative log-likelihood; a modeling objective for the video retrieval task can select a matching video most likely to have a moment to be localized by employing a contrastive loss that contrasts a compatibility score of positive (e.g., matching) pairs of video representation and query against negative (e.g., not matching) pairs of video representation and query. The negative pairs can be randomly sampled. One example compatibility score is computed as:
f(v,h)=max.sub.k(W.sub.VR.sup.T*Ψ(φ.sub.k;v,h))
where W.sub.VR is a linear regressor; Li, col. 2 line 39 through col. 3 line 7, obtaining a video stream by a system; sparsely sampling video frames from the video; sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively; computing a video-text contrastive loss by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss. In this way, instance-level alignment is learned by applying the video-text contrastive loss on the unimodal features, which encourages paired video-text instances to have similar representations; Li, col 4 line 23-41, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1, …, t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T; col. 7 line 45 through col. 8 line 36, FIG. 7 is a simplified logic flow diagram illustrating a method of training and using a video-text entity prompt network to generate soft labels, according to some embodiments described herein. One or more of the processes of method 700 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes. In some embodiments, method 700 corresponds to the operation of the video-and-language alignment module 630 (FIG. 6) to perform video-and-language alignment contrastive pretraining; a first contrastive loss indicative of video-to-text classification may be computed based on the computed similarity scores; a second contrastive loss indicative of text-to-video classification may be computed based on the computed similarity scores; and  a video-text contrastive loss may be computed by taking a weighted sum of the first and the second contrastive losses; the video encoder and the text encoder may be updated based at least in part on the video-text contrastive loss).
The motivation to combine the references and obviousness arguments are the same as in claim 1.
Regarding claim 7, Li-Jain discloses the method of claim 6, further comprising: learning, by the computing system, the one or more class prototypes based on maximized distances between each of the one or more class prototypes and a corresponding feature of the set of features (Jain, col. 7, lines 8-30, the hierarchical video encoder may output a plurality of segment representations associated with a plurality of video segments of the highest scoring videos, each of which has an associated compatibility score with the user query. The highest score of these compatibility scores can be used as representative of the entire video. In some implementations, the one or more highest likelihood videos can be selected based at least in part on a negative log-likelihood of the one or more highest likelihood videos containing the moment described by the user query. For instance, the videos can be selected to minimize the negative log-likelihood; a modeling objective for the video retrieval task can select a matching video most likely to have a moment to be localized by employing a contrastive loss that contrasts a compatibility score of positive (e.g., matching) pairs of video representation and query against negative (e.g., not matching) pairs of video representation and query. The negative pairs can be randomly sampled. One example compatibility score is computed as:
f(v,h)=max.sub.k(W.sub.VR.sup.T*Ψ(φ.sub.k;v,h))
where W.sub.VR is a linear regressor.; 
Li, col. 2 line 39 through col. 3 line 7, obtaining a video stream by a system; sparsely sampling video frames from the video; sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively; computing a video-text contrastive loss by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss. In this way, instance-level alignment is learned by applying the video-text contrastive loss on the unimodal features, which encourages paired video-text instances to have similar representations; Li, col 4 line 23-41, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1, …, t.sub.N.sub.t}, with t.sub.i∈custom character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. Similar to video encoder 220, positional embeddings are added to the text tokens; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T; col. 7 line 45 through col. 8 line 36, FIG. 7 is a simplified logic flow diagram illustrating a method of training and using a video-text entity prompt network to generate soft labels, according to some embodiments described herein. One or more of the processes of method 700 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes. In some embodiments, method 700 corresponds to the operation of the video-and-language alignment module 630 (FIG. 6) to perform video-and-language alignment contrastive pretraining; a first contrastive loss indicative of video-to-text classification may be computed based on the computed similarity scores; a second contrastive loss indicative of text-to-video classification may be computed based on the computed similarity scores; and  a video-text contrastive loss may be computed by taking a weighted sum of the first and the second contrastive losses; the video encoder and the text encoder may be updated based at least in part on the video-text contrastive loss).
The motivation to combine the references and obviousness arguments are the same as in claim 1.
Claims 11-17 and 20 are rejected the same reasons set forth in claim 1-7. Li further discloses processor(s), memory module(s), and computer readable medium (see Li, col. 6 line 36 through col. 7, line 19).
Regarding claim 21, Li-Jain discloses the method of claim 1, wherein generating the multi-modal feature includes: applying the cross-attention using the text tokens to query the frame tokens or using the frame tokens to query the text tokens (Li, col 3 line 63 though col. 5 line 7, FIG. 3 is a simplified block diagram illustrating a video-text contrastive learning framework. The video encoder 220 and the text encoder 222 in the pretraining network 225 may be trained by a video-text contrastive (VTC) loss; the video representation 315 and text representation 316 are fed to a video-text contrastive (VTC) loss module 330 to align features from the unimodal encoders 220 and 222 before sending them into the multimodal encoder 230. Specifically, given the embeddings of video [CLS] token 315 and the embedding 316 of text [CLS] tokens, a similarity score is computed between video V and text T; Jain, col. 3, lines 46-60, in the encoder(s), the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded (e.g., by a text encoder model, such as a text transformer; col. 16, lines 25-42, as illustrated in FIG. 5, the frame-level encoder model 510 and the segment-level encoder model 530 can be a multimodal encoder configured to produce a plurality of representations (e.g., 520, 540) based at least in part on associated text 504 (e.g., a user query). For instance, in addition to encoding the video data 502 and/or representations thereof, the encoder(s) 510, 530 can be cross-modal encoders that additionally fuse the video data 502 and/or representations thereof with associated text data 504, such as, for example, captioning data for the video and/or query data descriptive of a user query representing a user's search for videos and/or, more particularly, content depicted within the videos. For instance, in the encoder(s) 510, 530, the input modality pairs can have cross attention, such as visual-caption/transcript, visual-query, and/or transcript-query attention. In some implementations, the associated text can be encoded by a text encoder model 516, 536, such as a text transformer, prior to being fused with the video data 502; Jain, col. 10 lines 29-31, some example machine-learned models can include multi-headed structure).
The motivation to combine the references and obviousness arguments are the same as in claim 1.

5.	Claims 8-9 and 18 are rejected under AIA  35 U.S.C. 103 as being unpatentable over Li-Jain, as applied to claims 1 and 16 above, in view of Mo et al. (English Translation of Chinese Publication CN113704466 11-2021).
Regarding claim 8, Li-Jain discloses the method of claim 1.
Li-Jain does not explicitly disclose but Mo discloses determining, by the computing system and based on the multi-label classification, whether a particular video meets one or more content requirements for viewing by a user; and outputting an indication of whether the particular video meets one or more content requirements for viewing by the user (Mo, para. 0055, the server 102 determines the classification label corresponding to each data in the database through a text multi-label classification method. For audio and video data, the classification label corresponding to the audio and video data can be determined based on the descriptive text corresponding to the audio and video data (such as introduction, plot introduction and other text information); then the text data is classified and stored according to the label to improve the efficiency of data storage and retrieval. Different categories can also be displayed on the terminal device 101 to facilitate users to retrieve corresponding data under different categories. In a data retrieval scenario, a user may send retrieval conditions to the server 102 via the terminal device 101. The server 102 may quickly retrieve data that meets the retrieval conditions from the database based on the retrieval conditions and classification tags, and feed the data back to the terminal device 101. In the data push scenario, the server 102 can determine the user preferences based on the user information, and determine at least one classification label that matches the user preferences, select the data that needs to be pushed to the user from the data under the classification label, and push the data to the user's terminal device 101. Outputting an indication or notification of whether the particular video meets user’s requirement is well known in the art).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Mo’s features into Li-Jain’s invention for enhancing user’s video search experience by providing feedback as to category or type of the video.
Regarding claim 9, Li-Jain-Mo discloses the method of claim 8, wherein outputting the indication of whether the particular video meets one or more content requirements for viewing by the user comprises filtering or permitting the particular video (Mo, para. 0002, filtering text, para. 0095, text removal).
The motivation to combine the references and obviousness arguments are the same as in claim 8.
Claim 18 is rejected for the same reasons set forth in claim 8.

6.	Claim 10 is rejected under AIA  35 U.S.C. 103 as being unpatentable over Li-Jain, as applied to claim 1 above, in view of Peng (English Translation of Chinese Publication CN112735195 04-2021).
Regarding claim 10, Li-Jain discloses the method of claim 1, comprising generating the multi-label classification of the video according to content requirements associated with the user, as described above.
Li-Jain does not explicitly disclose but Peng discloses receiving, by the computing device, an indication of educational content requirements associated with a user; generating, by the computing device and based on the multi-label classification of the video and the educational content requirements associated with the user, a recommendation for the user to view the video; and outputting an indication of the recommendation (Peng, para. 0017, Create video tutorial units under the video resource module according to the requirements of efficient learning and practice; the video resource module includes a general video tutorial unit and a multi-label video tutorial unit. The general video tutorial unit contains several video tutorial materials recorded using general methods; the multi-label video tutorial unit contains several multi-label video tutorial materials. The method for creating multi-label video tutorial materials is: (a) For each software function and case to be explained, separate the video into multiple shorter explanation files according to the settings, options, results of different options, operation steps, etc., and set up label names; (b). The recorded and labeled video explanation files are labeled and combined according to the user's efficient learning method, and a new video tutorial file containing multiple knowledge point labels is generated after the combination. The labeling or label combination mentioned here refers to a video tutorial file that is recorded and combined. The file uses intuitive text descriptions as labels that can be easily selected during playback according to the different subdivided contents. For example: a video tutorial file tells the story of three people, Zhang San, Li Si and Wang Wu. When playing the video, when you click the screen with the mouse, three optional menus will appear: "Zhang San's Story", "Li Si's Story" and "Wang Wu's Story". Each menu item is a subdivided content label of the video tutorial file. If you only want to watch "Wang Wu's Story" during playback, click this label, and the content will jump directly to "Wang Wu's Story" for playback. The label combination mentioned in the manual is equivalent to combining the three short video files "Zhang San's Story", "Li Si's Story" and "Wang Wu's Story" together to form a new video file with a new label "The Story of Three People"; generating, by the computing device and based on the multi-label classification of the video and the educational content requirements associated with the user, a recommendation for the user to view the video; and outputting an indication of the recommendation is well known in the art).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Peng’s features and well-known technique in the art into Li-Jain’s invention for enhancing education video search experience by recommendation feedback video to users.

7.	 The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. 
US Publication 2023/0154213 by Gao et al.


Conclusion
8.	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
9.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOI H TRAN whose telephone number is (571)270-5645. The examiner can normally be reached 8:00AM-5:00PM PST FIRST FRIDAY OF BIWEEK OFF.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LOI H TRAN/           Primary Examiner, Art Unit 2484
Read full office action
Prosecution Timeline

Jun 07, 2024
Application Filed
Sep 30, 2025
Non-Final Rejection — §103
Dec 19, 2025
Interview Requested
Dec 30, 2025
Applicant Interview (Telephonic)
Jan 10, 2026
Examiner Interview Summary
Jan 30, 2026
Response Filed
Feb 20, 2026
Final Rejection — §103
Apr 09, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

18/463,427
Patent 12598366
CONTENT DATA PROCESSING METHOD AND CONTENT DATA PROCESSING APPARATUS
2y 5m to grant Granted Apr 07, 2026
18/194,454
Patent 12593112
METHOD, DEVICE, AND COMPUTER PROGRAM FOR ENCAPSULATING REGION ANNOTATIONS IN MEDIA TRACKS
2y 5m to grant Granted Mar 31, 2026
18/528,425
Patent 12592261
VIDEO EDITING METHOD AND APPARATUS, AND DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
17/302,302
Patent 12576798
CAMERA SYSTEM AND ASSISTANCE SYSTEM FOR A VEHICLE AND A METHOD FOR OPERATING A CAMERA SYSTEM
2y 5m to grant Granted Mar 17, 2026
18/322,321
Patent 12579810
SYSTEM AND METHOD FOR AUTOMATIC EVENTS IDENTIFICATION ON VIDEO
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
64%
Grant Probability
88%
With Interview (+23.6%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 611 resolved cases by this examiner. Grant probability derived from career allow rate.