Last updated: May 29, 2026

Application No. 18/237,083

Systems and Methods for Video Representation Learning Using Triplet Training

Non-Final OA §103

Filed

Aug 23, 2023

Priority

Aug 24, 2022 — provisional 63/400,551

Examiner

CHEN, JOSHUA NMN

Art Unit

2665

Tech Center

2600 — Communications

Assignee

Vionlabs AB

OA Round

1 (Non-Final)

Interview Optional

— +29.2% interview lift. Examiner has a relatively high allowance rate (83%); +29.2% interview lift. A written response may suffice.

Based on 42 resolved cases, 2023–2026

Examiner Intelligence

CHEN, JOSHUA NMN View full profile →

Grants 83% — above average

Career Allowance Rate

35 granted / 42 resolved

+21.3% vs TC avg

Strong +29% interview lift

Without

With

+29.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

13 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

2.2%

-37.8% vs TC avg

§103

93.6%

+53.6% vs TC avg

§102

2.2%

-37.8% vs TC avg

§112

2.2%

-37.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 42 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/17/2023, 05/13/2024, and 05/28/2024 were filed and the submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 5 and 17 objected to because of the following informalities: 
Regarding claims 5 and 17, both claims stated: applies a time-distributed attention process to chunked data and applies a time-distributed attention process to the chunked data. Though no particular grammatical mistake is present and no interpretation issue is present, examiner still raises this objection to notify the applicant that there exists two chunked data in both claims and are currently interpreted with the same meaning. Appropriate correction or clarification is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 4-5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Duncan et al.  (US 2021/0352380 A1, hereinafter Duncan) in view of Omote et al. (US 2019/0341025 A1, hereinafter Omote).

Regarding claims 1 and 13, Duncan discloses
Claim 1: A system for video representation learning, comprising: a processor configured to receive a video file (Para [0048]: “As used in the instant disclosure, the terms "multimedia data," "media data," and "audio-video data" are used interchangeably to refer to information (e.g., digitized and analog information) that encodes or represents audio, video and/or audio-video content. Media data may include information not corresponding to audio or video”, Para [0060]: “A system that is configured to produce text from audio-video data may include a component that receives audio-video data, and a component that provides speech-to-text conversion”); and  system code executed by the processor and causing the processor to:
Claim 13: A method for video representation learning, comprising the steps of:
extract at least one video feature, at least one audio feature, and at least one valence- arousal-dominance (VAD) feature from the video file (Para [0090]: “Context indicating data may include, for example, character speaking, number of characters in a scene, character singing, time code, scene location, dialogue, and so forth… For example, an indicator may be designed as a multi-dimensional vector with values representing intensity of psychological qualities such as arousal, and valence… Examples of emotional states or vectors in a two-dimensional valence-arousal space is shown in FIG. 7A, while 7B shows emotional states or vectors in a three-dimensional valence-arousal-dominance space.”); 
process the at least one video feature, the at least one audio feature, and the at least one VAD feature to generate a video embedding, an audio embedding, and a VAD embedding (Para [0093]: “The processor parameterizes the non-linear features by any useful model for representing an emotional state of the character/vocal instance.”, Para [0095]: “Calculation of evaluation values of the six basic facial expressions such as joy, anger, sadness, and pleasure may be implemented by a known technique in the art… Alternatively, and as another non-limiting example, determining the facial expression of the detected face in the digital image has three stages: (a) face detection, (b) feature extraction and (c) facial expression recognition… The third stage, automatic facial expression recognition, may involve simple Euclidean Distance method. In this method, the Euclidean distance between the feature points of the training images and that of the query image is compared. Based on minimum Euclidean distance, output image expression is decided. Alternatively, the proposed method may be further modified by using Artificial Neuro-Fuzzy Inference System (ANFIS) to give a better recognition rate compared to other known methods. For further example, a deep neural network may perform facial recognition and expression recognition after being trained on a training set.”, Para [0119]: “An engine as described herein may compute emotional content quantitatively using one or more emotional spaces as illustrated by FIGS. 7A-B and use quantitative scoring to classify emotional elements for use in dubbing lists or interpreting semantic messages.”); 
process the fingerprint to generate at least one of a mood prediction, a genre prediction, or a keyword prediction for the video file (Para [0089]: “As each interaction will have multiple recognized behaviors occurring simultaneously, the analysis application 550 weighs any combination of inputs and associated behaviors at a given time and derives a conclusion about an emotional state to report feedback on the given inputs.”).
However Duncan does not explicitly disclose
concatenate the video embedding, the audio embedding, and the VAD embedding to create a concatenated embedding; 
process the concatenated embedding to generate a fingerprint associated with the video file.
Omote teaches
concatenate the video embedding, the audio embedding, and the VAD embedding to create a concatenated embedding (Fig. 1A, Para [0034]: “As seen in FIG.1A and FIG. 2 Sentence Level Feature Fusion according to aspects of the present disclosure takes multiple different feature vectors 101 generated on a per sentence basis 201 and concatenates 202 them into a single vector 102 before performing classification 203 with a multimodal Neural Network 103. That is, each feature vector 101 of the multiple different types of feature vectors is generated on a per sentence level. After generation, the feature vectors are concatenated to create a single feature vector 103 herein referred to as a fusion vector.”); 
process the concatenated embedding to generate a fingerprint associated with the video file (Para [0034]: “This fusion vector is then provided to a multimodal neural network configured to classify the features.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Duncan with concatenating feature vectors of from different sources and provide the concatenated vector to a classification model of Omote to effectively increase the accuracy of the model.

Regarding claim 4, dependent upon claim 1, Duncan in view of Omote teaches everything regarding claims 1.
Omote further teaches
the system code processes the at least one video feature, the at least one audio feature, and the at least one VAD feature by processing the at least one video feature, the at least one audio feature, and the least one VAD feature using a recurrent neural network (RNN) and chunking output data from the RNN (Para [0035]: “The network configured to map feature vectors to an emotional subspace vector may be any type known in the art but are preferably of the recurrent type, such as, plain RNN, long-short term memory, etc.”; Long short term memory (LSTM) model implies that the model will forget features after some time interval, which is a form of chunking data.).

Regarding claim 5, dependent upon claim 4, Duncan in view of Omote teaches everything regarding claims 4.
Omote further teaches
the system code applies a time-distributed attention process to chunked data and applies a time-distributed attention process to the chunked data (Para [0040]: “For example and without limitation, an attention mechanism may be used to determine which parts of a temporal sequence are more important or to determine which modality ( e.g., audio, video or text) is more important and give higher weights to the more important modality or modalities. The system may correlate audio and video information by vector operations, such as concatenation or element-wise product of audio and video features to create a reorganized fusion vector.”).

Claims 11 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Duncan et al.  (US 2021/0352380 A1, hereinafter Duncan) in view of Omote et al. (US 2019/0341025 A1, hereinafter Omote) and Wang et al. (Video Affective Content Analysis: A Survey of State-of-the-Art Methods, hereinafter Wang).

Regarding claims 11 and 23, dependent upon claims 1 and 13, Duncan in view of Omote teaches everything regarding claims 1 and 13.
However, Duncan in view of Omote does not explicitly teach
the system code determines video features and audio features for the video file, concatenates the video features and the audio features to create a concatenated feature, inputs the concatenated feature into a VAD model, and determines the at least one VAD feature using the VAD model.
Wang teaches
the system code determines video features and audio features for the video file (P. 5 Section 3.1: “The video content can be captured by various visual and audio features. Specifically, the affective content of a video consists of two main categories of data: visual data and auditory data. The visual data can be further divided into visual image, print, and other graphics, while the auditory signal can be divided into speech, music, and environmental sound.”), 
concatenates the video features and the audio features to create a concatenated feature, inputs the concatenated feature into a VAD model (P. 10 Section 3.4: “The two modalities in a video, i.e., visual and audio, can be fused for video affective content analysis. Data fusion can be performed in two levels: feature level and decision level. Feature-level fusion combines audio and video features and feeds them jointly to a classifier or regressor for video affective content analysis.”), and 
determines the at least one VAD feature using the VAD model (P. 2 Section 2: “Dimensional views of emotion have been advocated and applied by a several researchers. Most agree that three dimensions are enough to describe a subjective response. However, a consensus has not been reached on the labels of the dimensions. Valence-arousal (VA)-dominance is one set of labels.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Duncan in view of the Omote with concatenating video and audio features to determine emotion of Wang to effectively increase the accuracy when determining the emotion of a video.

Claims 12 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Duncan et al.  (US 2021/0352380 A1, hereinafter Duncan) in view of Omote et al. (US 2019/0341025 A1, hereinafter Omote) and Cheong et al. (Affective Understanding in Film, hereinafter Cheong).

Regarding claims 12 and 24, dependent upon claims 1 and 13, Duncan in view of Omote teaches everything regarding claims 1 and 13. 
However, Duncan in view of Omote does not explicitly teach
the system code determines a training VAD dataset comprising VAD labels, extracts training video features and training audio features from the VAD dataset, concatenates the training video features and the training audio features to create a training concatenated feature, trains a VAD model based at least in part on the training concatenated feature to generate a trained VAD model, and deploys the trained VAD model.
Cheong teaches
the system code determines a training VAD dataset comprising VAD labels (P. 11 Section VII A: “To obtain the ground truth for experimentation, we attempt to manually match the affective content of a scene to one of the output emotions. If ambiguities arise, we resort to the VA diagram (see text and Fig. 8 in the Appendix). Three persons are employed to independently label each scene. To prevent fatigue and systematic bias, an individual labels only one random movie daily, of a genre different from the previously labeled movie. Except for unanimous decisions that stand, all scenes with dissenting views are reviewed using Fig. 8 as a guide, which usually result in common agreement. Scenes where no agreement can be reached have dual labels; the main label that received two votes, and an alternate label that received one vote. Dual label scenes comprise of 14.08% of all scenes; there are no cases with three differing votes.”), 
extracts training video features and training audio features from the VAD dataset (P. 6 Section IV: “We show how effective low-level audio cues may be derived based on considerations (particularly in relation to the seven output emotions chosen) discussed in Sections II and III.”, P. 9 Section V: “We describe several visual cues and show their relationships with respect to the perspectives laid out in Section III. As a preliminary, unless otherwise stated, the visual cues are computed exclusively in the hue, lightness, and saturation (HLS) color space.”), 
concatenates the training video features and the training audio features to create a training concatenated feature (P. 11 Section VI: “The features as described by Sections IV and V are extracted and concatenated into row vectors to form the data points characterizing every scene.”), 
trains a VAD model based at least in part on the training concatenated feature to generate a trained VAD model (P. 11 Section VI: “Then K-fold cross validation is used with grid search to obtain the optimal penalty and margin parameters. Subsequently, radial basis kernel SVMs are individually trained for each class pair, so that only features with discriminative value are used.”), and 
deploys the trained VAD model (P. 11 Section VII B: “Using a take-one-movie-out approach, we reserve the scenes of one movie for testing while using the rest for training. This approach is repeated for every movie in Table IV, where every testing scene is classified into one of the output emotions.”, P. 13 Section VII C: “Machine understanding of the affective aspect of Hollywood multimedia can enhance and complement existing classification systems at several levels of resolution. Here we demonstrate applications at two levels: the more generalized movie genre level, and the more refined movie affective vector level (Fig. 7). Other possible applications include using scene-level affective results for story unit extraction.”).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Duncan in view of Omote with training a emotion classification model and deploying the model of Cheong to effectively increase the accuracy for determining the emotion of a scene.

Allowable Subject Matter
Claims 2-3, 6-10 and 14-22 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Relevant Prior Art Directed to State of Art
Ray et al. (Multi-level Attention network using text, audio and video for Depression Prediction, hereinafter Ray) is prior art not applied in the rejection(s) above. Ray discloses a multi-level attention based network for multimodal depression prediction that fuses features from audio, video and text modalities while learning the intra and inter modality relevance.

Shi, et al. (Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification, hereinafter Shi,) is prior art not applied in the rejection(s) above. Shi, discloses a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.

Girshick  et al. (Non-local Neural Networks, hereinafter Girshick) is prior art not applied in the rejection(s) above. Girshick discloses non-local operation computes the response at a position as a weighted sum of the features at all positions.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSHUA CHEN whose telephone number is (703)756-5394. The examiner can normally be reached M-Th: 9:30 am - 4:30pm ET F: 9:30 am - 2:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, STEPHEN R KOZIOL can be reached at (408)918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/J. C./Examiner, Art Unit 2665                                                                                                                                                                                                        
/Stephen R Koziol/Supervisory Patent Examiner, Art Unit 2665

Read full office action

Prosecution Timeline

Aug 23, 2023

Application Filed

Dec 02, 2025

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/318,475

Patent 12626334

METHOD AND DEVICE WITH IMAGE PROCESSING

2y 12m to grant Granted May 12, 2026

17/852,884

Patent 12614378

SYSTEMS AND METHODS TO PROCESS ELECTRONIC IMAGES TO DETERMINE HISTOPATHOLOGY QUALITY

3y 10m to grant Granted Apr 28, 2026

18/026,081

Patent 12602747

METHOD AND APPARATUS FOR DENOISING A LOW-LIGHT IMAGE

3y 1m to grant Granted Apr 14, 2026

17/904,842

Patent 12592090

COMPENSATION OF INTENSITY VARIANCES IN IMAGES USED FOR COLONY ENUMERATION

3y 7m to grant Granted Mar 31, 2026

17/978,489

Patent 12579614

IMAGING DEVICE

3y 4m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

99%

With Interview (+29.2%)

2y 9m (~0m remaining)

Median Time to Grant

Low

PTA Risk

Based on 42 resolved cases by this examiner. Grant probability derived from career allowance rate.

Systems and Methods for Video Representation Learning Using Triplet Training

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email