Last updated: April 19, 2026
Application No. 18/057,643
MULTI-MODAL UNDERSTANDING OF EMOTIONS IN VIDEO CONTENT

Non-Final OA §103
Filed
Nov 21, 2022
Examiner
AZARIAN, SEYED H
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Samsung Electronics Co., Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +11.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 901 resolved cases, 2023–2026
Examiner Intelligence

AZARIAN, SEYED H View full profile →
Grants 90% — above average
Career Allow Rate
807 granted / 901 resolved
+27.6% vs TC avg
Moderate +12% lift
Without
With
+11.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 3m
Avg Prosecution
9 currently pending
Career history
910
Total Applications
across all art units
Statute-Specific Performance

§101
17.0%
-23.0% vs TC avg
§103
21.5%
-18.5% vs TC avg
§102
31.4%
-8.6% vs TC avg
§112
13.9%
-26.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 901 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/19/2025 has been entered.
 
DETAILED ACTION
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-9, 11-16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al (Pub. No.: U.S. 2025/0078569 A1) in view of Chen et al (Self-Cure Network for Facial Expression Recognition).
           Regarding claim 1, Zhang discloses a method comprising: obtaining a video sequence comprising multiple video frames and audio data (see page 7, paragraph, [0069] in some examples, the multimedia data above is video data with video pictures containing the target object: the video frames in the video data are each taken as a current video frame one by one, and the following operations are performed on the current video frame: acquiring the static facial feature from the current video frame; acquiring the expression change feature from the “video frame sequence” containing the current video frame: acquiring the sound feature from the “audio data” corresponding to the video frame sequence: acquiring the language content feature from the audio data and/or subtitle data corresponding to the video frame sequence);
           extracting video features associated with at least one face in the video frames and audio features associated with the audio data by splitting the video frames into multiple collections of video frames and processing the collections of video frames using a self-cure network (SCN) to identify the video features; (see abstract, the method includes: extracting a static “facial feature” and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an expression change feature, a “sound feature” and a language content (audio), feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also, page 3, paragraph, [0027] in the above, the expression change feature above can be obtained from video data, and in the video data, for the same target object, when the expression changes, a shape of the face, and shapes and positions of five sense organs of the face, etc. all change, and the expression change feature characterizing the change of the expression of the target object in video frames can be extracted from the video data. The sound feature can be extracted from “audio data”, and under different emotions, the sound features emitted by the same target object may be different. For example, under a calm emotion, the sound is soft, under a pleasantly surprised emotion, the sound is possibly sharp, and under an anger emotion, the sound is possibly deep, and therefore, the sound feature contains feature information characterizing the emotion. In some embodiments, the sound feature may include frequency feature, tone feature, pitch feature, energy features, or the like. In some embodiments, the language content feature above may be extracted from audio data, subtitle data, or a text typed and output by the target object. The speaking contents under different emotions can be different for the same target object. For example, under a happy emotion, the speaking content tends to be positive and sunny, and under a sad emotion, the speaking content tends to be depressed and dark. Therefore, the language content feature also contains feature information characterizing the emotion.
           Finally, page 7, paragraphs, [0063-0064] in the above, in the feature mapping mode of performing linear combination mapping based on the preset facial action unit, expressions of the face are divided into (splitting), a plurality of action units in advance according to muscle distribution of the face, and when the face expresses the emotion through the expression, the expression is represented by a linear combination of the action units. In some embodiments, after the branch network receives the spliced feature, the feature mapping mode thereof includes calculating a linear weight of each action unit according to the spliced feature, and performing linear combination on the action units using the linear weights, so as to obtain the emotion analysis result. In the feature mapping mode of performing linear combination mapping based on the plural preset basic emotion types, emotions are “divided” in advance into plural basic emotions, such as neutrality, happiness, sadness, surprise, fears, anger, aversions, or the like. In some embodiments, after the branch network receives the spliced features, the feature mapping mode thereof includes calculating a linear weight of each basic emotion according to the spliced feature, and performing linear combination on the basic emotions using the linear weights, so as to obtain the emotion analysis result);
           processing the video features and the audio features using a trained machine learning model, the trained machine learning model performing a multi-tiered fusion of the video features and different subsets of the audio features in order to identify at least one emotion expressed by at least one person in the video sequence (see page 1, paragraph, [0005] the present disclosure further provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions executable by the processor, and the processor is configured to execute the machine executable instructions to implement an object emotion analysis method, the object emotion analysis method including: acquiring multimedia data associated with a target object, and extracting a static facial feature and a dynamic feature of the target object from the multimedia data, where the dynamic feature includes at least one of an expression change feature, a “sound feature” or a language content feature of the target object; and inputting the static facial feature and the dynamic feature into an object emotion analysis model, wherein the object emotion analysis model is pre-trained, fusing the static facial feature and the dynamic feature by the object emotion analysis model to obtain a fusion feature, and outputting an emotion analysis result of the target object based on the fusion feature.
           Also, page 2, paragraphs, [0024-0025] the multimedia data may include data in a variety of formats, such as a video, an image, an “audio”, a text, or the like. The present embodiment is intended to analyze the target object, and thus, the multimedia data is usually associated with the target object, for example, the target object being included in a video, the target object being included in an image, a sound emitted by the target object being included in an audio, speaking content being included in a text or content being output in other forms. The target object here may be a “human, an animal”, a biomimetic robot or other objects with emotion fluctuations. The static facial feature of the above-mentioned target object may be extracted from image data containing the target object, and the image data may also be a video frame image. The static facial feature data can be extracted by a pre-trained facial feature extraction model, and the facial feature extraction model can be specifically composed of a “convolutional neural network”, a residual network, or the like. The static facial feature may characterize an appearance feature, action and posture features, an expression feature, or the like, of the target object, and may be understood as a mixed feature. If the model is trained only based on the static facial feature, it is difficult for the model to learn only the expression feature therein, but the model may also learn the appearance feature of the target object, such that the model is influenced by the appearance feature of the target object when analyzing an expression: the model may also learn the action and posture features of the target object, such that the model is influenced by an action and a posture of the target object when analyzing the expression, thereby reducing expression analysis accuracy of the model.
           Also, page 4, paragraph, [0039] the subtitle data is usually in a text format and records words spoken by the target object, such that the language content text of the target object can be directly obtained from the subtitle data. For the audio data, the words spoken by the target object in the audio data can be “recognized by a voice” recognition tool, so as to obtain the language content text in the text format. In some examples, the language content text of the target object may be extracted from the subtitle data or audio data corresponding to the video frame sequence from which the expression change feature is extracted. In one example, the language content text is “Oh, my god”, and the language content text typically contains a feature characterizing a surprised emotion.
           Also, page 6, paragraph, [0058] further, the output result of the cross-attention network is further required to be processed as follows to obtain the fusion feature: performing second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by transforming the first fusion result; inputting the second fusion result into a preset first multilayer perceptron, and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and performing third fusion processing on the mapping result and the second fusion result to obtain the fusion feature.
           Also, page 7, paragraphs, [0067-0070], in some examples, the multimedia data above is video data with video pictures containing the target object: the “video frames” in the video data are each taken as a current video frame one by one, and the following operations are performed on the current video frame: acquiring the static “facial feature” from the current video frame; acquiring the expression change feature from the video frame sequence containing the current video frame: acquiring the sound feature from the “audio data” corresponding to the “video frame sequence”: acquiring the language content feature from the audio data and/or subtitle data corresponding to the video frame sequence; and obtaining the emotion analysis result of the target object in the current video frame using the object emotion analysis method described in the above-mentioned embodiments. For example, the afore-mentioned video data includes N video frames, and emotion analysis result i can be obtained for video frame i among them: the emotion analysis results of the video frames are arranged according to an arrangement sequence of the video frames to obtain emotion analysis result 1, emotion analysis result 2, . . . and emotion analysis result N. In some examples, the emotion analysis result corresponding to the video data may be an arrangement combination of a series of emotions, such as peace, peace, surprise, surprise, surprise, happiness, happiness, happiness, happiness, etc.)”
           Finally, page 11, paragraph, [0112] By executing the machine executable instructions, the processor in the electronic device above may implement the following operations in the object emotion analysis method above: acquiring a specified audio sequence from audio data in the multimedia data if the dynamic feature includes the sound feature, where the audio sequence includes a sound signal emitted by the target object; and extracting the sound feature of the target object from the audio sequence by a pre-trained sound feature extraction model, where the sound feature includes one or more of a frequency feature, a tone feature, a pitch feature, and an energy feature.
           However, regarding claim 1, Zhang clearly disclosed (see abstract, an object emotion analysis method and apparatus and an electronic device are provided. The method includes: extracting a static facial feature and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an “expression change” feature, a sound feature and a language content feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also pages 2 and 5, paragraphs, [0025] and [0049], the static facial feature of the above-mentioned target object may be extracted from image data containing the target object, and the image data may also be a video frame image. The static facial feature data can be extracted by a pre-trained facial feature extraction model, and the facial feature extraction model can be specifically composed of a “convolutional neural network”, a residual network, or the like. The static facial feature may characterize an appearance feature, action and posture features, an “expression feature”, or the like, of the target object, and may be understood as a mixed feature. If the model is trained only based on the static facial feature, it is difficult for the model to learn only the expression feature therein, but the model may also learn the appearance feature of the target object, such that the model is influenced by the appearance feature of the target object when analyzing an expression: the model may also learn the action and posture features of the target object, such that the model is influenced by an action and a posture of the target object when analyzing the expression, thereby reducing expression analysis accuracy of the model. Firstly, the dynamic feature is transformed to obtain a first input parameter of the self-attention network, the first input parameter is input into the self-attention network, and an intermediate feature of the dynamic feature is output, where the intermediate feature is used for characterizing autocorrelation of the dynamic feature: a second input parameter of the cross-attention network is determined based on the intermediate feature, a third input parameter of the cross-attention network is determined based on the static facial feature, the second input parameter and the third input parameter are input to the cross-attention network to obtain an output result, and the fusion feature is determined based on the output result.
           However, Zhang does not explicitly state limitation of amended claim, “self-cure network”, (machine learning model).
           On the other hand, Chen in the same field of “facial expression recognition”, teaches (see abstract, therefore this article proposes a self-cure network with two-stage method (SCN-TSM) which prevents deep networks from over-fitting ambiguous images. First, base on SCN-TSM, a two-stage training scheme is designed, taking full advantage of the gendered information. Furthermore, a self-attention mechanism to highlight the essential images, and to weight each sample with a weighting regularization. Finally, a relabeling module to modify the labels of these samples in inconsistent labels. 
           Therefore, it would have been obvious to one having ordinary skill in the art at the time the invention was made to modify Zhang invention according to the teaching of Chen because to combine, Zhang, multi-module, multi-layer perception starting with a self-attention network for image and audio emotion labeling and analysis with further teaching of Chen emotion analysis using self-cure network for correcting and the labels would provide for an image detailed and accurate system and method for video image and audio analysis in detecting and modeling emotion images and corresponding audio in videos.
and generating CT data and using wireless connection for measuring the generated data and identify the transmission rate for a more efficient transmission of the data.
           Regarding claim 2, Zhang discloses the method of Claim 1, wherein: extracting the video features comprises, performing face detection in the collections of video frames, and (ii) processing the collections of video frames based on results of the face detection in order to identify the video features associated with the at least one face (see claim 1, also abstract, an object emotion analysis method and apparatus and an electronic device are provided. The method includes: extracting a static facial feature and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an expression change feature, a sound feature and a language content feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also, page 7, paragraph, [0063] In the above, in the feature mapping mode of performing linear combination mapping based on the preset facial action unit, expressions of the face are “divided into” a plurality of action units in advance according to muscle distribution of the face, and when the face expresses the emotion through the expression, the expression is represented by a linear combination of the action units. In some embodiments, after the branch network receives the spliced feature, the feature mapping mode thereof includes calculating a linear weight of each action unit according to the spliced feature, and performing linear combination on the action units using the linear weights, so as to obtain the emotion analysis result);
           and extracting the audio features comprises (i) processing the audio data in order to identify a first subset of the audio features associated with waveforms of the audio data and (ii) processing the audio data using a pre-trained audio model in order to identify a second subset of the audio features (see, abstract, the method includes: extracting a static “facial feature” and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an expression change feature, a “sound feature” and a language content feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also, page 1, paragraph, [0006] in some aspects, the present disclosure further provides a non-transitory machine-readable storage medium, where the machine-readable storage medium stores machine executable instructions, and when invoked and executed by a processor, the machine executable instructions cause the processor to implement an object emotion analysis method, the object emotion analysis method including: acquiring multimedia data associated with a target object, and extracting a static facial feature and a dynamic feature of the target object from the multimedia data, where the dynamic feature includes at least one of an expression change feature, a sound feature or a language content feature of the target object; and inputting the static facial feature and the dynamic feature into an object emotion analysis model, wherein the object emotion analysis model is “pre-trained”, fusing the static facial feature and the dynamic feature by the object emotion analysis model to obtain a fusion feature, and outputting an emotion analysis result of the target object based on the fusion feature.
           Also page 4, paragraph, [0036] and [0039], if the afore-mentioned dynamic feature includes the sound feature, a specified audio sequence is acquired from the audio data in the multimedia data, where the audio sequence includes a sound signal emitted by the target object: the sound feature of the target object is extracted from the audio sequence by a “pre-trained sound” feature extraction model, where the sound feature includes one or more of a frequency feature, a tone feature, a pitch feature, and an energy feature. The subtitle data is usually in a text format and records words spoken by the target object, such that the language content text of the target object can be directly obtained from the subtitle data. For the audio data, the words spoken by the target object in the audio data can be recognized by a voice recognition tool, so as to obtain the language content text in the text format. In some examples, the language content text of the target object may be extracted from the subtitle data or audio data corresponding to the video frame sequence from which the expression change feature is extracted. In one example, the language content text is “Oh, my god”, and the language content text typically contains a feature characterizing a surprised emotion).
           Regarding claim 4, Zhang discloses the method of Claim 1, wherein the trained machine learning model comprises: at least one cross-modal transformer encoder layer configured to receive and fuse the video features and a first subset of the audio features and generate multi-modal features; at least one fusion encoder layer configured to combine the multi-modal features; and a multi-layer perceptron (MLP) decoder layer configured to decode outputs of the at least one fusion encoder layer as fused with a second subset of the audio features (see claim 1, also page 4, paragraph, [0040] the language content feature extraction model is mainly configured to identify the semantic feature of the language content text above, and can be implemented by a text feature model bidirectional “encoder” representation from transformers (BERT) or by other text semantic feature extraction models. The language content feature extraction model can be trained using a corpus with a large data volume and can extract the feature between text words of adjacent texts. Since the semantic feature is extracted by the language content feature extraction model and characterize the language meaning of the language emitted by the target object. Also page 6, paragraphs, [0058-0059] further, the output result of the “cross-attention network” is further required to be processed as follows to obtain the fusion feature: performing second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by “transforming” the first fusion result; inputting the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and performing third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. In some embodiments, the first fusion result above is the first fusion result after performing the first fusion processing on the intermediate feature output from the self-attention network and the dynamic feature. In some embodiments, the second fusion processing may include performing feature addition on the output result and the first fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the second fusion result. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. In some embodiments, the third fusion processing above may include performing feature addition on the mapping result and the second fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the fusion feature. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. The first multilayer perceptron above may be implemented by a multilayer perceptron (MLP) network.
           Finally, page 10, paragraphs, [0099-0101] the result output module is further configured to: perform second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by transforming the first fusion result: input the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and perform third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. The second multilayer perceptron above includes a plurality of branch networks; and the result output module is further configured to: input the spliced features into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network, and the feature mapping mode includes more of the follows: performing linear “combination” mapping based on a preset facial action unit, performing linear combination mapping based on a plurality of preset basic emotion types, and performing linear characterization mapping based on a positive-negative degree and an intense degree of the emotion; and map the spliced features by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks).
           Regarding claim 5, Zhang discloses the method of Claim 1, wherein: the trained machine learning model comprises a multi-modal transformer, the multi- modal transformer comprising one or more cross-modal transformer encoder layers and one or more fusion encoder layers; outputs of the multi-modal transformer are fused with a second subset of the audio features; and the video features and a first subset of the audio features are fused by one of: an earlier layer in the multi-modal transformer to support an early-late fusion of the video features and the audio features; a later layer in the multi-modal transformer to support a late-late fusion of the video features and the audio features; or a layer between the earlier and later layers in the multi-modal transformer to support a mid-late fusion of the video features and the audio features (see claim 1, also page 4, paragraph, [0040] the language content feature extraction model is mainly configured to identify the semantic feature of the language content text above, and can be implemented by a text feature model bidirectional “encoder” representation from transformers (BERT) or by other text semantic feature extraction models. The language content feature extraction model can be trained using a corpus with a large data volume and can extract the feature between text words of adjacent texts. Since the semantic feature is extracted by the language content feature extraction model and characterize the language meaning of the language emitted by the target object. Also page 6, paragraphs, [0058-0059] further, the output result of the “cross-attention network” is further required to be processed as follows to obtain the fusion feature: performing second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by “transforming” the first fusion result; inputting the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and performing third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. In some embodiments, the first fusion result above is the first fusion result after performing the first fusion processing on the intermediate feature output from the self-attention network and the dynamic feature. In some embodiments, the second fusion processing may include performing feature addition on the output result and the first fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the second fusion result. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. In some embodiments, the third fusion processing above may include performing feature addition on the mapping result and the second fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the fusion feature. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. The first multilayer perceptron above may be implemented by a multilayer perceptron (MLP) network.
           Finally, page 10, paragraphs, [0099-0101] the result output module is further configured to: perform second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by transforming the first fusion result: input the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and perform third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. The second multilayer perceptron above includes a plurality of branch networks; and the result output module is further configured to: input the spliced features into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network, and the feature mapping mode includes more of the follows: performing linear “combination” mapping based on a preset facial action unit, performing linear combination mapping based on a plurality of preset basic emotion types, and performing linear characterization mapping based on a positive-negative degree and an intense degree of the emotion; and map the spliced features by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks).
           Regarding claim 7, Zhang discloses the method of Claim 1, wherein the trained machine learning model is trained to recognize multiple emotions arranged in a hierarchy, two root categories of the hierarchy comprising positive emotions and negative emotions (see claim 1, also pages 6-7, paragraphs, [0062-0066] further, in order to make the emotion analysis result more accurate and reasonable, in the present embodiment, the object emotion analysis model outputs analysis results of a plurality of emotion analysis modes. Based on this, the second multilayer perceptron above includes a plurality of branch networks; and in the “training process”, each branch network learns a feature mapping mode corresponding to one “emotion analysis mode”. The spliced features are input into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network; the feature mapping mode includes more of the follows: performing linear combination mapping based on a preset facial action unit, performing linear combination mapping based on plural preset basic emotion types, and performing linear characterization mapping based on a “positive-negative” degree and an intense degree of the emotion; and the spliced features are mapped by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks. In the above, in the feature mapping mode of performing linear combination mapping based on the preset facial action unit, expressions of the face are divided into a plurality of action units in advance according to muscle distribution of the face, and when the face expresses the emotion through the expression, the expression is represented by a linear combination of the action units. In the feature mapping mode of performing linear combination mapping based on the plural preset basic emotion types, emotions are divided in advance into plural basic emotions, such as neutrality, “happiness, sadness, surprise, fears, anger”, (categories), aversions, or the like. In some embodiments, after the branch network receives the spliced features, the feature mapping mode thereof includes calculating a linear weight of each basic emotion according to the spliced feature, and performing linear combination on the basic emotions using the linear weights, so as to obtain the emotion analysis result. In some embodiments, in the feature mapping mode of performing linear characterization mapping based on the positive-negative degree and the intense degree of the emotion, after the branch network receives the spliced features, the feature mapping mode thereof includes calculating a parameter of the positive-negative degree and a parameter of the intense degree according to the spliced features, and characterizing the emotion based on the two parameters, so as to obtain the emotion analysis result. In practical, the second multilayer perceptron above includes three branch networks which correspond to three feature mapping modes respectively of: performing the linear combination mapping based on the preset facial action unit, performing the linear combination mapping based on the plural preset basic emotion types, and performing the linear characterization mapping based on the positive-negative degree and the intense degree of the emotion, such that the obtained emotion analysis result includes the emotion analysis result obtained according to each of the feature mapping modes).
           Regarding claim 9, Zhang discloses the electronic device of Claim 8, wherein: to extract the video features, the at least one processing device is configured to (i) split the video frames into multiple collections of video frames, (ii) perform face detection in the collections of video frames, and (iii) process the collections of video frames based on results of the face detection in order to identify the video features associated with the at least one face (see claim 1, also, abstract, an object emotion analysis method and apparatus and an electronic device are provided. The method includes: extracting a static facial feature and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an expression change feature, a sound feature and a language content feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also, page 7, paragraph, [0063] In the above, in the feature mapping mode of performing linear combination mapping based on the preset facial action unit, expressions of the face are “divided into” a plurality of action units in advance according to muscle distribution of the face, and when the face expresses the emotion through the expression, the expression is represented by a linear combination of the action units. In some embodiments, after the branch network receives the spliced feature, the feature mapping mode thereof includes calculating a linear weight of each action unit according to the spliced feature, and performing linear combination on the action units using the linear weights, so as to obtain the emotion analysis result);
           to extract the audio features, the at least one processing device is configured to (i) process the audio data in order to identify a first subset of the audio features associated with waveforms of the audio data and (ii) process the audio data using a pre-trained audio model in order to identify a second subset of the audio features (see, abstract, the method includes: extracting a static “facial feature” and a dynamic feature from multimedia data associated with a target object, wherein the dynamic feature includes one or more of an expression change feature, a “sound feature” and a language content feature: inputting the static facial feature and the dynamic feature into a pre-trained object emotion analysis model, fusing the static facial feature and the dynamic feature by the object emotion analysis model, and outputting an emotion analysis result.
           Also, page 1, paragraph, [0006] in some aspects, the present disclosure further provides a non-transitory machine-readable storage medium, where the machine-readable storage medium stores machine executable instructions, and when invoked and executed by a processor, the machine executable instructions cause the processor to implement an object emotion analysis method, the object emotion analysis method including: acquiring multimedia data associated with a target object, and extracting a static facial feature and a dynamic feature of the target object from the multimedia data, where the dynamic feature includes at least one of an expression change feature, a sound feature or a language content feature of the target object; and inputting the static facial feature and the dynamic feature into an object emotion analysis model, wherein the object emotion analysis model is “pre-trained”, fusing the static facial feature and the dynamic feature by the object emotion analysis model to obtain a fusion feature, and outputting an emotion analysis result of the target object based on the fusion feature.
           Also page 4, paragraphs, [0036] and [0039], if the afore-mentioned dynamic feature includes the sound feature, a specified audio sequence is acquired from the audio data in the multimedia data, where the audio sequence includes a sound signal emitted by the target object: the sound feature of the target object is extracted from the audio sequence by a “pre-trained sound” feature extraction model, where the sound feature includes one or more of a frequency feature, a tone feature, a pitch feature, and an energy feature. The subtitle data is usually in a text format and records words spoken by the target object, such that the language content text of the target object can be directly obtained from the subtitle data. For the audio data, the words spoken by the target object in the audio data can be recognized by a voice recognition tool, so as to obtain the language content text in the text format. In some examples, the language content text of the target object may be extracted from the subtitle data or audio data corresponding to the video frame sequence from which the expression change feature is extracted. In one example, the language content text is “Oh, my god”, and the language content text typically contains a feature characterizing a surprised emotion).
           Regarding claim 11, Zhang discloses the electronic device of Claim 8, wherein the trained machine learning model comprises: at least one cross-modal transformer encoder layer configured to receive and fuse the video features and a first subset of the audio features and generate multi-modal features; at least one fusion encoder layer configured to combine the multi-modal features; and a multi-layer perceptron (MLP) decoder layer configured to decode outputs of the at least one fusion encoder layer as fused with a second subset of the audio features (see claim 1, also page 4, paragraph, [0040] the language content feature extraction model is mainly configured to identify the semantic feature of the language content text above, and can be implemented by a text feature model bidirectional “encoder” representation from transformers (BERT) or by other text semantic feature extraction models. The language content feature extraction model can be trained using a corpus with a large data volume and can extract the feature between text words of adjacent texts. Since the semantic feature is extracted by the language content feature extraction model and characterize the language meaning of the language emitted by the target object. Also page 6, paragraphs, [0058-0059] further, the output result of the “cross-attention network” is further required to be processed as follows to obtain the fusion feature: performing second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by “transforming” the first fusion result; inputting the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and performing third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. In some embodiments, the first fusion result above is the first fusion result after performing the first fusion processing on the intermediate feature output from the self-attention network and the dynamic feature. In some embodiments, the second fusion processing may include performing feature addition on the output result and the first fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the second fusion result. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. In some embodiments, the third fusion processing above may include performing feature addition on the mapping result and the second fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the fusion feature. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. The first multilayer perceptron above may be implemented by a multilayer perceptron (MLP) network.
           Finally, page 10, paragraphs, [0099-0101] the result output module is further configured to: perform second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by transforming the first fusion result: input the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and perform third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. The second multilayer perceptron above includes a plurality of branch networks; and the result output module is further configured to: input the spliced features into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network, and the feature mapping mode includes more of the follows: performing linear “combination” mapping based on a preset facial action unit, performing linear combination mapping based on a plurality of preset basic emotion types, and performing linear characterization mapping based on a positive-negative degree and an intense degree of the emotion; and map the spliced features by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks).
           Regarding claim 12, Zhang discloses the electronic device of Claim 8, wherein: the trained machine learning model comprises a multi-modal transformer, the multi- modal transformer comprising one or more cross-modal transformer encoder layers and one or more fusion encoder layers; the trained machine learning model is further configured to fuse outputs of the multi- modal transformer with a second subset of the audio features; and the trained machine learning model is configured to fuse the video features and a first subset of the audio features in one of: an earlier layer in the multi-modal transformer to support an early-late fusion of the video features and the audio features; a later layer in the multi-modal transformer to support a late-late fusion of the video features and the audio features; or a layer between the earlier and later layers in the multi-modal transformer to support a mid-late fusion of the video features and the audio features (see claim 1, also page 4, paragraph, [0040] the language content feature extraction model is mainly configured to identify the semantic feature of the language content text above, and can be implemented by a text feature model bidirectional “encoder” representation from transformers (BERT) or by other text semantic feature extraction models. The language content feature extraction model can be trained using a corpus with a large data volume and can extract the feature between text words of adjacent texts. Since the semantic feature is extracted by the language content feature extraction model and characterize the language meaning of the language emitted by the target object. Also page 6, paragraphs, [0058-0059] further, the output result of the “cross-attention network” is further required to be processed as follows to obtain the fusion feature: performing second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by “transforming” the first fusion result; inputting the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and performing third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. In some embodiments, the first fusion result above is the first fusion result after performing the first fusion processing on the intermediate feature output from the self-attention network and the dynamic feature. In some embodiments, the second fusion processing may include performing feature addition on the output result and the first fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the second fusion result. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. In some embodiments, the third fusion processing above may include performing feature addition on the mapping result and the second fusion result to obtain an addition result, and then performing normalization processing on the addition result to obtain the fusion feature. In some embodiments, the feature addition may be feature splicing, or addition of feature data located at same position points, or the like. The first multilayer perceptron above may be implemented by a multilayer perceptron (MLP) network.
           Finally, page 10, paragraphs, [0099-0101] the result output module is further configured to: perform second fusion processing on the output result and the first fusion result corresponding to the second input parameter to obtain a second fusion result, where the second input parameter is obtained by transforming the first fusion result: input the second fusion result into a preset first multilayer perceptron (MLP), and performing mapping on the second fusion result by the first multilayer perceptron to obtain a mapping result; and perform third fusion processing on the mapping result and the second fusion result to obtain the fusion feature. The second multilayer perceptron above includes a plurality of branch networks; and the result output module is further configured to: input the spliced features into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network, and the feature mapping mode includes more of the follows: performing linear “combination” mapping based on a preset facial action unit, performing linear combination mapping based on a plurality of preset basic emotion types, and performing linear characterization mapping based on a positive-negative degree and an intense degree of the emotion; and map the spliced features by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks).
           Regarding claim 14, Zhang discloses the electronic device of Claim 8, wherein the trained machine learning model is trained to recognize multiple emotions arranged in a hierarchy, two root categories of the hierarchy comprising positive emotions and negative emotions (see claim 1, also pages 6-7, paragraphs, [0062-0066] further, in order to make the emotion analysis result more accurate and reasonable, in the present embodiment, the object emotion analysis model outputs analysis results of a plurality of emotion analysis modes. Based on this, the second multilayer perceptron above includes a plurality of branch networks; and in the “training process”, each branch network learns a feature mapping mode corresponding to one “emotion analysis mode”. The spliced features are input into the plurality of branch networks of the second multilayer perceptron respectively, where each of the branch networks is preset with a feature mapping mode corresponding to the branch network; the feature mapping mode includes more of the follows: performing linear combination mapping based on a preset facial action unit, performing linear combination mapping based on plural preset basic emotion types, and performing linear characterization mapping based on a “positive-negative” degree and an intense degree of the emotion; and the spliced features are mapped by the branch networks according to the feature mapping modes corresponding to the branch networks, so as to obtain the emotion analysis results output by the branch networks. In the above, in the feature mapping mode of performing linear combination mapping based on the preset facial action unit, expressions of the face are divided into a plurality of action units in advance according to muscle distribution of the face, and when the face expresses the emotion through the expression, the expression is represented by a linear combination of the action units. In the feature mapping mode of performing linear combination mapping based on the plural preset basic emotion types, emotions are divided in advance into plural basic emotions, such as neutrality, “happiness, sadness, surprise, fears, anger”, (categories), aversions, or the like. In some embodiments, after the branch network receives the spliced features, the feature mapping mode thereof includes calculating a linear weight of each basic emotion according to the spliced feature, and performing linear combination on the basic emotions using the linear weights, so as to obtain the emotion analysis result. In some embodiments, in the feature mapping mode of performing linear characterization mapping based on the positive-negative degree and the intense degree of the emotion, after the branch network receives the spliced features, the feature mapping mode thereof includes calculating a parameter of the positive-negative degree and a parameter of the intense degree according to the spliced features, and characterizing the emotion based on the two parameters, so as to obtain the emotion analysis result. In practical, the second multilayer perceptron above includes three branch networks which correspond to three feature mapping modes respectively of: performing the linear combination mapping based on the preset facial action unit, performing the linear combination mapping based on the plural preset basic emotion types, and performing the linear characterization mapping based on the positive-negative degree and the intense degree of the emotion, such that the obtained emotion analysis result includes the emotion analysis result obtained according to each of the feature mapping modes).
           With regard to claims 6, 8, 13, 15, 16, and 18-20 the arguments analogous to those presented above for claims 1, 2, 4, 5, 7, 9, 11, 12 and 14, are respectively applicable to claims 6, 8, 13, 15, 16, and 18-20.

Allowable Subject Matter
Claims 3, 10 and 17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

                                                       Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Seyed Azarian whose telephone number is (571) 272-7443. The examiner can normally be reached on Monday through Thursday from 6:00 a.m. to 7:30 p.m. 
           If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Moyer Andrew, can be reached at (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
           Information regarding the status of an application may be obtained from the Patent Application information Retrieval (PAIR) system. Status information for published application may be obtained from either Private PAIR or Public PAIR.
Status information about the PAIR system, see http:// pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/SEYED H AZARIAN/Primary Examiner, Art Unit 2667 
January 20, 2026
Read full office action
Prosecution Timeline

Nov 21, 2022
Application Filed
May 13, 2025
Examiner Interview (Telephonic)
May 15, 2025
Non-Final Rejection — §103
Jul 09, 2025
Applicant Interview (Telephonic)
Jul 15, 2025
Examiner Interview Summary
Aug 19, 2025
Response Filed
Sep 18, 2025
Examiner Interview (Telephonic)
Sep 21, 2025
Final Rejection — §103
Nov 04, 2025
Response after Non-Final Action
Nov 12, 2025
Examiner Interview (Telephonic)
Dec 19, 2025
Request for Continued Examination
Jan 16, 2026
Response after Non-Final Action
Jan 23, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/453,954
Patent 12602783
SYSTEM AND METHODS FOR AUTOMATIC IMAGE ALIGNMENT OF THREE-DIMENSIONAL IMAGE VOLUMES
2y 5m to grant Granted Apr 14, 2026
18/336,934
Patent 12597134
IMAGE PROCESSING DEVICE, METHOD, AND PROGRAM
2y 5m to grant Granted Apr 07, 2026
18/613,004
Patent 12598264
Color Correction for Electronic Device with Immersive Viewing
2y 5m to grant Granted Apr 07, 2026
18/266,030
Patent 12586206
METHOD FOR IDENTIFYING A MATERIAL BOUNDARY IN VOLUMETRIC IMAGE DATA
2y 5m to grant Granted Mar 24, 2026
18/474,627
Patent 12573039
IMAGING SYSTEMS AND METHODS USEFUL FOR PATTERNED STRUCTURES
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+11.7%)
2y 3m
Median Time to Grant
High
PTA Risk
Based on 901 resolved cases by this examiner. Grant probability derived from career allow rate.
MULTI-MODAL UNDERSTANDING OF EMOTIONS IN VIDEO CONTENT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email