Last updated: May 29, 2026
Application No. 18/836,567
METHOD AND APPARATUS FOR EMOTION RECOGNITION IN REAL-TIME BASED ON MULTIMODAL

Non-Final OA §103
Filed
Aug 07, 2024
Priority
Feb 28, 2022 — RE 10-2022-0025988 +1 more
Examiner
LEE, EUNICE SOMIN
Art Unit
2656
Tech Center
2600 — Communications
Assignee
SK Telecom Co. Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +33.3% interview lift. Examiner has a relatively high allowance rate (86%); +33.3% interview lift. A written response may suffice.
Based on 35 resolved cases, 2023–2026
Examiner Intelligence

LEE, EUNICE SOMIN View full profile →
Grants 86% — above average
Career Allowance Rate
30 granted / 35 resolved
+23.7% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
12 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.1%
-36.9% vs TC avg
§103
96.9%
+56.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 35 resolved cases
Office Action

§103
DETAILED ACTION
This communication is in response to the Application filed on August 7, 2024. 
Claims 1 - 12 are pending and have been examined. 
Claims 1, 11 and 12 are independent.
Foreign priority: February 28, 2022.
PCT/KR2023/001005 was filed on January 20, 2023.

           

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Information Disclosure Statement
The information disclosure statements (IDS) submitted on August 7, 2024 and March 28, 2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Drawings
The drawings filed on August 7, 2024 have been accepted and considered by the Examiner.


Double Patenting Note
The Examiner notes that previously published patent application publication U.S. 2025/0201018 was analyzed for Double Patenting. However, based on the current claim scope no Double patenting was found.



Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


Claims 1, 4 - 7, 9 - 10 are rejected under 35 U.S.C. 103 as being unpatentable over Popovic et al., ("Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations," Proceedings of the 1st International Workshop on Multimodal Conversational AI, 2020), hereinafter referred to as Popovic, in view of Krishna et al., (“Using large pre-trained models with cross-modal attention for multi-modal emotion recognition,” arXiv:2108.09669 2, 2021), hereinafter referred to as Krishna.
Regarding Claim 1, Popovic teaches:
1. An emotion recognition method using an audio stream performed by an emotion recognition apparatus, the emotion recognition method comprising:
receiving an audio signal having a preset unit length to generate the audio stream corresponding to the audio signal; [Popovic, Figure 1, “Audio (i.e. the claimed “audio signal”) was extracted as mono 16kHz (i.e., the claimed “audio stream corresponding to the audio signal”),” Pg. 34]

    PNG
    media_image1.png
    519
    1044
    media_image1.png
    Greyscale
converting the audio stream into a text stream corresponding to the audio stream; and [Popovic, Figure 1 clearly shows converting the audio stream into a text stream corresponding to the audio stream: “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”),” Pg. 33; “transcript (i.e., the claimed “text stream corresponding to the audio stream) is generated using ASR (i.e., the claimed “converting the audio stream”), Pg. 33; “Our experiments were performed on audio files extracted from the video files in the MELD dataset. Audio was extracted as mono, 16 KHz (i.e., the claimed “audio stream”),” Pg. 34; Once the ASR transcripts for the MELD data set were generated (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”),” Pg. 34]
inputting the audio stream and the converted text stream to a pre-trained emotion recognition model to output a multi-modal emotion corresponding to the audio signal. [Popovic, Figure 1 “Multimodal emotion detection”; Figure 1 clearly shows inputting in parallel the audio stream and the converted text stream into a pre-trained emotion recognition model: “In the pre-processing phase they extract mel-frequency cepstrum (MFCC) features for audio (i.e., the claimed “audio stream”) and a pre-trained model of Bidirectional Encoder Representations from Transformers (BERT) for embedding text information (i.e., the claimed “text stream”). These features (i.e., the claimed “audio stream and the converted text stream”) are fed in parallel (i.e., the claimed “audio stream and the converted text stream”) to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”),” Pg. 33; “achieve the task of speech-to-text conversion”, Pg. 34; “Pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) available within the ”Transformers” (bert-based uncased for BERT, distilbert-base-uncased for DistilBERT and Roberta-base for RoBERTa) were used as starting points to create classifiers for the 7 emotion classes in MELD (anger, disgust, fear, joy, neutral, sadness, surprise) (i.e., the claimed “output a multi-modal emotion corresponding to the audio signal”).” Pg. 34]
Popovic fails to explicitly teach preset unit length.
However Krishna teaches:

    PNG
    media_image2.png
    590
    930
    media_image2.png
    Greyscale
receiving an audio signal having a preset unit length to generate the audio stream corresponding to the audio signal; [Krishna, “The audio encoder base takes raw waveform signal as input (i.e., the claimed “receiving an audio signal) and generates a 1024 dimensional feature representation of the speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)”. Pg. 3; “The model has a two-streams (i.e., the claimed “audio stream” and “text stream”) in the network called the audio network (i.e., the claimed “audio stream”) and text network.” Pg. 2]
inputting the audio stream and the converted text stream to a pre-trained emotion recognition model to output a multi-modal emotion corresponding to the audio signal. [Krishna, Figure 1 clearly shows inputting the audio stream and the text stream into a pre-trained emotion recognition model. “The model has a two-streams (i.e., the claimed “audio stream” and “text stream”) in the network called the audio network (i.e., the claimed “audio stream”) and text network.” Pg. 2; “We propose using large self-supervised pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” Pg. 1; “In this work, propose to use large self-supervised pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) along with cross-modal attention for the multi-modal emotion recognition task.” Pg. 1]
Popovic and Krishna pertain to multimodal emotion recognition systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the multimodal emotion recognition systems art to modify Popovic’s teachings of “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”)” that are “fed in parallel to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”)” (Popovic, Pg. 33, Pg. 34) with the explicit teachings of “speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)” (Krishna, Pg. 3) taught by Krishna in order to enable “pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” (Krishna, Pg. 1).

Regarding Claim 4, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the outputting includes a pre-feature extraction step of extracting a first feature from the audio stream and extracting a second feature from the text stream; [Popovic, see mapping applied to claim 1; Krishna, see mapping applied to claim 1; Popovic, Figure 1, “Audio feature (i.e., the claimed “first feature from the audio stream”), “Textual features (i.e., the claimed “second feature from the text stream”); “textual data and features (i.e., the claimed “second feature from the text stream”)  extracted from audio.” Pg. 33; “In the pre-processing phase (i.e., the claimed “pre-feature extraction step”), they extract mel-frequency cepstrum (MFCC) features (i.e., the claimed “first feature from the audio stream”) for audio,” Pg. 33; “We use cross-modal attention layers to learn interactive information between the two modalities 
    PNG
    media_image3.png
    600
    566
    media_image3.png
    Greyscale
by aligning speech features (i.e., the claimed “first feature from the audio stream”) and text features (i.e., the claimed “second feature from the text stream”) in both directions.” Pg. 1] 
a uni-modal feature extraction step of extracting a first embedding vector from the first feature and extracting a second embedding vector from the second feature; [Popovic clearly shows in Figure 1 unimodal feature extraction of textual feature and unimodal feature extraction of audio feature: Figure 1, “Audio features    Unimodal emotion detection”, “Textual features    Unimodal emotion detection”; Popovic, see mapping applied to claim 1; Krishna, see mapping applied to claim 1; Krishna, “We also conduct unimodal experiments for both audio and text modalities (i.e., the claimed “uni-modal feature extraction step of extracting a first embedding vector from the first feature and extracting a second embedding vector from the second feature”) and compare them with previous best methods.”  Pg. 1]

    PNG
    media_image4.png
    630
    681
    media_image4.png
    Greyscale
a multi-modal feature extraction step of correlating the first embedding vector with the second embedding vector to extract a first multi-modal feature and a second multi-modal feature; and [Popovic clearly shows in Figure 1 multi-modal correlation of first multi-modal feature (audio feature) and second multi-modal feature (textual feature): Figure 1 “Audio feature” (i.e., the claimed “first multi-modal feature”) and “Textual feature” (i.e., the claimed “second multi-modal feature) fed in parallel into “Multi-modal emotion detection”; Popovic, see mapping applied to claim 1; Krishna, see mapping applied to claim 1; Krishna, “This work shows that we can jointly fine-tune the pre-trained model in a multi-modal setting for better emotion recognition performance. We use the cross-modal attention mechanism to learn the alignment between audio (i.e., the claimed “first multi-modal feature”) and text (i.e., the claimed “second multi-modal feature) features,” Pg. 2; “The feature vectors from the text encoder base,” Pg. 2; “embedding vectors extracted from text (i.e., the claimed “second embedding vector”), Pg. 1; “speech embeddings (i.e., the claimed “first embedding vector”), Pg. 1]
concatenating the first multi-modal feature with the second multi-modal feature in a channel direction. [Popovic, see mapping applied to claim 1; Krishna Eq. 5 clearly shows concatenating the first multi-modal feature with the second multi-modal feature in a channel direction; Krishna, see mapping applied to claim 1; Krishna, “We perform an early fusion of audio (i.e., the 
    PNG
    media_image5.png
    313
    504
    media_image5.png
    Greyscale
claimed “first multi-modal feature”) and text (i.e., the claimed “second multi-modal feature) features to combine both modalities. In this case, we use simple concatenation as the early fusion operator.” Pg. 3; “The feature encoder contains seven 1D convolution blocks and each block have 512 channels,” Pg. 4; “This bidirectional attention from both modalities helps capture interactive information from both directions and thus improves emotion classification accuracy.” Pg. 3]

Regarding Claim 5, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the uni-modal feature extraction step includes inputting the first feature to a first convolutional layer to extract a third feature having a preset dimension; [Popovic, see mapping applied to claims 1, 4; Krishna, see mapping applied to claims 1, 4; Popovic, “Each block consists of one or more modules with 1D time-channel separable convolutional layers (i.e., the claimed “first convolutional layer”), batch normalization, and ReLU layers.” Pg. 234; “The audio feature (i.e., the claimed “first feature”) encoder consists of 2 layers of 1-D convolution (i.e., the claimed “first convolutional layer”) followed by a single BLSTM layer.” Pg. 2; “These feature sequences (i.e., the claimed “third feature”) from the audio feature encoder and text encoder base are fed into a cross-modal attention module. Pg. 2; Krishna, “The audio encoder base takes raw waveform signal as input and generates a 1024 dimensional feature representation (i.e., the claimed “third feature having  preset dimension”) of the speech signal every 20 ms”. Pg. 3; Krishna, “The Audio feature encoder takes a sequence of contextual features C (i.e., the claimed “first feature”) from the last layer of the wav2vec2.0 model as input and performs a 1D convolution operation on the feature sequences.” Pg. 2]
inputting the third feature to a first self-attention layer to acquire a first embedding vector including correlation information between words in a sentence corresponding to the audio stream; [Popovic, see mapping applied to claims 1, 4; Krishna, see mapping applied to claims 1, 4; Krishna, “It consists of multilayer self-attention (i.e., multilayer consists of the claimed “first self-attention layer”) blocks similar to transformers, but instead of fixed positional encoding, the model uses relative positional encoding.” Pg. 2; “speech embeddings (i.e., the claimed “first embedding vector including correlation information between words in a sentence corresponding to the audio stream”) along with acoustic features for improved emotion recognition,” Pg. 1; “The wav2vec2.0 is one such model trained on ~53k hours of unlabelled audio data in a self-supervised fashion using CPC objective.” Pg. 1; “The audio encoder base uses wav2vec2.0 model, and the text encoder uses BERT architecture. We use cross-modal attention layers to learn interactive information between the two modalities by aligning speech features (i.e., the claimed “first embedding vector including correlation information between words in a sentence corresponding to the audio stream”) and text features in both directions. These aligned features are pooled using statistics pooling and concatenated to get utterance level feature representation.” Pg. 1]
inputting the second feature to a second convolutional layer to extract a fourth feature having the dimension; and [Popovic, see mapping applied to claims 1, 4; Krishna, see mapping applied to claims 1, 4; Krishna, “Text encoder takes sequence of N feature vectors (i.e., the claimed “second feature”) from BERT W=[w1, w2…wN] as input to 1D convolutional layer (i.e., the claimed “second convolutional layer”), where wi is a 768 dimensional feature vector (i.e.. the claimed “fourth feature having the dimension”) for the ith token.” Pg. 3]
inputting the fourth feature to a second self-attention layer to acquire a second embedding vector including correlation information between words in a sentence corresponding to the text stream. [Popovic, see mapping applied to claims 1, 4; Krishna, see mapping applied to claims 1, 4; Krishna, “It consists of multilayer self-attention (i.e., multilayer consists of the claimed “second self-attention layer”) blocks similar to transformers, but instead of fixed positional encoding, the model uses relative positional encoding.” Pg. 2; “The convolutional layer acts as a projection layer where it projects the word embedding (i.e., the claimed “second embedding vector including correlation information between words in a sentence corresponding to the text stream”) dimension from 786 to 256 and does not alter the time dimension.” Pg. 3; “We use BERT (Bidirectional Encoder Representation from Transformer” as the text encoder in our architecture.” Pg. 3; “This model takes a sequence of tokenized words as input  (i.e., the claimed “words in a sentence corresponding to the text stream”) and produces feature representations (i.e., the claimed “second embedding vector including correlation information between words in a sentence corresponding to the text stream”) that contain rich contextual information.” Pg. 3]

Regarding Claim 6, Popovic in view of Krishna has been discussed above. The combination further teaches:

    PNG
    media_image6.png
    456
    293
    media_image6.png
    Greyscale
wherein the multi-modal feature extraction step includes inputting a query embedding vector generated based on the first embedding vector to a first cross-modal transformer and inputting a key embedding vector and a value embedding vector generated based on the second embedding vector to extract the first m  ulti-modal feature; and [Krishna clearly shows in Figure 2 “CMA1” (i.e., the claimed “first cross-modal transformer”); Popovic, see mapping applied to claims 1, 5; Krishna, see mapping applied to claims 1, 5; Krishna, “These feature sequences from the audio feature encoder and text encoder base are fed into a cross-modal attention module (i.e., the claimed “cross-modal transformer”). Pg. 2; “The cross-modal attention (CMA-1) module (i.e., the claimed “first cross modal transformer”) in the audio network takes the audio feature encoder’s output rom the last BLSTM layer as the query vectors (i.e., the claimed “query embedding vector generated based on the first embedding vector”) and output of the text encoder base as key and value vectors (i.e., the claimed “key embedding vector and a value embedding vector generated based on the second embedding vector”) and applies multi-head scaled dot product attention.” Pg. 2

    PNG
    media_image7.png
    472
    286
    media_image7.png
    Greyscale
inputting a query embedding vector generated based on the second embedding vector to a second cross-modal transformer and inputting a key embedding vector and a value embedding vector generated based on the first embedding vector to extract the second multi- modal feature. [Krishna clearly shows in Figure 2 “CMA2” (i.e., the claimed “second cross-modal transformer”); Popovic, s ee mapping applied to claims 1, 5; Krishna, see mapping applied to claims 1, 5; Krishna, “These feature sequences from the audio feature encoder and text encoder base are fed into a cross-modal attention module (i.e., the claimed “cross-modal transformer”). Pg. 2; Similarly, The cross-modal attention (CMA-2) module (i.e., the claimed “second cross modal transformer”) in the text network (CMA-2) takes the output from the text encoder based after projection as query vectors (i.e., the claimed “query embedding vector generated based on the second embedding vector”) and output from BLSTM of the audio feature encoder as key and value vectors (i.e., the claimed “key embedding vector and a value embedding vector generated based on the first embedding vector”) and applies multi-head scaled dot product attention.” Pg. 2

Regarding Claim 7, Popovic in view of Krishna has been discussed above. The combination further teaches:
outputting an audio emotion corresponding to the audio stream based on the first embedding vector; and [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6; Popovic clearly shows in Figure 1 outputting audio emotion (“Audio features    Unimodal emotion detection”) corresponding to the audio capture (i.e., the claimed “text stream”) based in the audio features (i.e., the claimed “first embedding vector”)]
outputting a text emotion corresponding to the text stream based on the second embedding vector. [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6; Popovic clearly shows in Figure 1 outputting text emotion (“Textual features    Unimodal emotion detection”) corresponding to the textual data (i.e., the claimed “text stream”) based on the textual features (i.e., the claimed “second embedding vector”)]

Regarding Claim 9, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the outputting includes acquiring embedding vectors including correlation information between modalities; [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6] 
inputting each of the embedding vectors to a self-attention layer to extract multi-modal features including temporal correlation information; and [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6; Popovic, “the model is trained with Connectionist Temporal Classification (CTC) loss,” Pg. 34; Krishna, “Since speech is a temporal (i.e., the claimed “temporal correlation information”) sequence, Recurrent neural networks are the best fit for processing speech signals.” Pg. 1; “The feature encoder module contains temporal (i.e., the claimed “temporal correlation information”) convolution blocks followed by layer normalization and GELU activations.” Pg. 2]
concatenating the multi-modal features in a channel direction. [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6]

Regarding Claim 10, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the acquiring of the embedding vectors includes acquiring a first embedding vector including correlation information between the audio stream and the text stream, [Popovic, see mapping applied to claims 1, 4 - 9; Krishna, see mapping applied to claims 1, 4 - 6, 9]
based on a weighted sum of a feature for the audio stream and a feature for the text stream; and [Popovic, see mapping applied to claims 1, 4 - 9; Krishna, see mapping applied to claims 1, 4 - 6, 9; Krishna, “The audio encoder base is initialized with pre-trained wav2vec2.0 model weights. Similarly, the text encoder base is initialized with BERT model weights.” Pg. 2; Popovic, “weighted average (i.e., the claimed “weighted sum”)” Pg. 33]
acquiring a second embedding vector including correlation information between the text stream and the audio stream, [Popovic, see mapping applied to claims 1, 4 - 9; Krishna, see mapping applied to claims 1, 4 - 6, 9]
based on the weighted sum of the feature for the text stream and the feature for the audio stream.  [Popovic, see mapping applied to claims 1, 4 - 9; Krishna, see mapping applied to claims 1, 4 - 6, 9; Krishna, “The audio encoder base is initialized with pre-trained wav2vec2.0 model weights. Similarly, the text encoder base is initialized with BERT model weights.” Pg. 2; Popovic, “weighted average (i.e., the claimed “weighted sum”)” Pg. 33]


Claims 2 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Popovic in view of Krishna and Koretzky et al., (U.S. Patent Application Publication 2017/0236531), hereinafter referred to as Koretzky.
Regarding Claim 2, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the generating includes concatenating an audio signal pre-stored in an audio buffer with the audio signal to generate the audio stream. [Popovic, see mapping applied to claim 1; Krishna, see mapping applied to claim 1]
The combination fails to teach concatenating an audio signal pre-stored in an audio buffer.
However, Koretzky teaches:
wherein the generating includes concatenating an audio signal pre-stored in an audio buffer with the audio signal to generate the audio stream. [Koretzky, “CCB logic 118 receives each fragment for a given component signal (i.e., the claimed “audio signal”) and concatenates them into a dedicated buffer (i.e., the claimed “audio buffer”),” Par. 0087; “corresponding component audio signals stored (i.e., the claimed “pre-stored”) in the local buffers (i.e., the claimed “audio buffer”).” Par. 0091; “udio source 105 may be any audio stream or audio file accessible to server 102.” Par. 0049]
Popovic, Krishna and Koretzky pertain to audio systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio systems art to modify Popovic’s teachings of “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”)” that are “fed in parallel to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”)” (Popovic, Pg. 33, Pg. 34) with the explicit teachings of “speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)” (Krishna, Pg. 3) taught by Krishna and the teachings of “audio signals stored (i.e., the claimed “pre-stored”) in the local buffers (i.e., the claimed “audio buffer”)” and “concatenation” (Koretzky, Par. 0087, Par. 0091) taught by Koretzky in order to enable “pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” (Krishna, Pg. 1) and enable “combined signal to component signals in real-time” and “audio streaming” (Koretzky, Par. 0043, Par. 0097).

Regarding Claim 11, Popovic in view of Krishna and Koretzky has been discussed above. The combination further teaches:
11. An emotion recognition apparatus using an audio stream, the emotion recognition apparatus comprising:
an audio buffer configured to receive an audio signal with a preset unit length and generate the audio stream corresponding to the audio signal; [Popovic, see mapping applied to claims 1 - 2; Krishna, see mapping applied to claims 1 - 2; Koretzky, see mapping applied to claim 2] 
a speech-to-text (STT) model configured to convert the audio stream into a text stream corresponding to the audio stream; and [Popovic, see mapping applied to claims 1 - 2; Krishna, see mapping applied to claims 1 - 2; Koretzky, see mapping applied to claim 2; Popovic, Figure 1, “ASR” (i.e., the claimed “speech-to-text (STT) model”); Popovic, “Speech-to-Text (STT),” Pg. 34]
an emotion recognition model configured to receive the audio stream and the converted text stream and output a multi-modal emotion corresponding to the audio signal. [Popovic, see mapping applied to claims 1 - 2; Krishna, see mapping applied to claims 1 - 2; Koretzky, see mapping applied to claim 2]


Claim 3 is rejected under 35 U.S.C. 103(a) as being unpatentable over Popovic in view of Krishna, Koretzky and Kim et al., (KR20210081166A), hereinafter referred to as Kim.
Regarding Claim 3, Popovic in view of Krishna and Koretzky has been discussed above. The combination further teaches:
resetting the audio buffer when a length of the audio signal stored in the audio buffer exceeds a preset reference length. [Popovic, see mapping applied to claims 1 - 2; Krishna, see mapping applied to claims 1 - 2; Koretzky, see mapping applied to claim 2; “In one embodiment, a fragment size (i.e., the claimed “length of the audio signal stored in the audio buffer”) is selected that represents greater than or equal to 1.5 second and less than or equal to 3.0 seconds of audio.” Par. 0056]
The combination fails to teach resetting and exceeds a preset reference length.
However, Kim teaches: 
resetting the audio buffer when a length of the audio signal stored in the audio buffer exceeds a preset reference length. [Kim, “ The voice chunk splitting unit (210) resets the frame count and buffer (i.e., the claimed “audio buffer”) (S101).” Par. 0059; “exceeds a preset reference length,” Par. 0022; “First, the voice chunk splitting unit (210) checks whether the frame count exceeds a reference value (S108), Since the frame count represents the length of the frame stored in the buffer, a voice chunk is generated only when the content temporarily stored in the buffer is larger than the preset minimum voice chunk size.” Par. 0069]
Popovic, Krishna, Koretzky and Kim pertain to audio systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio systems art to modify Popovic’s teachings of “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”)” that are “fed in parallel to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”)” (Popovic, Pg. 33, Pg. 34) with the explicit teachings of “speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)” (Krishna, Pg. 3) taught by Krishna and the teachings of “audio signals stored (i.e., the claimed “pre-stored”) in the local buffers (i.e., the claimed “audio buffer”)” and “concatenation” (Koretzky, Par. 0087, Par. 0091) taught by Koretzky and the teachings of “reset” the “buffer” and “exceeds a preset reference length” (Kim, Par. 0022, Par. 0059) taught by Kim in order to enable “pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” (Krishna, Pg. 1) and enable “combined signal to component signals in real-time” and “audio streaming” (Koretzky, Par. 0043, Par. 0097) and improve “voice conversion systems” (Kim, Par. 0007).


Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Popovic in view of Krishna and Liu et al., (U.S. Patent 12,431,155), hereinafter referred to as Liu.
Regarding Claim 8, Popovic in view of Krishna has been discussed above. The combination further teaches:
wherein the pre-feature extraction step includes inputting the audio stream to a Problem-Agnostic Speech Encoder+ (PASE+) to extract the first feature. [Popovic, see mapping applied to claims 1, 4 - 6; Krishna, see mapping applied to claims 1, 4 - 6; Krishna, “audio encoder” (i.e., the claimed “Speech Encoder”), Pg. 1] 
The combination fails to teach PASE.
However, Liu teaches:
wherein the pre-feature extraction step includes inputting the audio stream to a Problem-Agnostic Speech Encoder+ (PASE+) to extract the first feature. [Liu, “Problem Agnostic Speech Encoder, PASE,” Col. 6:5 -6]
Popovic, Krishna, and Liu pertain to audio systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio systems art to modify Popovic’s teachings of “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”)” that are “fed in parallel to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”)” (Popovic, Pg. 33, Pg. 34) with the explicit teachings of “speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)” (Krishna, Pg. 3) taught by Krishna and the teachings of “PASE” (Liu, Col. 6:5 -6) taught by Liu in order to enable “pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” (Krishna, Pg. 1) and “allowing for unambiguous attribution of extracted sound source signals to actual sound sources” (Liu, Col. 1:47-48).


Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Popovic in view of Krishna and Ji et al., (U.S. Patent Application Publication 2020/0365148), hereinafter referred to as Ji.
Regarding Claim 12, Popovic in view of Krishna has been discussed above. The combination further teaches:
12. A computer program stored in one or more computer-readable recording media to cause each step included in the emotion recognition method according to claim 1 to be executed. [Popovic, see mapping applied to claim 1; Krishna, see mapping applied to claim 1]
The combination fails to explicitly teach computer-readable recording media.
However, Ji teaches:
12. A computer program stored in one or more computer-readable recording media to cause each step included in the emotion recognition method according to claim 1 to be executed. [Ji, “one or more of the method processes may be embodied in a computer readable device (i.e., the claimed “computer-readable recording media”) containing computer readable code such that a series of steps are performed when the computer readable code is executed on a computing device.” Par. 0016; “Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable storage device(s) (i.e., the claimed “computer-readable recording media”) having computer readable program code embodied thereon.” Par. 0118]
Popovic, Krishna, and Ji pertain to audio systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio systems art to modify Popovic’s teachings of “textual data (i.e., the claimed “text stream”) and features extracted from audio (i.e., the claimed “converting the audio stream into text stream corresponding to the audio stream”)” that are “fed in parallel to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”)” (Popovic, Pg. 33, Pg. 34) with the explicit teachings of “speech signal every 20 ms (i.e., the claimed “audio signal having a preset unit length”)” (Krishna, Pg. 3) taught by Krishna and the teachings of “one or more computer readable storage device(s) (i.e., the claimed “computer-readable recording media”)” (Ji, Par. 0118) taught by Ji in order to enable “pre-trained models (i.e., the claimed “pre-trained emotion recognition model”) for both audio and text modality (i.e., the claimed “audio stream and text stream”) with cross-modality attention for multimodal emotion recognition.” (Krishna, Pg. 1) and “ASR engine allow listening and processing of speech input” (Ji, Par. 0003).
 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ho et al., ("Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network," IEEE Access, 2020) teaches extracting mel-frequency cepstrum (MFCC) features for audio (i.e., the claimed “audio stream”) and a pre-trained model of Bidirectional Encoder Representations from Transformers (BERT) for embedding text information (i.e., the claimed “text stream”). These features (i.e., the claimed “audio stream and the converted text stream”) are fed in parallel (i.e., the claimed “audio stream and the converted text stream”) to the self-attention mechanism base RNNs (i.e., the claimed “emotion recognition model”).
Tao, et al., (CN112329604A) teaches multi-modal emotion recognition.
Liu, et al., (CN116052291A) teaches multi-modal emotion recognition.
Nam, et al., (KR102500073B1) teaches multi-modal emotion recognition.
	


Any inquiry concerning this communication or earlier communications from the examiner should be directed to EUNICE LEE whose telephone number is 571-272-1886. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EUNICE LEE/Examiner, Art Unit 2656

 /BHAVESH M MEHTA/ Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Aug 07, 2024
Application Filed
Mar 20, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/247,212
Patent 12620405
SIGNAL PROCESSING METHOD AND ELECTRONIC DEVICE
3y 1m to grant Granted May 05, 2026
18/449,809
Patent 12603078
GENERATING SPEECH DATA USING ARTIFICIAL INTELLIGENCE TECHNIQUES
2y 8m to grant Granted Apr 14, 2026
17/992,605
Patent 12597365
AUTOMATIC TRANSLATION BETWEEN SIGN LANGUAGE AND SPOKEN LANGUAGE
3y 4m to grant Granted Apr 07, 2026
18/205,615
Patent 12585876
METHOD OF TRAINING POS TAGGING MODEL, COMPUTER-READABLE RECORDING MEDIUM AND POS TAGGING METHOD
2y 9m to grant Granted Mar 24, 2026
18/518,786
Patent 12579385
EMBEDDED TRANSLATE, SUMMARIZE, AND AUTO READ
2y 3m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+33.3%)
2y 7m (~10m remaining)
Median Time to Grant
Low
PTA Risk
Based on 35 resolved cases by this examiner. Grant probability derived from career allowance rate.