Last updated: May 29, 2026

Application No. 18/792,131

Apparatus For Estimating Emotion Using Multimodal Model And Method Of Training The Same

Non-Final OA §103

Filed

Aug 01, 2024

Priority

Mar 08, 2024 — RE 10-2024-0033415 +1 more

Examiner

GODBOLD, DOUGLAS

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Emma Healthcare Co. Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +10.6% interview lift. Interview lift (+10.6%) is below the 15.0% threshold. A written response is recommended.

Based on 1089 resolved cases, 2023–2026

Examiner Intelligence

GODBOLD, DOUGLAS View full profile →

Grants 83% — above average

Career Allowance Rate

906 granted / 1089 resolved

+21.2% vs TC avg

Moderate +11% lift

Without

With

+10.6%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

18 currently pending

Career history

1106

Total Applications

across all art units

Statute-Specific Performance

§101

6.5%

-33.5% vs TC avg

§103

76.6%

+36.6% vs TC avg

§102

7.0%

-33.0% vs TC avg

§112

4.7%

-35.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1089 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office Action is in response to correspondence filed 30 September 2024 in reference to application 18/782,131.  Claims 1-9 and 21 are pending and have been examined.

Response to Amendment
The preliminary amendment filed 30 September 2024 has been accepted and considered in this office action.  Claims 10-20 were cancelled and claim 21 added new.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-9 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Makuichi et al. (Multimodal Emotional Recognition with High-level Speech and Text Features) in view of Kahn et al. (MSER: Multimodal speech emotion recognition using cross-attention with deep fusion).

Consider claim 1, Makuichi teaches an apparatus for estimating emotion (abstract), the apparatus comprising:
a speech emotion representation (SER) model processing unit configured to input an acquired speaker identity feature and an acquired speaker speech data feature to a predetermined SER model (figure 1, input of features and speaker identity embeddings, s), encode the speaker identity feature and the speaker speech data feature through a speech encoder in the SER model (section 3.1.1 wav2vec feature extraction), and encode the encoded speaker identity feature and speaker speech data feature through a disentangled feature encoder to output phonetic logits (figure 1, and section 3.1.3, disentangled speech representation and logits generated); 
a text emotion representation (TER) model processing unit configured to input a text feature acquired from transferred text corresponding to the speaker identity feature and the speaker speech data feature to a predetermined TER model, and encode the text feature through a text encoder in the TER model to output a text feature logits (figure 1, text input into text encoder, section 3.2, TER generates output logits); and 
a representation fusion (RF) model processing unit configured to input the output phonetic logits of the speaker and the output text logits to a predetermined RF model, and output a probability for each emotion category of the speaker (figure 1, score fusion, section 3.3 outputs fused to generate final probabilities for emotional categories.).
Makuichi does not specifically teach
the SER outputs a phonetic feature query, a phonetic feature key, and a phonetic feature value of a speaker
	the TER outputs a text feature query, a text feature key, and a text feature value of the text feature
a representation fusion (RF) model processing unit configured to input the output phonetic feature query, phonetic feature key, and phonetic feature value of the speaker and the output text feature query, text feature key, and text feature value to a predetermined RF model, and output a probability for each emotion category of the speaker by applying a cross-attention mechanism.
In the same field of multimodal emotion recognition, Kahn teaches  
the SER outputs a phonetic feature query, a phonetic feature key, and a phonetic feature value of a speaker (section 3, page 3, SER outputs query, key and value vectors (Q, H and V))
	the TER outputs a text feature query, a text feature key, and a text feature value of the text feature (section 3.2 and 3.3, TER also generates query, key and value vectors (Q, H and V))
a representation fusion (RF) model processing unit configured to input the output phonetic feature query, phonetic feature key, and phonetic feature value of the speaker and the output text feature query, text feature key, and text feature value to a predetermined RF model, and output a probability for each emotion category of the speaker by applying a cross-attention mechanism (section 3.3-3.4, cross attention used to fuses SER and TER vectors  Q, H and V to generate final output probabilities for classes).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use cross-attention and query, feature, and key values as taught by Kahn in the system of Makuichi in order to improve the accuracy of emotion classification (Kahn abstract).

Consider claim 2, Makuichi teaches The apparatus for estimating emotion according to claim 1, wherein the speaker identity feature and the speaker speech data feature encoded by the speech encoder in the SER model processing unit correspond to speech representation encoded before down- sampling is applied (figure 1, section 3.1.1, down sampling is applied at the end of encoding, ).

Consider claim 3, Kahn teaches The apparatus for estimating emotion according to claim 1, wherein the cross-attention mechanism in the RF model processing unit comprises at least one of a first cross- attention mechanism to which the acquired phonetic feature query, text feature key, and text feature value are applied or a second cross-attention mechanism to which the acquired text feature query, phonetic feature key, and phonetic feature value are applied (figure 1, and section 3.3 cross attention applied to key query and value output from SER and TER).

Consider claim 4, Makuichi and Kahn teach The apparatus for estimating emotion according to claim 1, wherein the RF model processing unit outputs, to the user, an emotion category having a highest probability among probability output values for respective emotion categories of the speaker (Makuichi, figure 1, output class probabilities, section 6, applying model to label data.   Kahn, figure 1, outputting classified emotion, also see figure 3, predicted labels).

Consider claim 5, Makuichi teaches the apparatus for estimating emotion according to claim 1, wherein the speaker speech data feature in the SER model processing unit is a wav2vec feature or a wav2vec2.0 feature extracted from a predetermined speech recognition pre-learning model (section 3.1.1 wave2vec feature extraction).

Consider claim 6, Makuichi teaches a method estimating emotion (abstract), the apparatus comprising:
inputting an acquired speaker identity feature and an acquired speaker speech data feature to a predetermined SER model (figure 1, input of features and speaker identity embeddings, ), 
encoding the speaker identity feature and the speaker speech data feature through a speech encoder in the SER model (section 3.1.1 wav2vec feature extraction), 
encoding the encoded speaker identity feature and speaker speech data feature through a disentangled feature encoder to output phonetic logits (figure 1, and section 3.1.3, disentangled speech representation and logits generated); 
inputting a text feature acquired from transferred text corresponding to the speaker identity feature and the speaker speech data feature to a predetermined TER model (figure 1, text feature input into text encoder), and 
encoding the text feature through a text encoder in the TER model to output a text feature logits (figure 1, text input into text encoder, section 3.2, TER generates output logits); and 
inputting the output phonetic logits of the speaker and the output text logits to a predetermined RF model, and output a probability for each emotion category of the speaker (figure 1, score fusion, section 3.3 outputs fused to generate final probabilities for emotional categories.).
Makuichi does not specifically teach
 outputting a phonetic feature query, a phonetic feature key, and a phonetic feature value of a speaker
	outputting a text feature query, a text feature key, and a text feature value of the text feature
inputting the output phonetic feature query, phonetic feature key, and phonetic feature value of the speaker and the outputting text feature query, text feature key, and text feature value to a predetermined RF model, and output a probability for each emotion category of the speaker by applying a cross-attention mechanism.
In the same field of multimodal emotion recognition, Kahn teaches  
outputting a phonetic feature query, a phonetic feature key, and a phonetic feature value of a speaker (section 3, page 3, SER outputs query, key and value vectors (Q, H and V))
	outputting a text feature query, a text feature key, and a text feature value of the text feature (section 3.2 and 3.3, TER also generates query, key and value vectors (Q, H and V))
inputting the output phonetic feature query, phonetic feature key, and phonetic feature value of the speaker and the outputting text feature query, text feature key, and text feature value to a predetermined RF model, and output a probability for each emotion category of the speaker by applying a cross-attention mechanism (section 3.3-3.4, cross attention used to fuses SER and TER vectors  Q, H and V to generate final output probabilities for classes).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use cross-attention and query, feature, and key values as taught by Kahn in the system of Makuichi in order to improve the accuracy of emotion classification (Kahn abstract).

Claim 7 contains similar limitations as claim 2 and is therefore rejected for the same reasons.

Claim 8 contains similar limitations as claim 3 and is therefore rejected for the same reasons.

Claim 9 contains similar limitations as claim 4 and is therefore rejected for the same reasons.

Consider claim 21, Makuichi teaches The apparatus for estimating emotion according to claim 1, wherein the speaker identity feature in the SER model processing unit is extracted from Resemblyzer (section 3.1.2, speaker identify features using Resemblyzer).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Alisamir et al. (US PAP 2023/197106) teaches a similar multimodal emotion recognition system.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DOUGLAS GODBOLD
Examiner
Art Unit 2655



/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Aug 01, 2024

Application Filed

Mar 20, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/384,009

Patent 12640138

REAL-TIME VOICE RECOGNITION METHOD, MODEL TRAINING METHOD, APPARATUSES, DEVICE, AND STORAGE MEDIUM

2y 7m to grant Granted May 26, 2026

18/449,237

Patent 12626690

SYSTEMS, METHODS, AND DEVICES FOR LOW-POWER AUDIO SIGNAL DETECTION

2y 9m to grant Granted May 12, 2026

18/365,765

Patent 12614553

METHOD, APPARATUS, ELECTRONIC DEVICE, AND MEDIUM FOR SPEECH PROCESSING

2y 8m to grant Granted Apr 28, 2026

18/429,150

Patent 12614037

LARGE LANGUAGE MODEL INTERFACE FOR COMPLEX DATABASES

2y 2m to grant Granted Apr 28, 2026

18/739,304

Patent 12613919

Error Correcting of Programming Code Generated Through Integration with Generative Artificial Intelligence

1y 10m to grant Granted Apr 28, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

94%

With Interview (+10.6%)

2y 9m (~11m remaining)

Median Time to Grant

Low

PTA Risk

Based on 1089 resolved cases by this examiner. Grant probability derived from career allowance rate.