Last updated: April 19, 2026

Application No. 18/644,124

METHOD FOR GENERATING AUDIO-BASED ANIMATION WITH CONTROLLABLE EMOTION VALUES AND ELECTRONIC DEVICE FOR PERFORMING THE SAME.

Final Rejection §103

Filed

Apr 24, 2024

Examiner

WELCH, DAVID T

Art Unit

2613

Tech Center

2600 — Communications

Assignee

Fluentt Inc.

OA Round

2 (Final)

Interview Optional

— +27.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 303 resolved cases, 2023–2026

Examiner Intelligence

WELCH, DAVID T View full profile →

Grants 82% — above average

Career Allow Rate

247 granted / 303 resolved

+19.5% vs TC avg

Strong +27% interview lift

Without

With

+27.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

29 currently pending

Career history

332

Total Applications

across all art units

Statute-Specific Performance

§101

11.6%

-28.4% vs TC avg

§103

47.4%

+7.4% vs TC avg

§102

20.6%

-19.4% vs TC avg

§112

12.2%

-27.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 303 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-12 are rejected under 35 U.S.C. 103 as being unpatentable over Seol et al. (U.S. Patent Application Publication No. 2024/0013462), referred herein as Seol, in view of Kumar et al. (U.S. Patent Application Publication No. 2024/0424398), referred herein as Kumar.
	Regarding claim 1, Seol teaches a device for generating an audio-based and emotion-adjustable animation, the device comprising: a memory storing one or more instructions; and at least one processor, wherein the at least one processor performs operations of (figs 4 and 9; processor 902, memory 920; paragraphs 118 and 120)
receiving an audio source by executing the stored instructions, and inputting the audio source into a pre-training feature extractor to extract at least one audio-based control function for generating the emotion-adjustable animation (figs 4; paragraph 31, lines 6-26; paragraph 41, lines 1-11; paragraph 46, lines 12-17; audio is received and input into the pre-training feature extractor to extract an audio-based control function for generating an emotion-adjustable animation);

determining conditional features through at least one first feature extracted based on the audio-based control function (paragraph 31, lines 10-26; paragraph 47, lines 5-8; voice-based conditional features are determined),
at least one second feature extracted based on reference data (paragraph 32, lines 1-14; paragraph 47, lines 1-5 and 18-21; the style data and ground truth data are extracted), and
at least one third feature extracted based on animation data (paragraph 33, lines 1-15; paragraph 46, the last 11 lines; animation data is also extracted);
training a training module to generate the emotion-adjustable animation based on the conditional features, and generating the emotion-adjustable animation through an inference module based on a target audio source and a target image input value (fig 2, inference module 206, as one example; figs 4; paragraphs 34 and 38; paragraph 43, lines 1-9 and 24-38; paragraph 46, lines 1-3, 12-17, and the last 11 lines; paragraph 47, lines 1-18 and the last 13 lines; the training module generates emotional-adjustable animation based on the features and generates an emotion animation through an inference module based on target audio and target input image data),
wherein the inference module includes an animation model and an emotion controller, the animation model being configured to generate a synthetic animation based on the target audio source and the target image input value, and the emotion controller being configured to generate the emotion-adjustable animation by adjusting an emotion value of the synthetic animation based on emotion data extracted from the target audio source (fig 2, the animation model 212 generating a synthetic animation based on the audio and image inputs, as one example, and emotion controller 210 receiving adjusted emotion values for the synthetic animation from components 201, 220, and 222, as one example; paragraph 30, lines 1-21; paragraphs 31 and 32; paragraph 43, lines 1-9 and 24-55; paragraph 46, lines 1-3, 12-17, and the last 11 lines; paragraph 47, lines 1-18 and the last 13 lines), and
wherein the inference module adjusts a degree of emotion of the emotion-adjustable animation based on the audio-based control function (figs 4; paragraph 31, lines 1-21; paragraph 43, lines 31-55; the animation is dynamically emotion-adjusted to varying degrees as emotion changes from frame to frame, based on the audio control function and image information).
Seol does not explicitly teach a diffusion model to generate the animation.
However, in a similar field of endeavor, Kumar teaches a device for generating an audio-based and emotion-adjustable animation by receiving audio and image data and extracting features to generate the animation (figs 3 and 11; abstract; paragraphs 26 and 28), wherein the device comprises a diffusion model to generate the animation (fig 11; paragraph 27, lines 1-13; paragraph 49).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the diffusion model of Kumar with the animation model of Seol because this helps to create realistic and contextually accurate animations, thereby producing high-fidelity, higher quality animation output across the various modalities of the animation such as text, images, and audio (see, for example, Kumar, paragraph 27, the last 11 lines).
Regarding claim 2, Seol in view of Kumar teaches the device of claim 1, wherein the pre-training feature extractor is pre-trained through an operation including extracting feature vector values from the audio source using an audio encoder and classifying the audio source into at least one emotion based on the extracted feature vector values through a style block (Seol, fig 3; paragraph 31, lines 6-35; paragraph 41, lines 1-11 and 14-19).
Regarding claim 3, Seol in view of Kumar teaches the device of claim 2, wherein the style block includes a first block operating to classify a first emotion, a second block operating to classify a second emotion, a third block operating to classify a third emotion, and a fourth block operating to classify a fourth emotion (Seol, figs 3 and 4, vector 310 and the four emotions in figs 3 and 4; paragraph 31, lines 6-35; paragraph 41, lines 1-11 and 14-19).
Regarding claim 4, Seol in view of Kumar teaches the device of claim 3, wherein the at least one processor is pre-trained: by placing a first weight value, to determine the audio source as the first emotion through the first block based on the extracted feature vector values; by placing a second weight value, to determine the audio source as the second emotion through the second block based on the extracted feature vector values; by placing a third weight value, to determine the audio source as the third emotion through the third block based on the extracted feature vector values; and by placing a fourth weight value, to determine the audio source as the fourth emotion through the fourth block based on the extracted feature vector values (Seol, fig 3, the weight values assigned to each emotion; fig 4, the weight values assigned to each emotion in the right column; paragraph 31, the last 10 lines; paragraph 41, lines 14-35; paragraph 43, lines 1-9 and 24-45).
Regarding claim 5, Seol in view of Kumar teaches the device of claim 1, wherein the at least one processor trains the training module through an operation of extracting audio features from the audio source (Seol, figs 4; paragraph 31, lines 6-26; paragraph 41, lines 1-11; paragraph 46, lines 12-17),
an operation of extracting facial expression features from the animation data (Seol, paragraph 32, lines 1-14; paragraph 34, lines 1-9; paragraph 47, lines 5-18), and
an operation of generating the emotion-adjustable animation based on at least one of the first feature and the second feature and based on the audio features and the facial expression features (Seol, figs 4; paragraph 34, lines 1-9; paragraph 43, lines 1-9 and 24-38; paragraph 46, lines 1-3; paragraph 47, lines 1-18).
Regarding claim 6, Seol in view of Kumar teaches the device of claim 5, wherein the at least one processor extracts a plurality of frames from the audio source, maps the audio features to each of the plurality of frames, extracts an emotion indicator based on the audio features mapped to each of the plurality of frames, and determines emotion values based on the emotion indicator (Seol, paragraph 31, lines 6-26; paragraph 32; paragraph 33, lines 1-15; paragraph 34; paragraph 43, lines 1-9 and 24-38; paragraph 47, lines 1-21).
Regarding claim 7, Seol in view of Kumar teaches the device of claim 6, wherein the emotion indicator is determined to have emotion if the audio feature is equal to or greater than a predetermined threshold value, and the emotion indicator is determined to have no emotion if the audio feature is less than the predetermined threshold value (Seol, figs 3 and 4; paragraph 23, the last 13 lines; paragraph 24, lines 1-5 and the last 9 lines; paragraph 26, lines 1-10; paragraph 32, lines 1-14; paragraph 41, lines 1-11 and the last 11 lines; if the audio-based “neutral” emotion amount is set and other audio-based emotion amounts are zero [below a threshold], there is no emotion determined for the emotion style; conversely, if the audio-based emotion amounts are set [above a threshold] then the emotion style is determined to have the emotion [or blend of emotions]).
Regarding claim 8, Seol in view of Kumar teaches the device of claim 1, wherein the generating of the emotion-adjustable animation includes extracting a target audio feature by using the target audio source as an input value and determining a target emotion indicator, determining a target emotion value based on the target emotion indicator, and generating the emotion-adjustable animation reflecting the target audio feature and the target emotion value based on the target image input value (Seol, figs 4; paragraph 34, lines 1-9; paragraph 43, lines 1-9 and 24-55; paragraph 46, lines 1-3; paragraph 47, lines 1-18).
Regarding claim 9, Seol in view of Kumar teaches the device of claim 8, wherein the at least one processor generates the emotion-adjustable animation by controlling the target emotion value, and the target emotion value is controlled by assigning different weights to multiple emotion indicators (Seol, figs 3 and 4; paragraph 31, lines 6-35; paragraph 41, lines 1-11 and 14-19; paragraph 43, lines 1-9 and 24-55).
Regarding claim 10, Seol in view of Kumar teaches the device of claim 8, wherein the at least one processor obtains at least one prompt, determines an emotion variable based on the at least one prompt, and determines the target emotion value based on the emotion variable, and the reference data is determined based on the at least one prompt (Seol, paragraph 31, lines 6-26; paragraph 32, lines 1-14; paragraph 33, lines 1-15; paragraph 46, lines 12-17 and the last 11 lines).
Regarding claim 11, Seol in view of Kumar teaches the device of claim 10, wherein the prompt is provided in at least one of a text form, an image form, a video form, an animation form, or any combination thereof (Seol, paragraph 31, lines 6-13; paragraph 32, lines 1-7; paragraph 43, lines 1-9 and 24-40; image, video, and/or animation prompts, for example), and
the target emotion value is determined by considering different weights based on the emotion variable (Seol, figs 3 and 4; paragraph 31, the last 10 lines; paragraph 41, lines 14-35; paragraph 43, lines 1-9 and 24-45; different weights are assigned to each emotion and are used to determine the target emotion).
Regarding claim 12, the limitations of this claim substantially correspond to the limitations of claim 1; thus they are rejected on similar grounds.

Response to Arguments
Applicant’s arguments with respect to the claim objections have been fully considered, and are persuasive. The amendments have resolved this issue; thus the claim object is withdrawn.

Applicant’s arguments with respect to the claim objections have been fully considered, but are not persuasive.
On pages 8 and 9 of the Applicant’s Remarks, with respect to the amended claim 1, the Applicant argues, in summary, that Seol does not teach the amended limitations regarding the diffusion model. While, the Examiner agrees that Seol does not teach the diffusion model, it is respectfully submitted that this argument is moot in view of the new grounds of rejection presented above.
On page 9 of the Applicant’s Remarks, with respect to amended claim 1, the Applicant argues, in summary, that the Office Action asserts that Seol’s emotion animation may be adjusted manually or automatically, but this does not teach adjusting the emotion based on the extracted audio control function. The Examiner respectfully disagrees with this argument. The citations to Seol clearly describe that the emotion-adjustable animation dynamically adjusts the emotion animation based, in part, on changes in emotion within the audio control function. As just one example among many, paragraph 43 of Seol states, in relevant part:
“…the emotion and/or style may change during such a clip or segment, such as at various points in time or for/at specific frames of animation, which can be referred to herein as emotional keyframes. An emotional keyframe can indicate when one or more values for an emotion and/or style is to change…As illustrated in FIG. 4B, a different time point 452 in the same audio clip 402 is associated with very different emotion and style values. This may occur in response to something that triggers a change in the state of the character at a point in the audio file. At the first time point, the character was animated with an emotional state that was a combination of joy and neutral emotional state. As for the style, the character was to convey these emotions with a style that is both relatively professional and focused. At the second time point 452 as illustrated in FIG. 4B, the emotional state of the user has changed significantly, as illustrated by the updated character reconstruction. In this example, the character now has an emotional state that is primarily anger with a little disgust. With respect to style, this character now conveys these emotions with very high intensity and focus. As illustrated, this can have drastic impact on the motion of the facial components during these different points in the audio.”

The Examiner respectfully submits that Seol is replete with such disclosure, and that this particular citation, among many others, explicitly teaches adjusting the emotion based on the extracted audio control function. Any differences between the Applicant’s audio control function and that of the prior art is not currently reflected in the claims.
On page 9 of the Applicant’s Remarks, with respect to claim 12, the Applicant argues that this claim is not taught by Seol for the same reasons as those discussed in regard to claim 1. The Examiner respectfully disagrees with this argument, for the reasons discussed above.

Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Karras et al.; (Audio-driven facial animation by joint end-to-end learning of pose and emotion); ACM Transactions on Graphics; 2017. 
Peng et al.; (EmoTalk: Speech-driven emotional disentanglement for 3D face animation); arXiv preprint; March 20, 2023. 
Sadiq et al.; (Emotion Dependent Facial Animation from Affective Speech); IEEE 22nd International Workshop on Multimedia Signal Processing; 2020.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID T WELCH whose telephone number is (571)270-5364. The examiner can normally be reached Monday-Thursday, 8:30-5:30 EST, and alternate Fridays, 9:00-2:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached at 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DAVID T. WELCH
Primary Examiner
Art Unit 2613



/DAVID T WELCH/Primary Examiner, Art Unit 2613

Read full office action

Prosecution Timeline

Apr 24, 2024

Application Filed

Nov 24, 2025

Non-Final Rejection — §103

Feb 11, 2026

Response Filed

Feb 28, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/413,383

Patent 12602742

IMAGE PROCESSING APPARATUS, BINARIZATION METHOD, AND NON-TRANSITORY RECORDING MEDIUM

2y 5m to grant Granted Apr 14, 2026

18/529,550

Patent 12602842

TEXTURE GENERATION USING MULTIMODAL EMBEDDINGS

2y 5m to grant Granted Apr 14, 2026

18/737,274

Patent 12592048

System and Method for Creating Anchors in Augmented or Mixed Reality

2y 5m to grant Granted Mar 31, 2026

18/641,421

Patent 12579734

METHOD FOR RENDERING VIEWPOINTS AND ELECTRONIC DEVICE

2y 5m to grant Granted Mar 17, 2026

17/779,661

Patent 12573119

APPARATUS AND METHOD FOR GENERATING SPEECH SYNTHESIS IMAGE

2y 5m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

82%

Grant Probability

99%

With Interview (+27.2%)

3y 2m

Median Time to Grant

Moderate

PTA Risk

Based on 303 resolved cases by this examiner. Grant probability derived from career allow rate.