DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The office action is in response to Applicant’s amendment filed 12/16/2025 which has been entered and made of record. Claims 1, 4, 5, 10, 12, 13, 15, 17 and 19 have been amended. No claim has been newly added. Claims 1-20 are pending in the application. Applicant's amendments to claim 10 have overcome each and every objection previously set forth in the Non-Final Office Action mailed 09/17/2025.
Response to Arguments
Applicant’s arguments, filed 12/16/2025, with respect to the rejection(s) under 35 U.S.C. 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Liu, Rey and Hussen as fully explained below.
Applicant argues Liu, Rey, Zhuang, Zhuang2 and Wu taken individually or in combination, do not teach the newly amended limitation of “user input comprising a user instruction for animating [an] avatar”.
Examiner agrees Liu, Rey, Zhuang, Zhuang2 and Wu do not teach the newly amended independent claims. However, a new ground of rejection is made in view of Liu, Rey and Hussen.
Conclusions: The rejections set in the previous Office Action are shown to have been proper, and the claims are rejected below. New citations and parenthetical remarks can be considered new grounds of rejection and such new grounds of rejection are necessitated by the Applicant's amendments to the claims. Therefore, the present Office Action is made final.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 9, 13 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL Liu et al. (“MusicFace: Music-driven Expressive Singing Face Synthesis”), hereinafter as Liu, in view of de la Rey et al. (US 20230130844 A1), hereinafter as Rey, further in view of Hussen Abdelaziz et al. (US 20210248804 A1), hereinafter as Hussen.
Regarding claim 1, Liu teaches A method (Liu Page 1, Left column, Abstract, “we present a method for this task with natural motions of the lip, facial expression, head pose, and eye states.”) comprising: accessing an audio input comprising a mixture of vocal sounds and non-vocal sounds (Liu Page 1, Right Column, Figure 1, “Our goal is to synthesize a vivid dynamic singing face coherent with the input music audio, which is mixed with human voice and background music.”); separating, …… , the audio input into a first audio output representing the vocal sounds and a second audio output representing the non-vocal sounds (Liu Page 3, Right Column, last paragraph “firstly adopts an audio source separation model O to decompose music into human voice Av and background music Ab , then gets encoded lyric feature L and melody feature M respectively using an attention-assisted two-stream encoder E”); determining, by one or more trained avatar animation models and by separately encoding the first audio output representing the vocal sounds and the second audio output representing the non-vocal sounds, an avatar animation temporally corresponding to the audio input (Liu teaches two stream decoder for vocal and non-vocal sounds, further teaches expression generation model, pose generation model and eye state generation model. Page 3, Figure 2, “the Generator module generates facial driving parameters …… the Renderer module aims to synthesize a photo-realistic video. Specifically, eye state parameters are encoded into eye attention maps, and other parameters provide a 3D model guidance to render faces. Finally, an expressive and rhythmic singing face video is rendered by combining rendered faces with eye attention maps”, Page 4, Figure 3, “Our generator contains an Encoder and a Decoder. The Encoder consists of a Two-stream Audio Encoder (TSAE) and an Attention-based Modulator (ATM). The Decoder contains three downstream generators, including Expression Generation Network (EGN), Pose Generation Network (PGN), and Eye State Generation Network (ESGN).” And Page 4, Left column, Second paragraph, “The Encoder (Sec. 3.2) consists of a Two-stream Audio Encoder (TSAE) to encode lyric and melody separately”)); ……and rendering, in real time and temporally coincident with the audio input, the determined avatar animation (Liu teaches temporally match the input audio with the video sequence, and the avatar animation is synchronized displayed with audio input. Page 5, Left Column, first paragraph, “Furthermore, in order to incorporate temporal information and match the frequency of video frames (30 fps), the feature sequence are converted to overlapping windows of size 39 (corresponding to 390ms ) at 30 fps.”, Page 5, Figure 5, Right Column, “our generated head pose dynamics are smoother than others. And the turn of dominant varying angle (shown as green curve) occurs nearly at the same time with ground truth, meaning that our generated head dynamics have more similar rhythm to the ground truth recorded by a performer.” And Page 5, right column, last paragraph, “our two-stream design greatly reduces the complexity of the lip synchronization task, thus leading to a better synchronization result.”).
Liu does not explicitly teach by a trained audio source-separation model……receiving a user input comprising a user instruction for animating the avatar: determining, based on the user input, one or more of (1) an animation mode for the avatar or (2) an animation for the avatar;…… Rey teaches by a trained audio source-separation model (Rey paragraph [0070] “In one method a crude, general model 410 is trained on a general training dataset. General training dataset may comprise labeled source audio data and labeled noise audio data. General model 410 may be referred to as a general source separation model 410 or trained audio source separation model 410.” And paragraph [0021] “train an audio source separation model using, at least in part, the received single-track audio input stream, wherein the audio separation model is trained to receive the single-track audio input stream and generate a plurality of audio stems corresponding to one or more audio sources of the plurality of sources, and separate audio sources, using the audio source separation model, from the audio input stream in accordance with one or more processing recipes to generate a plurality of source separated output stems.”).
Liu and Rey are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Rey teaches an improved audio source separation model to improve the accuracy and quality of source separation (Rey paragraph [0074] “This process 420 can be repeated iteratively, each improving the model's separation quality (e.g., fine tuning to improve the accuracy and/or quality of the source separation).”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the audio source separation model of Liu with the teaching of Rey to achieve better accuracy and quality with audio source separation, and eventually achieve better avatar animation result.
Liu in view of Rey fail to teach ……receiving a user input comprising a user instruction for animating the avatar: determining, based on the user input, one or more of (1) an animation mode for the avatar or (2) an animation for the avatar…… Hussen teaches ……receiving a user input comprising a user instruction for animating the avatar (Hussen paragraph [0301] “avatar animator 820 determines the type of avatar to generate based on a selection by a user……a user may select a desired type of avatar on a user interface of a user device or with a voice input that is then provided to system 800 and provided to avatar animator 820.”): determining, based on the user input, one or more of (1) an animation mode for the avatar or (2) an animation for the avatar…… (Hussen teaches a neural network to process user input to animation parameters, further teaches using the animation parameters in the animation animator 820 to generate an animation for the avatar, paragraph [0307-0308] “After pre-processing the input, text 802 is provided to system 800 and thus to neural network 810. Neural network 810 then determines based on “Happy Birthday!” that the emotional state is “happy” based on the use of the word “Happy” in text 802, as well as the use of an exclamation point. Neural network 810 then determines speech data set 812 representing “Happy Birthday!” and animation parameters 814 as a set of parameters representing one or more movements of avatar 906. Based on the determined emotional state of “happy” animation parameters 814 …… the parameters may cause the head pose of avatar 906 to look up, which is generally associated with a person being happy, or the parameters may cause the lips of avatar 906 to make a movement similar to a smile. Animation parameters 814 are then received by avatar animator 820, which generates avatar data 822 for animating avatar 906.” ).
Liu, Rey and Hussen are in the same field of endeavor, namely computer graphics, especially in the field of audio and text driven avatar animation. Hussen teaches a method for animating avatar based on user input to achieve better flexibility and accuracy (Hussen paragraph [0009] “For example, if the determined emotional state is sad the generated avatar may include movements that convey that the user is sad, resulting in a more realistic avatar that more accurately conveys the desired message to the recipient. Accordingly, generating the speech data set and the set of parameters representing one or more movements of the avatar in this manner provides for greater flexibility and accuracy, resulting in a more enjoyable user experience.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Hussen with the method of Liu in view of Rey to achieve better flexibility and accuracy.
Regarding claim 9, Liu in view of Rey and Hussen teach The method of Claim 1, and further teach wherein the one or more trained avatar animation models comprise a trained facial expression model (Liu Page 5, right column, fourth paragraph, “3.3.1 Expression Generation Network We employ a simple MLP consisting of two fully connected layers and one ReLU activation layer to regress facial expression (including lip motion) parameters from the encoded lyric and melody features.”), the method further comprising: generating, by a trained vocal encoder of the trained facial expression model, a set of encoded vocal features (Liu teaches a vocal encoder
A
E
v
, Page 5, Left column, second paragraph, “3.2.2 Two-stream Audio Encoder (TSAE) Given the separated human voice feature
A
v
and background music feature
A
b
, we adopt a Two-stream Audio Encoder (TSAE) that consists of two networks
A
E
v
and
A
E
b
to encode the MFCC features of human voice
a
t
v
and background music
a
t
b
, separately:
f
t
v
=
A
E
v
a
t
v
”); generating, by a trained non-vocal encoder of the trained facial expression model, a set of encoded non-vocal features (Liu teaches a non-vocal encoder
A
E
b
,
the background music feature as the non-vocal feature, and a separate network to encode the background music feature,
f
t
b
=
A
E
b
a
t
b
)
; and generating, by a decoder of the trained facial expression model, a facial-expression animation for the avatar based on the set of encoded vocal features and the set of encoded non-vocal features (Liu, Page 4, Figure 3, “The Encoder consists of a Two-stream Audio Encoder (TSAE) and an Attention-based Modulator (ATM). The Decoder contains three downstream generators, including Expression Generation Network (EGN), Pose Generation Network (PGN), and Eye State Generation Network (ESGN)” and Page 3, Figure 2, “the Generator module generates facial driving parameters (expressions, head poses and eye states)…… the Renderer module aims to synthesize a photo-realistic video.).
Regarding claim 13, it recites similar limitations of claim 1 but in a non-transitory computer readable storage media form. The rationale of claim 1 rejection is applied to reject claim 13. In addition, Rey teaches One or more non-transitory computer readable storage media storing instructions and coupled to one or more processors that are operable to execute the instructions to (Rey paragraph [0021] “a system includes a memory component storing machine-readable instructions, and a logic device and/or processor configured to execute the machine-executable instructions” and paragraph [0180] “Software in accordance with the present disclosure, such as non-transitory instructions, program code, and/or data, can be stored on one or more non-transitory machine-readable mediums.”):
Liu, Rey and Hussen are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Rey teaches an improved audio source separation model to improve the accuracy and quality of source separation (Rey paragraph [0074] “This process 420 can be repeated iteratively, each improving the model's separation quality (e.g., fine tuning to improve the accuracy and/or quality of the source separation).”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the audio source separation model of Liu in view of Hussen with the teaching of Rey to achieve better accuracy and quality with audio source separation, and eventually achieve better avatar animation result.
Regarding claim 17, it recites similar limitations of claim 1 but in an apparatus form. The rationale of claim 1 rejection is applied to reject claim 17. In addition, Rey teaches An apparatus comprising: one or more non-transitory computer readable storage media storing instructions; and one or more processors coupled to the non-transitory computer readable storage media, the one or more processors operable to execute the instructions to (Rey paragraph [0048] “[0048] The systems and methods disclosed herein may be implemented on at least one computer-readable medium carrying instructions that, when executed by at least one processor causes the at least one processor to perform any of the method steps disclosed herein. Some implementations relate to a computer system including at least one processor and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform any method steps disclosed herein. In various implementations, models described herein may be implemented as stored data and software modules and/or code acting on the stored data.”):
Liu, Rey and Hussen are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Rey teaches an improved audio source separation model to improve the accuracy and quality of source separation (Rey paragraph [0074] “This process 420 can be repeated iteratively, each improving the model's separation quality (e.g., fine tuning to improve the accuracy and/or quality of the source separation).”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the audio source separation model of Liu in view of Hussen with the teaching of Rey to achieve better accuracy and quality with audio source separation, and eventually achieve better avatar animation result.
Claim(s) 2, 14 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL Liu et al. (“MusicFace: Music-driven Expressive Singing Face Synthesis”), hereinafter as Liu, in view of de la Rey et al. (US 20230130844 A1), hereinafter as Rey, further in view of Hussen Abdelaziz et al. (US 20210248804 A1), hereinafter as Hussen, and Xiong et al. (CN 111091800 A), hereinafter as Xiong. The original and a machine translation of Xiong are provided by the examiner.
Regarding claim 2, Liu in view of Rey and Hussen teach The method of Claim 1, and further teach wherein the trained audio source-separation model is defined by a self-supervised training process (Rey paragraph [0123] “the neural network 1500 may be trained using supervised learning where combinations of training data that include a combination of input data and a ground truth (e.g., expected) output data.”) comprising: providing, to a source-separation model, a plurality of training audio inputs (Rey paragraph [0070] “The training dataset may comprise a plurality of datasets, each of the plurality of datasets comprising labeled audio samples configured to train the system to address a source separation problem. The plurality of datasets may comprise a speech training dataset comprising a plurality of labeled speech samples, and/or a non-speech training dataset comprising a plurality of labeled music and/or noise data samples.”); for each of the training audio inputs (Rey paragraph [0117] “A neural network 1500 is implemented as a recurrent neural network, deep neural network, convolutional neural network or other suitable neural network that receives a labeled training dataset 1510 to produce audio output 1512 (e.g., one or more audio stems) for each input audio sample.”): separating, by the source-separation model, each of the plurality of training audio inputs into a first training audio output representing vocal sounds and a second training audio output representing non-vocal sounds (Rey paragraph [0095] “The process 1000 begins with a general model 1002, which may be implemented as a pretrained model as previously discussed. An input mixture 1004, such as a single-track audio signal, or a plurality of single-track audio signals, with an unseen source mixture, is processed through the general model 1002 in step 1006 to generate separated audio signals from the mixture, including machine learning separated source signals 1008 and machine learning separated noise signals 1010.”); …… classifying, by a sound classifier, (1) the encoded first training audio output as vocal or non-vocal sounds and (2) the encoded second training audio output as vocal or non-vocal sounds (Rey Figure 10 and 13D, paragraph [0095] “including machine learning separated source signals 1008 and machine learning separated noise signals 1010. In step 1012, the results are evaluated to confirm the separated audio sources have a sufficient quality (e.g., comparing an estimated MOS and threshold as previously described and/or other quality measurement). If the results are determined to be good, then the separated sources are output in step 1014.” And paragraph [0108] “output fidelity may be measured use an MOS algorithm that is able to measure divergence of the stem's output from a particular labeled dataset, such as a collection of speech samples from an individual. In some implementations, such an algorithm may be implemented as a neural network that has been pre-trained to either classify a source or to measure divergence of the source from a given dataset.”); and updating the source separation model based on (1) a similarity between the composite audio output and the respective training audio input and (2) the classifications made by the sound classifier (Rey teaches updating network parameters based on misclassification and difference of the output audio and input audio mix, paragraph [0063] “If the network mislabels the input audio sample, then a backward pass through the network may be used to adjust parameters of the network to correct for the misclassification.” And paragraph [0118] “The training process to generate a trained neural network model includes a forward pass through the neural network 1500 to produce an audio stem or other desired audio output 1512. Each data sample is labeled with the desired output of the neural network 1500, which is compared to the audio output 1512. In some implementations, a cost function is applied to quantify an error in the audio output 1512 and a backward pass through the neural network 1500 may then be used to adjust the neural network coefficients to minimize the output error.”).
Liu, Rey and Hussen are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Rey teaches an improved audio source separation model to improve the accuracy and quality of source separation (Rey paragraph [0074] “This process 420 can be repeated iteratively, each improving the model's separation quality (e.g., fine tuning to improve the accuracy and/or quality of the source separation).”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the audio source separation model of Liu in view of Hussen with the teaching of Rey to achieve better accuracy and quality with audio source separation, and eventually achieve better avatar animation result.
Liu in view of Rey and Hussen fail to teach encoding, by a vocal encoder, the first training audio output; encoding, by a music encoder, the second training audio output; constructing, by an audio decoder and based on the encoded first training audio output and the encoded second training audio output, a composite audio output; Xiong teaches encoding, by a vocal encoder, the first training audio output (Xiong paragraph [0013] “an encoding unit, configured to use a speaker voiceprint encoder in a trained singing optimization model to encode the user singing signal”); encoding, by a music encoder, the second training audio output (Xiong paragraph [0013] “and use a music encoder in the trained singing optimization model to encode the reference singing signal and the accompaniment signal”); constructing, by an audio decoder and based on the encoded first training audio output and the encoded second training audio output, a composite audio output (Xiong paragraph [0013] “a decoding unit, configured to use a spectrum decoder in a trained singing optimization model to decode based on the encoding of the user singing signal, the encoding of the reference singing signal, and the encoding of the accompaniment signal, to obtain a spectrum signal of an optimized song; a conversion unit, configured to convert the spectrum signal of the optimized song into the audio of the optimized song.”);
Liu, Rey, Hussen and Xiong are in the same field of endeavor, namely computer graphics, especially audio related processing. Xiong teaches a neural network based song generation method based on singing voice and accompany music to enrich audio generation and improve optimization effect (Xiong paragraph [0061] “Therefore, the
song generation method of this embodiment can optimize the user's singing voice differently based on different reference songs, effectively enriching the song generation method and improving the optimization effect.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Xiong with the method of Liu in view to Rey and Hussen to improve optimization effect of audio source separation model.
Regarding claim 14, claim 14 has similar limitations as claim 2, therefore it is rejected under the same rationale as claim 2.
Regarding claim 18, claim 18 has similar limitations as claim 2, therefore it is rejected under the same rationale as claim 2.
Claim(s) 4-5, 15 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL Liu et al. (“MusicFace: Music-driven Expressive Singing Face Synthesis”), hereinafter as Liu, in view of de la Rey et al. (US 20230130844 A1), hereinafter as Rey, further in view of Hussen Abdelaziz et al. (US 20210248804 A1), hereinafter as Hussen, and Wu et al. (US 20240304177 A1), hereinafter as Wu.
Regarding claim 4, Liu in view of Rey and Hussen teach The method of Claim 1, further comprising determining based on the user input, the animation for the avatar by: but fail to teach encoding by a text-based encoder, the user input comprising the user instruction for animating the avatar; determining, based on one or more encoded features of the encoded user input and by a trained classifier, at least one animation classification for animating the avatar; determining, based on the at least one animation classification, a subsequent encoding of encoded user input comprising the user instruction for animating the avatar; and determining the avatar animation further based on the subsequent encoding of the user input comprising the user instruction. Wu teaches encoding by a text-based encoder, the user input comprising the user instruction for animating the avatar (Wu teaches a text based encoder in Figure 3C, paragraph [0060] “an input 302 is provided for processing by the network. The input 302 may correspond to the input 106 and/or may be a modified version of the input 106. For example, the input 302 may be a textual input that has been converted from an original audio input”, and paragraph [0066-0068] “the character model 344 may apply a similar quantified value to the character associated with the user of the input 306. As a result, each of the models may generate respective emotion labels 356 and character labels 358, which may be provided as an input to the text encoder 346……the text encoder 346 may generate emotion and character predictions 360, 362, which may be used by the length regulator 352.”); determining, based on one or more encoded features of the encoded user input and by a trained classifier, at least one animation classification for animating the avatar (Wu teaches tokenizing and embedding text input and further teaches determining emotional features as the animation classification for avatar, Figure 3A, paragraph [0060] “ A tokenizer 304 is used to tokenize each BPE in the sentence and then an embedding layer 306 converts each BPE token into dense vector spaces, similar to a bag of words neural language model. Next, the embeddings are fed into layers 308 of a CNN with various kernel sizes.”, paragraph [0087] “one or more emotional features are determined from an input text 522, such as by using a trained classifier to extract emotional characteristics from the words used in the text.”); determining, based on the at least one animation classification, a subsequent encoding of encoded user input comprising the user instruction for animating the avatar (Wu teaches the synthesized audio as a subsequent encoding of the character feature, paragraph [0087] “Additionally, embodiments also determine one or more character features based on the input text 524. This features may then be provided to a trained network to generate synthesized audio associated with the text 526, where the audio may incorporate the features identified.”); and determining the avatar animation further based on the subsequent encoding of the user input comprising the user instruction (Wu paragraph [0087] “Thereafter, using the features and synthesized audio, a graphical representation of an avatar may be generated 528, such as by providing the features and audio as inputs to a trained A2F model.”).
Liu, Rey, Hussen and Wu are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Wu teaches using a machine learning method to generate an avatar based on the identified emotion and
character traits from user input (Wu paragraph [0046] “These differences in emotion and
character traits need to be incorporated into the avatars or characters in order to
provide more realistic environments, characters” and paragraph [0074] “different
elements of the user's characteristics and emotions may be carried through to the
generation of the facial and body expressions, thereby providing a more realistic and
accurate portrayal of the user.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Wu with the method of Liu, Rey and Hussen to achieve realistic avatar animation.
Regarding claim 5, Liu in view of Rey, Hussen and Wu teach The method of Claim 4, and further teach wherein the at least one animation classification comprises at least one of an emotion classification or a dance classification based on the one or more encoded features of the encoded user input (Wu teaches determining emotional features as the animation classification for avatar, paragraph [0087] “one or more emotional features are determined from an input text 522, such as by using a trained classifier to extract emotional characteristics from the words used in the text.”);.
Liu, Rey, Hussen and Wu are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Wu teaches using a machine learning method to generate an avatar based on the identified emotion and
character traits from user input (Wu paragraph [0046] “These differences in emotion and
character traits need to be incorporated into the avatars or characters in order to
provide more realistic environments, characters” and paragraph [0074] “different
elements of the user's characteristics and emotions may be carried through to the
generation of the facial and body expressions, thereby providing a more realistic and
accurate portrayal of the user.”).Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Wu with the method of Liu, Rey and Hussen to achieve realistic avatar animation.
Regarding claim 15, claim 15 has similar limitations as claim 4, therefore it is rejected under the same rationale as claim 4.
Regarding claim 19, claim 19 has similar limitations as claim 4, therefore it is rejected under the same rationale as claim 4.
Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL Liu et al. (“MusicFace: Music-driven Expressive Singing Face Synthesis”), hereinafter as Liu, in view of de la Rey et al. (US 20230130844 A1), hereinafter as Rey, further in view of Hussen Abdelaziz et al. (US 20210248804 A1), hereinafter as Hussen, and NPL Zhuang et al. (“Music2Dance: DanceNet for Music-driven Dance Generation”), hereinafter as Zhuang.
Regarding claim 10, Liu in view of Rey and Hussen teaches The method of Claim 1, but fail to teach wherein the one or more trained avatar animation models comprise a trained dance model, the method further comprising: generating, by a trained non-vocal encoder of the trained dance model, a set of encoded non-vocal features; generating, by a trained motion encoder of the trained dance model, a set of encoded motion features; and generating, by a trained dance-style classifier of the trained dance model, a dance classification based on the encoded non-vocal features; generating, by a motion decoder of the trained dance model, a dance animation for the avatar based on the set of encoded motion features and the dance classification. Zhuang teaches wherein the one or more trained avatar animation models comprise a trained dance model (Zhuang Page 1, abstract, “we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm and melody of music as the control signals to generate 3D dance motions with high realism and diversity.”), the method further comprising: generating, by a trained non-vocal encoder of the trained dance model, a set of encoded non-vocal features (Zhuang teaches a musical encoder as non-vocal encoder, Page 6, Figure 2, “musical context-aware encoder”); generating, by a trained motion encoder of the trained dance model, a set of encoded motion features (Zhuang Page 6, Figure 2, “motion encoder” and Page 9, first paragraph, “Motion encoder. We stack two ”Conv1D+Relu” module as motion encoder to encode the past k frames. The convolution kernel is set to 1, which ensures that each frame motion code(512 channels) is independent.”); and generating, by a trained dance-style classifier of the trained dance model, a dance classification based on the encoded non-vocal features (Zhuang teaches a musical style classifier, it takes the input of music feature as an encoded non-vocal feature. Page 6, Figure 2, “First we extract the musical rhythm and melody (music features) and classify the musical style by the musical style classifier.”); generating, by a motion decoder of the trained dance model, a dance animation for the avatar based on the set of encoded motion features and the dance classification (Zhuang Page 6, Figure 2, “DanceNet takes the musical style, rhythm and melody as control signals to generate dance motion. DanceNet consists of four parts: musical context-aware encoder, motion encoder, residual motion control stacked module, motion decoder.”).
Liu, Rey, Hussen and Zhuang are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Zhuang teaches music guided DanceNet to achieve realistic and diverse avatar animation (Zhuang Page 3, third paragraph, “The results show that our method can achieve SOTA result, and the dance motions generated by our method are not only realistic and diverse, but also are music-consistent.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Zhuang with the method of Liu, Rey and Hussen to achieve realistic and diverse avatar animation.
Claim(s) 11, 16 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL Liu et al. (“MusicFace: Music-driven Expressive Singing Face Synthesis”), hereinafter as Liu, in view of de la Rey et al. (US 20230130844 A1), hereinafter as Rey, further in view of Hussen Abdelaziz et al. (US 20210248804 A1), hereinafter as Hussen, NPL Zhuang et al. (“Text/Speech-Driven Full-Body Animation”), hereinafter as Zhuang2, and NPL Zhuang et al. (“Music2Dance: DanceNet for Music-driven Dance Generation”), hereinafter as Zhuang.
Regarding claim 11, Liu in view of Rey and Hussen teach The method of Claim 1, and further teach wherein the trained one or more animation models comprise: ……a facial-expression model for animation a face of the avatar (Liu, Page 4, Figure 3, “The Decoder contains three downstream generators, including Expression Generation Network (EGN)”)……
Liu in view of Rey and Hussen fail to explicitly teach a lip-sync model for animating a mouth of the avatar;…… and a dance model for animating a body of the avatar. Zhuang2 teaches a lip-sync model for animating a mouth of the avatar (Zhuang2 Page 2, Figure 1, “Given a section of text and speech, the human face and the body are synthesized through two branches respectively. One branch adopts a learning-based method to synthesize lip motions and expressions”);…… and a dance model for animating a body of the avatar (Zhuang2 Page 2, Figure 1, “while the other branch uses a database retrieval-based method to synthesize skeleton motion. Full-body animation is then obtained through skinning and rendering.”).
Liu, Rey, Hussen and Zhuang2 are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Zhuang2 teaches a method to simultaneously synthesize face and body animation to achieve high quality avatar animation (Zhuang2 Page 1, left column, first paragraph, “We adopt a learning-based approach for synthesizing facial animation and a graph-based approach to animate the body, which generates high-quality avatar animation efficiently and robustly. Our results demonstrate the generated avatar animations are realistic, diverse and highly text/speech-correlated.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Zhuang2 with the method of Liu, Rey and Hussen to achieve realistic and diverse avatar animation.
Liu in view of Rey, Hussen and Zhuang2 fail to teach the trained one or more animation models… dance model. Zhuang teaches the trained one or more animation models… dance model (Zhuang Page 1, abstract, “we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm and melody of music as the control signals to generate 3D dance motions with high realism and diversity.”).
Liu, Rey, Hussen, Zhuang2 and Zhuang are in the same field of endeavor, namely computer graphics, especially in the field of audio driven avatar animation. Zhuang teaches music guided DanceNet to achieve realistic and diverse avatar animation (Zhuang Page 3, third paragraph, “The results show that our method can achieve SOTA result, and the dance motions generated by our method are not only realistic and diverse, but also are music-consistent.”). Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the dance model of Zhuang2 with the DanceNet of Zhuang, with the method of Liu, Rey and Hussen to achieve realistic and diverse avatar animation.
Regarding claim 16, claim 16 has similar limitations as claim 11, therefore it is rejected under the same rationale as claim 11.
Regarding claim 20, claim 20 has similar limitations as claim 11, therefore it is rejected under the same rationale as claim 11.
Allowable Subject Matter
Claims 3, 6-8 and 12 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
Regarding claim 3, the closest prior art of Rey teaches a self-iterative supervised audio source separation model. However, Rey fails to teach the combined limitation as a whole “wherein the trained audio source separation model is further defined by a supervised learning training process comprising, for each of the plurality of training audio inputs: providing a predetermined training avatar animation; determining, by a pretrained instance of the one or more trained avatar animation models and from the first training audio output and the second training output, a corresponding training avatar animation output; and updating the source separation model based on a similarity between the predetermined training avatar animation and the corresponding training avatar animation output.”. Furthermore, no prior art of record either alone or in combination teaches the above limitation as a whole. Therefore, claim 3 is considered to allowable.
Regarding claim 6, the closest prior art of Wu teaches an emotion classification based on text input, and further teaches the animation of avatar based on emotion features. However, Wu fails to teach the combined limitations as a whole “when the at least one animation classification comprises an emotion classification, then determining the subsequent encoding using an emotion encoder to output one or more encoded emotion features for animating a facial expression of the avatar; and when the at least one animation classification comprises a dance classification, then determining the subsequent encoding using a dance encoder to output or more encoded dance features for animating a body movement of the avatar.”. Furthermore, no prior art of record either alone or in combination teaches the above limitation as a whole. Therefore, claim 6 is considered to allowable.
Claims 7-8 contain allowable subject matter because they depend on claim 6 that contains allowable subject matter.
Regarding claim 12, the closet prior art of Hussen teaches an user interface to generate avatar animation using neural network. However, Hussen fails to teach the combined limitation below as a whole “wherein the user input comprising the user instruction for animating the avatar comprises an identification of an avatar animation mode for animating the avatar; and the method further comprises selecting, based on the avatar animation mode, one or more of the trained one or more animation models for animating the avatar.”. The prior art of Zhuang2 teaches a text and audio based full body simulation with lip movement generation, facial expression generation and motion graph construction based body animation. However, Zhuang2 fails to teach the combined limitation above as a whole. Furthermore, no prior art of record either alone or in combination teaches the above limitation as a whole. Therefore, claim 12 is considered to allowable.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAOMING WEI whose telephone number is (571)272-3831. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XIAOMING WEI/Examiner, Art Unit 2611
/KEE M TUNG/Supervisory Patent Examiner, Art Unit 2611