Last updated: July 17, 2026
Application No. 18/457,251
AUDIO-DRIVEN FACIAL ANIMATION USING MACHINE LEARNING

Non-Final OA §103
Filed
Aug 28, 2023
Priority
Aug 16, 2023 — CN 202311036144.8
Examiner
MINKO, DENIS VASILIY
Art Unit
2612
Tech Center
2600 — Communications
Assignee
NVIDIA Corporation
OA Round
3 (Non-Final)
Interview Optional

— +13.9% interview lift. Interview lift (+13.9%) is below the 15.0% threshold. A written response is recommended.
Based on 26 resolved cases, 2023–2026
Examiner Intelligence

MINKO, DENIS VASILIY View full profile →
Grants 65% — above average
Career Allowance Rate
17 granted / 26 resolved
+3.4% vs TC avg
Moderate +14% lift
Without
With
+13.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
12 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
93.0%
+53.0% vs TC avg
§102
4.7%
-35.3% vs TC avg
§112
1.2%
-38.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 26 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 4/17/2026 has been entered.
 
Status of the Claims
Claims 1-20 are pending.
Claim 1, 11, 12, and 17 is amended.
Claims 1-10 and 12-20 are rejected.
Claim 11 is objected.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Yang et al. (US 20230230303 ).

Regarding claim 1: 
Parbhavalkar teaches:
A computer-implemented method (Parbhavalkar [0095] The computer program product contains instructions that, when executed, perform one or more methods, such as those described above.), comprising:
receiving audio data corresponding to an utterance of speech (Parbhavalkar [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.);
computing, using a transformer-based audio encoder and a decoder, a weighted vector indicative of a plurality of features associated with the audio data (Parbhavalkar [0042] Accordingly, concatenating the contextual biasing vector 138 with the weighted audio encoding vector 136 into a weighted vector “injects” contextual biasing into the speech recognition model 300. The weighted vector 140 collectively represents the audio, grapheme, and phoneme information. The weighted vector 140 is input to the decoder 142.);

Parbhavalkar does not teach:
	computing, using the weighted vector and one or more component vectors, an animation vector corresponding to one or more positions for one or more feature points associated with a digital character representation;
	rendering the digital character representation based, at least, on the animation vector.
	and a volume level of the audio data;

Yu teaches:
computing, using the weighted vector and one or more component vectors, an animation vector corresponding to one or more positions for one or more feature points associated with a digital character representation (Yu [Abstract] The invention relates to the field of two-dimensional animation and provides an automatic match method of two-dimensional animation characters in the two-dimensional animation production environments. The method includes: respectively extracting feature points in characters according to character information in two key frames; allocating the dimension and the direction for each feature point in each character by means of a feature description algorithm and generating a high-dimensional feature vector; constructing a Markov random field satisfying the adjacency relation based on the obtained feature points; and computing the maximum posterior probability and seeking a minimum point of an energy function according to the obtained Markov random field and by combining the obtained high-dimensional feature vectors so as to establish match relations of the animation characters.); and

Buddemeiser teaches:
rendering the digital character representation based, at least, on the animation vector (Buddemeiser [0005] The present invention provides a technique for translating facial animation values to head mesh positions for rendering facial features of an animated avatar.):

Yang teaches:
	and a volume level of the audio data (Yang [0170] When the type of the input of the speaker volume adjuster 1150 is a volume down input and, as a result, a volume level is less than or equal to a preset first threshold value, the utterance mode may be determined as a ‘whisper’ mode, and the avatar animation module 1350 may change the avatar animation to express the facial expression of the avatar in a smaller size and express the tone of the avatar's uttered voice more smoothly.);

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Yu, Buddemeiser, and Yang. Having vectors correspond positions, features, and sound of an animation and then rendering the character, as in Yu and Buddemeiser and Yang, would benefit the Parbhavalkar teachings by allowing for the audio to be added with a visual representation. Additionally, this is the application of a known technique, combining audio and character positions, to yield predictable results.

Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), Liu (US 20190130562), and Weston et al. (EP 3937170).

Regarding claim 2: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
wherein the transformer-based audio encoder is a pre-trained audio encoder, further comprising:
training the decoder based, at least, on the transformer-based audio encoder,
wherein parameters for the transformer-based audio encoder are locked while the decoder is trained.

Weston teaches:
wherein the transformer-based audio encoder is a pre-trained audio encoder, further comprising (Weston [pg12 par2] Firstly, the trained Transformer encoder stack could 412 simply be frozen i.e. not part of the ongoing task specific training but simply left in its state following pre-training in which it translates a sequence of audio-linguistic tokens to a combined audio-linguistic representation sequence which can be classified to provide the diagnosis.):
training the decoder based, at least, on the transformer-based audio encoder (Weston [pg4 par8] Preferably, in these examples, training comprises training the encoder and decoder together to map the input sequence of audio and linguistic representations to the target output.), 

Liu teaches:
wherein parameters for the transformer-based audio encoder are locked while the decoder is trained (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Weston and Liu. Having the transformer encoder pre-trained and training the encoders together, as in Weston and Liu, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by allowing a way for the coders to be trained beforehand. Additionally, this is the application of a known technique, having a pre-trained encoder, to yield predictable results.

Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Zhou et al. (US 20180137857).

Regarding claim 3:
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
	wherein the plurality of features are selected during training.
Zhou teaches:
wherein the plurality of features are selected during training (Zhou [0006] The method further includes performing, with the processor, a training process for a neural network ranker using the plurality of feature vectors corresponding to the plurality of training speech recognition results as inputs to the neural network ranker).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Zhou. Selecting features while training, as in Zhou, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by customizing the training. Additionally, this is the application of a known technique, choosing what features to be trained on, to yield predictable results.

Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Yang et al. (CN 112822068).

Regarding claim 4: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, further comprising:

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
	receiving a respective layer vector associated with the plurality of features for layers of the transformer-based audio encoder; determining, for individual layers, a layer weight; applying the layer weight to the respective individual layer; and determining the weighted vector.

Yang teaches: 
receiving a respective layer vector associated with the plurality of features for layers of the transformer-based audio encoder; determining, for individual layers, a layer weight; applying the layer weight to the respective individual layer; and determining the weighted vector (Yang [pg15 par2] obtaining an initial model comprising a feature extraction layer, a weight calculation layer and a prediction layer. structure of the initial model as shown in FIG. 8, the initial model is a comprising a feature extraction layer; a weight calculation layer and a prediction layer three-layer structure model, wherein the feature extraction layer is used for obtaining each sentence in the communication sentence vector, feature extraction layer input can be a communication data, as shown in FIG. 8 Si represents a communication data, such as feature extraction layer can receive a plurality of communication data at the same time; extracting each of the sentence vector of communication data by the feature extraction layer, feature extraction layer can be through but not limited to through BERT (Encoder Representation from Transformers are used to sentence vector the communication data. weight calculating layer to sentence vector input, calculating the weight of each sentence vector, such as weight calculating layer can be obtained by but not limited to BiLSTM (Bi-directional Long Short-Term Memory) ).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Yang. Having weighted vectors, as in Yang, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Liu et al. (US 20210266274).

Regarding claim 5: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method.of claim 1, 
Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
	wherein the audio data has a duration less than a threshold duration.

Liu teaches:
wherein the audio data has a duration less than a threshold duration (Liu [0067] When the audio duration corresponding to the audio data is less than or equal to the duration threshold).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Liu. Having audio duration less than a certain threshold, as in Liu, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having a limit to audio. Additionally, this is the application of a known technique, having an audio limit threshold, to yield predictable results.

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Deng et al. (CN 110837738).

Regarding claim 6: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 5, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
wherein the plurality of features are determined using transformer layers.

Deng teaches:
wherein the plurality of features are determined using transformer layers (Deng [pg3 par11] the vector matrix using a first transformer layer extracted by the characteristic extraction to obtain first feature matrix).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Deng. Using a feature matrix to store features extracted, as in Deng, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having a way to extract features. Additionally, this is the application of a known technique, extracting features using transformer layers, to yield predictable results.

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Gareth et al. (GB 2568475).

Regarding claim 7:
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1,

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
wherein the one or more feature points corresponds to at least one of facial features, a tongue position, an eye position, or an extremity position.

Gareth teaches:
wherein the one or more feature points corresponds to at least one of facial features, a tongue position, an eye position, or an extremity position (Gareth [0034] The feature points correspond to facial features such as key points on eyes, nose, lips etc).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Gareth. Having feature points correspond to facial features, as in Gareth, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having a way to have feature points correspond to something. Additionally, this is the application of a known technique, having feature points correspond to facial features, to yield predictable results.

Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Davis et al. (US 9705904).

Regarding claim 8: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
wherein.the plurality of features are extracted from the audio data via processing using a convolutional neural network (CNN).

Davis teaches:
wherein.the plurality of features are extracted from the audio data via processing using a convolutional neural network (CNN) (Davis [0016] Approaches such as convolutional neural networks can yield classifiers that can learn to extract features that are at least as effective as human-engineered features. While such models are currently applied to image and audio data).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Davis. Using a CNN to extract features, as in Davis, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having a way to extract features using a neural network. Additionally, this is the application of a known technique, extracting features using a CNN, to yield predictable results.

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ), and Hewage et al. (WO 2019092459).

Regarding claim 9: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
	wherein the component vector includes at least one of an emotion vector or a style vector.

Hewage teaches:
wherein the component vector includes at least one of an emotion vector or a style vector (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Hewage. Having a style vector, as in Hewage, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by having a vector associated with the style. Additionally, this is the application of a known technique, using a style vector, to yield predictable results.

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Yang et al.(US 20230230303 ),and Ludwig et al. (CN 104349074).

Regarding claim 10: 
Parbhavalkar, Yu, and Buddemeiser and Yang teach:
The computer-implemented method of claim 1, 

Parbhavalkar, Yu, and Buddemeiser and Yang do not teach:
	wherein the decoder disregards information from one or more previous frames.

Ludwig teaches:
wherein the decoder disregards information from one or more previous frames (Ludwig [pg6 par11] In addition, in the inter-frame mode, it can use the ignore (skip). to ignore blocks for coding, and not sending residual or motion vector. encoder only records it is ignored. the decoder from other blocks omitted image information block has been decoded. block according to the invention, preferably, from the same frame of the digital video data or image information of the previous frame block derivation ignored.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser and Yang with Ludwig. Ignoring previous frames, as in Ludwig, would benefit the Parbhavalkar, Yu, and Buddemeiser and Yang teachings by not having redundant data go through. Additionally, this is the application of a known technique, ignoring previous frames for coding, to yield predictable results.

Claim(s) 12 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), and Yang et al.(US 20230230303 ).

Regarding claim 12:
Parbhavalkar teaches:
A processor comprising:  one or more processing units to (Parbhavalkar [0088] The method 500 may be executed on data processing hardware 610 (FIG. 6) residing on a user device 106 associated with a user 102 that spoke the utterance 104.):
compute, using a transformer-based audio encoder and based, at least, on audio data corresponding to speech (Parbhavalkar [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.), 
a weighted feature vector associated with the audio data (Parbhavalkar [0042] Accordingly, concatenating the contextual biasing vector 138 with the weighted audio encoding vector 136 into a weighted vector “injects” contextual biasing into the speech recognition model 300. The weighted vector 140 collectively represents the audio, grapheme, and phoneme information. The weighted vector 140 is input to the decoder 142.);

Parbhavalkar does not teach:
	compute, using the weighted feature vector and a component vector indicative of one or more properties associated with the speech, position data for one or more feature points of one or more deformable bodily components of a virtual character; and
	render, for one or more time points 'in a sequence of time points of the audio data, image data representative of the virtual character based, at least, on the position data to generate an animation of the character appearing to utter the speech.
	and a volume level of the audio data;

Seo and Park teach:
compute, using the weighted feature vector and a component vector indicative of one or more properties associated with the speech, position data for one or more feature points of one or more deformable bodily components of a virtual character (Seo [pg 7 par11] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference. Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.); and

Kudo teaches:
render, for one or more time points 'in a sequence of time points of the audio data, image data representative of the virtual character based, at least, on the position data to generate an animation of the character appearing to utter the speech (Kudo [0002] In recent years, a three-dimensional (3D) animation image is created and displayed using computer graphic (CG) technology. [0003] Here, as one form of animation, a motion image of the state of the mouth of the face and the state of the facial expression corresponding to the utterance and emotion of the character is displayed. [0004] When creating a facial animation of utterances and facial expressions in CG in this way, a facial model animation pattern image is created for each frame of the video. They were arranged along the time axis and played continuously.).

Yang teaches:
	and a volume level of the audio data (Yang [0170] When the type of the input of the speaker volume adjuster 1150 is a volume down input and, as a result, a volume level is less than or equal to a preset first threshold value, the utterance mode may be determined as a ‘whisper’ mode, and the avatar animation module 1350 may change the avatar animation to express the facial expression of the avatar in a smaller size and express the tone of the avatar's uttered voice more smoothly.);

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Kudo and Yang. Having vectors correspond with speech, sound, position, and body features of an animation and then rendering the character, as in Seo, Park and Kudo and Yang, would benefit the Parbhavalkar teachings by allowing a way to create a visual representation. Additionally, this is the application of a known technique, combining speech, position, and body parts to create an animation, to yield predictable results.

Regarding claim 16: 
Parbhavalkar, Seo, Park, and Kudo and Yang teach:
The processor of claim 12, 
wherein the processor is comprised in at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs);a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application; a system for performing operations using a language model; a system for performing one or more generative content operations using a large language model (LLM);a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or.*a system implemented at least partially using cloud computing resources (Parbhavalkar [0088] Optionally, the data processing hardware 610 may reside on a remote device (e.g., server of a cloud-based computing environment) in communication with the user device 106, e.g., over a network.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Kudo and Yang. Using a cloud computer network, as in Parbhavalkar, would benefit the Seo, Park and Kudo and Yang teachings by allowing to reallocate the process load. Additionally, this is the application of a known technique, having a cloud computing, to yield predictable results.


Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), Yang et al.(US 20230230303 ), and Yang (CN 112822068).

Regarding claim 13: 
Parbhavalkar, Seo, Park, and Kudo and Yang teach:
The processor of claim 12, 

Parbhavalkar, Seo, Park, and Kudo and Yang do not teach:
	wherein the weighted feature vector is based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder, wherein -individual layer vectors are associated with a plurality of features extracted from the audio data.

Yang teaches:
wherein the weighted feature vector is based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder, wherein -individual layer vectors are associated with a plurality of features extracted from the audio data (Yang [pg15 par2] obtaining an initial model comprising a feature extraction layer, a weight calculation layer and a prediction layer. structure of the initial model as shown in FIG. 8, the initial model is a comprising a feature extraction layer; a weight calculation layer and a prediction layer three-layer structure model, wherein the feature extraction layer is used for obtaining each sentence in the communication sentence vector, feature extraction layer input can be a communication data, as shown in FIG. 8 Si represents a communication data, such as feature extraction layer can receive a plurality of communication data at the same time; extracting each of the sentence vector of communication data by the feature extraction layer, feature extraction layer can be through but not limited to through BERT (Encoder Representation from Transformers are used to sentence vector the communication data. weight calculating layer to sentence vector input, calculating the weight of each sentence vector, such as weight calculating layer can be obtained by but not limited to BiLSTM (Bi-directional Long Short-Term Memory) ).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo and Yang with Yang. Having weighted vectors, as in Yang, would benefit the Parbhavalkar, Seo, Park, and Kudo and Yang teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.

Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), Yang et al.(US 20230230303 ), and Liu (US 20190130562).

Regarding claim 14: 
Parbhavalkar, Seo, Park, and Kudo teach:
The processor of claim 12, 

Parbhavalkar, Seo, Park, and Kudo do not teach:
	wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder.

Zhou teaches:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo and Yang with Liu. Having weighted vectors, as in Parbhavalkar, Seo, Park, and Kudo and Yang, would benefit the Liu, Yu, and Buddemeiser teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.

Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), Yang et al.(US 20230230303 ), and Hewage (WO 2019092459).

Regarding claim 15: 
Parbhavalkar, Seo, Park, and Kudo and Yang teach:
The processor of claim 12, 

Parbhavalkar, Seo, Park, and Kudo and Yang teach:
	wherein the component vector includes at least one of an emotion vector or a style vector.

Hewage teaches:
wherein the component vector includes at least one of an emotion vector or a style vector (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo and Yang with Hewage. Having a style vector, as in Hewage, would benefit the Parbhavalkar, Seo, Park, and Kudo and Yang teachings by having a vector associated with the style. Additionally, this is the application of a known technique, using a style vector, to yield predictable results.

Claim(s) 17,18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Hewage et al. (WO 2019092459), and Yang et al.(US 20230230303 ).

Regarding claim 17: 
Parbhavalkar, Seo, and Park teach:
A system, comprising: one or more processing units to generate an animation of a character using position data representative of one-or more positions of one or more feature points of the character (Parbhavalkar [0088] The method 500 may be executed on data processing hardware 610 (FIG. 6) residing on a user device 106 associated with a user 102 that spoke the utterance 104. [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.) (Seo [pg7 par3] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference.) (Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.), 

Parbhavalkar, Seo, and Park do not teach:
the position data computed based at least in part on a transformer-based audio encoder processing audio data representative of speech and component data indicative of one or more values corresponding to at least one of a style parameter or an emotion parameter associated with the speech.
	and a volume level of the audio data;

Hewage teaches:
the position data computed based at least in part on a transformer-based audio encoder processing audio data representative of the speech and component data indicative of one or more values corresponding to at least one of a style parameter or an emotion parameter associated with the speech (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).

Yang teaches:
	and a volume level of the audio data (Yang [0170] When the type of the input of the speaker volume adjuster 1150 is a volume down input and, as a result, a volume level is less than or equal to a preset first threshold value, the utterance mode may be determined as a ‘whisper’ mode, and the avatar animation module 1350 may change the avatar animation to express the facial expression of the avatar in a smaller size and express the tone of the avatar's uttered voice more smoothly.);

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage and Yang. Having vectors correspond with speech, sound, position, and body features of an animation and having a style vector, as in Seo, Park and Hewage and Yang, would benefit the Parbhavalkar teachings by allowing a way to create a visual representation. Additionally, this is the application of a known technique, combining speech, position, and body parts to create an animation, to yield predictable results.

Regarding claim 18: 
Parbhavalkar, Seo, Park, and Hewage and Yang teach:
The system of claim 17; 
wherein the transformer-based audio encoder computes a weighted feature vector based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder (Seo [pg 7 par11] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference. Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage and Yang. Having a feature vector, as in Seo, Park and Hewage and Yang, would benefit the Parbhavalkar teachings by allowing a vector to store feature information. Additionally, this is the application of a known technique, having a feature vector in an audio coder, to yield predictable results.

Regarding claim 20: 
Parbhavalkar, Seo, Park, and Hewage and Yang teach:
The system of claim 17; 
wherein the system comprises at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering. graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs);a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application;-a system for performing operations using a language model; a system for performing one or more generative content operations using a large language model (LLM);a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using, a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or a system implemented at least partially using cloud computing resources (Parbhavalkar [0088] Optionally, the data processing hardware 610 may reside on a remote device (e.g., server of a cloud-based computing environment) in communication with the user device 106, e.g., over a network.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage and Yang. Using a cloud computer network, as in Parbhavalkar, would benefit the Seo, Park and Hewage and Yang teachings by allowing to reallocate the process load. Additionally, this is the application of a known technique, having a cloud computing, to yield predictable results.

Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Hewage et al. (WO 2019092459), Yang et al.(US 20230230303 ), and Liu et al (US 20190130562).

Regarding claim 19:
Parbhavalkar, Seo, Park, and Hewage and Yang teach:
The system of claim 17; 

Parbhavalkar, Seo, Park, and Hewage do not teach:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder.

Liu teaches:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park and Hewage and Yang with Yang. Locking the parameter, as in Liu, would benefit the Parbhavalkar, Seo, Park and Hewage and Yang teachings by the parameters to not change. Additionally, locking parameters during training, to yield predictable results.

Allowable Subject Matter
In regards to claim 11, the cited prior art fails to teach the following limitations in that claim: “…penalizing motion between neighboring frames when-a volume of the audio data is less than a volume threshold.” Therefore, claim 11 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Response to Arguments
Applicant's arguments filed 10/23/2025 have been fully considered but they are not persuasive. 
Applicant claims to have overcome the rejection by including “volume level.”
Yang has been added to teach this limitation. Yang mentions: “When the type of the input of the speaker volume adjuster 1150 is a volume down input and, as a result, a volume level is less than or equal to a preset first threshold value, the utterance mode may be determined as a ‘whisper’ mode, and the avatar animation module 1350 may change the avatar animation to express the facial expression of the avatar in a smaller size and express the tone of the avatar's uttered voice more smoothly.” In paragraph [0170].
Yang also has animation regarding the volume level. When it is below a threshold or above it can change the animation to show a different type of animation.
This same reasoning is applied to all independent and dependent claims.
Therefore, the rejection has not been withdrawn.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENIS VASILIY MINKO whose telephone number is (571)270-5226. The examiner can normally be reached Monday-Thursday 8:30-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Said Broome can be reached at 571-272-2931. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Denis Minko/Examiner, Art Unit 2612                                                                                                                                                                                                        
/Said Broome/Supervisory Patent Examiner, Art Unit 2612
Read full office action
Prosecution Timeline

Aug 28, 2023
Application Filed
Apr 28, 2025
Non-Final Rejection mailed — §103
Oct 23, 2025
Response Filed
Jan 28, 2026
Final Rejection mailed — §103
Apr 17, 2026
Request for Continued Examination
Apr 20, 2026
Response after Non-Final Action
Jun 18, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/653,592
Patent 12639942
ARTIFACT PROCESSING IN VIDEO USING TEXTURE INFORMATION
2y 0m to grant Granted May 26, 2026
18/163,187
Patent 12622757
USER INTERFACE FOR THREE DIMENSIONAL IMAGING AND TREATMENT
3y 3m to grant Granted May 12, 2026
18/465,862
Patent 12608854
SYSTEMS AND METHODS FOR TEETH WHITENING SIMULATION
2y 7m to grant Granted Apr 21, 2026
18/577,155
Patent 12597195
METHOD FOR GENERATING PHOTOGRAPHED IMAGE DATA USING VIRTUAL ORGANOID
2y 0m to grant Granted Apr 07, 2026
18/247,682
Patent 12579732
Face-Oriented Geometry Streaming
2y 11m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
65%
Grant Probability
79%
With Interview (+13.9%)
2y 5m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 26 resolved cases by this examiner. Grant probability derived from career allowance rate.