Last updated: April 19, 2026
Application No. 18/345,843
GENERATING FACE MODELS BASED ON IMAGE AND AUDIO DATA

Non-Final OA §103
Filed
Jun 30, 2023
Examiner
TRUONG, KARL DUC
Art Unit
2614
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
3 (Non-Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 29 resolved cases, 2023–2026
Examiner Intelligence

TRUONG, KARL DUC View full profile →
Grants 52% of resolved cases
Career Allow Rate
15 granted / 29 resolved
-10.3% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
45 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.2%
-36.8% vs TC avg
§103
85.3%
+45.3% vs TC avg
§102
9.5%
-30.5% vs TC avg
§112
2.1%
-37.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 29 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9th October, 2025 has been entered.

Response to Amendment
This action is in response to the amendment filed on 12th November, 2025. Claims 1, 15, and 29-30 have been amended. Claims 10 and 24 have been cancelled. Claims 1-9, 11-23, and 25-30 remain rejected in the application.

Response to Arguments
Applicant's arguments with respect to Claims 1, 15, and 29-30 filed on 12th November, 2025, with respect to the rejection under 35 U.S.C. § 103, regarding that the prior art does not teach the limitation(s): "generate image features based on the one or more images of the one or both eyes of the user using one or more image-encoder machine-learning models, wherein the image features comprise a vector of values representative of the one or more images and wherein the one or more image-encoder machine-learning models are trained to generate image-based features based on training images", "generate audio features based on the audio data using an audio-encoder machine-learning model, wherein the audio features comprise a vector of values representative of the audio data and wherein the audio-encoder machine-learning model is trained to generate audio-based features based on training audio data", "combine the image features and the audio features", and "generate a three-dimensional model of the face based on the combined image and audio features using a model-generator machine-learning model, wherein the model-generator machine-learning model is trained to generate three-dimensional models based on training image and audio features" have been fully considered, but are moot because of new grounds for rejection. It has now been taught by the combination of Richard and Sahu.

Regarding arguments to Claims 2-9, 11-14, 16-23, and 25-28, they directly/indirectly depend on independent Claims 1, 15, and 29-30 respectively. Applicant does not argue anything other than independent Claims 1, 15, and 29-30. The limitations in those claims, in conjunction with combination, was previously established as explained.













Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 7-9, 15-18, 21-23, and 29-30 are rejected under 35 U.S.C. 103 as being unpatentable over Richard et al. (US 20220309724 A1), hereinafter referenced as Richard, in view of Sahu et al. (US 20220300740 A1), hereinafter referenced as Sahu.

Regarding Claim 1, Richard discloses an apparatus for generating models of faces, the apparatus (Richard, [0089]: teaches a computer system 1200 <read on apparatus> used for generating 3D facial models as shown in FIG. 12) comprising:
    PNG
    media_image1.png
    401
    437
    media_image1.png
    Greyscale

at least one memory (Richard, [0090]: teaches the computer system 1200 including memory 1204); and
at least one processor coupled to the at least one memory and configured to (Richard, [0089]: teaches the computer system 1200 including processor 1202):
obtain one or more images of one or both eyes of a face of a user (Richard, [0082]: teaches receiving audio and image capture of the face of a subject <read on obtain images of both eyes of face of user>);
generate image features based on the one or more images of the one or both eyes of the user using one or more image-encoder machine-learning models (Richard, [0033]: teaches a facial expression encoder <read on image-encoder machine-learning models> that identifies an expression-like facial feature of the subject to generate a second mesh for an upper portion of the face of the subject <read on generate image features>, where sampling of multiple facial expressions of the subject is performed), wherein
[[the image features comprise a vector of values representative of the one or more images and wherein]]
[[the one or more image-encoder machine-learning models are trained to generate image-based features based on training images;]]
obtain audio data based on utterances of the user (Richard, FIG. 9 teaches obtaining speech audio input <read on obtained audio data> of a subject <read on utterance of user> and modifying a mesh template to match facial features to the audio input); and
    PNG
    media_image2.png
    401
    532
    media_image2.png
    Greyscale

generate audio features based on the audio data using an audio-encoder machine-learning model (Richard, [0039]: teaches recording a speech signal 328 <read on audio data>, where "for each tracked mesh, a Mel spectrogram is generated <read on generated audio features>, including a 600 ms audio snippet starting 500 ms before and ending 100 ms after the respective visual frame"; [0033]: teaches using an audio encoder <read on audio-encoder machine-learning model> that "identifies audio-correlated facial features to generate a first mesh for a lower portion of a face of a subject, according to a classification scheme that is learned by training"), wherein
[[the audio features comprise a vector of values representative of the audio data and wherein]]
[[the audio-encoder machine-learning model is trained to generate audio-based features based on training audio data;]]
[[combine the image features and the audio features; and]]
generate a three-dimensional model of the face [[based on the combined image and audio features]] using a model-generator machine-learning model (Richard, FIG. 10 teaches generating a synthesized mesh <read on generate 3D model of face> based on the first and second meshes, where the first mesh is generated for a lower portion of a face of the subject based on audio-correlated facial feature, and the second mesh is generated for an upper portion of a face of the subject based on expression-like facial feature, which uses a multimodal encoder), wherein
    PNG
    media_image3.png
    401
    496
    media_image3.png
    Greyscale

the model-generator machine-learning model is trained to generate three-dimensional models based on training image [[and audio features]] (Richard, [0083]: teaches "training a 3D model to create real-time 3D speech animation <read on generate 3D models> of a subject"; [0085]: teaches "generating a first mesh for a lower portion of a human face, based on the facial feature <read on training image> and the first correlation value"; [0033]: teaches the facial expression encoder 244 selects "the expression-like facial feature based on a prior sampling of multiple subject's facial expressions").

However, Richard does not expressly disclose
the image features comprise a vector of values representative of the one or more images and wherein
the one or more image-encoder machine-learning models are trained to generate image-based features based on training images;
the audio features comprise a vector of values representative of the audio data and wherein
the audio-encoder machine-learning model is trained to generate audio-based features based on training audio data;
combine the image features and the audio features; and
generate a three-dimensional model of the face based on the combined image and audio features using a model-generator machine-learning model, wherein
the model-generator machine-learning model is trained to generate three-dimensional models based on training image and audio features.

Sahu discloses
the image features comprise a vector of values representative of the one or more images (Sahu, [0064]: teaches the system using video and audio feature extractors 206 and 210 to extract video and audio features respectively and generate a representation X, where X is an input feature matrix <read on vector of values>, such as one containing audio/video features <read on image features>; [0087]: teaches "multiple matrices are generated based on the identified features at step 506," where processor 120 uses "the linear transformation functions 304a-304c to convert an input matrix containing the features 208, 212 into transformed outputs 306a-306c, such as query, key, and value matrices) and wherein
the one or more image-encoder machine-learning models are trained to generate image-based features based on training images (Sahu, [0094]: teaches "processor 120 passing the training samples through the machine learning model 214 and using the results to calculate a cross-entropy loss value associated with the training samples," where "this may be performed for both audio and video training samples <read on training images>"; [0094]: further teaches processor 120 obtaining input data 202 and 204 (or the extracted features 208 <read on generate image-based features> and 212) representing known training samples and associated ground truth labels 222);
the audio features comprise a vector of values representative of the audio data (Sahu, [0064]: teaches the system using the video and audio feature extractors 206 and 210 to generate a representation X, where X is an input feature matrix <read on vector of values>, such as one containing audio/video features <read on audio features>; [0087]: teaches "multiple matrices are generated based on the identified features at step 506," where processor 120 uses "the linear transformation functions 304a-304c to convert an input matrix containing the features 208, 212 into transformed outputs 306a-306c, such as query, key, and value matrices) and wherein
the audio-encoder machine-learning model is trained to generate audio-based features based on training audio data (Sahu, [0087]: teaches an audio feature extractor 210 extracting feature of input audio to generate audio features 212 that represents the input audio data 204; [0051]: teaches the audio feature extractor being an encoder <read on audio-encoder machine-learning model>);
combine the image features and the audio features (Sahu, [0073]: teaches outputting representations 370 <read on combine features> of the audio/video input <read on image and audio features>); and
generate a three-dimensional model of the face based on the combined image and audio features using a model-generator machine-learning model (Sahu, [0073]: teaches outputting representations 370 <read on combined features> of the audio/video input <read on image and audio features>), wherein
the model-generator machine-learning model is trained to generate three-dimensional models based on training image and audio features (Sahu, [0094]: teaches "processor 120 passing the training samples through the machine learning model 214 and using the results to calculate a cross-entropy loss value associated with the training samples," where "this may be performed for both audio and video training samples <read on training images and audio features>").

Sahu is analogous art with respect to Richard because they are from the same field of endeavor, namely analyzing audio and image inputs for machine learning model training applications. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement audio and video extractors to generate audio and video features as taught by Sahu into the teaching of Richard. The suggestion for doing so would allow the neural network to better understand video and audio correlations, thereby improving the overall system. Therefore, it would have been obvious to combine Sahu with Richard.



Regarding Claim 15, it recites the limitations that are similar in scope to Claim 1, but in a method. As shown in the rejection, the combination of Richard and Sahu discloses the limitations of Claim 1. Additionally, Richard discloses a method for generating models of faces (Richard, [0074]: teaches method 1000 for embedding a 3D speech animation model <read on generating face models> in a VR environment), the method comprising:…

Thus, Claim 11 is met by Richard according to the mapping presented in the rejection of Claim 1, given the apparatus corresponds to a method.

Regarding Claim 29, it recites the limitations that are similar in scope to Claim 1, but in a non-transitory computer-readable storage medium. As shown in the rejection, the combination of Richard and Sahu discloses the limitations of Claim 1. Additionally, Richard discloses a non-transitory computer-readable storage medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to (Richard, [0020]: teaches "a non-transitory, computer-readable medium stores instructions which, when executed by a processor, cause a computer to perform a method"):…

Thus, Claim 29 is met by Richard according to the mapping presented in the rejection of Claim 1, given the apparatus corresponds to a non-transitory computer-readable storage medium.

Regarding Claim 30, it recites the limitations that are similar in scope to Claim 1. As shown in the rejection, the combination of Richard and Sahu discloses the limitations of Claim 1. Thus, Claim 30 is rejected under the same rationale as in Claim 1.

Regarding Claims 2 and 16, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein
a mouth portion of the three-dimensional model of the face is based on the audio data (Richard, [0025]: teaches "the latent space is trained based on a novel cross-modality loss that encourages the model to have an accurate upper face reconstruction independent of the audio input and accurate mouth area <read on mouth portion of 3D model of face> that only depends on the provided audio input").

Regarding Claims 3 and 17, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein the three-dimensional model comprises
a three-dimensional morphable model (3DMM) of the face (Richard, FIG. 9 teaches obtaining speech audio input and modifying a mesh template <read on 3DMM of face> to match facial features to the audio input).

Regarding Claims 4 and 18, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein the three-dimensional model comprises
a plurality of vertices corresponding to points of the face (Richard, [0038]: teaches face meshes including 6, 172 vertices <read on points of face> with a high level of detail including eye lids, upper face structure, and different hair styles).

Regarding Claims 7 and 21, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein the audio data comprises
perception-based representation of the utterances of the user (Richard, [0061]: teaches model 500 generating two different sets of latent representations <read on perception-based representation>,                                 
                                    
                                        
                                            S
                                        
                                        
                                            a
                                            u
                                            d
                                            i
                                            o
                                        
                                    
                                
                             and                                 
                                    
                                        
                                            S
                                        
                                        
                                            e
                                            x
                                            p
                                            r
                                        
                                    
                                
                            , where "                                
                                    
                                        
                                            S
                                        
                                        
                                            a
                                            u
                                            d
                                            i
                                            o
                                        
                                    
                                
                             contains latent codes (lower face meshes 521A) obtained by fixing the expression input to facial expression encoder (e.g., facial expression encoders 244 and 344) and varying the audio signal <read on utterances of user>").

Regarding Claims 8 and 22, the combination of Richard and Sahu discloses the apparatus and the method of Claims 7 and 21 respectively. Additionally, Richard further discloses wherein the perception-based representation of the utterances comprises
a representation of the audio data based on perceptually-relevant frequencies and perceptually-relevant amplitudes (Richard, [0075]: teaches "identifying an intensity and a frequency of the audio capture <read on audio data representation> from the subject and correlating an amplitude <read on perceptually-relevant amplitudes> and a frequency <read on perceptually-relevant frequencies> of an audio waveform with a geometry of the lower portion of the face of the subject"; Note: it should be noted that amplitudes are used to distinguish sounds from one another and can also be used for detecting and predicting a current emotion; correlating the amplitude and frequency of the audio capture is being interpreted as perceptually-relevant amplitudes and perceptually-relevant frequencies respectively).

Regarding Claims 9 and 23, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein the audio data comprises
a Mel spectrogram representative of the utterances of the user (Richard, [0065]: teaches using Deep-Speech features, which include Mel spectrograms).

Claims 5-6, 11-12, 19-20, and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over Richard et al. (US 20220309724 A1), hereinafter referenced as Richard, in view of Sahu et al. (US 20220300740 A1), hereinafter referenced as Sahu as applied to Claims 1 and 15 above respectively, and further in view of Cao et al. (US 20230245365 A1, previously cited), hereinafter referenced as Cao.

Regarding Claims 5 and 19, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. The combination of Richard and Sahu does not expressly disclose the limitations of Claims 5 and 19; however, Cao discloses wherein the at least one processor is further configured to
obtain a view for the three-dimensional model of the face (Cao, FIG. 7 teaches obtaining a plurality of views of the user's face), wherein
    PNG
    media_image4.png
    293
    589
    media_image4.png
    Greyscale


the three-dimensional model of the face is generated based on the view (Cao, [0042]: teaches "a universal prior model that is trained on high resolution multi-view video captures of facial performances of hundreds of human subjects," where the 3D head avatar output <read on generated 3D model of face> matches the facial shape and appearance <read on view features> of the user from a plurality of angles).

Cao is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely utilizing machine learning models to generate 3D face meshes based on training data. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a universal prior model to capture and update viewing angles of the captured 3D face mesh as taught by Cao into the teaching of Richard, in view of Sahu. The suggestion for doing so would allow the system to capture the facial shape and appearance of the user's face, thereby capturing and displaying unique facial attributes at multiple viewing angles. Therefore, it would have been obvious to combine Cao with Richard, in view of Sahu.

Regarding Claims 6 and 20, the combination of Richard, Sahu, and Cao discloses the apparatus and the method of Claims 5 and 19 respectively. The combination of Richard and Sahu does not expressly disclose the limitations of Claims 6 and 20; however, Cao discloses wherein the view for the three-dimensional model of the face is based on
an angle from which the three-dimensional model of the face is to be viewed (Cao, [0073]: teaches generating tracked meshes, where "a high-coverage landmark detector runs multiple views <read on angle> of each frame," in which "the detected landmarks are then used to initialize a Principal Component Analysis (PCA) model-based tracking method to produce the final tracked mesh").

Cao is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely utilizing machine learning models to generate 3D face meshes based on training data. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a universal prior model to capture and update viewing angles of the captured 3D face mesh as taught by Cao into the teaching of Richard, in view of Sahu. The suggestion for doing so would allow the system to capture the facial shape and appearance of the user's face, thereby capturing and displaying unique facial attributes at multiple viewing angles. Therefore, it would have been obvious to combine Cao with Richard, in view of Sahu.

Regarding Claims 11 and 25, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. The combination of Richard and Sahu does not expressly disclose the limitations of Claims 11 and 25; however, Cao discloses wherein the at least one processor is further configured to:
obtain a view for the three-dimensional model of the face (Cao, FIG. 7 teaches obtaining a plurality of views of the user's face); and
generate view features based on the view using a view-encoder machine-learning model (Cao, [0042]: teaches "a universal prior model <read on view-encoder machine-learning model> that is trained on high resolution multi-view video captures of facial performances of hundreds of human subjects," where the 3D head avatar output matches the facial shape and appearance <read on generate view features> of the user from a plurality of angles), wherein
the view-encoder machine-learning model is trained to generate view-based features based on training views (Cao, [0042]: teaches "a universal prior model <read on view-encoder machine-learning model> that is trained on high resolution multi-view video captures <read on training views> of facial performances of hundreds of human subjects," where the 3D head avatar output matches the facial shape and appearance <read on generate view-based features> of the user from a plurality of angles); wherein
the three-dimensional model of the face is generated based on the view features (Cao, [0042]: teaches "a universal prior model that is trained on high resolution multi-view video captures of facial performances of hundreds of human subjects," where the 3D head avatar output <read on generated 3D model of face> matches the facial shape and appearance <read on view features> of the user from a plurality of angles).

Cao is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely utilizing machine learning models to generate 3D face meshes based on training data. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a universal prior model to capture and update viewing angles of the captured 3D face mesh as taught by Cao into the teaching of Richard, in view of Sahu. The suggestion for doing so would allow the system to capture the facial shape and appearance of the user's face, thereby capturing and displaying unique facial attributes at multiple viewing angles. Therefore, it would have been obvious to combine Cao with Richard, in view of Sahu.





Regarding Claims 12 and 26, the combination of Richard, Sahu, and Cao discloses the apparatus and the method of Claims 11 and 25 respectively. The combination of Richard and Sahu does not expressly disclose the limitations of Claims 11 and 25; however, Cao discloses wherein the view for the three-dimensional model of the face is based on
an angle from which the three-dimensional model of the face is to be viewed (Cao, [0073]: teaches generating tracked meshes, where "a high-coverage landmark detector runs multiple views <read on angle> of each frame," in which "the detected landmarks are then used to initialize a Principal Component Analysis (PCA) model-based tracking method to produce the final tracked mesh").

Cao is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely utilizing machine learning models to generate 3D face meshes based on training data. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a universal prior model to capture and update viewing angles of the captured 3D face mesh as taught by Cao into the teaching of Richard, in view of Sahu. The suggestion for doing so would allow the system to capture the facial shape and appearance of the user's face, thereby capturing and displaying unique facial attributes at multiple viewing angles. Therefore, it would have been obvious to combine Cao with Richard, in view of Sahu.

Claims 13 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Richard et al. (US 20220309724 A1), hereinafter referenced as Richard, in view of Sahu et al. (US 20220300740 A1), hereinafter referenced as Sahu as applied to Claims 1 and 15 above respectively, and further in view of Kwatra et al. (US 20230343010 A1, previously cited), hereinafter referenced as Kwatra.

Regarding Claims 13 and 27, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. The combination of Richard and Sahu does not expressly disclose the limitations of Claims 13 and 27; however, Kwatra discloses wherein the at least one processor is further configured to:
generate a UV map of the face based on the three-dimensional model of the face using a first renderer (Kwatra, [0068]: teaches the system predicting both texture and geometry of the user's face by "projecting the corresponding predicted vertices onto the reference cylinder, and using their 2D location as new texture coordinates <read on generating UV map>" to be used to render <read on first renderer> a 3D head model);
generate a texture map based on the UV map of the face using a machine-learning encoder-decoder (Kwatra, [0030]: teaches using an encoder-decoder framework "that computes embeddings from audio spectrograms, and decodes them into 3D geometry and texture <read on generating texture map>"); and
render the three-dimensional model of the face based on the three-dimensional model of the face and the texture map using a second renderer (Kwatra, [0082]: teaches "the computing system can combine the face geometry and the face texture <read on texture map> to generate a three-dimensional face mesh model"; [0084]: teaches "rendering <read on second renderer> the three-dimensional face mesh within the two-dimensional target video at the target position to generate the synthesized video").




Kwatra is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely generating 3D face models based on audio input data. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to have the system analyze audio input data for phenomes and visemes as taught by Kwatra into the teaching of Richard, in view of Sahu. The suggestion for doing so would allow the system to predict the user's facial expression based on spectrogram audio data and facial image data, thereby resulting in a more accurate and realistic 3D model of the user's face when combined together. Therefore, it would have been obvious to combine Kwatra with Richard, in view of Sahu.

Claims 14 and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Richard et al. (US 20220309724 A1), hereinafter referenced as Richard, in view of Sahu et al. (US 20220300740 A1), hereinafter referenced as Sahu as applied to Claims 1 and 15 above respectively, and further in view of Li et al. (US 20170243387 A1, previously cited), hereinafter referenced as Li.

Regarding Claims 14 and 28, the combination of Richard and Sahu discloses the apparatus and the method of Claims 1 and 15 respectively. Additionally, Richard further discloses wherein the at least one processor is further configured to
obtain an image of at least a portion of a mouth of the face of the user (Richard, [0025]: teaches "the latent space is trained based on a novel cross-modality loss that encourages the model to have an accurate upper face reconstruction independent of the audio input and accurate mouth area <read on image of mouth portion of face> that only depends on the provided audio input"); and
[[generate mouth-image features using a mouth-image-encoder machine-learning model, wherein]]
[[the mouth-image-encoder machine-learning model is trained to generate mouth-image features based on training mouth images, wherein]]
[[the three-dimensional model of the face is generated based on the mouth-image features.]]

However, the combination of Richard and Sahu does not expressly disclose
generate mouth-image features using a mouth-image-encoder machine-learning model, wherein
the mouth-image-encoder machine-learning model is trained to generate mouth-image features based on training mouth images, wherein
the three-dimensional model of the face is generated based on the mouth-image features.

Li discloses
generate mouth-image features using a mouth-image-encoder machine-learning model (Li, [0045]: teaches a training system 200 <read on mouth-image encoder machine-learning model> generating a mouth regression model 208 based on mouth image training data, where the mouth contains animation parameters <read on generate mouth-image features> derived from the training process), wherein
the mouth-image-encoder machine-learning model is trained to generate mouth-image features based on training mouth images (Li, [0045]: teaches the training system 200 <read on mouth-image encoder machine-learning model> generating a mouth regression model 208 based on mouth image training data <read on training mouth images>, where the mouth contains animation parameters <read on generate mouth-image features> derived from the training process), wherein
the three-dimensional model of the face is generated based on the mouth-image features (Li, [0045]: teaches the training system 200 generating the mouth regression model 208 <read on 3D model> based on mouth image training data, which contains mouth animation parameters <read on mouth-image features>).

Li is analogous art with respect to Richard, in view of Sahu because they are from the same field of endeavor, namely utilizing machine learning models to generate 3D face meshes. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a training system to generate a mouth regression model as taught by Li into the teaching of Richard, in view of Sahu. The suggestion for doing so would include mouth animation parameters based on mouth image training data, which allows the neural networks to understand the limits of the mouth mesh, and thereby yielding desirable results. Therefore, it would have been obvious to combine Li with Richard, in view of Sahu.













Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Bouaziz et al. (US 20140362091 A1) discloses real-time facial animation that uses a dynamic expression model;
Mittal et al. (US 20200135226 A1) discloses animating a visual representation of a face based on spoken words of a speaker; and
Nefian et al. (US 20050047664 A1) discloses modeling an audio-visual observation of a subject using a coupled Markov model to obtain an audio-visual model.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KARL TRUONG whose telephone number is (703)756-5915. The examiner can normally be reached 7:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.D.T./Examiner, Art Unit 2614                                                                                                                                                                                                        
/KENT W CHANG/Supervisory Patent Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Jun 30, 2023
Application Filed
Apr 28, 2025
Non-Final Rejection — §103
Jul 08, 2025
Interview Requested
Jul 15, 2025
Examiner Interview Summary
Jul 22, 2025
Response Filed
Jul 30, 2025
Final Rejection — §103
Sep 23, 2025
Interview Requested
Sep 30, 2025
Examiner Interview Summary
Oct 09, 2025
Response after Non-Final Action
Nov 12, 2025
Request for Continued Examination
Nov 19, 2025
Response after Non-Final Action
Feb 09, 2026
Non-Final Rejection — §103
Mar 26, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

18/324,617
Patent 12573149
DATA PROCESSING METHOD AND APPARATUS, DEVICE, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT
2y 5m to grant Granted Mar 10, 2026
18/455,592
Patent 12561875
ANIMATION FRAME DISPLAY METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/211,149
Patent 12494013
AUTODECODING LATENT 3D DIFFUSION MODELS
2y 5m to grant Granted Dec 09, 2025
18/125,596
Patent 12456258
SYSTEMS AND METHODS FOR GENERATING A SHADOW MESH
2y 5m to grant Granted Oct 28, 2025
18/028,063
Patent 12444020
FLEXIBLE IMAGE ASPECT RATIO USING MACHINE LEARNING
2y 5m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
83%
With Interview (+31.0%)
2y 7m
Median Time to Grant
High
PTA Risk
Based on 29 resolved cases by this examiner. Grant probability derived from career allow rate.
GENERATING FACE MODELS BASED ON IMAGE AND AUDIO DATA

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email