Last updated: May 29, 2026

Application No. 18/747,627

MOUTH SHAPE-BASED METHOD AND APPARATUS FOR GENERATING FACE IMAGE, METHOD AND APPARATUS FOR TRAINING MODEL, AND STORAGE MEDIUM

Non-Final OA §102

Filed

Jun 19, 2024

Priority

Aug 17, 2023 — CN 202311040269.8

Examiner

LETT, THOMAS J

Art Unit

2611

Tech Center

2600 — Communications

Assignee

BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— -35.9% interview lift. Interview lift (-35.9%) is below the 15.0% threshold. A written response is recommended.

Based on 725 resolved cases, 2023–2026

Examiner Intelligence

LETT, THOMAS J View full profile →

Grants 84% — above average

Career Allowance Rate

606 granted / 725 resolved

+21.6% vs TC avg

Minimal -36% lift

Without

With

+-35.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

21 currently pending

Career history

748

Total Applications

across all art units

Statute-Specific Performance

§101

5.3%

-34.7% vs TC avg

§103

41.0%

+1.0% vs TC avg

§102

51.0%

+11.0% vs TC avg

§112

2.4%

-37.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 725 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-3, 5-9, 18 and 36 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kang et al.  (US 20220044463 A1).
Regarding claim 1, Kang et al. discloses a mouth shape-based method for generating a face image (to enable, according to the expression parameter, an animation character to make an expression corresponding to the first speech, para. 0009),
comprising:
acquiring audio data to be recognized (obtaining a first speech, the first speech comprising a plurality of speech frames, para. 0025) and a preset face image (For a set animation character whose expression (for example, a mouth shape) may be adjusted by adjusting the expression parameter, the animation character may be adjusted by using the determined expression parameter, to generate an animation character corresponding to the speech, para. 0028); 
determining an audio feature of the audio data to be recognized (a first speech including a plurality of speech frames is obtained, linguistics information corresponding to a speech frame in the first speech may be determined, para. 0030); 
wherein the audio feature comprises a speech speed feature and a semantic feature (he PPG (that is, the linguistics information) is spliced with a pre-annotated sentiment vector to obtain a final feature, so that the expression parameter corresponding to the speech frame in the first speech is determined according to the PPG and the sentiment vector corresponding to the first speech, para. 0063); and 
performing, according to the speech speed feature and the semantic feature, processing on the preset face image, to generate a face image having a mouth shape (a set animation character whose expression (for example, a mouth shape) may be adjusted by adjusting the expression parameter, the animation character may be adjusted by using the determined expression parameter, to generate an animation character corresponding to the speech, para. 0028).
Regarding claim 2, Kang et al. discloses the method according to claim 1, wherein the determining the audio feature of the audio data to be recognized comprises: 
determining, according to a preset first feature extraction model, a speech speed feature of the audio data to be recognized (the sentiment is represented by using the sentiment vector, para. 0064); 
wherein the first feature extraction model is used for extracting the speech speed feature from the audio data to be recognized (Speech2Face system includes four parts. A first part is to train an ASR model for PPG extraction, para. 0053); and 
determining, according to a preset second feature extraction model, a semantic feature of the audio data to be recognized (the sentiment is represented by using the sentiment vector, para. 0064); 
wherein the second feature extraction model is used for extracting the semantic feature from the audio data to be recognized (after a PPG of the speech frame is obtained by using the trained ASR model, the PPG (that is, the linguistics information) is spliced with a pre-annotated sentiment vector to obtain a final feature, so that the expression parameter corresponding to the speech frame in the first speech is determined according to the PPG and the sentiment vector corresponding to the first speech, para. 0063).
Regarding claim 3, Kang et al. discloses the method according to claim 2, wherein the determining, according to the preset first feature extraction model, the speech speed feature of the audio data to be recognized comprises: 
inputting the audio data to be recognized into the preset first feature extraction model for feature extraction, to obtain a phonetic posterioriorgram feature of the audio data to be recognized ( linguistics information being used for identifying a distribution possibility that the speech frame in the first speech pertains to phonemes, para. 0008); 
wherein the phonetic posterioriorgram feature represents information about a phoneme category of the audio data to be recognized (a combination of two or more of a phonetic posterior gram (PPG), para. 0052); and 
determining, according to the phonetic posterioriorgram feature of the audio data to be recognized, the speech speed feature of the audio data to be recognized (Examiner articulates that (PPG) is a time-varying representation of a speech signal that shows the probability distribution over different phonemes (speech sounds) at each moment in time.).
Regarding claim 5, Kang et al. discloses the method according to claim 2, wherein the determining, according to the preset second feature extraction model, the semantic feature of the audio data to be recognized comprises: 
inputting the audio data to be recognized into the preset second feature extraction model for feature extraction, to obtain output semantic feature of the audio data to be recognized (a vector having 218 dimensions is obtained, and is spliced with a 4-dimensional sentiment vector of the first speech to obtain a feature vector having 222 dimensions, which is subsequently used as an input to the neural network mapping model, para. 0064).
Regarding claim 6, Kang et al. discloses the method according to claim 1, wherein the performing, according to the speech speed feature and the semantic feature, the processing on the preset face image, to generate the face image having the mouth shape comprises: inputting the speech speed feature and the semantic feature into a preset model for determining a mouth shape of a face for processing (a feature vector having 222 dimensions, which is subsequently used as an input to the neural network mapping model, para. 0064), and generating, according to a result obtained from the processing and the preset face image, the face image having the mouth shape (to drive the animation character to make an expression (a mouth shape), para. 0067).
Regarding claim 7, Kang et al. discloses the method according to claim 6, wherein the inputting the speech speed feature and the semantic feature into the preset model for determining the mouth shape of the face for the processing, and the generating, according to the result obtained from the processing and the preset face image, the face image having the mouth shape comprise: 
performing, based on the preset model for determining the mouth shape of the face, splicing processing on the speech speed feature and the semantic feature, to obtain a spliced feature of the audio data to be recognized; wherein the spliced feature represents the speech speed feature and the semantic feature (the PPG (that is, the linguistics information) is spliced with a pre-annotated sentiment vector to obtain a final feature, so that the expression parameter corresponding to the speech frame in the first speech is determined according to the PPG and the sentiment vector corresponding to the first speech, para. 0063); 
performing, according to a convolutional layer in the preset model for determining the mouth shape of the face, feature extraction on the spliced feature, to obtain a face driving parameter (determined by using the neural network mapping model, manners of determining the speech frame set are different and quantities of speech frames in the speech frame set are different according to different neural network mapping models used, para. 0071); 
wherein the face driving parameter is used for representing a parameter required to drive a mouth shape change in a face image; and performing, according to the face driving parameter, image rendering on the preset face image, to generate the face image having the mouth shape ( an expression parameter may be determined according to the first speech, to drive the animation character to make an expression (a mouth shape), para. 0067).
Regarding claim 8, Kang et al. discloses the method according to claim 7, wherein the face driving parameter is a weight parameter of a blend shape (an expression parameter may be determined according to linguistics information, to accurately drive an animation character to make an expression corresponding to the first speech, para. 0030); and
the performing, according to the face driving parameter, the image rendering on the preset face image, to generate the face image having the mouth shape comprises: 
determining, according to the weight parameter of the blend shape, facial three-dimensional mesh data corresponding to the preset face image (a designed 3D animation character to make an expression corresponding to the first speech (for example, S304), para. 0053); 
wherein the facial three-dimensional mesh data is data representing a three-dimensional mesh model of a facial surface on a face image (see figure 2); and performing, according to the facial three-dimensional mesh data, image rendering on the preset face image, to generate the face image having the mouth shape (a phoneme is a minimum phonetic unit obtained through division according to a natural attribute of a speech, analysis is performed according to a pronunciation action in a syllable, and an action (for example, a mouth shape) forms a phoneme, para. 0045).
Regarding claim 9, Kang et al. discloses the method according to any one of claims 1 to 8, further comprising: if it is determined that a value represented by the speech speed feature of the audio data to be recognized is less than a preset speech speed threshold value, performing, according to the semantic feature, processing on the preset face image, to generate the face image having the mouth shape (he expressions may include a facial expression and a body posture expression. The facial expression may include, for example, a mouth shape, para. 0060).
Claim 18, a mouth shape-based apparatus claim, is rejected for the same reason as claim 1.
Claim 36, a non-transitory computer readable storage medium claim, is rejected for the same reason as claim 1.

Claims 10, 11, 27 and 38 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Chae et al.  (US 20220358703 A1).
Regarding claim 10, Chae et al. discloses a method for training a model for determining a mouth shape of a face, comprising: 
acquiring image data to be trained and a preset face image (the person background image may be an image used during a training process, para. 0045); 
wherein the image data to be trained comprises audio data to be trained and a face image to be trained, and the face image to be trained has a mouth shape corresponding to the audio data to be trained (extract a voice feature vector from the speech audio signal, para. 0008);
determining an audio feature of the audio data to be trained (extract a voice feature vector, para. 0008); 
wherein the audio feature comprises a speech speed feature and a semantic feature; performing, according to the speech speed feature, the semantic feature, and the preset face image, training on an initial model for determining a mouth shape of a face, and obtaining a face image having a mouth shape (adjust learning parameters (e.g., the loss function or the Softmax function) so that the generated speech video (i.e., a video in which the portion related to the speech are reconstructed through the audio part) is similar to the original speech video, para. 0044); and if the face image having the mouth shape and the face image to be trained are consistent, determining that a trained model for determining a mouth shape of a face is obtained (decoder 108 may compare the generated speech video with the original speech video (i.e., an answer value) and thus adjust learning parameters (e.g., the loss function or the Softmax function), para. 0044).
Regarding claim 11, Chae et al. discloses the method according to claim 10, wherein the determining the audio feature of the audio data to be trained comprises: determining, according to a preset first feature extraction model, a speech speed feature of the audio data to be trained; wherein the first feature extraction model is used for extracting the speech speed feature from the audio data to be trained; and determining, according to a preset second feature extraction (second encoder 104 is a machine learning model that is trained to extract a voice feature vector using a speech audio signal as an input. Here, the speech audio signal corresponds to an audio part of the person background image (i.e., an image in which a person is speaking) input to the first encoder 102. In other words, in a video in which a person speaks (hereinafter, referred to as a “speech video”), a video part thereof may be input to the first encoder 102, and an audio part thereof may be input to the second encoder 104. The second encoder 104 may include at least one convolutional layer, para. 0040) model, a semantic feature of the audio data to be trained; wherein the second feature extraction model is used for extracting the semantic feature from the audio data to be trained.
Claim 27, an apparatus claim, is rejected for the same reason as claim 10.
Claim 38, a non-transitory computer readable storage medium claim, is rejected for the same reason as claim 10.

Allowable Subject Matter
Claims 4, 12, 13 and 15-17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS J LETT whose telephone number is (571)272-7464. The examiner can normally be reached Mon-Fri 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571) 272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS J LETT/Primary Examiner, Art Unit 2611

Read full office action

Prosecution Timeline

Jun 19, 2024

Application Filed

Jan 15, 2026

Non-Final Rejection mailed — §102

Apr 13, 2026

Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

18/566,523

Patent 12633014

GENERATING IMAGE METHOD AND APPARATUS, DEVICE, AND MEDIUM

2y 5m to grant Granted May 19, 2026

18/468,301

Patent 12627947

APPARATUSES, COMPUTER-IMPLEMENTED METHODS, AND COMPUTER PROGRAM PRODUCTS FOR IMPROVED DATA TRANSMISSION AND TRACKING

2y 8m to grant Granted May 12, 2026

17/935,077

Patent 12620181

DETERMINING AN ASSIGNMENT OF VIRTUAL OBJECTS TO POSITIONS IN A USER FIELD OF VIEW TO RENDER IN A MIXED REALITY DISPLAY

3y 7m to grant Granted May 05, 2026

18/529,268

Patent 12619774

CONTROLLED EXPOSURE TO LOCATION-BASED VIRTUAL CONTENT

2y 5m to grant Granted May 05, 2026

18/382,917

Patent 12602714

LIGHTING AND INTERNET OF THINGS DESIGN USING AUGMENTED REALITY

2y 5m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

84%

Grant Probability

48%

With Interview (-35.9%)

2y 10m (~10m remaining)

Median Time to Grant

Low

PTA Risk

Based on 725 resolved cases by this examiner. Grant probability derived from career allowance rate.