Last updated: April 19, 2026

Application No. 18/655,580

REAL-TIME EXTRACTION OF 3D ANIMATION INFORMATION FROM PREDICTED SPEECH

Non-Final OA §103

Filed

May 06, 2024

Examiner

ZHAI, KYLE

Art Unit

2611

Tech Center

2600 — Communications

Assignee

Charter Communications Operating LLC

OA Round

1 (Non-Final)

Interview Optional

— +18.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 473 resolved cases, 2023–2026

Examiner Intelligence

ZHAI, KYLE View full profile →

Grants 75% — above average

Career Allow Rate

353 granted / 473 resolved

+12.6% vs TC avg

Strong +19% interview lift

Without

With

+18.6%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

31 currently pending

Career history

504

Total Applications

across all art units

Statute-Specific Performance

§101

10.6%

-29.4% vs TC avg

§103

61.2%

+21.2% vs TC avg

§102

7.9%

-32.1% vs TC avg

§112

15.1%

-24.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 473 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 15, 17, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Skov (US 2025/0218440) in view of Strietzel et al. (US 2009/0132371).
Regarding claim 1, Skov discloses a computer-implemented method (Skov, [0074], “FIG. 5 illustrates a flowchart of an example method 500 for context-based speech assistance”) comprising: 
processing, by a computing system comprising one or more computing devices (Skov, [0017], “the device 110 may include or be any electronic or digital computing device or system”), one or more inputs with a machine-learned language model to obtain a prediction output (Skov, [0029], “an artificial intelligence system that has been trained on a large amount of textual data to understand and generate human-like language prompts and responses”. In addition, in paragraph [0030], “the prediction system 130 may generate the prediction based on the word or words mostly likely to follow the words previously spoken”), wherein the one or more inputs comprises speech information descriptive of one or more first words spoken by a user (Skov, [0018], “obtain the speech in real-time as the user 112 speaks”), and wherein the prediction output comprises one or more second words predicted to follow the one or more first words (Skov, [0026], “generate a prediction based on the transcription of the audio. The prediction may include one or more words that are predicted to follow a last word in the transcription obtained by the prediction system”); 
Skov does not express disclose “determining, a sequence of visemes formed to produce the one or more second words”;
Strietzel et al. (hereinafter Strietzel) discloses determining, a sequence of visemes formed to produce one or more words (Strietzel, [0030], “detecting from the at least one audio track a plurality of phonemes and creating at least one viseme track that associates the plurality of phonemes with a plurality of visemes”);
based on the sequence of visemes, generating, facial animation information descriptive of a facial animation that animates a three-dimensional representation of a mouth of a user forming the sequence of visemes to speak the one or more words (Strietzel, [0031], “each of the plurality of visemes comprising instructions for a corresponding animated mouth movement of the individualized 3D head model; and compositing the media content, the individualized 3D head model, the at least one audio track and the at least one viseme track such that the individualized 3D head model is associated with the character and such that the at least one audio track and the at least one viseme track are associated with the individualized 3D head model to cause the animated mouth movement of the individualized 3D head model to correspond to the at least one audio track during playback of the personalized media content”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify Skov’s prediction system to incorporate Strietzel’s animated mouth movement technique for the individualized 3D head model. The motivation doing so would have been providing ability to generate a 3D head model that lip-syncs and exhibits fascial expression for a human-like presence.
Regarding claim 2, Skov teaches the one or more second words predicted to follow the one or more first words (Skov, [0030], “the prediction system 130 may generate the prediction based on the word or words mostly likely to follow the words previously spoken”)
Skov as modified by Strietzel with the same motivation from claim 1 discloses extracting a sequence of phonemes from one or more words (Strietzel, [0210], “the phoneme module 1814 can be configured to convert the audio track into a phoneme track, consisting of a plurality of phonemes”);
for each phoneme of the sequence of phonemes: mapping, the phoneme to one or more visemes of the sequence of visemes formed to produce the phoneme (Strietzel, [0211], “convert the phoneme track into a viseme track. As used in this disclosure, a viseme is the visual counterpart of a phoneme and represents the basic unit of speech in the visual domain. In particular, a viseme can represent the particular facial and oral positions and movements that occur alongside the voicing of phoneme”).
Regarding claim 15, Skov discloses obtaining, by the computing system from a user device associated with the user, the speech information descriptive of the one or more first words spoken by the user from the user device (Skov, [0019], “the device 110 may be configured to obtain a transcription of the speech. The transcription may include a written form of words, for example text, that may be included in the speech of the audio obtained by the device 110”).
Regarding claim 17, Skov discloses a computing system (Skov, [0013], “systems and methods that may prompt a speaker when the speaker encounters a speech disfluency”), comprising: 
a memory (Skov, [0017], “the device 110 may include memory”); and 
one or more processor devices coupled to the memory to (Skov, [0017], “the device 110 may include memory and at least one processor, which are configured to perform operations as described in this disclosure”).
The remaining limitations recite in claim 17 are similar in scope to the method recited in claim 1 and therefore are rejected under the same rationale.
Regarding claim 18, claims 18 recites functions that are similar in scope to the method steps recited in claim 2 and therefore are rejected under the same rationale.
Regarding claim 20, Skov discloses a non-transitory computer-readable storage medium that includes executable instructions to cause one or more processor devices to (Skov, [0089], “The memory 612 may include computer-readable storage media or one or more computer-readable storage mediums for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable storage media may be any available media that may be accessed by a general-purpose or special-purpose computer, such as the processor 610”).
 The limitations recite in claim 20 are similar in scope to the method recited in claim 1 and therefore are rejected under the same rationale.

Claims 3, 4, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Skov (US 2025/0218440) in view of Strietzel et al. (US 2009/0132371), as applied to claim 1, in further view of Sarkis et al. (US 2024/0062467).
Regarding claim 3, Skov as modified by Strietzel with the same motivation from claim 1 discloses providing, by the computing system, the facial animation information (Strietzel, [0074], “Animation of video data can include portrayal of such events as turning or tilting of the head, speaking, blinking, and/or different facial expressions”) to the user (Strietzel, [0074], “create and display personalized media content starring the user”); Skov as modified by Strietzel does not expressly disclose “a user computing device associated with a second user different than the user”;
Sarkis et al. (hereinafter Sarkis) discloses a user computing device associated with a second user different than a user (Sarkis, [0092], “a first user can transmit (e.g., directly or via the server), for receipt by a second device of a second user, mesh information defining a virtual representation or avatar for the first user for use in participating in a virtual session (e.g., a 3D collaborative virtual meeting in a metaverse environment, a computer or virtual game, or other virtual session) between the first and second users”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Sarkis’s feature of transmit a first user’s virtual avatar to a second user into the device described in Skov. The motivation for doing so would have been enabling the setup of a call with high-quality avatar representing a person.
Regarding claim 4, Skov as modified by Strietzel with the same motivation from claim 1 discloses using, the facial animation information to render at least some of the facial animation of the three-dimensional representation of the mouth of the user forming the sequence of visemes to speak the one or more second words (Strietzel, [0225], “These expression morph targets can comprise a plurality of instructions for manipulating and/or animating both a mouth portion and facial portions of a 2D or 3D model”);
Skov as modified by Strietzel and Sarkis with the same motivation from claim 3 discloses providing, the at least some of the facial animation to a user computing device associated with a second user different than the user (Sarkis, [0092], “a first user can transmit (e.g., directly or via the server), for receipt by a second device of a second user, mesh information defining a virtual representation or avatar for the first user for use in participating in a virtual session (e.g., a 3D collaborative virtual meeting in a metaverse environment, a computer or virtual game, or other virtual session) between the first and second users”. The mesh information reads on at least some of the facial animation).
Regarding claim 19, claims 18 recites function that is similar in scope to the method step recited in claim 3 and therefore is rejected under the same rationale.

Claims 5-8 are rejected under 35 U.S.C. 103 as being unpatentable over Skov (US 2025/0218440) in view of Strietzel et al. (US 2009/0132371), as applied to claim 1, in further view of Beith et al. (US 2024/0078731).
Regarding claim 5, Skov discloses processing, by the computing system, the speech information with the machine-learned language model to obtain the prediction output descriptive of the one or more second words predicted to follow the one or more first words (Skov, [0029], “an artificial intelligence system that has been trained on a large amount of textual data to understand and generate human-like language prompts and responses”. In addition, in paragraph [0030], “the prediction system 130 may generate the prediction based on the word or words mostly likely to follow the words previously spoken”); Skov as modified by Strietzel does not expressly disclose “a plurality of contextual information elements”;
Beith et al. (hereinafter Beith) discloses a plurality of contextual information elements (Beith, [0061], “emotional states which involve multiple parameters of the face to be in concert to convey the accurate emotion”);
an emotional state information element indicative of a predicted emotional state of a user (Beith, [0095], “detecting emotion associated with the meanings of words, phrases, and sentences of the user's speech 258, the audio unit 222 can include one or more machine learning models that are configured to detect audible emotions, such as happy, sad, angry, playful, romantic, serious, frustrated, etc., based on the speaking characteristics of the user”. In addition, in paragraph [0095], “process the audio data 204 to predict the emotion 270”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to use Beith’s emotion detection to generate predictions for the following words in Skov. The motivation for doing so would have been enabling the system to anticipate the following words more effectively by providing emotion state in addition to the literal words.
Regarding claim 6, Skov as modified by Strietzel and Beith with the same motivation from claim 5 discloses the emotional state information element indicative of the predicted emotional state of the user (Beith, [0095], “detecting emotion associated with the meanings of words, phrases, and sentences of the user's speech 258, the audio unit 222 can include one or more machine learning models that are configured to detect audible emotions, such as happy, sad, angry, playful, romantic, serious, frustrated, etc., based on the speaking characteristics of the user”. In addition, in paragraph [0095], “process the audio data 204 to predict the emotion 270”);
determining, one or more facial movements indicative of the predicted emotional state of the user (Beith, [0095], “the adjusted face data 134 causes the avatar facial expression 156 to represent the emotion 270 (e.g., smiling to express happiness, eyes narrowed to express anger, eyes widened to express surprise, etc.)”);
based on the predicted emotional state of the user, generating, a first portion of the facial animation information (Beith, [0095], “the adjusted face data 134 causes the avatar facial expression 156 to represent the emotion 270 (e.g., smiling to express happiness, eyes narrowed to express anger, eyes widened to express surprise, etc.)”. Eyes narrowed to express anger or eyes widened to express surprise is considered a first portion of the facial animation information), wherein the first portion of the facial animation information is descriptive of a first portion of the facial animation that animates a three-dimensional representation of an upper facial region of a face of the user performing the one or more facial movements (Beith, [0095], “the adjusted face data 134 causes the avatar facial expression 156 to represent the emotion 270 (e.g., smiling to express happiness, eyes narrowed to express anger, eyes widened to express surprise, etc.)”. Eyes are considered an upper facial region of a face of the user performing the one or more facial movements. In addition, in paragraph [0089], “the face data generator 230 includes a three-dimensional morphable model (3DMM) encoder configured to input the image data 208 and generate the face data 132 as a rough mesh representation of the user's face”)
Regarding claim 7, Skov discloses generating, the one or more second words (Skov, [0059], “The prediction may be of one or more words that may follow the last word in the transcription”);
Skov as modified by Strietzel with the same motivation from claim 1 discloses based on the sequence of visemes, generating, by the computing system, a second portion of the facial animation information descriptive of a second portion of the facial animation that animates the three-dimensional representation of the mouth of the user forming the sequence of visemes to speak the one or more second words (Strietzel, [0030], “each of the plurality of visemes being indicative of an animated mouth movement of the individualized 3D head model”. The animated mouth is considered a second portion of the facial animation that animates the three-dimensional representation of the mouth of the user forming the sequence of visemes to speak the one or more second words).
Regarding claim 8, Skov as modified by Strietzel and Beith with the same motivation from claim 5 discloses processing, the speech information with a machine-learned sentiment analysis model to obtain the emotional state information element, wherein the machine-learned sentiment analysis model is trained to evaluate a tone of the user (Beith, [0095], “detecting emotion associated with the meanings of words, phrases, and sentences of the user's speech 258, the audio unit 222 can include one or more machine learning models that are configured to detect audible emotions, such as happy, sad, angry, playful, romantic, serious, frustrated, etc., based on the speaking characteristics of the user 108 (e.g., based on tone, pitch, cadence, volume, etc.)”).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Skov (US 2025/0218440) in view of Strietzel et al. (US 2009/0132371) in view of Beith et al. (US 2024/0078731), as applied to claim 5, in view of Coccaro et al. (US 2014/0278379) in further view of Kadam et al. (US 2022/0171939).
Regarding claim 14, Skov as modified by Strietzel and Beith does not expressly disclose “geographic context information element descriptive of the geographic area that the user is associated with”;
Coccaro et al. (hereinafter Coccaro) discloses geographic context information element descriptive of a geographic area that the user is associated with (Coccaro, [0048], “demographic information associated with the speaker (e.g., language, age, gender, geographic location)”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Coccaro’s determination of the user’s geographic location to predict the following words in Skov. The motivation for doing so would have been enabling the prediction of words that are more relevant to the user’s environment.
 Skov as modified by Strietzel and Coccaro does not expressly disclose “identifying, by the computing system, a synonym for a particular word of the one or more second words, wherein the synonym is associated with the geographic area that the user is associated with, and wherein the particular word is associated with a second geographic area different than the first geographic area”;
Kadam et al. (hereinafter Kadam) discloses identifying, a synonym for a particular word, wherein the synonym is associated with a geographic area, and wherein the particular word is associated with a second geographic area different than the geographic area (Kadam, [0038], “For example, couple entity-meaning “chum-befriend” has a global context for the entity (which means that throughout the English speaking world “chum” means “befriend”) and couple entity-meaning “chum-menstruate” has a local context for the entity defined as India (which means that, in India, “chum” may mean “menstruate”)”. Fig. 1 illustrates USA and India represent different geographic area).
replacing, by the computing system, the particular word with the synonym (Kadam, [0026], “The output content item 110 corresponds to the input content item 102 in which some entities are replaced each with its respective meaning that is used in the selected interpretation”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to apply Kadam’s method of geographic based word replacement to modify the predicted words of Skov as modified by Strietzel and Coccaro. The motivation for doing so would have been ensuring that words are appropriate for the user’s local culture.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Skov (US 2025/0218440) in view of Strietzel et al. (US 2009/0132371), as applied to claim 5, in further view of Weisz et al. (US 2025/0329317).
Regarding claim 16, Skov discloses receiving, by the computing system, streaming audio data from the user device, wherein the streaming audio data comprises audio of the user speaking the one or more first words (Skov, [0018], “the device 110 may be configured to obtain audio of the user 112. The audio may be part of a video format or only audio. The audio may include speech of the user 112”);
processing, by the computing system, the streaming audio data (Skov, [0029], “an artificial intelligence system that has been trained on a large amount of textual data to understand and generate human-like language prompts and responses”. In addition, in paragraph [0030], “the prediction system 130 may generate the prediction based on the word or words mostly likely to follow the words previously spoken”);
Skov as modified by Strietzel does not expressly disclose “a machine-learned speech recognition model to obtain a speech-to-text output comprising the speech information”;
Weisz et al. (hereinafter Weisz) discloses a machine-learned speech recognition model to obtain a speech-to-text output comprising the speech information (Weisz, [0025], “the user input engine 111 can process, using automatic speech recognition (ASR) model(s) stored in the ML model(s) database 180…audio data that capture the spoken utterance and that is generated by microphone(s) of the client device 110 to generate ASR output. The ASR output can include, for example, speech hypotheses (e.g., term hypotheses and/or transcription hypotheses) that are predicted to correspond to the spoken utterance captured in the audio data…the user input engine 111 can select one or more of the speech hypotheses as recognized text that corresponds to the spoken utterance”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to predict the words of Skov by incorporating the speech recognition model of Weisz. The motivation for doing so would have been enabling the generation of high-quality transcription that improves the accuracy of predicted next words.

Allowable Subject Matter
Claims 9-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE ZHAI whose telephone number is (571)270-3740. The examiner can normally be reached 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ke Xiao can be reached at (571) 272 - 7776. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KYLE ZHAI/Primary Examiner, Art Unit 2611

Read full office action

Prosecution Timeline

May 06, 2024

Application Filed

Nov 26, 2025

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/916,043

Patent 12602879

METHOD AND DEVICE FOR PROVIDING SURGICAL GUIDE USING AUGMENTED REALITY

2y 5m to grant Granted Apr 14, 2026

18/412,804

Patent 12594123

VIRTUAL REALITY SYSTEM WITH CUSTOMIZABLE OPERATION ROOM

2y 5m to grant Granted Apr 07, 2026

18/229,234

Patent 12590811

METHOD, APPARATUS, AND PROGRAM FOR PROVIDING IMAGE-BASED DRIVING ASSISTANCE GUIDANCE IN WEARABLE HELMET

2y 5m to grant Granted Mar 31, 2026

18/356,303

Patent 12573162

MODELLING METHOD FOR MAKING A VIRTUAL MODEL OF A USER'S HEAD

2y 5m to grant Granted Mar 10, 2026

17/948,478

Patent 12566580

HOLOGRAPHIC PROJECTION SYSTEM, METHOD FOR PROCESSING HOLOGRAPHIC PROJECTION IMAGE, AND RELATED APPARATUS

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

75%

Grant Probability

93%

With Interview (+18.6%)

3y 0m

Median Time to Grant

Low

PTA Risk

Based on 473 resolved cases by this examiner. Grant probability derived from career allow rate.