Last updated: April 19, 2026
Application No. 18/633,750
SYSTEM AND METHOD FOR GENERATING VIDEOS DEPICTING VIRTUAL CHARACTERS

Non-Final OA §103
Filed
Apr 12, 2024
Examiner
LE, MICHAEL
Art Unit
2614
Tech Center
2600 — Communications
Assignee
UNIVERSITY OF ROCHESTER
OA Round
1 (Non-Final)
Interview Optional

— +22.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 864 resolved cases, 2023–2026
Examiner Intelligence

LE, MICHAEL View full profile →
Grants 66% — above average
Career Allow Rate
568 granted / 864 resolved
+3.7% vs TC avg
Strong +22% interview lift
Without
With
+22.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
61 currently pending
Career history
925
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
52.7%
+12.7% vs TC avg
§102
13.4%
-26.6% vs TC avg
§112
15.9%
-24.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 864 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Information Disclosure Statement
2.	The information disclosure statements (IDS) submitted on the following dates are in compliance with the provisions of 37 CFR 1.97 and are being considered by the Examiner: 04/12/2024.

Claim Rejections - 35 USC § 103
3.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

4.	Claims 1-3, 7 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Bai et al.,  [machine translation of CN-115225829-A with citation below, hereinafter “Bai”] in view of “Eye blinks are perceived as communicative signals in human face-to-face interaction” by Paul Homke (“Homke”)
Regarding claim 1, Bai discloses a computer-implemented method (Bai- ¶0001, at least discloses field of human-computer interaction, and more particularly to a video generation method and apparatus, and a computer-readable storage medium) comprising:
accessing, by a processor (Bai- Fig. 12 shows a processor 1201; ¶0048, at least discloses A processor is configured to execute executable instructions stored in the memory, and when the executable instructions are executed, the processor executes the video generation method), a first video depicting a first subject, wherein the first video includes an audio component that corresponds to speech spoken by the first subject (Bai- Fig. 1 and ¶0067, at least disclose The terminal can extract features from the audio and video of the real object through the speaker encoder (equivalent to an encoder) [first video depicting a first subject, wherein the first video includes an audio component that corresponds to speech spoken by the first subject], and input the extracted features, attitude (equivalent to the first feature), and reference image (equivalent to a preset standard image) into the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes arranged in timeline, thus obtaining a virtual object video sequence;  Fig. 2 and ¶0069, at least disclose S101. Collect audio and video sequences of real objects;  Fig. 3a, 3b, 3c and ¶0072, at least disclose Figures 3a, 3b, and 3c are respectively an optional speaker video generation schematic diagram 1, a speaker video generation schematic diagram 2, and a speaker video generation schematic diagram 3  […] As shown in Figure 3a, the speaker video generation task includes generating the speaker's body posture; as shown in Figure 3b, the speaker video generation task includes generating the speaker's lip movements; and as shown in Figure 3c, the speaker video generation task includes generating the movement of the speaker's head (including the face);  Fig. 4 and ¶0081, at least disclose S1021. Extract features from the video sequence of the real object using an encoder to obtain multiple video features);
accessing, by the processor (As discussed above), an image depicting a second subject (Bai- Fig. 1 and ¶0067, at least disclose input the extracted features, attitude (equivalent to the first feature), and reference image (equivalent to a preset standard image) into the listener decoder [image depicting a second subject] (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes arranged in timeline, thus obtaining a virtual object video sequence; ¶0070-0071, at least disclose It requires the listener [second subject] to focus entirely on what a person is saying, listen carefully, and at the same time show some visual response to the speaker. These responses can provide the speaker with information about whether the listeners are interested, understand, and agree with the content of the speech, thus adjusting the pace and progress of the conversation and facilitating smooth communication. For active listening, there are common visual patterns when listeners express their opinions;  ¶0107, at least discloses Where is the listener's reference image (equivalent to a standard image), e is the listener's attitude, and the entirety of the generated listener videos can be represented a);
providing, by the processor (As discussed above), the first video and the image to one or more machine learning models (Bai- Fig. 10 and ¶0200, at least disclose the terminal acquires the speaker video (equivalent to a video sequence of a real object), preprocesses the speaker video through a speaker encoder (equivalent to an encoder) to obtain continuous video frames [the first video and the image] […] The terminal acquires the speaker audio (equivalent to an audio sequence of a real object), processes the speaker audio through an encoder to obtain audio at multiple moments […] The listener decoder (equivalent to a virtual prediction network) in the terminal acquires the listener's reference image (equivalent to a standard image) [the image], and extracts features from the reference image through a face reconstruction model to obtain identity identifier, material, lighting, expression, and posture (expression and posture are shown in Figure 10) […] The first pose and expression features are input into the Long Short-Term Memory network encoder [machine learning models] (equivalent to the first processing module) […] The first identity feature is shared to the decoder. The decoder (equivalent to the second processing module) fuses each frame of pose and expression features in the multi-frame pose and expression features with the first identity feature to obtain a virtual object video sequence;  ¶0204, at least discloses the predicted facial expression, represents the predicted pose (rotation and translation); Dm and LSTM are both components in the listener decoder used to generate multi-frame pose and facial expression features; represents the fused features; ht represents the predicted video frame; and ct represents the stored predicted video frame);
generating, by the processor and using the one or more machine learning models (As discussed above), a second video depicting the second subject, wherein the second video depicts the second subject performing a blinking motion (Bai- Fig. 1 and ¶0067, at least disclose the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes [second subject performing a motion] arranged in timeline, thus obtaining a virtual object video sequence [a second video depicting the second subject]; ¶0071, at least discloses For active listening, there are common visual patterns when listeners express their opinions. For example, symmetrical and cyclical movements are used to indicate "yes," "no," or similar signals […] In face-to-face human interactions, even the blink of an eye can be considered a communication signal. Therefore, generating virtual objects that can hear video sequences based on audio and video sequences is of great significance), and wherein the  motion performed by the second subject is responsive to at least one of the speech spoken by the first subject, a facial expression of the first subject, and a head pose motion of the first subject (Bai- Fig. 1 and ¶0067, at least disclose The terminal can extract features from the audio and video of the real object [the speech spoken by the first subject] through the speaker encoder (equivalent to an encoder), and input the extracted features, attitude (equivalent to the first feature), and reference image (equivalent to a preset standard image) into the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes [motion performed by the second subject] arranged in timeline, thus obtaining a virtual object video sequence; ¶0070-0071, at least disclose It requires the listener [second subject] to focus entirely on what a person is saying, listen carefully, and at the same time show some visual response to the speaker [the second subject is responsive to at least one of the speech spoken by the first subject];  ¶0083-0086, at least disclose the terminal can extract features from each video frame of the video sequence of the real object using a face reconstruction model to obtain multiple video frame features; and use all the second pose expression features corresponding to the video sequence of the real object as video features […] S10211. Feature extraction is performed on each video frame of the video sequence of the real object using a face reconstruction model to obtain multiple video frame features […] the video frame features are obtained by recording the head rotation, facial expressions, and various shooting factors of the person in each video frame of the video sequence. Video frame features include secondary identity features and secondary pose and facial expression features […] The second type of postural facial features are obtained by recording the changes in a real person's head movements and facial expressions while they are speaking [the speech spoken by the first subject, a facial expression of the first subject, and a head pose motion of the first subject]); and
storing, by the processor (As discussed above), the second video on a storage device (Bai- ¶0202-0204, at least disclose the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation) […] the predicted facial expression, represents the predicted pose (rotation and translation); Dm and LSTM are both components in the listener decoder used to generate multi-frame pose and facial expression features; represents the fused features; ht represents the predicted video frame; and ct represents the stored predicted video frame)).
Bai does not explicitly disclose, but Homke discloses
the blinking motion performed by the second subject (Homke- page 1, 3rd - 4th paragraphs, at least disclose experimentally testing whether listener blink behavior has any measurable effect on speakers’ speech production […]  a novel experimental paradigm using Virtual Reality technology enabling us to selectively manipulate blink duration in a virtual listener […] The nods accompanied the avatar’s blinking behavior to mimic the typical natural environment of blinks that occur in feedback slots in conversation; page 2, 4th paragraph, at least discloses The present study demonstrates, for the first time, a sensitivity of speakers to listener blink behavior as a communicative signal in interactive face-to-face communication; page 3, section Supporting information, at least discloses S1 Video. Long listener blink. Example of a long listener blink as used in face-to-face conversation […] S2 Video. Example of a trial (short blink). Example of a trial in the nod with short blink condition, including the avatar’s question, the avatar’s nods with short blinks during the participant’s answer, and the avatar’s response following answer completion […] S3 Video. Example of a trial (long blink). Example of a trial in the nod with long blink condition, including the avatar’s question, the avatar’s nods with long blinks during the participant’s answer, and the avatar’s response following answer completion;  page 5, section Measures, Questionnaires, at least discloses we used a questionnaire assessing any explicit awareness of the different feedback types, that is, whether participants had noticed nodding and/or blinking in the virtual listeners at all, and if so, if they had noticed any variation in these behaviors across conditions).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai to incorporate the teachings of Homke, and apply the avatar’s blinking behavior into the Bai’s teachings for generating, by the processor and using the one or more machine learning models, a second video depicting the second subject, wherein the second video depicts the second subject performing a blinking motion, and wherein the blinking motion performed by the second subject is responsive to at least one of the speech spoken by the first subject, a facial expression of the first subject, and a head pose motion of the first subject.
Doing so would provide a sensitivity of speakers to listener blink behavior as a communicative signal in interactive face-to-face communication.

Regarding claim 2, Bai in view of Homke, discloses the computer-implemented method of claim 1, and further discloses wherein generating the second video comprises: 
generating, by the processor and based on the first video (see Claim 1 rejection for detailed analysis), a plurality of feature vectors representing visual features and speech features of the first subject (Bai- ¶0050, at least discloses acquiring an audio-visual sequence of a real object; extracting features from the audio-visual sequence [speech features] to determine anthropomorphic features; ¶0082, at least discloses the video features are those obtained by recording the head rotation and facial expression changes that occur during human communication. Video features are the features obtained after extracting features from a video sequence; video features include pose and facial expression; ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation)).

Regarding claim 3, Bai in view of Homke, discloses the computer-implemented method of claim 2, and further discloses wherein generating the second video further comprises:
generating, by the processor and based on the plurality of feature vectors (see Claim 2 rejection for detailed analysis), an emotion vector representing one or more emotional characteristics of the first subject (Bai- ¶0018, at least discloses Based on the first anthropomorphic feature, the first posture and expression feature, and the first feature in the multi-frame anthropomorphic features, the first processing module performs prediction to obtain the next predicted video frame; the first feature is one of a positive attitude, a negative attitude, and a neutral attitude [emotional characteristics]; ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation)).

Regarding claim 7, Bai in view of Homke, discloses the computer-implemented method of claim 1, and discloses the method further comprising:
retrieving, by the processor (As discussed above), the second video from the storage device (Bai- ¶0107, at least discloses the listener's reference image (equivalent to a standard image), e is the listener's attitude, and the entirety of the generated listener videos can be represented as;  ¶0171, at least discloses Given a sequence of generated virtual object videos, volunteers are required to determine its emotion (positive, negative, natural)); and
displaying, by the processor (As discussed above), the second video on a display (Bai- ¶0168, at least discloses The virtual object with a positive attitude smiles in frames 6-14, while the listener with a negative attitude frowns and displays a negative mouth shape throughout. Virtual objects with negative attitudes exhibit minimal changes in movement and have wandering eyes, while neutral virtual objects maintain a relatively calm expression, accompanied by regular head movements).


Regarding claims 16-18, all claim limitations are set forth as claims 1-3 in One or more non-transitory computer-readable media storing computer-readable instructions that, when executed by a processing system comprising a processor, and rejected as per discussion for claim 1-3.

Regarding claim 16, Bai in view of Homke, discloses One or more non-transitory computer-readable media storing computer-readable instructions that, when executed by a processing system comprising a processor (Bai- ¶0228-0229, at least disclose the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code […] each block of a flowchart and/or block diagram, as well as combinations of blocks in a flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, produce means for implementing the functions specified in one or more flowcharts and/or one or more block diagrams.), cause a system to perform operations comprising the method of claim 1.

5.	Claims 4 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Bai in view of Homke, further in view of Datta et al. ("Datta") [US-12,254,548-B1]
Regarding claim 4, Bai in view of Homke, discloses the computer-implemented method of claim 3, and further discloses wherein generating the second video further comprises:
generating, by the processor and based on the plurality of feature vectors and the emotion vector (see Claim 3 rejection for detailed analysis)one or more motion characteristics of the second subject (Bai- ¶0018, at least discloses the first feature is one of a positive attitude, a negative attitude, and a neutral attitude [emotional characteristics];  ¶0152, at least discloses The facial images of the real listeners were collected under different primary features, which included positive attitude, negative attitude, and neutral attitude [motion characteristics of the second subject]; ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation)).
The prior art does not explicitly disclose, but Datta discloses
a discrete latent space, the discrete latent space representing one or more motion characteristics (Datta- col 47, lines 8-15, at least discloses To represent the manifold of realistic listener facial motion, the encoder component 1220 may be implemented as a vector quantized variational auto-encoder (VQ-VAE). For example, the system 100 may extend the VQ-VAE to the domain of motion synthesis and learn a codebook of a discrete latent space (e.g., codebook data). This discrete representation enables the system 100 to predict a multinomial distribution over a next timestep of motion data).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai/Homke to incorporate the teachings of Datta, and apply the discrete latent space into the Bai/Homke’s teachings for generating, by the processor and based on the plurality of feature vectors and the emotion vector, a discrete latent space, the discrete latent space representing one or more motion characteristics of the second subject.
Doing so would improve a user experience and/or an interaction with the user, a system may be configured to enable listener animation to mimic listening behavior for a virtual assistant, virtual avatar, and/or the like.

Regarding claim 19, all claim limitations are set forth as claim 4 in One or more non-transitory computer-readable media storing computer-readable instructions that, when executed by a processing system comprising a processor, and rejected as per discussion for claim 4.


6.	Claims 5-6 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bai in view of Homke, further in view of Datta, still further in view of Song  ("Song") [CN-113986015-B], cited paragraphs refer to the English version of Song, ("Song") [US-2025/0014296-A1])
Regarding claim 5, Bai in view of Homke and Datta, discloses the computer-implemented method of claim 4, and further discloses wherein generating the second video further comprises:
generating, by the processor and based on the discrete latent space (see Claim 4 rejection for detailed analysis),  blinking performed by the first subject (Homke- page 3, 3rd paragraph, at least discloses Speaking behavior, like any other social behavior, varies from individual to individual. One particular individual difference measure of dispositional social sensitivity—the Empathy Quotient [22]—may modulate the perception of eye blinks;  page 7, section Discussion, at least discloses Speakers produced shorter answers when talking to a listener providing feedback in the form of nods with long blinks instead of short blinks).
The prior art does not explicitly disclose, but Song discloses
a sequence of blink coefficients representing blinking performed by the first subject (Song- ¶0075, at least discloses the blink degree can be used to reflect the change in the postures of the eyes, and the blink degree can be quantified by, for example, the blink coefficient B. FIG. 2 is a schematic diagram of an eye posture provided by the present disclosure, FIG. 3 is a schematic diagram of another eye posture provided by the present disclosure, FIG. 4 is a schematic diagram of yet another eye posture provided by the present disclosure, when the eyes are fully open, the blink coefficient B=1, and at this time, the postures of the eyes are as shown in FIG. 2 ; when the eyes are half-open, the blink coefficient B=0.5, at this time, the postures of the eyes are as shown in FIG. 3 ; when the user closes his eyes, the blink coefficient B=0, at this time, the postures of the eyes are as shown in FIG. 4;  ¶0134, at least discloses the first posture change parameter may be a blink coefficient B in the current frame. For example, the blink coefficient B may be determined based on difference between key point coordinates Vup of the upper eyelid and key point coordinate Vdown of the lower eyelid in the user's three-dimensional face vertex data;  ¶0151, at least discloses The normalization parameter S is a preset parameter. The smaller value between |Vup−Vdown|/S and 1 is the blink coefficient B, generally, the larger the eyes are, the larger the value of the normalization parameter S is, so that in a state that the eyes are incompletely open, the blink coefficient B should be less than 1 as much as possible to ensure that the value of the blink coefficient B is relatively close to the real eye posture, in this way, the value of the blink coefficient B ranges from 0 to 1, which can achieve the purpose of normalizing the blink coefficient, it can determine more accurate blink coefficients for eyes of different sizes, thereby making the virtual prop to better fit the target object and improving the display effect of the virtual prop).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai/Homke/Datta to incorporate the teachings of Song, and apply the blink coefficient into the Bai/Homke/Datta’s teachings for generating, by the processor and based on the discrete latent space, a sequence of blink coefficients representing blinking performed by the first subject.
Doing so would improve the display effects of virtual props.

Regarding claim 6, Bai in view of Homke, Datta and Song, discloses the computer-implemented method of claim 5, and further discloses wherein generating the second video further comprises:
generating, by the processor and based on the image (see Claim 1 rejection for detailed analysis), a mesh of the second subject (Datta- Figs. 17-18 show examples of generating a mesh model representing a facial animation; col 43, lines 53-58, at least discloses The third facial landmarks correspond to fourth facial landmarks 1040 associated with a second mesh model representing the second user, enabling the system 100 to mirror facial expressions, facial motion, and/or other nonverbal gestures generated by the second user.), and the sequence of blink coefficients (see Claim 5 rejection for detailed analysis), the second video (Bai- Fig. 1 and ¶0067, at least disclose the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes arranged in timeline, thus obtaining a virtual object video sequence [second video]).

Regarding claims 20, all claim limitations are set forth as claims 5-6 in One or more non-transitory computer-readable media storing computer-readable instructions that, when executed by a processing system comprising a processor, and rejected as per discussion for claim 5-6.

7.	Claims 8-13 are rejected under 35 U.S.C. 103 as being unpatentable over Bai et al.,  [machine translation of CN-115225829-A with citation below, hereinafter “Bai”] in view of Saito et al. (“Saito”) [US-2023/0260182-A1]
Regarding claim 8, Bai discloses a computer-implemented method (Bai- ¶0001, at least discloses field of human-computer interaction, and more particularly to a video generation method and apparatus, and a computer-readable storage medium) comprising:
accessing, by a processor (Bai- Fig. 12 shows a processor 1201; ¶0048, at least discloses A processor is configured to execute executable instructions stored in the memory, and when the executable instructions are executed, the processor executes the video generation method), plurality of videos and a plurality of images (Bai- ¶0032, at least discloses Collect audio and video sequence samples of real people who are confiding in others, and their corresponding facial images of the real people listening to them [plurality of videos and a plurality of images];  Fig. 1 and ¶0067, at least disclose The terminal can extract features from the audio and video of the real object through the speaker encoder (equivalent to an encoder), and input the extracted features, attitude (equivalent to the first feature), and reference image (equivalent to a preset standard image) into the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes arranged in timeline, thus obtaining a virtual object video sequence;  ¶0075, at least discloses the present invention is applied to occasions requiring human-computer interaction, such as intelligent consultation devices in shopping malls, which can generate corresponding videos based on videos shown by shoppers to guide them;  ¶0107, at least discloses Where is the listener's reference image (equivalent to a standard image), e is the listener's attitude, and the entirety of the generated listener videos can be represented as;  ¶0171, at least discloses Given a sequence of generated virtual object videos, volunteers are required to determine its emotion (positive, negative, natural);  ¶0178, at least discloses As shown in Table 2, for each attitude, the mean and variance of the classification accuracy of all volunteers are calculated, and the model can generate videos with a specified attitude to a certain extent.);
generating, by the processor (As discussed above), a first feature vector from at least one video of the plurality of videos (Bai- ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation)), the first feature vector representing one or more visual features of the at least one video (Bai- ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression [one or more visual features] and representing the pose (rotation and translation));
generating, by the processor (As discussed above), a second feature vector from the at least one video (Bai- ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation)), the second feature vector representing one or more audio features of the at least one video (Bai- ¶0070, at least discloses For active listening, there are common visual patterns when listeners express their opinions. For example, symmetrical and cyclical movements are used to indicate "yes," "no," or similar signals; small linear movements are used in conjunction with emphasized syllables in the other person's speech; and larger linear movements often occur during pauses in the other person's speech [audio features]. In face-to-face human interactions, even the blink of an eye can be considered a communication signal. Therefore, generating virtual objects that can hear video sequences based on audio and video sequences is of great significance;  ¶0094, at least discloses audio features [audio features] refer to certain characteristics that accompany a speaker's speech during human communication. Audio features [audio features] are features obtained after feature extraction from an audio sequence);
combining, by the processor (As discussed above), the first feature vector with the second feature vector (Bai- ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors);
one or more motion characteristics of a subject (Bai- Fig. 1 and ¶0067, at least disclose);

generating, by the processor (As discussed above),  the avatar comprising a sequence of frames depicting the subject and an emotional reaction of the subject (Bai- Fig. 1 and ¶0067, at least disclose the terminal includes an encoder, a virtual prediction network, and a virtual human interface (not shown in the figure). The terminal can extract features from the audio and video of the real object through the speaker encoder (equivalent to an encoder), and input the extracted features, attitude (equivalent to the first feature), and reference image (equivalent to a preset standard image) into the listener decoder (equivalent to the virtual prediction network) for prediction, thereby generating the listener's head movement and facial expression changes arranged in timeline, thus obtaining a virtual object video sequence;  ¶0166, at least discloses Figures 8a and 8b show the video sequence results of a virtual object generated by a video generation method according to an embodiment of the present invention, which are respectively Figure 1 and Figure 2 of the video sequence results of a virtual object generated by a video generation method […] As shown in Figures 8a and 8b, the horizontal axis represents continuous video frames [sequence of frames], including frames 0-32. Figure 8a shows the results of testing the video sequence results of the generated virtual object within the domain (meaning that the training data includes the speaker or listener in the test dataset), and Figure 8b shows the results of testing the video sequence results of the generated virtual object outside the domain (meaning that the face data of the speaker and listener have never appeared in the training set, mainly testing the model's generalization ability on unseen faces);  ¶0171, at least discloses Given a sequence of generated virtual object videos, volunteers are required to determine its emotion (positive, negative, natural) [emotional reaction]).
Bai does not explicitly disclose the combination of the first feature vector and the second feature vector representing a continuous latent space for the at least one video; mapping, by the processor, the continuous latent space to a discrete latent space, the discrete latent space representing one or more motion characteristics of a subject; decoding, by the processor, the discrete latent space into a plurality of coefficients; and generating, by the processor, an avatar based on the plurality of coefficients.
However, Saito discloses
a continuous latent space for the at least one video (Saito- ¶0016, at least discloses the human motion generation system utilizes a neural network encoder including convolutional layers or transformer layers to generate a sequence of latent feature representations in a continuous latent space based on the digital scene;  ¶0093, at least discloses generating, utilizing a plurality of convolutional neural network layers of the encoder, the sequence of latent feature representations in a continuous latent space. Act 1002 can involve generating, utilizing a plurality of transformer neural network layers of the encoder, the sequence of latent feature representations in a continuous latent space);
mapping, by the processor (Saito- ¶0085, at least discloses one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices), the continuous latent space to a discrete latent space, the discrete latent space representing one or more motion characteristics of a subject (Saito- ¶0044, at least discloses  the latent feature representations include hidden feature vectors representing attributes of the human motions mapped to a continuous latent feature space; ¶0074, at least discloses  comparing the discretized latent space to a continuous latent space, experimenters trained a variational autoencoder (“VAE”) model with a reconstruction loss and a KL divergence loss on the prior;  ¶0096-0097, at least disclose converting, utilizing the codebook of the discretized motion model, the sequence of latent feature representations into a sequence of discretized feature representations by mapping the sequence latent feature representations to a plurality of learned latent feature representations corresponding to human motions according to the plurality of sampling probabilities);
decoding, by the processor (As discussed above), the discrete latent space into a plurality of coefficients (Saito- ¶0043, at least discloses  the human motion generation system 102 learns a discrete latent space (e.g., latent feature representations within the discrete latent space) utilizing an encoder-decoder neural network architecture. For example, the human motion generation system 102 utilizes unsupervised learning to learn a discrete latent space by reconstructing human motion sequences from a digital scene;  ¶0059-0061, at least disclose the human motion generation system 102 utilizes a discretized motion model (e.g., a discrete variational autoencoder) that includes a discrete latent space codebook and three blocks: 1) an encoder, 2) a discrete sampler, and 3) a decoder  […]  the discrete sampler includes a Gumbel-softmax function. The sampling probabilities allow the human motion generation system 102 to sample the latent code z from the codebook E as z= ({tilde over (z)})·E. The human motion generation system 102 feeds the latent code z to the decoder ϕ, with weights ϕ, to obtain the reconstructed human motion sequence {tilde over (x)} as {tilde over (x)}= ϕ(z).); and
generating, by the processor (As discussed above), an avatar based on the plurality of coefficients (Saito- Fig. 4 shows images with avatars; ¶0047, at least discloses the discretized motion model utilizes the discretized features to generate a reconstructed human motion sequence 402;  ¶0058, at least discloses  the human motion generation system 102 generates a three-dimensional model based on the reconstructed human motion sequence 402 including a sequence of three-dimensional objects in a three-dimensional environment […] the human motion generation system 102 generates a digital video including the reconstructed human motion sequence 402. In some embodiments, the human motion generation system 102 utilizes the reconstructed human motion sequence 402 to generate a neural network-based motion graph with discrete motions mapped to a discrete latent feature space for use in a number of different applications).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai to incorporate the teachings of Saito, and apply the continuous latent space and the discrete latent space into the Bai’s teachings for combining the first feature vector with the second feature vector, the combination of the first feature vector and the second feature vector representing a continuous latent space for the at least one video; mapping the continuous latent space to a discrete latent space, the discrete latent space representing one or more motion characteristics of a subject; decoding the discrete latent space into a plurality of coefficients; and generating an avatar based on the plurality of coefficients, the avatar comprising a sequence of frames depicting the subject and an emotional reaction of the subject.
Doing so would efficiently, flexibly, and accurately generating and reconstructing human motion sequences.

Regarding claim 9, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein the one or more visual features of the at least one video (see Claim 8 rejection for detailed analysis) comprises a facial expression or motion of a subject of the at least one video (Bai- ¶0072, at least discloses In Figure 3c, the motion generation of the speaker's head (including face) is mainly achieved by processing a time-varying signal and a reference image of the speaker and emotion input by the dashed box through a head motion generation model. The processed result is then rendered by a head rendering model, outputting a motion image frame of the speaker's head (including face) shown by the dotted dashed box ;  ¶0083-0086, at least disclose the terminal can extract features from each video frame of the video sequence of the real object using a face reconstruction model to obtain multiple video frame features; and use all the second pose expression features corresponding to the video sequence of the real object as video features […] S10211. Feature extraction is performed on each video frame of the video sequence of the real object using a face reconstruction model to obtain multiple video frame features […] the video frame features are obtained by recording the head rotation, facial expressions, and various shooting factors of the person in each video frame of the video sequence. Video frame features include secondary identity features and secondary pose and facial expression features […] The second type of postural facial features are obtained by recording the changes in a real person's head movements and facial expressions while they are speaking).

Regarding claim 10, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein the one or more audio features of the at least one video (see Claim 8 rejection for detailed analysis) comprises speech made by a subject of the at least one video (Bai- ¶0070, at least discloses responses can provide the speaker with information about whether the listeners are interested, understand, and agree with the content of the speech, thus adjusting the pace and progress of the conversation and facilitating smooth communication).

Regarding claim 11, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein mapping the continuous latent space to the discrete latent space (see Claim 8 rejection for detailed analysis) comprises dividing the continuous latent space into a plurality of segments, encoding each segment of the plurality of segments, and mapping each encoded segment into a discrete representation of the discrete latent space (Saito- ¶0022, at least discloses some conventional image generation systems utilize motion graphs including discrete motion segments from captured data labeled as nodes and transitions as edges. While these conventional systems provide intuitive and practical utility for character animation in various industries once the motion graphs are constructed, the conventional systems lack scalability. Specifically, the conventional systems require manual labeling of motion segments and transition parameters, which requires significant time and expertise. Accordingly, the conventional systems lack efficiency, because they are limited to only specific motions segments (and corresponding transitions) that have previously been labeled without significant additional time and effort).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai to incorporate the teachings of Saito, and apply the motion segments into the Bai’s teachings for dividing the continuous latent space into a plurality of segments, encoding each segment of the plurality of segments, and mapping each encoded segment into a discrete representation of the discrete latent space.
The same motivation that was utilized in the rejection of claim 8 applies equally to this claim.

Regarding claim 12, Bai in view of Saito, discloses the computer-implemented method of claim 8, and discloses the method  further comprising:
combining, by the processor, a third feature vector with the discrete latent space, wherein the third feature vector represents an emotional characteristic of a subject of the at least one video (Bai- ¶0018, at least discloses the first feature is one of a positive attitude, a negative attitude, and a neutral attitude [emotional characteristics];  ¶0152, at least discloses The facial images of the real listeners were collected under different primary features, which included positive attitude, negative attitude, and neutral attitude; ¶0202, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors).

Regarding claim 13, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein decoding the discrete latent space into the plurality of coefficients (see Claim 8 rejection for detailed analysis) comprises decoding one or more geometrical features of at least one image of the plurality of images (Bai- ¶0117, at least discloses Based on the first pose expression features, anthropomorphic features, and the first feature, a virtual prediction network is used to predict and decode to determine the multi-frame pose expression features of the virtual object […] the next predicted video frame is decoded by the second processing module to determine the next posture and expression feature of the virtual object corresponding to the next predicted video frame; prediction and decoding are continued using the next posture and expression feature and the next frame anthropomorphic feature in the multi-frame anthropomorphic features until the last posture and expression feature of the virtual object corresponding to the last predicted video frame is obtained, thereby obtaining the multi-frame posture and expression features of the virtual object;  ¶0202-0204, at least discloses the listener decoder is used to decode the predicted video frame into a vector containing two feature vectors: representing the expression and representing the pose (rotation and translation) […] Where represents the predicted facial expression, represents the predicted pose (rotation and translation); Dm and LSTM are both components in the listener decoder used to generate multi-frame pose and facial expression features).


8.	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Bai in view of Saito, further in view of Nayar et al. (“Nayar”) [US-2011/0115798-A1]
Regarding claim 14, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein generating the avatar based on the plurality of coefficients (see Claim 8 rejection for detailed analysis) and does not explicitly disclose, but Nayar discloses comprises warping at least one image of the plurality of images (Nayar- ¶0061, at least discloses Similar to FIGS. 2-4, corresponding points are used to warp the prototype surface to create a facial surface that corresponds to the stereo image. For example, a dense mesh can be generated by warping the prototype facial surface to match the set of reconstructed points […]  similar to FIGS. 2-4, a number of corresponding points can be manually marked between points on the generic mesh and points on the stereo image. These corresponding points are then used to obtain an initial estimate of the rigid pose and warping of the generic mesh).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai to incorporate the teachings of Nayar, and apply the warping of the generic mesh into the Bai’s teachings for generating the avatar based on the plurality of coefficients comprises warping at least one image of the plurality of images.
Doing so would create speech-enabled avatars of faces that provide realistic facial motion from text or speech inputs.


9.	Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Bai in view of Saito, further in view of Biswas et al. (“Biswas”) [US-2022/0036617-A1]
Regarding claim 15, Bai in view of Saito, discloses the computer-implemented method of claim 8, and further discloses wherein generating the avatar (see Claim 8 rejection for detailed analysis) comprises and does not explicitly disclose, but Biswas discloses
controlling a blinking rate of the subject (Biswas- ¶0082, at least discloses The method of the present disclosure produced a blink rate of 0.3 blink(s) and 0.38 blink(s) (refer above Table 1 and 2) for TCD-TIMIT and GRID datasets respectively which is similar to the average human blink rate of 0.28-0.4 blink(s).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Bai to incorporate the teachings of Biswas, and apply the blink rate into the Bai’s teachings for generating the avatar comprises controlling a blinking rate of the subject.
Doing so would generate realistic animation from audio on any unknown faces and cannot be easily generalized to different facial characteristics and voice accents.

Conclusion
10.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. They are as recited in the attached PTO-892 form.
11.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL LE whose telephone number is (571)272-5330. The examiner can normally be reached 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL LE/Primary Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Apr 12, 2024
Application Filed
Jan 24, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/573,760
Patent 12579211
AUTOMATED SHIFTING OF WEB PAGES BETWEEN DIFFERENT USER DEVICES
2y 5m to grant Granted Mar 17, 2026
18/006,008
Patent 12579738
INFORMATION PRESENTING METHOD, SYSTEM THEREOF, ELECTRONIC DEVICE, AND COMPUTER-READABLE STORAGE MEDIUM
2y 5m to grant Granted Mar 17, 2026
18/405,933
Patent 12579072
GRAPHICS PROCESSOR REGISTER FILE INCLUDING A LOW ENERGY PORTION AND A HIGH CAPACITY PORTION
2y 5m to grant Granted Mar 17, 2026
18/203,183
Patent 12573094
COMPRESSION AND DECOMPRESSION OF SUB-PRIMITIVE PRESENCE INDICATIONS FOR USE IN A RENDERING SYSTEM
2y 5m to grant Granted Mar 10, 2026
18/412,614
Patent 12558788
SYSTEM AND METHOD FOR REAL-TIME ANIMATION INTERACTIVE EDITING
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
66%
Grant Probability
88%
With Interview (+22.1%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 864 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEM AND METHOD FOR GENERATING VIDEOS DEPICTING VIRTUAL CHARACTERS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email