Last updated: April 19, 2026
Application No. 18/342,721
END-TO-END VIRTUAL HUMAN SPEECH AND MOVEMENT SYNTHESIZATION

Non-Final OA §103
Filed
Jun 27, 2023
Examiner
PROVIDENCE, VINCENT ALEXANDER
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Samsung Electronics Co., Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 18 resolved cases, 2023–2026
Examiner Intelligence

PROVIDENCE, VINCENT ALEXANDER View full profile →
Grants 83% — above average
Career Allow Rate
15 granted / 18 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
38 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
82.4%
+42.4% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
0.9%
-39.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 18 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Amendment filed January 5th 2026 has been entered. Claims 1, 3-8, 10-14 and 16-20 are pending in the application. Applicant’s amendments to the Claims 1, 8, and 14 have overcome the rejections previously set forth in the Final Office Action mailed November 5th 2025. Further search has been performed to address the material amended in the aforementioned claims. Newly found reference Ji (NPL: Audio-Driven Emotional Video Portraits) was used for the amended independent claims.

Response to Arguments
Applicant’s arguments with respect to claims 1, 7, 8, 14, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 4, 5, 6, 8, 10, 11, 12, 13, 14, 16, 17, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lee (US 20220398794 A1) in view of Steptoe (US 11270487 B1, see attached document for paragraph numbers) and Seol (US 20240013462 A1) and Ji (NPL: Audio-Driven Emotional Video Portraits).

Regarding claim 1:
Lee teaches:
A computer-implemented method, comprising: 
capturing supplemental data (Lee: audio/video (A/V) input [0024]) generated by a transducer (Lee: the input module may include a camera and a microphone [0024]), wherein the supplemental data specifies one or more attributes of a user (Lee: In the case of video input, the assessment unit can be part of the computer vision unit for analyzing the video of the user to determine the satisfaction factor of the session. The satisfaction factor may be based on emotional indicators of the user, such as facial expressions, hand gestures and speech, such as loudness, speed and intonation [0034]), and wherein the capturing is performed in substantially real-time with the user providing an input to a conversational platform (Lee: a user may speak into the user device configured with a camera and microphone and a display displaying the 3D AI avatar chatbot conversing with the user [0024]); 
generating, by a behavior determiner (Lee: chatbot system [0006]), behavioral data (Lee: satisfaction factor [0019]; the avatar generation module is configured to cooperate with the audio to face module and the avatar animation module to generate an animated 3D lifelike avatar speaking the audio response with facial movement [0006]; see Note 1A) based on the supplemental data (Lee: the video of the user [0019]) and an audio response generated by the conversational platform (Lee: the MMC module converts the text response to an audio response [0006]) in response to the input to the conversational platform (Lee: a multi-modal conversational (MMC) module configured to process user input into a processed input for the chatbot system to respond [0006]); and 
generating, by a rendering network (Lee: rendering unit 276 [0030]), based on the behavioral data (Lee: From the analysis of the assessment unit, the chatbot can understand misunderstandings in the user's question, learning from the misunderstandings and mistakes to improve the chatbot's performance [0019]; see Note 1A) and the audio response, a video rendering of a virtual human engaging in a conversation with the user, wherein the video rendering is synchronized with the audio response (Lee: The avatar output module, in one embodiment, cooperates with the audio to face module, the avatar animation module and the text to speech unit of the MMC to produce a lifelike 3D avatar to speak the response on the display of the user device [0029]).  
wherein the generating the video rendering comprises combining the audio response and the behavioral data (see Note 1A) to generate one or more head poses of the virtual human during the conversation (Lee: the audio to face module and text to speech unit are configured to animate the 3D model of the virtual human which includes the animated face with the voice from the audio to face module and the text to speech [0031]) in which mouth and lip movements of the virtual human are synchronized with the audio response during the conversation (Lee: In one embodiment, the 3D avatar is imparted with lifelike characteristics, including facial expressions and voice, such as facial, eye and mouth movements as well as natural speech corresponding to the response [0018]; see Note 2A).

Note 1A: Lee teaches that the chatbot may generate a satisfaction factor (par [0006] as cited above), and an animation based on the audio response (par [0019] as cited above). 
Regarding the satisfaction factor, Lee teaches “The satisfaction factor may be based on emotional indicators of the user, such as facial expressions, hand gestures and speech, such as loudness, speed and intonation,” and that “The emotional indicators can be analyzed to infer the satisfaction factor of the user, such as contentment, happiness, frustration, uncertainty and anger,” thereby indicating a behavioral state of the user. The satisfaction is then further utilized to train the rendering network, which will ultimately impact the video rendering generated by the system: “From the analysis of the assessment unit, the chatbot can understand misunderstandings in the user's question, learning from the misunderstandings and mistakes to improve the chatbot's performance” [0019] and the performance of the chatbot includes generating a video rendering a virtual human engaging in a conversation with the user.
Regarding the audio to face module, Lee teaches that the audio to face module utilizes the audio response to “generate an animated 3D lifelike avatar speaking the audio response with facial movement,” which would enable the avatar to display an emotion or behavior depending on the contents of the audio response.
Note 2A: Lee teaches “lifelike characteristics, including facial expressions and voice, such as facial, eye and mouth movements” [0018] which inherently requires synchronizing lip movements of the virtual human with the audio.

Lee fails to explicitly teach:
generating, by a behavior determiner, behavioral data based on the supplemental data and an audio response including viseme features generated by the conversational platform in response to the input to the conversational platform,
wherein the behavior determiner is trained to output behavioral data including contour drawings specifying a spatial arrangement of eyes and eyebrows and head position for a virtual human based on predicted emotive condition of the user;
generating, by a rendering network, based on the behavioral data including the contour drawings and the audio response, a video rendering of the virtual human engaging in a conversation with the user
combining the audio response including the viseme features and the behavioral data to generate one or more head poses of the virtual human during the conversation in which mouth and lip movements of the virtual human are synchronized with the audio response during the conversation.

Steptoe teaches:
generating, by a behavior determiner, behavioral data based on the supplemental data (Steptoe: Modules 102 may include an identifying module 104 that identifies a set of action units (AUs) associated with a face of a user, Paragraph 10; see Note 7B below) and an audio response including viseme features (see Note 1B) generated by the conversational platform in response to the input to the conversational platform;

Note 1B: Steptoe teaches: “audio data 146 may include, without limitation, information associated with one or more phonemes that may be produced by a user, data associated with one or more audio clips associated with one or more phonemes, data representative of one or more waveforms associated with one or more phonemes, data associated with recognition of one or more phonemes, and so forth.” (Paragraph 16). Steptoe further teaches: “a viseme may be conceptually viewed as a visual analog to a phoneme” (Paragraph 33).
Furthermore, Steptoe teaches: “one or more of the systems described herein may direct a computer-generated avatar that represents the user to produce the viseme in accordance with the set of AU parameters associated with each AU in response to detecting that the user has produced the sound,” (Paragraph 76). That is, the avatar may respond to a user’s audio input by generating a viseme. Therefore, when the teachings of Steptoe are combined with Lee, the audio response taught by Lee would include viseme features.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Steptoe with Lee. Generating, by a behavior determiner, behavioral data based on the supplemental data generated by the conversational platform in response to the input to the conversational platform, as in Steptoe, would benefit the Lee teachings by enabling realistic and convincing animation of the avatar.

Lee in view of Steptoe fails to explicitly teach:
wherein the behavior determiner is trained to output behavioral data including contour drawings specifying a spatial arrangement of eyes and eyebrows and head position for a virtual human based on predicted emotive condition of the user;
generating, by a rendering network, based on the behavioral data including the contour drawings and the audio response, a video rendering of the virtual human engaging in a conversation with the user
combining the audio response including the viseme features and the behavioral data to generate one or more head poses of the virtual human during the conversation in which mouth and lip movements of the virtual human are synchronized with the audio response during the conversation.

Seol teaches:
combining the audio response and the behavioral data (Seol: an animation system (or other image data generation or synthesis system, component, module, or device) can accept or infer an emotional state, and attempt to generate animation of this character that not only matches any audio to be uttered by this character, but also conveys that utterance with an emotional behavior. [0022]) to generate one or more head poses of the virtual human during the conversation (Seol: a character such as a character corresponding to the head region illustrated in a set 100 of images illustrated in FIG. 1 might be animated to have their mouth, face, and/or head move in such a way as to convey that the character is uttering speech represented by audio data, which may be provided for playback or other presentation along with this animation. [0022])

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Seol with Lee in view of Steptoe. Combining the audio response and the behavioral data to generate one or more head poses of the virtual human during the conversation, as in Seol, would benefit the Lee in view of Steptoe teachings by enabling more natural head movements of the virtual human while conversing with the user.

Lee in view of Steptoe and Seol still fails to explicitly teach:
wherein the behavior determiner is trained to output behavioral data including contour drawings specifying a spatial arrangement of eyes and eyebrows and head position for a virtual human based on predicted emotive condition of the user;
generating, by a rendering network, based on the behavioral data including the contour drawings and the audio response, a video rendering of the virtual human engaging in a conversation with the user

Ji teaches:
wherein the behavior determiner is trained (Ji: The [generated landmarks with the edge map of the target image] can serve as the guidance to train an Edge-to-Video translation network, Pg. 5, Section 3.3: Target-Adaptive Face Synthesis; see Note 1C) to output behavioral data including contour drawings (Ji: guidance map, Pg. 5, Edge-to-Video Translation Network, par. 1; see Note 1D) specifying a spatial arrangement of eyes and eyebrows and head position for a virtual human (see Note 1E) based on predicted emotive condition of the user (see Note 1F);
generating, by a rendering network, based on the behavioral data including the contour drawings and the audio, a video rendering of the virtual human (Ji: the rendering network gives us photo-realistic animations of the target portrait based on the target video and edge maps, Pg. 3, Figure 2).

Note 1C: Ji teaches: “we adopt a conditional-GAN architecture for our Edge-to-Video translation network” (Pg. 5, Edge-to-Video Translation Network). A GAN (Generative Adversarial Network) is known in the art to operate in a feedback loop where it trains based on whether its own output is “real” or “fake”. Therefore, one of ordinary skill in the art would understand the Edge-to-Video Translation Network taught by Ji to be trained to produce a guidance map (i.e., the generated landmarks merged with the edge map).

Note 1D: Ji teaches: “Given the adapted landmarks and the target frame, we merge the landmarks and the edge map extracted from this frame into a guidance map for portrait generation. In particular, […] we connect adjacent facial landmarks to create a face sketch.” (Pg. 5, Edge-to-Video Translation Network).
The specification of the present application teaches: “Once trained on the annotated segments, the behavior determiner is capable of generating outputs (e.g., contour drawings) that when fed into the rendering network guide the network in generating a video rendering of the virtual human.” [0030]. Ji also teaches generating a video based on the edge maps merged with the annotated landmarks: “the rendering network gives us photo-realistic animations of the target portrait based on the target video and edge maps”, (Pg. 3, Figure 2).
Therefore, the Examiner considers the edge maps of Ji to be analogous to the claimed contour drawings. 

Note 1E: Figure 7 on Pg. 8 showcases that the edge map contains representations of eyes, eyebrows, and a head. Similarly, the landmarks showcased in Figure 2 also contain representations of eyes, eyebrows, and a head. Therefore, when “the landmarks and the edge map extracted from this frame [are merged] into a guidance map” (Pg. 5, Edge-to-Video Translation Network), the resulting guidance map would also contain representations of eyes, eyebrows, and a head.

Note 1F: Ji teaches that: “We first extract disentangled content and emotion information from the audio signal. Then we predict landmark motion from audio representations.” That is, the predicted landmarks represent a predicted emotion expression based on the audio. Figure 2 on Pg. 3 of Ji showcases that the predicted landmarks are then utilized to generate a reconstructed mesh, which is projected to the edge maps, which in turn is fed to the Edge-to-Video translation network. Put simply, the guidance map generated by the Edge-to-Video network is indirectly based on predicted emotion data from the audio of the user.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Ji with Lee in view of Steptoe and Seol. Outputting behavioral data including contour drawings specifying a spatial arrangement of eyes and eyebrows and head position would benefit the Lee in view of Steptoe and Seol teachings by enabling the network to generate realistic head movements based on emotions detected from audio: “Audio does not supply any cues for head poses and the global movements of a head, thus the edited head inferred from audio may have large head pose and movement variances with the target videos” (Ji: Pg. 2, Section 1: Introduction, par. 4).

Regarding claim 3:
Lee in view of Steptoe, Seol, and Ji teaches:
The computer-implemented method of claim 1 (as shown above), wherein the supplemental data includes user speech (Lee: The satisfaction factor may be based on emotional indicators of the user, such as […] speech [0019]), and wherein the generating the behavioral data includes: 
generating the behavioral data, at least in part, based on a machine-generated sentiment analysis of the user speech (Lee: The emotional indicators can be analyzed to infer the satisfaction factor of the user, such as contentment, happiness, frustration, uncertainty and anger [0019]).

Regarding claim 4:
Lee in view of Steptoe, Seol, and Ji teaches:
The computer-implemented method of claim 1 (as shown above), wherein the supplemental data includes one or more user facial expressions (Lee: The satisfaction factor may be based on emotional indicators of the user, such as facial expressions [0019]), and wherein the generating the behavioral data includes: 
generating the behavioral data, at least in part, based on a machine-generated expression analysis of the one or more user facial expressions (Lee: The emotional indicators can be analyzed to infer the satisfaction factor of the user, such as contentment, happiness, frustration, uncertainty and anger [0019]).  

Regarding claim 5:
Lee in view of Steptoe, Seol, and Ji teaches:
The computer-implemented method of claim 1 (as shown above), wherein the audio response is generated from a textual reply, generated by the conversational platform, to the input provided from the user to the conversational platform (Lee: the database module is configured to generated[sic] a text response based on the processed user input, wherein the MMC module converts the text response to an audio response. [0006]).

Regarding claim 6:
Lee in view of Steptoe, Seol, and Ji teaches:
The computer-implemented method of claim 1 (as shown above), wherein the generating the video rendering comprises: 
combining the audio response and the behavioral data (Lee: the audio to face module and text to speech unit are configured to animate the 3D model of the virtual human which includes the animated face with the voice from the audio to face module and the text to speech [0031]; see Note 6A) to generate both head movement and body movement (Lee: For example, the 3D avatar is a lifelike 3D avatar with body movements, including facial expressions as well as a human-like voice [0031]) of the virtual human during the conversation.  

Note 6A: Lee teaches that the “assessment unit can analyze the video of the user to determine the satisfaction factor of the session”, where the satisfaction factor is analogous to the behavioral data as previously shown. Lee further teaches that the “the chatbot can understand misunderstandings in the user's question, learning from the misunderstandings and mistakes to improve the chatbot's performance,” i.e., the chatbot of Lee may respond based on the user’s behavior and emotions. Lee further teaches “the 3D avatar is imparted with lifelike characteristics, including facial expressions and voice, such as facial, eye and mouth movements as well as natural speech corresponding to the response” [0018], indicating that the head and body movements will utilize the response based on audio and behavioral data to animate the avatar.

Lee fails to explicitly teach:
combining the audio response and the behavioral data to generate both head and body movements of the virtual human during the conversation.

Seol teaches:
combining the audio response and the behavioral data (Seol: an animation system (or other image data generation or synthesis system, component, module, or device) can accept or infer an emotional state, and attempt to generate animation of this character that not only matches any audio to be uttered by this character, but also conveys that utterance with an emotional behavior. [0022]) to generate both head and body movements of the virtual human during the conversation (Seol: a character such as a character corresponding to the head region illustrated in a set 100 of images illustrated in FIG. 1 might be animated to have their mouth, face, and/or head move in such a way as to convey that the character is uttering speech represented by audio data, which may be provided for playback or other presentation along with this animation. [0022]; Seol: The motion or deformation information output by the network can correspond to a set of facial (or other body) components or portions that can be animated [0021]).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Seol with Lee in view of Steptoe and Ji. Combining the audio response and the behavioral data to generate both head and body movements of the virtual human during the conversation, as in Seol, would benefit the Lee in view of Steptoe and Ji teachings by enabling more natural head and body movements of the virtual human while conversing with the user.

Regarding claim 8:
Claim 8 is substantially similar to Claim 1, and is therefore rejected for similar reasons. Claim 8 contains the following notable differences:
Claim 8 claims a computer program product instead of a method. Lee teaches a “chatbot system” [0006].

Regarding claim 10:
Claim 10 is substantially similar to Claim 3, and is therefore rejected for similar reasons. Claim 10 contains the following notable differences:
Claim 10 claims a computer program product instead of a method. Lee teaches a “chatbot system” [0006].

Regarding claim 11:
Claim 11 is substantially similar to Claim 4, and is therefore rejected for similar reasons. Claim 11 contains the following notable differences:
Claim 11 claims a computer program product instead of a method. Lee teaches a “chatbot system” [0006].

Regarding claim 12:
Claim 12 is substantially similar to Claim 5, and is therefore rejected for similar reasons. Claim 12 contains the following notable differences:
Claim 12 claims a computer program product instead of a method. Lee teaches a “chatbot system” [0006].

Regarding claim 13:
Claim 13 is substantially similar to Claim 6, and is therefore rejected for similar reasons. Claim 13 contains the following notable differences:
Claim 13 claims a computer program product instead of a method. Lee teaches a “chatbot system” [0006].

Regarding claim 14:
Claim 14 is substantially similar to Claim 1, and is therefore rejected for similar reasons. Claim 14 contains the following notable differences:
Claim 14 claims a computer program product instead of a method. Lee teaches a computer program product: 
A computer program product (Lee: an AI-based conversational chatbot platform 100 [0013]), the computer program product comprising: one or more computer-readable storage media and program instructions collectively stored on the one or more computer-readable storage media (Lee: The DB module 290 stores the knowledge base of the owner of the chatbot system. For example, the DB module stores the knowledge for the AI avatar [0027]), the program instructions executable by a processor to cause the processor to initiate operations (Lee: In one embodiment, the input is processed by the MMC module, [0025])

Regarding claim 16:
Claim 16 is substantially similar to Claim 3, and is therefore rejected for similar reasons. Claim 16 contains the following notable differences:
Claim 16 claims a computer program product instead of a method. In the rejection of claim 14, it was shown that Lee teaches a computer program product.

Regarding claim 17:
Claim 17 is substantially similar to Claim 4, and is therefore rejected for similar reasons. Claim 17 contains the following notable differences:
Claim 17 claims a computer program product instead of a method. In the rejection of claim 14, it was shown that Lee teaches a computer program product.

Regarding claim 18:
Claim 18 is substantially similar to Claim 5, and is therefore rejected for similar reasons. Claim 18 contains the following notable differences:
Claim 18 claims a computer program product instead of a method. In the rejection of claim 14, it was shown that Lee teaches a computer program product.

Regarding claim 19:
Claim 19 is substantially similar to Claim 6, and is therefore rejected for similar reasons. Claim 19 contains the following notable differences:
Claim 19 claims a computer program product instead of a method. In the rejection of claim 14, it was shown that Lee teaches a computer program product.

Claims 7 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lee (US 20220398794 A1) in view of Steptoe (US 11270487 B1, see attached document for paragraph numbers), Seol (US 20240013462 A1) and Fruhstuck et al.: (NPL: InsetGAN for Full-Body Image Generation; from applicant’s IDS (cited by NPL “Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement”)).

Regarding claim 7:
Lee in view of Steptoe, Seol, and Ji teaches:
The computer-implemented method of claim 6 (as shown above), 

Lee in view of Steptoe, Seol, and Ji fails to teach:
wherein the rendering network comprises distinct subnetworks for generating, respectively, the head and body movements of the virtual human during the conversation.

Fruhstuck teaches:
wherein the rendering network includes a first subnetwork for generating the head movement and a second subnetwork, distinct from the first subnetwork, for generating the body movement of the virtual human during the conversation (Fruhstuck: we show that a face GAN trained with the face regions cropped from our full-body training images can be used to improve the appearance of the body GAN results. Alternatively, we can also leverage a face generator trained on other datasets such as FFHQ [14] for face enhancement as well. Similarly, specialized hands or feet generators can also be used in our framework to improve other regions of the body, Pg. 3, Section 3.2: Multi-GAN Optimization, par. 2; see Note 7A);

Note 7A: When combined with the teachings of Lee in view of Steptoe, Seol, and Ji, it would be obvious to utilize the neural networks in order to generate the virtual human during a conversation with the user.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Fruhstuck with Lee in view of Steptoe, Seol, and Ji. Having the rendering network be trained using machine learning with training data that includes annotated audio and video segments, as in Fruhstuck, would benefit the Lee in view of Steptoe, Seol, and Ji teachings by ensuring that networks trained on a part of the avatar can generate it to the best accuracy possible while minimizing errors. (Fruhstuck: Faces are especially hard since we humans are ultra-sensitive to artifacts in these areas. They therefore deserve dedicated networks and special treatment. Pg. 3, Section 3.1: Full Body GAN)

Lee in view of Seol, Ji, and Fruhstuck still fails to explicitly teach:
wherein the behavioral data generated by the behavior determiner includes head-related behavior data and body-related behavior data; and
wherein the first subnetwork uses the head-related behavior data to generate the head movement and the second subnetwork uses the body-related behavior data to generate the body movement.

Steptoe teaches:
wherein the behavioral data generated by the behavior determiner (Steptoe: Modules 102 may include an identifying module 104 that identifies a set of action units (AUs), Paragraph 10) includes head-related behavior data and body-related behavior data (Steptoe: in some examples, body data 142 may include face data 144 that may include any suitable data associated with one or more faces including, without limitation, information associated with one or more visemes, Paragraph 67; see Note 7B); and
wherein the first subnetwork uses the head-related behavior data to generate the head movement and the second subnetwork uses the body-related behavior data to generate the body movement (see Note 7C).

Note 7B: Steptoe teaches “action units (AUs) associated with a face of a user, each AU associated with at least one muscle group engaged by the user to produce a viseme associated with a sound produced by the user” and that the action units are included in “body data” that “may describe any relationship of a muscle group to a body action” (Paragraph 93). 
The AU parameters may represent both face movement and body movement data: “Although described by way of example above in reference to AU parameters used to produce visemes, such parameters may describe any relationship of a muscle group to a body action. Such parameters may be associated with any user, any suitable muscle group and any suitable predefined body action, such as a hand movement, a walking gait, a throwing motion, a head movement, and so forth.” (Paragraph 93). The AU parameters are eventually used to control an avatar to perform a corresponding motion: “directing module 110 may cause user device 202, server 206, and/or target device 208 to direct a computer-generated avatar (e.g., computer-generated avatar 238) that represents the user to produce the viseme in accordance with the set of AU parameters,” (Paragraph 23).
Furthermore, one of ordinary skill in the art would liken the AU data to behavioral data, because Steptoe teaches: “face data 144 may include AU parameters, such as an onset curve (e.g., onset curve 224) and/or a falloff curve (e.g., falloff curve 226) […] (e.g., an onset curve and/or offset curve that may describe behavior of the muscle groups of most users …” (Paragraph 67) That is, the AU data represents the behavior of a user. Therefore, the Examiner understands Steptoe to teach behavioral data generated by a behavior determiner including head-related behavior data and body-related behavior data.

Note 7C: Previously, it was shown that Fruhstuck teaches distinct networks for head and body movement generation. In Note 7B, it was shown that Steptoe teaches body data and face data, with corresponding behavioral data (“action units” or AU data) used to generate movements. Therefore, it would be obvious to one of ordinary skill in the art to apply the face data to the “face GAN” to generate head movements, and the body data to the “body GAN” to generate body movements.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Steptoe with Lee in view of Seol and Fruhstuck. Having the behavioral data generated by the behavior determiner include head-related behavior data and body-related behavior data and having the first subnetwork use the head-related behavior data to generate the head movement and the second subnetwork use the body-related behavior data to generate the body movement, as in Steptoe, would benefit the Lee in view of Seol and Fruhstuck teachings by enabling the system to apply body movements separate from facial movements, allowing for more natural movement.

Regarding claim 20:
Claim 20 is substantially similar to Claim 7, and is therefore rejected for similar reasons. Claim 20 contains the following notable differences:
Claim 20 claims a computer program product instead of a method. In the rejection of claim 14, it was shown that Lee teaches a computer program product.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT ALEXANDER PROVIDENCE whose telephone number is (571)270-5765. The examiner can normally be reached Monday-Thursday 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VINCENT ALEXANDER PROVIDENCE/Examiner, Art Unit 2617                                                                                                                                                                                                        /KING Y POON/Supervisory Patent Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

Jun 27, 2023
Application Filed
May 01, 2025
Non-Final Rejection — §103
Jun 26, 2025
Interview Requested
Jul 03, 2025
Examiner Interview Summary
Jul 03, 2025
Applicant Interview (Telephonic)
Aug 07, 2025
Response Filed
Oct 31, 2025
Final Rejection — §103
Nov 25, 2025
Interview Requested
Dec 09, 2025
Applicant Interview (Telephonic)
Dec 09, 2025
Examiner Interview Summary
Jan 01, 2026
Response after Non-Final Action
Jan 23, 2026
Request for Continued Examination
Jan 29, 2026
Response after Non-Final Action
Mar 30, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/165,619
Patent 12586303
GEOMETRY-AWARE THREE-DIMENSIONAL SYNTHESIS IN ALL ANGLES
2y 5m to grant Granted Mar 24, 2026
18/100,546
Patent 12530847
IMAGE GENERATION FROM TEXT AND 3D OBJECT
2y 5m to grant Granted Jan 20, 2026
18/270,591
Patent 12530808
Predictive Encoding/Decoding Method and Apparatus for Azimuth Information of Point Cloud
2y 5m to grant Granted Jan 20, 2026
18/268,027
Patent 12524946
METHOD FOR GENERATING FIREWORK VISUAL EFFECT, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Jan 13, 2026
18/481,552
Patent 12380621
COMPUTER-IMPLEMENTED SYSTEMS AND METHODS FOR GENERATING ENHANCED MOTION DATA AND RENDERING OBJECTS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 5m
Median Time to Grant
High
PTA Risk
Based on 18 resolved cases by this examiner. Grant probability derived from career allow rate.