DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al. (NPL, “Predicting Personalized Head Movement from Short Video and Speech Signals”, 09/2022) and Shu et al. (US Publication Number 2024/0062495 A1, hereinafter “Shu”), further in view of Valentin et al. (US Patent Number 11,429,835 B1).
(1) regarding claim 8:
As explained in the introduction, Yi disclosed a computer device (page 2, I. Introduction, para. [0006], note that based on the proposed mapping strategy, we build a system that can transfer the speech signal of an arbitrary source person into the talking face video of an arbitrary target person with learning-based personalized head pose), comprising:
cause the computer device to perform a method for training a video generation mode including:
obtaining a training video of a target user (page 4, page 3, III. Predicting Head Movement from Multi-Modal Input, para. [0001], note that we propose a novel two-step mapping strategy to predict head movement from input speech signal and short video);
extracting, from the training video, a phonetic feature of the target user, an expression parameter of the target user, and a head parameter of the target user, (page 4, page 4, page 3, III. Predicting Head Movement from Multi-Modal Input, para. [0004], note that we characterize the head motion in each video of 10-15 seconds by a motion behavior pattern. Also see para. [0001], addition to learning the transformation from the speech signal to lip motion and facial expression our system specially considers the generation of personalized head movement of the target person);
synthesizing the phonetic feature of the target user, the expression parameter of the target user, and the head parameter of the target user to obtain a condition input of the training video (page 4, para. [0003], note that we characterize the head motion in each video of 10-15 seconds by a motion behavior pattern. Our modeling strategy has the following characteristics. First, different people have different motion behavior patterns. Second, since the coupling between speech and head motion changes from utterance to utterance, one person can have multiple motion behavior patterns in multiple short videos); and
wherein the video generation model is configured to perform object reconstruction on a target video of the target user to obtain a corresponding reconstructed video of the target user (see fig. 1, note that Fig. 1. Flowchart of our method. Stage 1: from audio-visual information to 3D facial animation, including (1) reconstructing 3D face of the target person (Section IV-A1), (2) training a general mapping from speech to the facial expression (Section IV-A2), and (3) training an encoder and decoder for personalized head movement for multiple subjects (Section IV-A3). Stage 2: from 3D facial animation to realistic talking face video generation, including (1) rendering 3D facial animation into video frames using a lightweight graphic engine (Section IV-B1), (2) background matching (Section IV-B2), (3) fine tuning rendered frames into realistic ones using a rendering-to-realistic GAN module (Section IV-B3), and (4) an enhancement module to obtain high quality results).
Yi disclosed most of the subject matter as described as above except for specifically teaching a memory; one or more processors coupled to the memory; and one or more computer programs that, when executed by the one or more processors, performing network training on a neural radiance field based on the condition input, three-dimensional coordinates, and a viewing direction to obtain a video generation model; and the head parameter of the target user representing head pose information and head position information of the target user.
However, Shu disclosed a memory (804, fig. 8, note that memory is disclosed); one or more processors coupled to the memory (802, fig. 8, note that memory and the processor are coupled); and one or more computer programs that, when executed by the one or more processors, performing network training on a neural radiance field based on the condition input, three-dimensional coordinates, and a viewing direction to obtain a video generation model (para. [0037], note that the color model 215 models a neural radiance field (NeRF) is defined as the continuous function F that, given a position of a point in the 3D scene (e.g. the editable 3D scene 104), x, and a direction it is being viewed from, d (e.g. based on a camera position and camera view angle), outputs a color c=(r, g, b) i.e. three-dimensional coordinates and a density σ).
At the time of filing for the invention, it would have been obvious to a person of ordinary skilled in the art to teach a memory; one or more processors coupled to the memory; and one or more computer programs that, when executed by the one or more processors, performing network training on a neural radiance field based on the condition input, three-dimensional coordinates, and a viewing direction to obtain a video generation model. The suggestion/motivation for doing so would have been in order to generate, based on an input video and using a deformable NeRF scene representation model, a 3D scene including an editable object (para. [0003]). Therefore, it would have been obvious to combine Yi with Shu to obtain the invention as specified in claim 8.
In addition to that, Valentin disclosed the head parameter of the target user representing head pose information and head position information of the target user (col. 8, lines 27-43, note that the user device 208 may begin capturing image data 216 of the user 206 as part of a guided process where the user 206 is prompted to assume certain head motions or positions and/or facial expressions during capture of the image data 216. Also see col. 14, lines 54-56, head pose).
At the time of filing for the invention, it would have been obvious to a person of ordinary skilled in the art to teach a memory; one or more processors coupled to the memory; and one or more computer programs that, when executed by the one or more processors, performing network training on a neural radiance field based on the condition input, three-dimensional coordinates, and a viewing direction to obtain a video generation model. The suggestion/motivation for doing so would have been in order to improve and facilitate remote visual communication (col. 2, lines 40-42). Therefore, it would have been obvious to combine Yi, Shu and Valentin to obtain the invention as specified in claim 8.
(2) regarding claim 9:
Yi further disclosed the computer device according to claim 8, wherein the video generation model is obtained by optimizing an image reconstruction loss between a color value of a predicted object and a color value of a real object, the color value of the predicted object being generated by the neural radiance field based on the condition input, the three-dimensional coordinates, and the viewing direction (page 7, Attention-based generator G, para. [0001] We use an attention-based generator to refine rendered frames into realistic ones. Given a window of rendered frames (rt−2, rt−1, rt), the generator synthesizes both a color mask Ct and an attention mask At, and outputs a refined frame that is the weighted average of the rendered frame and color mask).
(3) regarding claim 10:
Yi further disclosed the computer device according to claim 8, wherein the video generation model is obtained by optimizing a mouth emphasis loss between a color value of a predicted mouth and a color value of a real mouth, and the color value of the predicted mouth being generated by the neural radiance field based on the condition input, the three-dimensional coordinates, and the viewing direction (page 10, para. [0001], note that s to select the best video in term of overall quality (including head movement, lip synchronization and video quality). 3) Finally, to evaluate the personalized head movements, given a short video of the target person, we generate two synthesized videos using two different speeches. Then the third question was to ask the participants to watch two generated videos side by side and rate the head motion difference between them: 1 (exactly the same), 2 (slightly different), 3 (quite different), and 4 (totally different). We calculate the pose variety score as the average rating. High pose variety score indicates that the method can generate diverse head poses for different speeches.).
(4) regarding claim 11:
Yi further disclosed the computer device according to claim 8, wherein the expression parameter of the target user is extracted by:
performing three-dimensional face reconstruction on the training video of the target user to obtain a face shape representation of a three-dimensional face shape of the target user (page 6, B. Stage 2: From 3D Facial Animation to Realistic Talking Face Video Generation, para. [0002], note that a detailed albedo from input video; that is, we first project the reconstructed 3D shape (i.e., a face mesh) onto the image plane, and then we assign the pixel color to each mesh vertex. In this way, the albedo is computed by dividing illumination. Finally, the albedo from the frame with the most neutral expression and the smallest rotation angles is set as the albedo of the video); and
determining the expression parameter of the target user based on the face shape representation (fig. 2, fig. 4, C. Comparison with State of the arts, para. [0001], note that the method is proposed with a focus on personalized head pose predicted from an input short video (Fig. 4), which meanwhile has comparable talking facial quality (e.g., lip synchronization and expression) as good as state-of-the-art methods).
(5) regarding claim 12:
Yi further disclosed the computer device according to claim 8, wherein the head parameter of the target user is extracted by:
performing three-dimensional face reconstruction on the training video of the target user to obtain a face shape representation of a three-dimensional face shape of the target user (page 6, B. Stage 2: From 3D Facial Animation to Realistic Talking Face Video Generation, para. [0001], note that rendering of 3D Face With Personalized Pose: After reconstructing the 3D face of the target person and generating the expression and pose sequences, we obtain a sequence of 3DMM coefficients synchronized with the speech signal);
performing transformation and mapping on the three-dimensional face shape of the target user to obtain a rotation matrix and a translation vector corresponding to the three-dimensional face shape (page 6, A. Stage 1: From Audio-Visual Information to 3D Facial Animation, para. [0001], note that this method reconstructs the 3DMM coefficients χ(I) = {α, β, δ, γ, p} ∈ R257, where α ∈ R80 is the coefficient vector for face identity, β ∈ R64 is for expression, δ ∈ R80 is for texture, γ ∈ R27 is the coefficient vector for illumination, and p ∈ R6 is the pose vector including rotation and translation); and
determining the head pose information based on the rotation matrix, determining the head position information based on the translation vector, and obtaining the head parameter of the target user based on the head pose information and the head position information (page 6, A. Stage 1: From Audio-Visual Information to 3D Facial Animation, para. [0007], note that we use the head motion encoder and the head motion decoder to predict head pose. As shown in Fig. 1, (1) the head motion encoder first extracts a motion behavior pattern from the head pose sequence of the input short video; (2) then the head motion decoder predicts a head pose sequence from both the audio features of the input speech and the motion behavior pattern).
(6) regarding claim 13:
Yi further disclosed the computer device according to claim 8, wherein the obtaining a training video of a target user comprises:
obtaining an initial video of preset duration, the initial video recording audio content of a speech of the target user (page 4, III. Predicting Head Movement From Multi-Modal Input, para. [0004], note that we characterize the head motion in each video of 10-15 seconds by a motion behavior pattern. Our modeling strategy has the following characteristics. First, different people have different motion behavior patterns. Second, since the coupling between speech and head motion changes from utterance to utterance, one person can have multiple motion behavior patterns in multiple short videos, but most of the time one short video of a person (which is 10–15 seconds and only contains one basic utterance unit) only corresponds to one motion behavior pattern).
Yi disclosed most of the subject matter as described as above except for specifically teaching performing preprocessing on the initial video to obtain the training video by anchoring a portrait of the target user of the initial video in a central area of a video frame of the training video.
However, Shu disclosed performing preprocessing on the initial video to obtain the training video by anchoring a portrait of the target user of the initial video in a central area of a video frame of the training video (para. [0042], note that the input frame 102 is a predefined frame, for example, a first frame of the video, a center frame of the video, or other predetermined frame. The user may request to display the editable 3D scene 104 corresponding to the input frame 102 so that the user can edit the particular frame of the video 103 and generate a modified video 106 including the edited input frame 102).
At the time of filing for the invention, it would have been obvious to a person of ordinary skilled in the art to teach performing preprocessing on the initial video to obtain the training video by anchoring a portrait of the target user of the initial video in a central area of a video frame of the training video. The suggestion/motivation for doing so would have been in order to generate, based on an input video and using a deformable NeRF scene representation model, a 3D scene including an editable object (para. [0003]). Therefore, it would have been obvious to combine Yi with Shu to obtain the invention as specified in claim 13.
(7) regarding claim 14:
Yi further disclosed the computer device according to claim 8, wherein the reconstructed video of the target user is obtained by:
obtaining a preset number of target video frames from the target video (page 5, IV. Talking Face Video Generation System, para. [0004], note that we render the 3D facial animation into video frames using the texture and lighting information obtained from input video);
inputting each target video frame to the video generation model, and calculating a reconstructed video frame corresponding to the target video frame (page 5, IV. Talking Face Video Generation System, para. [0004], note that we then use a rendering-to-realistic GAN module that can refine these rendered frames into realistic ones); and
synthesizing the reconstructed video frames to obtain the reconstructed video corresponding to the target user (page 7, Rendering-to-Realistic GAN for Refining Frames, para. [0003], note that We model the frame refinement process as a function Φ that maps from the rendered frame (i.e., synthesized frame rendered by the graphic engine) domain R to the real frame domain T using paired training data {(ri, gi)}, ri ∈ R and gi ∈ T . Given a real frame gi, the corresponding frame ri in the training data is synthesized by rendering the 3D face reconstructed from gi).
The proposed rejection as explained in claims 8-14, renders obvious the steps of the method of claims 1-7 and the non-transitory computer-readable storage medium claims 15-20 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claims 8-14 are equally applicable to claims 1-7 and 15-20.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Gafni et al. (NPL, “Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction”, 2021) disclosed a dynamic neural radiance fields for modeling the appearance and dynamics of a human face.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communication from the examiner should be directed to Hilina K Demeter whose telephone number is (571) 270-1676.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Y. Poon could be reached at (571) 270- 0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about PAIR system, see http://pari-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/HILINA K DEMETER/Primary Examiner, Art Unit 2617