Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The Amendment filed October 31 2025 has been entered and considered. Claims 1 and 13 have been amended. Claims 5, 11-12, 18, and 24-28 were canceled by way of previous amendment. In light of the amendment the prior art rejections of claims 1 and 13 are withdrawn as moot. The new grounds of rejection set forth in the present action were necessitated by Applicants’ claim amendments.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-4, 6-7, 9, 13-17, 19, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Yin et al. (NPL, published 3/16/2022, pdf attached) in view of Liu (NPL, previously cited), in view of Mouizina (US Patent Pub. No. 2022/0192346, previously cited).
PNG
media_image1.png
485
1032
media_image1.png
Greyscale
Regarding claim 1, Yin teaches a method in a computing system having a display device and a camera (Fig. 1, shows pictures being used an displayed), the method comprising: with the camera, capturing an audio/video sequence of a person speaking, the captured audio/video sequence comprising a sequence of frames and an audio track (Pg. 1, Col. 1, “One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image, driven by a video or an audio segment”; Pg. 1, Col. 2, “We design a video-based motion generation module and an audio-based one, which can be plugged into the framework either individually or jointly to drive the video generation”); causing presentation on the display of a plurality of images of the person (Fig. 1, reprinted below); receiving user input selecting one of a plurality of images, wherein the selected image shows a target environmental setting including at least one of particular hair, particular makeup, particular clothing, particular background, or particular lighting (Fig. 1, reprinted above, Pg. 6, Col. 1, “The goal of the video-driven motion generator is to generate dense flows with the driving video and the source image as inputs”, a user needs to select the source image for input); performing facial mapping for the frames in the captured audio/video sequence to produce a first facial mapping result (Pg. 6, Col. 1, “We use the 3DMM parameters 𝒑𝑡 from the driving frame 𝑑𝑡 as the motion representation. Specifically, these parameters are first mapped to a latent vector via a 3-layer MLP to aggregate the temporal information”, the driving frame is the captured audio/video sequence); performing facial mapping for the selected image to produce a second facial mapping result (Pg. 5, Col. 2, “Given a single source image, we first use the GAN inversion method [55] to get the latent style code and feature maps of the source image.”); for each frame of the captured audio/video sequence: spatially correlating the frame with the selected image using the first and second facial mapping results to produce spatial correlation results (Fig. 4, partly reprinted below, both
PNG
media_image2.png
477
360
media_image2.png
Greyscale
mapping result are used to produce flow fields (spatial correlation result)); and generating a target frame consistent with the target environment setting shown in the selected image, at least in part by warping and transferring one or more first regions of the frame of the captured audio/video sequence to one or more second regions of the selected image using the spatial correlation results while maintaining one or more third regions of the selected image in the generated target frame, wherein the one or more first regions warped and transferred correspond to one or more facial portions of the person shown in the frame and wherein the one or more third regions maintained correspond to one or more body portions of the person shown in the selected image; and combining the audio track with the generated target frames to obtain a resulting audio/video sequence (Figs. 9-10, showcase qualitative results of the method, displaying the warping and transferring of facial portions of the source image based on the driving frame while maintaining the body portions of the source image to generate a target frame).
Yin does not explicitly disclose performing facial mapping at least in part by detecting first facial points including corners of eyes, mouth, and nose from each of the frames or performing spatial correlation based, at least in part, on the detected facial points. Yin also does not explicitly disclose that responsive to the presentation on the display, to receive user input selecting one of the plurality of images presented on the display. However, they do require user input to input the file path of the image into the code (Pg. 21).
Liu teaches to performing facial mapping for the frames in the captured audio/video sequence to produce a facial mapping result, at least in part by detecting first facial points including corners of eyes, mouth, and nose from each of the frames (Fig. 1, reprinted below,
PNG
media_image3.png
270
1060
media_image3.png
Greyscale
, Pg. 2, Col. 1, “For the generation of target’s facial landmark, a GAN-based synthesizer is utilized to build the mapping from source’s FAUP to target’s facial landmark.”)... and for each frame of the captured video sequence: spatially correlating the frame with the selected image using the first and second facial mapping results to produce spatial correlation results based, at least in part, on the detected first and second facial points (Fig. 4, reprinted below,
PNG
media_image4.png
257
480
media_image4.png
Greyscale
, shows spatial correlation results based on the detected facial points).
Mouizina teaches to cause presentation on the display a plurality of images of the person; responsive to the presentation on the display, receiving user input selecting one of the plurality of images presented on the display, wherein the selected image shows a target environmental setting including at least one of particular hair, particular makeup, particular clothing, particular background, or particular lighting (Para. 139, “As shown in FIG. 22C, the display can show a plurality of user-selectable options 210A-210H. Each of the user- selectable options 210A-210H are associated with a unique modification of a first characteristic related to the user. In this example, that characteristic is hair color, and each individual user-selectable option 210A-21H is associated with a unique proposed modification of that characteristic, e.g. different hair colors. Each of the user-selectable options 210A-210H can also include an image showing the different proposed modifications.” Para. 140, “Once the system 10 receives via an input device a selection from the user of one of the options 210A-210H, the display can show a modified image 208 that shows the user with the proposed modification of the characteristic corresponding to the elected option.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yin to incorporate the teachings of Liu and Mouizina to include facial mapping using key facial points as well as receiving user input responsive to a display selecting one of the plurality of images presented on the display. Yin performs facial mapping using GAN inversion and a predicted flow field is used as the motion descriptor, however this warping “introduces noticeable artifacts in the final output, especially around the eyes and teeth.” (Pg. 2, Col. 2), as disclosed by Yin. One of ordinary skill in the art would recognize that the facial mapping of Yin could be improved, leaving it obvious to implement the similarly GAN-based method of Liu for more accurate and precise facial mapping and motion synthesis.
The method of Yin requires a user to input images and videos, but they do not disclose a presentation on a display to perform these actions. One skilled in the art would recognize that allowing for a user to specify the input image being used after showing it on a display is an obvious enhancement that gives the user more control over the merging process. Since the user can see the environment and characteristics of the image used for the merge, the user can find “any other suitable or desired modification” (Para. 156), as disclosed by Mouizina.
Regarding claim 2, Yin as modified teaches all of the elements of claim 1, as stated above, as well as wherein the one or more first regions comprise at least one region containing the mouth of the person, and wherein the one or more third regions comprise at least one region containing one or more of hair of the person, clothing worn by the person, makeup applied to the person, and surroundings of the person (Fig. 1, reprinted above, shows the surroundings, as well as all aspects of the person being captured and subjected to feature mapping and transferring).
Regarding claim 3, Yin as modified teaches all of the elements of claim 2, as stated above, as well as wherein the one or more first regions comprise at least one region collectively containing the eyes, cheeks, and mouth of the person (Fig. 1, reprinted above, shows the eyes, cheeks, and mouth being captured).
Regarding claim 4, Yin as modified teaches all of the elements of claim 1, as stated above, as well as for each frame of the captured audio/video sequence: before the generating of the target frame, applying a geometric transformation to warp the selected image to more closely match a shape and size of the person's face in the frame using the spatial correlation results (Fig. 2, reprinted below, Pg. 2, Col. 2, “The predicted flow field is used to spatially warp the latent feature map”; Pg.4, Col. 2, “To determine the proper layer for performing spatial transformation, we warp the feature map of each layer individually.”).
Regarding claim 6, Yin as modified teaches all of the elements of claim 1, as stated above, as well as causing presentation on the display of at least a portion of the resulting audio/video sequence (Fig. 1, reprinted above, shows the output being displayed).
Regarding claim 7, Yin as modified teaches all of the elements of claim 1, as stated above, as well as persistently storing the resulting audio/video sequence (Fig. 1, reprinted above, shows the resulting audio/video sequence which would need to be stored).
PNG
media_image5.png
313
492
media_image5.png
Greyscale
Regarding claim 9, Yin as modified teaches all of the elements of claim 1, as stated above, as well as wherein the generating of the target frame uses output of an identity module trained with video captured of the person (Fig. 5, reprinted below).
Regarding claim 13, the computer-readable media performs variably the same function as the method of claim 1. It is rejected under the same analysis.
Regarding claim 14, the recited elements perform the same function as that of claim 2. It is rejected under the same analysis.
Regarding claim 15, the recited elements perform the same function as that of claim 4. It is rejected under the same analysis.
PNG
media_image6.png
409
1029
media_image6.png
Greyscale
Regarding claim 16, Yin as modified teaches all of the elements of claim 15, as stated above, as well as wherein the geometric transformation uses one or more of an affine transformation, a thin plate spline transformation, or both an affine transformation and a thin plate spline transformation (Fig. 2, reprinted below) .
Regarding claim 17, Yin as modified teaches all of the elements of claim 13, as stated above, as well as wherein the spatial correlation uses a keypoint detection technique (Pg. 5, Col. 1, “Finally, when the complicated deformations, such as TPS operations, are applied to the feature map, the source image is also interpolated to match the randomly sampled target keypoints”).
Regarding claim 23, the recited elements perform the same function as that of claim 9. It is rejected under the same analysis.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID A WAMBST whose telephone number is (703)756-1750. The examiner can normally be reached M-F 9-6:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Gregory Morse can be reached at (571)272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVID ALEXANDER WAMBST/ Examiner, Art Unit 2663
/GREGORY A MORSE/ Supervisory Patent Examiner, Art Unit 2698