DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see p. 7, 1st paragraph, filed December 12, 2025, with respect to the objection to the specification for no “What is claimed is” have been fully considered and are persuasive. The objection to the specification for no “What is claimed is” has been withdrawn.
Applicant's arguments filed December 12, 2025, with respect to the objection to the specification for the title have been fully considered but they are not persuasive. Applicant states that the title will be submitted upon indication of allowable subject matter if necessary (p. 6).
In reply, the Examiner points out that since the title has not yet been amended, the objection to the specification for the title is maintained.
Applicant’s arguments with respect to claim(s) 1-5 have been considered but are moot because new grounds of rejection are made in view of Meng (see citation below) and Berlin (US 20210056348A1).
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Meng (see citation below), Jeong (US 20220207262A1), and Berlin (US 20210056348A1).
As per Claim 1, Meng teaches an image processing device comprising: at least one processor, wherein the at least one processor executes computer-executable instructions that cause the image processing device to perform operations comprising: receiving a voice signal; generating, based on the received voice signal as input, by machine learning, an action unit of the voice signal, wherein the action unit describes movement of a mimic muscle of a first face as a part of an action unit sequence describing a sequence of movements of the mimic muscle of the first face; generating, based on the action unit and a face still image as input, by machine learning, a sequence of facial images, wherein machine learning transforms an expression of the face still image into another expression in the sequence of facial images according to the movement of the mimic muscle of the first face as specified by the action unit; and updating, based on the action unit sequence, display of the face still image according to the voice signal (recognizing facial action units (AUs), p. 537, Abstract; facial activities can be “heard”, i.e., inferred from the information extracted from the audio channel, facial AUs and voice are highly correlated, voice/speech has strong physiological relationships with some lower-face AUs, because jaw and lower-face muscular movements are the major mechanisms to produce differing sounds, p. 537, last paragraph; learn the relationships between AUs and phonemes in a CTBN model, p. 538, last paragraph; extracted from the visual channel, i.e., static images, p. 539, left column, 4th paragraph).
However, Meng does not expressly teach a memory; and the at least one processor is connected to the memory; and that the generating, based on the received voice signal, is performed by a first neural network; and the generating, based on the action unit and a face, by a second neural network, and the second neural network transforms the expression of the face into another expression. However, Jeong teaches a memory; and at least one processor connected to the memory (storage module of the user client 100, [0057]); generating, based on the received voice signal as input, by a first neural network (20), an emotion vector of the voice signal (audio encoder 20 encodes the audio data that is a basis for the mouth shape synthesis as audio including a specific person’s audio, audio encoder 20 may be configured of ConvNet including Convolution Layers that receive a mel spectrogram (audio data) as the input data, [0074]; mouth shape synthesis device 1 using an artificial neural network may be configured such that the audio embedding vector that is the output vector of the audio encoder 20 and the emotion vector that distinguishes a plurality of emotion state Classes are output together, [0108], Fig. 10 shows that audio data is input to the audio encoder 20, which outputs the emotion vector); generating, based on the emotion vector and a face video as input, by a second neural network, a sequence of facial images, wherein the second neural network transforms an expression of the face video into another expression in the sequence of facial images (original video encoder 10 encodes the original video data that is the target of the mouth shape synthesis as a video including the face of the synthesis target and outputs an original video embedding vector, original video encoder 10 may be configured of a ConvNet which receives a specific frame of the original video as input data, [0060], synthesized video decoder 30 uses a combined vector in which the original embedding vector output from the original video encoder 10 and the audio embedding vector output from the audio encoder 20 are combined as the input data, and outputs synthetic video data that is a sequence of n specific frames in which the mouth shape corresponding to the audio data is synthesized to the target face that is the synthesis target, synthesized video decoder 30 may be configured of the ConvNet, [0076], mouth shape synthesis device 1 using an artificial neural network, the face of the specific person is included in the video, the original video data is input to the original video encoder 10 to output the original video embedding vector, the audio segment is input to the audio encoder 20 to output the audio embedding vector and the emotion vector (a plurality of emotion state Classes such as happiness, sadness, surprise, fear, and emotionlessness), and then the emotion vector, the original video embedding vector, and the audio embedding vector are input to the synthesized video decoder 30 to output synthesized video data, [0110], according to this, synchronization with the audio data of the mouth shape of the target face output from the synthesized video data is achieved, and at the same time, an effect that the mouth shape is synthesized according to the emotion state included in each segment of the audio data is generated, since the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis, an effect is generated in which a more natural synthesized video can be generated from the viewpoint of the viewer of the synthesized video, [0112], Fig. 10 shows that the audio encoder 20 outputs the audio embedding vector and emotion vector to the synthesized video decoder 30, and the original video embedding vector is also input to the synthesized video decoder 30, and the synthesized video decoder 30 outputs the synthesized video data). Jeong teaches that the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis [0112]. Thus, the emotion vector is equivalent to an action unit [0112]. Thus, Jeong teaches generating, based on the received voice signal as input, by a first neural network (20), an action unit of the voice signal [0074, 0108, 0112] (Fig. 10); generating, based on the action unit and a face video as input, by a second neural network, a sequence of facial images, wherein the second neural network transforms an expression of the face video into another expression in the sequence of facial images [0060, 0076, 0110, 0112] (Fig. 10).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Meng to include a memory; and the at least one processor is connected to the memory; and that the generating, based on the received voice signal, is performed by a first neural network; and the generating, based on the action unit and a face, by a second neural network, and the second neural network transforms the expression of the face into another expression because Jeong suggests that this improves the precision of the synthesis of a mouth shape and reality through an audio input [0007].
However, Meng and Jeong do not teach generating, based on the face still image of a second face as input, the sequence of facial images of the second face. However, Berlin teaches generating, based on the received voice signal and a face still image of a second face as input, by a neural network, a sequence of facial images of the second face, wherein the neural network transforms an expression of the face still image into another expression in the sequence of facial images according to the received voice signal (face swapping process, the source data set of CGI facial images is provided to the trained artificial intelligence engine (trained autoencoder), the destination data set of CGI facial images is provided to the trained artificial intelligence engine, the trained artificial intelligence engine performs the face swap (by feeding the destination data set facial images to the encoder trained using source and destination facial images and to the decoder trained using the source facial images), [0144], the output video sequence images are rendered on a display (the destination image with the destination face replace with the CGI facial image), if the destination images were associated with a speech track recorded of the person in the source images speaking, the output may have the lips and facial expressions of the CGI face synchronized with the speech track so that it appears that the CGI face is speaking the recorded speech track in the same manner as the original face, [0145], generator neural network configured to synthesize a face and swap a source face likeness from a corresponding source image to a destination face in a corresponding destination image, [0015]). Thus, this teaching of the second face from Berlin can be implemented into the combination of Meng and Jeong so that it generates, based on a face still image of a second face as input, the sequences of facial images of the second face.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Meng and Jeong to include generating, based on the face still image of a second face as input, the sequence of facial images of the second face because Berlin suggests that face-swapping is well-known in the art, and this is a more computer resource-efficient and time-efficient technique to perform face-swapping [0003].
As per Claim 2, Meng teaches machine learning that inputs a face still image and outputs an action unit (p. 537, Abstract; p. 537, last paragraph; p. 538, last paragraph; p. 539, left column, 4th paragraph).
However, Meng does not teach wherein the processor is further configured to: learn the first neural network in a manner such that an error between an action unit outputted from a voice signal of a face moving image with voice and an action unit extracted in advance in each frame of the face moving image is reduced; and learn the second neural network by using a third neural network that inputs a face still image and outputs an action unit in a manner such that an error between the action unit of an input of the second neural network and an action unit outputted by inputting the generated sequence of facial images to the third neural network is reduced. However, Jeong teaches wherein the processor is further configured to: learn the first neural network in a manner such that an error between an action unit outputted from a voice signal of a face moving image with voice and an action unit extracted in advance in each frame of the face moving image is reduced (learning session of the sync discriminator may be configured such that the lower half video of the face of the specific frame segment of the learning video data for which the synchronization of the mouth shape and the audio is completed is input to the synthesized video embedding module to output the learning video sync vector, the audio segment of the learning video data is input to the audio embedding module 41 to output the audio sync vector, and then the sync loss between the learning video sync vector and the audio sync vector is calculated, and the parameters of the audio embedding module 41 and the synthesized video embedding module are updated in a direction in which the sync loss is minimized, [0092], the audio data and the synthesized video data are input to the sync discriminator, and the audio sync vector in the audio embedding module 41 and the synthesized video sync vector in the synthesized video embedding module are respectively output, and the sync loss is calculated based on the output audio sync vector and the synthesized video sync vector, [0093], loss function with respect to the synchro rate for the audio and the synthesized video is separately included, so that an effect of further improving the audio synchro rate for the mouth shape of the synthesized video that is generated by the synthesized video decoder is generated, [0094]); and learn the second neural network by using a third neural network (31) (synthesized video discriminator 31 is an artificial neural network module learned to receive the synthesized video data output from the synthesized video decoder 30, and output a synthesized video discrimination vector that distinguishes Real (original unsynthesized by the synthesized video decoder) and Fake (synthesis synthesized by the synthesized video decoder) of the synthesized video data, [0079]) that inputs a face video image and outputs an emotion vector in a manner such that an error between the emotion state of an input of the second neural network and an emotion vector outputted by inputting the generated image to the third neural network is reduced (learning session may be configured such that the parameters of the original video encoder 10, the audio encoder 20 and the synthesized video decoder 30 are updated in a direction in which a sum of the reconstruction loss calculated based on the synthesized video data and the original video data, the synthesized video loss in the pre-learned synthesized video discriminator 31, the sync loss, and the emotion loss calculated based on the synthesized video data and the audio data in the pre-learned sync discriminator 40 is reduced, in the learning session of the synthesized video discriminator 31, the learning session may be configured such that the original video data and the synthesized video data output from the synthesized video decoder 30 are input to the synthesized video discriminator 31, the synthesized video discrimination vector that distinguishes Real and Fake of synthetic video data, and the emotion vector are output in the synthesized video discriminator 31, and the parameters of the synthesized video discriminator 31 are updated in a direction in which the loss including the difference between the synthesized video discrimination vector and the actual Label (Real or Fake) of the input data, and the difference between the emotion vector and the actual emotion state Label of the input data is reduced, [0110]). Jeong teaches that the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis [0112]. Thus, the emotion vector and the emotion state are equivalent to an action unit, since the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis [0112]. Thus, Jeong teaches learning the second neural network by using a third neural network (31) [0079] that inputs a face video and outputs an action unit in a manner such that an error between the action unit of an input of the second neural network and an action unit outputted by inputting the generated sequences of facial images to the third neural network is reduced [0110, 0112]. Since Meng teaches machine learning that inputs a face still image and outputs an action unit (p. 537, Abstract; p. 537, last paragraph; p. 538, last paragraph; p. 539, left column, 4th paragraph), this teaching from Jeong can be implemented with the face still image from Meng so that it uses the third neural network that inputs a face still image and outputs the action unit.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Meng to include learn the first neural network in a manner such that an error between an action unit outputted from a voice signal of a face moving image with voice and an action unit extracted in advance in each frame of the face moving image is reduced; and learn the second neural network by using a third neural network that inputs a face still image and outputs an action unit in a manner such that an error between the action unit of an input of the second neural network and an action unit outputted by inputting the generated sequence of facial images to the third neural network is reduced because Jeong suggests that this way, the face image that is output is more accurate [0092].
As per Claim 3, Claim 3 is similar in scope to Claim 2, and therefore is rejected under the same rationale.
As per Claim 4, Meng teaches an action unit generated by machine learning from a still image (p. 537, Abstract; p. 537, last paragraph; p. 538, last paragraph; p. 539, left column, 4th paragraph).
However, Meng does not teach wherein the processor is configured to learn the third neural network in a manner such that an error between an action unit generated by the third neural network from a still image of learning data and an action unit extracted from a still image of learning data is reduced. However, Jeong teaches wherein the processor is configured to learn the third neural network (31) in a manner such that an error between an emotion vector generated by the third neural network from a video of learning data and an emotion state extracted from a video of the learning data is reduced (in the learning session of the synthesized video discriminator 31, the learning session may be configured such that the original video data and the synthesized video data output from the synthesized video decoder 30 are input to the synthesized video discriminator 31, the synthesized video discrimination vector that distinguishes Real and Fake of synthetic video data, and the emotion vector are output in the synthesized video discriminator 31, and the parameters of the synthesized video discriminator 31 are updated in a direction in which the loss including the difference between the synthesized video discrimination vector and the actual Label (Real or Fake) of the input data, and the difference between the emotion vector and the actual emotion state Label of the input data is reduced, [0110]). Jeong teaches that the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis [0112]. Thus, the emotion vector and the emotion state are equivalent to an action unit, since the mouth shape is synthesized according to the emotion state of the audio data, which is the basis of the synthesis [0112]. Thus, Jeong teaches learning the third neural network (31) in a manner such that an error between an action unit generated by the third neural network from a video of learning data and an action unit extracted from a video of the learning data is reduced [0110, 0112]. Since Meng teaches an action unit generated by machine learning from a still image (p. 537, Abstract; p. 537, last paragraph; p. 538, last paragraph; p. 539, left column, 4th paragraph), this teaching from Jeong can be implemented with the still image from Meng so that the processor is configured to learn the third neural network in a manner such that an error between an action unit generated by the third neural network from a still image of learning data and an action unit extracted from a still image of learning data is reduced.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Meng to include learn the third neural network in a manner such that an error between an action unit generated by the third neural network from a still image of learning data and an action unit extracted from a still image of learning data is reduced because Jeong suggests that this way, the face image that is output is more accurate [0110].
As per Claim 5, Claim 5 is similar in scope to Claim 1, and therefore is rejected under the same rationale.
Prior Art of Record
Meng, Zibo; Listen to Your Face: Inferring Facial Action Units from Audio Channel; December 2019; IEEE Transactions on Affective Computing; Vol. 10; p. 537-549; https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8025777
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONI HSU whose telephone number is (571)272-7785. The examiner can normally be reached M-F 10am-6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JH
/JONI HSU/Primary Examiner, Art Unit 2611