DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-16,21-22,24 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Huang et al (US 20250061634 A1).
Regarding claim 1, Huang discloses A method for generating a human face video ([0023] generate animation that is representative of one or more characters uttering speech represented by audio data), comprising:
obtaining a mouth-shape multimedia resource and a reference human face image of a target object ([0023] receive one or more component vectors with which a character is to be animated to utter speech contained in an audio segment extracted from the input raw audio);
obtaining a reference style vector of the target object ([0023] one or more component vectors, such as a style vector or an emotion vector, that indicates one or more emotions, with potential relative weightings, to use to render the facial animation for an input audio clip.);
for each resource frame in the mouth-shape multimedia resource, obtaining a respective mouth-shape driving feature by performing a feature extraction process on each resource frame ([0023] The network may also receive a style vector (or style information incorporated in the emotion vector or other emotional data representation) that indicates a modification or fine control of the animation to be generated for the indicated emotion(s), as may relate to a style of animation or may relate to specific motions to be modified or enhanced, among other such options);
generating a respective stylistic human face image corresponding to each resource frame based on the respective mouth-shape driving feature, the reference human face image and the reference style vector ([0023] Modeling each of these facial (and/or body) components separately, and determining deformations for each of these components, can cause the rendered facial (and/or body) animation to appear more realistic for a given emotion, particularly when considering any style data provided to the network.); and
determining a stylistic human face video of the target object based on the respective stylistic human face image corresponding to each resource frame in the mouth-shape multimedia resource ([0025] When generating such image or video data for various operations, it can be a goal (or in at some examples required) for a representation of a character—such as a human, robot, animal, or other such entity—to behave as realistically as possible.).
Regarding claim 2, Huang discloses wherein the reference style vector conforms to a Gaussian distribution ([0077] a style state can be represented by an S-dimensional vector, where S is a tunable parameter that can be set to a value such as, without limitation, 16 or 24, and the components initialized to random values drawn from a Gaussian distribution.), and wherein obtaining the reference style vector of the target object comprises:
obtaining candidate Gaussian distributions ([0077] a Gaussian distribution);
obtaining a resource frame sample in a mouth-shape multimedia resource sample, a reference human face image sample of the target object, and a video frame sample corresponding to the resource frame sample in a stylistic human face video sample of the target object ([0077] One such vector can be allocated for each training sample, with the matrix that stores these latent variables being referred to herein as a style database);
selecting a target Gaussian distribution from the candidate Gaussian distributions based on the resource frame sample, the reference human face image sample, and the video frame sample ([0067] The component vectors may correspond to an emotion or style vector, among other options, that modifies different features of an output animation);
and determining a style vector satisfying the target Gaussian distribution as the reference style vector of the target object ([0077] If S is too low, the styles fail to disambiguate variations in the training data, leading to weak audio response. If Sis too high, styles may become too specialized to be useful for general inference.).
Regarding claim 3, Huang discloses wherein selecting the target Gaussian distribution from the candidate Gaussian distributions based on the resource frame sample, the reference human face image sample and the video frame sample, comprises:
determining a mouth-shape driving feature sample of the resource frame sample ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.);
determining a respective candidate style vector matching each of the candidate Gaussian distributions sequentially ([0077] a style state can be represented by an S-dimensional vector, where S is a tunable parameter that can be set to a value such as, without limitation, 16 or 24, and the components initialized to random values drawn from a Gaussian distribution);
generating a respective stylistic human face prediction image based on the respective candidate style vector, the reference human face image sample, and the mouth-shape driving feature sample ([0051] the output character model may provide improved realism by guiding or otherwise encouraging mouth closing at appropriate times.); and
in response to a similarity between a stylistic human face prediction image and the video frame sample satisfying a similarity condition, determining a corresponding candidate Gaussian distribution as the target Gaussian distribution ([0077] If S is too low, the styles fail to disambiguate variations in the training data, leading to weak audio response. If Sis too high, styles may become too specialized to be useful for general inference.).
Regarding claim 4, Huang discloses wherein generating the stylistic human face image corresponding to the resource frame based on the mouth-shape driving feature, the reference human face image and the reference style vector, comprises:
determining a mouth-shape style driving feature based on the mouth-shape driving feature and the reference style vector ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.); and
generating the stylistic human face image corresponding to the resource frame based on the mouth-shape style driving feature and the reference human face image ([0100] The output of the network can be examined for several novel audio clips with every remaining emotion/style vector, and a semantic meaning (e.g., “neutral”, “amused”, “surprised”, etc.) assigned to each of them, depending at least in part on factors such as the emotional state they convey. Which semantic emotions remain can depend on the training material, and it may not be possible to extract, e.g., a “happy” emotion if the training data does not contain enough such material to be generalizable to novel audio.).
Regarding claim 5, Huang discloses wherein the mouth-shape multimedia resource is a mouth-shape audio or a mouth-shape video ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.).
Regarding claim 6, Huang discloses wherein the mouth-shape multimedia resource is a mouth-shape multimedia resource of a non-target object or an integrated mouth-shape multimedia resource ([0073] A primary input to such a network is a speech audio signal, which may be converted to a format such as 16 kHz mono audio before feeding the audio to the network).
Regarding claim 7, Huang discloses A method for training a human face and mouth shape driver model ([0023] generate animation that is representative of one or more characters uttering speech represented by audio data), comprising:
obtaining a pre-trained human face and mouth shape driver model and an encoding network, the human face and mouth shape driver model comprising a feature extraction network and a human face driving network that are sequentially connected ([0046] he weights of the combination layer 312 may be learned as the model is trained through back propagation.);
obtaining a respective mouth-shape driving feature sample of each resource frame sample in a mouth-shape multimedia resource sample, a reference human face image sample ([0023] receive one or more component vectors with which a character is to be animated to utter speech contained in an audio segment extracted from the input raw audio), and
a stylistic human face video sample, wherein each resource frame sample in the mouth-shape multimedia resource sample corresponds one-to-one with each video frame sample in the stylistic human face video sample ([0023] one or more component vectors, such as a style vector or an emotion vector, that indicates one or more emotions, with potential relative weightings, to use to render the facial animation for an input audio clip.);;
for each resource frame sample in the mouth-shape multimedia resource sample, obtaining a style prediction vector outputted by the encoding network by inputting the mouth-shape driving feature sample and the video frame sample corresponding to the resource frame sample into an initial encoding network ([0023] The network may also receive a style vector (or style information incorporated in the emotion vector or other emotional data representation) that indicates a modification or fine control of the animation to be generated for the indicated emotion(s), as may relate to a style of animation or may relate to specific motions to be modified or enhanced, among other such options);;
obtaining a stylistic human face prediction image outputted by the human face driving network by inputting the style prediction vector, the mouth-shape driving feature sample and the reference human face image sample into the human face driving network ([0023] Modeling each of these facial (and/or body) components separately, and determining deformations for each of these components, can cause the rendered facial (and/or body) animation to appear more realistic for a given emotion, particularly when considering any style data provided to the network.); and
obtaining a trained human face and mouth shape driver model by adjusting parameters of the encoding network and the human face driving network in the human face and mouth shape driver model based on a distribution to which the style prediction vector belongs, a Gaussian distribution, the stylistic human face prediction image and the video frame sample corresponding to the resource frame sample ([0077] a style state can be represented by an S-dimensional vector, where S is a tunable parameter that can be set to a value such as, without limitation, 16 or 24, and the components initialized to random values drawn from a Gaussian distribution).
Regarding claim 8, Huang discloses wherein obtaining the respective mouth-shape driving feature sample of each resource frame sample in the mouth-shape multimedia resource sample, the reference human face image sample, and the stylistic human face video sample comprises:
obtaining the mouth-shape multimedia resource sample, the reference human face image sample, and the stylistic human face video sample ([0023] receive one or more component vectors with which a character is to be animated to utter speech contained in an audio segment extracted from the input raw audio); and
for each resource frame sample in the mouth-shape multimedia resource sample, obtaining the mouth-shape driving feature sample of the resource frame sample outputted by the feature extraction network by inputting the resource frame sample into the feature extraction network in the human face and mouth shape driver model ([0023] The network may also receive a style vector (or style information incorporated in the emotion vector or other emotional data representation) that indicates a modification or fine control of the animation to be generated for the indicated emotion(s), as may relate to a style of animation or may relate to specific motions to be modified or enhanced, among other such options).
Regarding claim 9, Huang discloses wherein obtaining the mouth-shape multimedia resource sample, the reference human face image sample and the stylistic human face video sample comprises:
obtaining a human face video sample t ([0023] receive one or more component vectors with which a character is to be animated to utter speech contained in an audio segment extracted from the input raw audio);
determining the human face video sample as the stylistic human face video sample ([0025] , it can be a goal (or in at some examples required) for a representation of a character—such as a human, robot, animal, or other such entity—to behave as realistically as possible.);
determining the human face video sample or an audio in the human face video sample as the mouth-shape multimedia resource sample ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.).; and
determining any one of video frames from the human face video sample as the reference human face image sample ([0032] provide for facial animation that provides realistic behavior under various emotional states for a variety of different character types and for a variety of different input audio types).
Regarding claim 10, Huang discloses wherein obtaining the trained human face and mouth shape driver model by adjusting the parameters of the encoding network and the human face driving network in the human face and mouth shape driver model based on the distribution to which the style prediction vector belongs, the Gaussian distribution, the stylistic human face prediction image and the video frame sample corresponding to the resource frame sample comprises:
determining a value of a first sub-loss function based on the distribution to which the style prediction vector belongs, the Gaussian distribution, and the first sub-loss function ([0037] A loss function used to determine the loss value can include terms for position, motion,);
determining a value of a second sub-loss function based on the stylistic human face prediction image, a video frame sample corresponding to the resource frame sample, and the second sub-loss function ([0037] This loss value can be used during backpropagation to update network parameters (e.g., weights and biases) for the deep neural network);
determining a value of a loss function based on the value of the first sub-loss function and the value of the second sub-loss function ([0037] he weights for the audio encoder may be frozen such that the backpropagation is used to only update one or more decoders.); and
obtaining the trained human face and mouth shape driver model by adjusting the parameters of the encoding network and the human face driving network in the human face and mouth shape driver model based on the value of the loss function ([0037] Once the network is determined to converge to an acceptable or desired level of accuracy or precision, and/or another training end criterion is satisfied (e.g., processing all training data or performing a target or maximum number of training iterations), the trained network 206 can be provided or deployed for inferencing.).
Regarding claim 11, Huang discloses an electronic device ([0023] generate animation that is representative of one or more characters uttering speech represented by audio data) comprising:
at least one processor ([0104] portions of this functionality can be performed using multiple computing devices, or multiple processors within one or more computing devices, such as may include a combination of CPUs and GPUs.); and
a memory, communicatively coupled to the at least one processor ([0109] , code and/or data storage 701 may be cache memory, dynamic randomly addressable memory (“DRAM”), static randomly addressable memory (“SRAM”), non-volatile memory (e.g.. Flash memory), or other storage.);
wherein the memory stores instructions executable by the at least one processor, and
when the instructions are executed by the at least one processor, the at least one processor is configured to:
obtain a reference style vector of the target object ([0023] one or more component vectors, such as a style vector or an emotion vector, that indicates one or more emotions, with potential relative weightings, to use to render the facial animation for an input audio clip.);
for each resource frame in the mouth-shape multimedia resource, obtaining a respective mouth-shape driving feature by performing a feature extraction process on each resource frame ([0023] The network may also receive a style vector (or style information incorporated in the emotion vector or other emotional data representation) that indicates a modification or fine control of the animation to be generated for the indicated emotion(s), as may relate to a style of animation or may relate to specific motions to be modified or enhanced, among other such options);
generate a respective stylistic human face image corresponding to each resource frame based on the respective mouth-shape driving feature, the reference human face image and the reference style vector ([0023] Modeling each of these facial (and/or body) components separately, and determining deformations for each of these components, can cause the rendered facial (and/or body) animation to appear more realistic for a given emotion, particularly when considering any style data provided to the network.); and
determine a stylistic human face video of the target object based on the respective stylistic human face image corresponding to each resource frame in the mouth-shape multimedia resource ([0025] When generating such image or video data for various operations, it can be a goal (or in at some examples required) for a representation of a character—such as a human, robot, animal, or other such entity—to behave as realistically as possible.).
Regarding claim 12, Huang discloses wherein the reference style vector conforms to a Gaussian distribution ([0077] a style state can be represented by an S-dimensional vector, where S is a tunable parameter that can be set to a value such as, without limitation, 16 or 24, and the components initialized to random values drawn from a Gaussian distribution., and the at least one processor is configured to:
obtain candidate Gaussian distributions ([0077] a Gaussian distribution);
obtain a resource frame sample in a mouth-shape multimedia resource sample, a reference human face image sample of the target object, and a video frame sample corresponding to the resource frame sample in a stylistic human face video sample of the target object ([0077] One such vector can be allocated for each training sample, with the matrix that stores these latent variables being referred to herein as a style database);
select a target Gaussian distribution from the candidate Gaussian distributions based on the resource frame sample, the reference human face image sample, and the video frame sample ([0067] The component vectors may correspond to an emotion or style vector, among other options, that modifies different features of an output animation);
and determine a style vector satisfying the target Gaussian distribution as the reference style vector of the target object ([0077] If S is too low, the styles fail to disambiguate variations in the training data, leading to weak audio response. If Sis too high, styles may become too specialized to be useful for general inference.).
Regarding claim 13, Huang discloses wherein the at least one processor is configured to:
determine a mouth-shape driving feature sample of the resource frame sample ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.);
determine a respective candidate style vector matching each of the candidate Gaussian distributions sequentially ([0077] a style state can be represented by an S-dimensional vector, where S is a tunable parameter that can be set to a value such as, without limitation, 16 or 24, and the components initialized to random values drawn from a Gaussian distribution);
generate a respective stylistic human face prediction image based on the respective candidate style vector, the reference human face image sample, and the mouth-shape driving feature sample ([0051] the output character model may provide improved realism by guiding or otherwise encouraging mouth closing at appropriate times.); and
in response to a similarity between a stylistic human face prediction image and the video frame sample satisfying a similarity condition, determining a corresponding candidate Gaussian distribution as the target Gaussian distribution ([0077] If S is too low, the styles fail to disambiguate variations in the training data, leading to weak audio response. If Sis too high, styles may become too specialized to be useful for general inference.).
Regarding claim 14, Huang discloses wherein the at least one processor is configured to:
determine a mouth-shape style driving feature based on the mouth-shape driving feature and the reference style vector ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.); and
generate the stylistic human face image corresponding to the resource frame based on the mouth-shape style driving feature and the reference human face image ([0100] The output of the network can be examined for several novel audio clips with every remaining emotion/style vector, and a semantic meaning (e.g., “neutral”, “amused”, “surprised”, etc.) assigned to each of them, depending at least in part on factors such as the emotional state they convey. Which semantic emotions remain can depend on the training material, and it may not be possible to extract, e.g., a “happy” emotion if the training data does not contain enough such material to be generalizable to novel audio.).
Regarding claim 15, Huang discloses wherein the mouth-shape multimedia resource is a mouth-shape audio or a mouth-shape video ([0051] there may be a reduced likelihood of movement between the frames, which would be expected if there were no sound, as the character would likely not be moving their mouth, as an example, if they were not speaking.).
Regarding claim 16, Huang discloses wherein the mouth-shape multimedia resource is a mouth-shape multimedia resource of a non-target object or an integrated mouth-shape multimedia resource ([0073] A primary input to such a network is a speech audio signal, which may be converted to a format such as 16 kHz mono audio before feeding the audio to the network).
Regarding claim 21, Huang discloses An electronic device ([0023] generate animation that is representative of one or more characters uttering speech represented by audio data), comprising:
at least one processor ([0104] portions of this functionality can be performed using multiple computing devices, or multiple processors within one or more computing devices, such as may include a combination of CPUs and GPUs.);; and
a memory, communicatively coupled to the at least one processor ([0109] , code and/or data storage 701 may be cache memory, dynamic randomly addressable memory (“DRAM”), static randomly addressable memory (“SRAM”), non-volatile memory (e.g.. Flash memory), or other storage.);
wherein the memory stores instructions executable by the at least one processor, and when the instructions are executed by the at least one processor, the at least one processor is configured to perform the method of claim 7
Regarding claim 22, Huang discloses A non-transitory computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions are used to cause a computer to perform the method of claim 1 ([0109] , code and/or data storage 701 may be cache memory, dynamic randomly addressable memory (“DRAM”), static randomly addressable memory (“SRAM”), non-volatile memory (e.g.. Flash memory), or other storage.).
Regarding claim 24, Huang discloses A non-transitory computer-readable storage medium having computer instructions, wherein the computer instructions are used to cause a computer to perform the method of claim 7 ([0109] , code and/or data storage 701 may be cache memory, dynamic randomly addressable memory (“DRAM”), static randomly addressable memory (“SRAM”), non-volatile memory (e.g.. Flash memory), or other storage.).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHIVANG I PATEL whose telephone number is (571)272-8964. The examiner can normally be reached on M-F 9-5am.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alicia Harrington can be reached on (571) 272-2330. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHIVANG I PATEL/Primary Examiner, Art Unit 2615