Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-10, 12-17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang et al. (U.S. Patent Application Publication No. 2024/0203099), referred herein as Hwang, in view of Song et al. (U.S. Patent Application Publication No. 2024/0346735), referred herein as Song.
Regarding claim 1, Hwang teaches one or more processors, comprising circuitry to use one or more neural networks (figs 1 and 15-17, processor 180, neural network portions 530, 630, 730, and 820) to:
obtain one or more conditional inputs (figs 15-17, inputs 516/730; paragraphs 280 and 281; paragraphs 291 and 292), audio data of speech (figs 15-17, audio input 518/820; paragraphs 280 and 281; paragraph 295), and a textual input (paragraphs 188 and 190; paragraphs 280 and 281; paragraph 296), at least one of the one or more conditional inputs comprising 3D position data or depth information of one or more features of an object (paragraphs 280 and 281; paragraphs 291 and 292);
determine an expression of a speaker of the speech represented by the audio data (paragraph 220; paragraph 253);
generate a representation that represents one or more body movements and facial expressions of the object conveying the expression, and iteratively predict based, at least in part, on the conditional inputs, the representation of the speaker, and the textual input, one or more body movements and facial expressions of an object when pronouncing the textual input (paragraphs 188 and 190; paragraphs 222 and 255; paragraphs 280 and 281; paragraph 297); and
generate, based on the one or more iteratively predicted body movements and iteratively predicted facial expressions, a 3D model of the object that depicts the object performing the one or more body movements while pronouncing the textual input and with facial expressions that correspond to the expression (figs 15-17, model generator 530; paragraphs 220 and 222; paragraphs 253 and 255; paragraphs 284 and 299).
Hwang teaches determining expressions based on the text, audio data of speech, and 3D position data, as shown above. However, Hwang does not explicitly teach determining an emotional state, and generating an emotion-conditioned latent space representation that represents body movements and facial expressions of the object conveying the emotional state, wherein the 3D model depicts the object with movements and facial expressions corresponding to the emotional state.
However, in a similar field of endeavor, Song teaches one or more processors to use neural networks to obtain conditional inputs comprising position data of features of an object and audio data of speech, iteratively predict body movements and facial expressions of an object pronouncing the input data, and generate a 3D model of the object that depicts the object performing the body movements and facial expressions (paragraph 38; paragraph 47, the last 11 lines; paragraph 48, lines 1-15; paragraph 50, lines 1-11; paragraph 59, lines 5-8), wherein the processor further uses the neural networks to determine an emotional state of a speaker of the speech, and generate an emotion-conditioned latent space representation that represents the body movements and facial expressions of the object conveying the emotional state, wherein the 3D model depicts the object with the body movements and facial expressions corresponding to the emotional state (paragraph 67, lines 1-7; paragraph 69, lines 1-7; paragraph 98; paragraphs 114 and 117).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the emotional state determination and latent space representation of Song with the text, audio data, and 3D position data processing of Hwang because this improves the realism of the 3D model and more accurately and naturally reflects human emotional response as a result of, and/or in reaction to, a speaker or the user, thereby improving the quality of the 3D model generation and the user interaction with the 3D model (see, for example, Song, paragraph 37, the last 10 lines; paragraph 38, lines 1-6; paragraph 119).
Regarding claim 2, Hwang in view of Song teaches the one or more processors of claim 1, wherein the one or more iteratively predicted body movements of the object comprise one or more motions of limbs of the object and the one or more iteratively predicted facial expressions of the object comprise one or more motions of features of a face of the object (Hwang, fig 17; paragraphs 182 and 188; paragraphs 252, 253, and 260; paragraphs 291 and 297; Song, paragraphs 67, 69, and 81; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Regarding claim 3, Hwang in view of Song teaches the one or more processors of claim 1, wherein the one or more neural networks comprise a second portion to generate the one or more iteratively predicted body movements (Hwang, fig 17, neural network portion 730/530; paragraph 284; paragraphs 291 and 292; paragraph 299; Song, paragraphs 69 and 81; the motivation to combine is similar to that discussed above in the rejection of claim 1), and a first portion to generate the one or more iteratively predicted facial expressions based, at least in part, on the audio data of speech (Hwang, fig 17, neural network portion 630/530; paragraph 284; paragraphs 289 and 290; paragraph 299; Song, paragraphs 69 and 118; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Regarding claim 5, Hwang in view of Song teaches the one or more processors of claim 1, wherein the one or more iteratively predicted body movements and iteratively predicted facial expressions of the object indicate body language conveying the emotional state of the speaker of the speech represented by the audio data, through motion of limbs and the facial expressions of the object (Hwang, paragraphs 220 and 253; paragraphs 284 and 299; Song, paragraph 48; paragraph 67, lines 1-7; paragraph 69, lines 1-7; paragraph 98; paragraphs 114 and 117; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Regarding claim 6, Hwang in view of Song teaches the one or more processors of claim 1, wherein the object is an avatar of one or more portions of a human (Hwang, fig 17; paragraphs 284, 297 and 299; paragraphs 324-326; Song, paragraph 48, the last the last 9 lines; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Regarding claim 7, Hwang in view of Song teaches the one or more processors of claim 1, wherein the audio data comprises one or more utterances of speech and the one or more conditional inputs correspond to the one or more utterances of speech (Hwang, paragraphs 280 and 281; paragraphs 291 and 292; paragraphs 295 and 297; Song, paragraph 38; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Regarding claims 8-10 and 12-14, the limitations of these claims substantially correspond to the limitations of claims 1-3 and 5-7, respectively; thus they are rejected on similar grounds as their corresponding claims.
Regarding claims 15-17, 19, and 20, the limitations of these claims substantially correspond to the limitations of claims 1-3, 5, and 6, respectively; thus they are rejected on similar grounds as their corresponding claims.
Claims 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hwang, in view of Song, and further in view of Li et al. (U.S. Patent Application Publication No. 2024/0153184), referred herein as Li.
Regarding claim 4, Hwang in view of Song teaches the one or more processors of claim 1, wherein the one or more conditional inputs are generated by one or more neural networks and indicate the 3D position data or depth information of the object corresponding to the one or more iteratively predicted body movements and iteratively predicted facial expressions of the object (Hwang, paragraphs 253 and 255; paragraphs 280 and 281; paragraphs 289 and 291; Song, paragraphs 69, 81, and 118; the motivation to combine is similar to that discussed above in the rejection of claim 1).
Hwang in view of Song does not teach an input comprising a heatmap.
However, in a similar field of endeavor, Li teaches a system comprising circuits to use one or more neural networks to generate motions of an avatar object based on input motion and audio (fig 5; paragraph 56, lines 1-4; paragraph 58, lines 1-22; paragraph 64, the last 5 lines), wherein heatmaps are input that indicate a position of the object (paragraph 61, lines 1-5 and the last 8 lines; paragraph 62, lines 1-6 and the last 8 lines).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the heatmap body position indication of Li with the body position processing of Hwang in view of Song because this provides highly accurate body position identification that is efficient enough to perform in real-time, while still reducing processing resource requirements (see, for example, Li, paragraph 21, lines 1-10 and the last 5 lines; paragraph 62, the last 2 lines).
Regarding claims 11 and 18, the limitations of each of these claims substantially correspond to the limitations of claim 4; thus they are rejected on similar grounds.
Response to Arguments
Applicant’s arguments with respect to the claim objections have been fully considered, and are persuasive. The amendments have overcome the claim objections; thus they are withdrawn.
Applicant’s arguments with respect to the prior art rejections have been fully considered, but are moot in view of the new grounds of rejection presented above. The Examiner agrees that the claimed emotion-conditioned latent space representations are not taught by the previously cited prior art; however, the Examiner respectfully submits that these limitations are taught by Song, as discussed above.
Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Mulliken (U.S. Patent No. 12,125,130); Perceptually and physiologically constrained optimization of avatar models.
Kang (U.S. Patent No. 11,915,513); Apparatus for leveling person image and operating method thereof.
Khirman (U.S. Patent No. 11,631,208); Systems and methods for generating clinically relevant images that preserve physical attributes of humans while protecting personal identity.
Villanueva (U.S. Patent No. 12,592,018); Generating expressive facial animation data from speech audio using speech emotion recognition.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID T WELCH whose telephone number is (571)270-5364. The examiner can normally be reached Monday-Thursday, 8:30-5:30 EST, and alternate Fridays, 9:00-2:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached at 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
DAVID T. WELCH
Primary Examiner
Art Unit 2613
/DAVID T WELCH/Primary Examiner, Art Unit 2613