Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 4 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 4 recites the limitation "the scene type". There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. (DanceIt: Music-inspired Dancing Video Synthesis, Computer Vision and Pattern Recognition, 2020) in view of Zhong et al. (Generative adversarial networks with decoder–encoder output noises, Neural Networks, 2020).
Regarding claim 1, Guo et al. (hereinafter Guo) discloses a video generation method (Guo, Fig. 6 illustrates pipeline of the proposed framework), including:
acquiring a target audio (Guo, Fig. 1 illustrates given a piece of input audio);
generating an image sequence according to characteristic information of the target audio (Guo, Fig. 1 illustrates generate a dancing sequence automatically by taking the human pose as an intermediary. The representative pose and generated dancing frames corresponding to the audio peaks) and an image generation model (Guo, Fig. 6 illustrates the imagination module synthesizes the final dancing videos from the processed pose sequences), wherein the image generation model is used for generating a corresponding image (Guo, Fig. 6 illustrates the imagination module synthesizes the final dancing videos from the processed pose sequences); and
combining the target audio and the image sequence to generate a target video corresponding to the target audio (Guo, Fig. 6 illustrates generation Phase: the spatial alignment is used to re-predict dance movements between discontinuous pose fragments. The temporal algorithm aligns the beats of the music and pose sequence. The imagination module synthesizes the final dancing videos from the processed pose sequences);
Guo does not expressly disclose “based on a randomly input vector”;
Zhong et al. (hereinafter Zhong) discloses generating an image based on randomly input vector (Zhong, 1. Introduction, [0006], “The decoder-encoder structure can transform the noninformative Gaussian noises to informative ones. Because the decoder-encoder structure carries information of the real images, the output noise vectors could accelerate the training process of the adversarial networks and improve the quality of the generated images”. The Gaussian noise vectors are randomly input vectors).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to generate Guo’s videos by applying Zhong’s noise vector as an input to Guo’s video generation process. The motivation for doing so would have been improving the quality of the generated images.
Regarding claim 9, Guo discloses generating the image sequence according to the characteristic information of the target audio and a generator in the image generation model (Guo, Fig. 1 illustrates generate a dancing sequence automatically by taking the human pose as an intermediary. The representative pose and generated dancing frames corresponding to the audio peaks. In addition, Fig. 6 illustrates the imagination module synthesizes the final dancing videos from the processed pose sequences);
Guo as modified by Zhong with the same motivation from claim 1 discloses a generative adversarial model (Zhong, 3. Generative adversarial networks with decoder-encoder output noises).
Claims 2 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. in view of Zhong et al., as applied to claims 1 and 15, in further view of Liem et al. (When music makes a scene Characterizing music in multimedia contexts via user scene descriptions, Int J Multimed Info Retr, 2013) in view of Qiu et al. (Image generation associated with music data, CVPR 2018 Sight and Sound Workshop).
Regarding claim 2, Guo teaches the target audio (Guo, Fig. 1); the generating of an image sequence according to characteristic information of the target audio and an image generation model (Guo, Fig. 1); Guo as modified by Zhong with the same motivation from claim 1 discloses a randomly input vector (Zhong, 1. Introduction, [0006], “The decoder-encoder structure can transform the noninformative Gaussian noises to informative ones. Because the decoder-encoder structure carries information of the real images, the output noise vectors could accelerate the training process of the adversarial networks and improve the quality of the generated images”. The Gaussian noise vectors are randomly input vectors);
Liem et al. (hereinafter Leim) discloses determining a target scene type to which an audio belongs (Liem, 6.1 Many descriptions of the same music fragment, [0001], “the described scene outdoors, and from the 44 respondents who went as far as indicating a geographic location for the scene, frequently mentioned regions are Europe… To the worker respondents, the fragment strongly evokes dancing/party scenes…To the worker respondents, it evokes mysterious, unknown and sometimes unpleasant situations. The most frequently mentioned actor category is that of adventurers”);
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify Guo’s video synthesis process to incorporate Liem’s data for determining a scene type corresponding to a target audio. The motivation for doing so would have been improving audio-visual consistency.
Guo as modified by Zhong and Liem does not expressly disclose “determining an image generation model corresponding to the target scene type”;
Qiu et al. (hereinafter Qiu) discloses determining an image generation model corresponding to an audio feature (Qiu, 3.4. Image generation from music data, [0001], “we can obtain music features that can correspond to images through our trained CNN-LSTM model. We then fuse those extracted music features into DCGAN to generate images”);
generating an image corresponding to the target type (Qiu, Fig. 8).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to use Qiu’s generating images based on extracted audio features to generate Guo as modified by Zhong and Liem’s images used in the video synthesis process. The motivation for doing so would have been improving semantic consistency between audio content and generated video output.
Regarding claim 21, claim 21 recites functions that are similar in scope to the method steps recited in claim 2 and therefore are rejected under the same rationale.
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. in view of Zhong et al. in view of Jin et al. (US 2021/0398336).
Regarding claim 15, Guo teaches video generation (Guo, Fig. 1);
Guo as modified by Zhong does not expressly disclose “a video generation device”;
Jin et al. (hereinafter Jin) discloses a video generation device (Jin, [0091], “a user may generate and acquire fusion images via a user apparatus 300 such as a personal computer”);
at least one processor and a memory (Jin, Fig. 19), the memory storing computer-executable instructions, wherein the computer-executable instructions upon execution by the at least one processor cause the at least one processor to implement operations (Jin, [0097], “a CPU (Central Processing Unit) or the like may implement them through information processing of the software items…and functions of at least a portion thereof are stored in a storage medium and may be loaded into a computer for execution”).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to perform Guo’s video synthesis process using Jin’s video generation device. The motivation for doing so would have been enabling practical implementation of the video synthesis process.
The remaining limitations recite in claim 15 are similar in scope to the method recited in claim 1 and therefore are rejected under the same rationale.
Allowable Subject Matter
Claims 3-8 and 22-25 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE ZHAI whose telephone number is (571)270-3740. The examiner can normally be reached 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ke Xiao can be reached at (571) 272 - 7776. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KYLE ZHAI/Primary Examiner, Art Unit 2611