Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 15 is objected to because of a minor typographical error: in the last limitation of the claim, “adding a object” should be amended to read --adding an object--. Appropriate correction is required.
Claim 18 is objected to because of a minor typographical error: in the last limitation of the claim, “genetared” should be amended to read --generated--. Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification, as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f), is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f):
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f), is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f), is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f), because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are:
In claim 6: “an input unit configured to…;”
Because these claim limitations are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. A review of the specification shows the following structure corresponding to the claim limitations: an input interface capable of receiving data through, for example, a communication network (see specification, page 10, lines 4-6).
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f), applicant may: (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f).
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-3, 6, 7, 9, and 11 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Liu et al. (U.S. Patent Application Publication No. 2024/0070956), referred herein as Liu.
Regarding claim 1, Liu teaches a method for training an image generating model that generates an image from an audio (fig 2), comprising: selecting at least one frame from a video including a plurality of frames based on a correlation between an audio and an image of each frame (paragraph 35, lines 1-3; paragraph 36, lines 1-9); extracting image information and audio information from each of the selected at least one frame (paragraph 37, lines 1-11; paragraph 38); and training an audio feature vector extracting model that extracts an audio feature vector from the audio information (paragraph 37, lines 16-21; paragraph 48, lines 1-13; paragraph 49), wherein the audio feature vector is aligned within an embedding space with an image feature vector extracted from the image information by a pre-trained image feature vector extracting model (paragraph 37, lines 16-21; paragraph 38; paragraph 39, lines 1-10; paragraph 42, lines 1-11; paragraph 49).
Regarding claim 2, Liu teaches the method of claim 1, wherein the selecting the at least one frame includes selecting the at least one frame from the video using a frame selection method (paragraph 35, lines 1-11; paragraph 36, lines 1-9).
Regarding claim 3, Liu teaches the method of claim 1, further comprising: inputting the audio feature vector into an image generator configured to generate the image based on the image feature vector; and providing the image generated by the image generator (paragraph 37, lines 1-16; paragraphs 38 and 39).
Regarding claim 6, Liu teaches an image generating apparatus (fig 6) comprising: an input unit configured to receive a first audio (fig 6, input unit 605; paragraph 56); a memory configured to store computer-executable instructions, an audio feature vector extracting model, an image feature vector extracting model including an image generator (fig 6, memory 602/603/608; paragraph 57); and a processor configured to execute the one or more instructions stored in the memory (fig 6, processor 601; paragraph 57), wherein the instructions, when executed by the processor, cause the processor to extract a first audio feature vector from a first audio using the audio feature vector extracting model, and generate a first image based on the first audio feature vector using the image generator (paragraph 35, lines 1-3; paragraph 37, lines 1-11; paragraph 38), wherein the audio feature vector extracting model is trained to extract, when at least one frame is selected from a video including a plurality of frames based on a correlation between an audio and an image of each frame, and a second image and a second audio are extracted from each of the selected at least one frame, a second audio feature vector from the second audio (figs 3 and 5B, frames 1-4; paragraph 35, the last 9 lines; paragraph 37, lines 1-11 and 16-21; paragraphs 38 and 39; paragraph 42, lines 1-11; paragraph 54, the last 7 lines; the extraction and processing are performed for each frame), wherein the second audio feature vector is aligned within an embedding space with a second image feature vector extracted from the second image by a pre-trained image feature vector extracting model, and wherein the image generator is pre-trained to generate the second image based on the second image feature vector (paragraph 37, lines 16-21; paragraph 38; paragraph 39, lines 1-10; paragraph 42, lines 1-11; paragraph 49).
Regarding claim 7, Liu teaches the image generating apparatus of claim 6, wherein the first audio is different from the second audio (figs 3 and 5B, frames 1-4; paragraph 37, lines 1-11 and 16-21; paragraph 48, lines 9-26; paragraph 54, the last 7 lines; the portion of audio corresponding to each frame is different from frame to frame).
Regarding claim 9, Liu teaches the image generating apparatus of claim 6, wherein the input unit is configured to receive the first audio or a third image and input the first audio and the third image to the image generator, and the image generator is configured to generate a fourth image in which the first image is reflected onto the third image (figs 3 and 5B, image frames 1-4; paragraph 36, lines 1-9; paragraph 37, lines 1-11 and 16-21; paragraph 39; paragraph 42, lines 1-11).
Regarding claim 11, Liu teaches the image generating apparatus of claim 9, wherein the fourth image is generated by modifying the third image, corresponding to the first audio (figs 3 and 5B, image frames 1-4; paragraph 36, lines 1-9; paragraph 37, lines 1-11 and 16-21; paragraph 39; paragraph 42, lines 1-11).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, in view of Yuan et al. (U.S. Patent Application Publication No. 2024/0119275), referred herein as Yuan.
Regarding claim 4, Liu teaches the method of claim 1, but does not teach that the training the audio feature vector extracting model is performed by a contrastive learning method.
However, in a similar field of endeavor, Yuan teaches a method for training an image generating model, comprising extracting image and audio information and aligning then in an embedding space (paragraph 22, lines 1-5; paragraph 24, lines 1-6), wherein training an audio feature vector extracting model is performed by a contrastive learning method (title; paragraph 24, lines 1-6; paragraph 34, lines 1-3 and the last 16 lines).
It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to combine the contrastive learning method of Yuan with the model training of Liu because this improves the performance of the model by producing a trained model that more efficiently classifies input data (see, e.g., Yuan, paragraph 2; paragraph 23, the last 7 lines).
Regarding claim 5, Liu in view of Yuan teaches the method of claim 4, wherein the contrastive learning method includes InfoNCE (noise contrastive estimation) (Yuan, paragraph 34, the last 16 lines; the motivation to combine is similar to that discussed above in the rejection of claim 4).
Claims 8, 12-14, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, in view of Hansen (U.S. Patent No. 10,785,451), referred herein as Hansen.
Regarding claim 8, Liu teaches the image generating apparatus of claim 6, but does not teach that the first image is generated, when a volume level of the first audio is changed, by reflecting the changed volume level.
However, in a similar field of endeavor, Hansen teaches an image generating apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from image data, and generating an image based on the audio and image features (figs 1 and 5; column 3, lines 18-22 and 48-65; column 4, lines 10-25), wherein the image is generated, when a volume level of the first audio is changed, by reflecting the changed volume level (column 5, lines 38-49; column 8, lines 8-29 and 55-64).
It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to combine the volume reflection of Hansen with the image generation of Liu because this improves the image generation to more accurately represent the input audio and video, without increasing the bandwidth and process requirements to do so (see, e.g., Hansen, column 1, lines 53-67; column 4, lines 25-30; column 5, lines 49-53).
Regarding claim 12, Liu teaches the image generating apparatus of claim 9, but does not teach that the fourth image is generated, when a volume level of the first audio changes, by reflecting the changed volume level.
However, in a similar field of endeavor, Hansen teaches an image generating apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from image data, and generating an image based on the audio and image features (figs 1 and 5; column 3, lines 18-22 and 48-65; column 4, lines 10-25), wherein the image is generated, when a volume level of the first audio changes, by reflecting the changed volume level (column 5, lines 38-49; column 8, lines 8-29 and 55-64).
The motivation to combine Hansen with Liu is the same as that discussed above in the rejection of claim 8.
Regarding claim 13, Liu teaches the image generating apparatus of claim 6, wherein the first audio includes an audio source originated from an entity, and the first image includes each sub-image corresponding to the entity (figs 3 and 5B, images 1-4; paragraph 35, lines 1-9; paragraph 37, lines 1-11; paragraph 39).
Liu does not explicitly teach a plurality of audio sources originated respectively from a plurality of entities, and images are generated corresponding to each entity in the plurality of entities, respectively.
However, in a similar field of endeavor, Hansen teaches an image generating apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from image data, and generating an image based on the audio and image features (figs 1 and 5; column 3, lines 18-22 and 48-65; column 4, lines 10-25), wherein there are a plurality of audio sources originated respectively from a plurality of entities, and images are generated corresponding to each entity in the plurality of entities, respectively (figs 1, 3B, and 5; column 4, lines 10-25; column 5, lines 38-49; column 8, lines 55-64).
It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to combine plurality of audio sources with the apparatus of Liu, because this helps ensure that multiple sound sources potentially in the audio can be handled in such a way as to properly and accurately extract the relevant audio regardless of noise, distortion, network connection, etc. (see, e.g., Hansen, column 4, lines 44-59).
Regarding claim 14, Liu in view of Hansen teaches the image generating apparatus of claim 13, wherein the first image is generated, when respective volume levels corresponding to a plurality of audio sources included in the first audio relatively is changed to each other, by reflecting the relatively changed respective volume levels (Hansen, column 5, lines 38-49; column 8, lines 8-29 and 55-64; the motivation to combine is similar to that discussed above in the rejection of claim 8).
Regarding claim 16, Liu teaches the image generating apparatus of claim 6, wherein the processor is configured to generate a video using a video generator, wherein the input unit is configured to input the first audio and a first video to the video generator, and wherein the video generator is configured to generate a second video, by adding a second plurality of images (paragraph 35, lines 1-3 and the last 5 lines; paragraph 37, lines 1-11; paragraph 38; the process repeats for every instance of the input audio and video; further, any number of frames are generated, thus everything after, for example, frames 1-4 may be considered a second video, or alternatively, frames 1 and 2 may be a first video and frames 3 and 4 may be a second video).
Liu does not teach that if a volume level of the first audio changes in a first video, images are generated by reflecting the changed volume level with a first plurality of images included in the first video.
However, in a similar field of endeavor, Hansen teaches an image generating apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from image data, and generating an image based on the audio and image features (figs 1 and 5; column 3, lines 18-22 and 48-65; column 4, lines 10-25), wherein if a volume level of a first audio changes in a first video, images are generated by reflecting the changed volume level with a first plurality of images included in the first video (column 5, lines 38-49; column 8, lines 8-29 and 55-64).
The motivation to combine Hansen with Liu is the same as that discussed above in the rejection of claim 8.
Claims 10 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, in view of Sadr et al. (U.S. Patent No. 11,941,678), referred herein as Sadr.
Regarding claim 10, Liu teaches the image generating apparatus of claim 9, wherein the fourth image is generated by aligning an object corresponding to the first audio onto the third image (figs 3 and 5B, image frames 1-4; paragraph 36, lines 1-9; paragraph 37, lines 1-11 and 16-21; paragraph 39; paragraph 42, lines 1-11).
Liu does not explicitly teach adding a new object.
However, in a similar field of endeavor, Sadr teaches an apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from input image data, and generating an image aligned in an embedding space based on the audio and image features (figs 1 and 4; column 17, lines 44-62; column 19, lines 33-45 and 56-67; column 20, lines 31-39 and 51-66), wherein a new object is added that corresponds to the image (column 20, lines 17-30).
It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to combine the object addition of Sadr with the image generation of Liu because this helps to more closely identify examples of the desired image output, which improves the dataset models such that they can generate images that more closely align with said output (see, e.g., Sadr, column 5, lines 39-46, and column 6, lines 18-28).
Regarding claim 15, Liu teaches the image generating apparatus of claim 6, wherein the processor is configured to generate a video using a video generator, wherein the input unit is configured to input the first audio and a first video to the video generator, and wherein the video generator is configured to generate a second video including a second plurality of images generated by aligning an object corresponding to the first audio with a first plurality of images included in the first video (paragraph 35, lines 1-3 and the last 5 lines; paragraph 37, lines 1-11; paragraph 38; any number of frames are generated, thus everything after, for example, frames 1-4 may be considered the second plurality, or alternatively, frames 1 and 2 could be considered a first plurality and frames 3 and 4 could be considered a second plurality).
Liu does not explicitly teach adding an object.
However, in a similar field of endeavor, Sadr teaches an apparatus comprising a processor configured to extract an audio feature from input audio and an image feature from input image data, and generating an image aligned in an embedding space based on the audio and image features (figs 1 and 4; column 17, lines 44-62; column 19, lines 33-45 and 56-67; column 20, lines 31-39 and 51-66), wherein an object is added that corresponds to the image (column 20, lines 17-30).
The motivation to combine Sadr with Liu is the same as that discussed above in the rejection of claim 10.
Claims 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu, in view of Wu et al. (U.S. Patent Application Publication No. 2022/0284220), referred herein as Wu.
Regarding claim 17, Liu teaches a non-transitory computer readable storage medium storing computer executable instructions, wherein the instructions, when executed by a processor, cause the processor to perform a sub-image generating method (fig 6, medium 602/603/608 and processor 601; paragraph 57), the method comprising, inputting an audio data, and extracting at least one audio information at predetermined time intervals in the audio data (paragraph 35, lines 1-3; paragraph 36, lines 1-9; paragraph 37, lines 1-11 and 16-21); extracting at least one audio feature vector by inputting the extracted at least one audio information into a pre-trained audio feature vector extracting model (paragraph 38; paragraph 42, lines 1-11; paragraph 48, lines 1-13; paragraph 49); and generating at least one sub-image by inputting the audio feature vector into an image generator trained to generate an image based on an image feature vector extracted by a pre-trained image feature vector extracting model from image information corresponding to the audio information (paragraph 36, lines 1-9; paragraph 37, lines 1-11 and 16-21; paragraph 38; paragraph 39, lines 1-10; paragraph 42, lines 1-11), wherein the audio feature vector is aligned within an embedding space with the image feature vector (paragraph 38; paragraph 39, lines 1-10; paragraph 42, lines 1-11).
Liu does not explicitly teach generating thumbnail images (although one might argue that one of the frames disclosed by Liu could be considered a thumbnail image).
However, in a similar field of endeavor, Wu teaches a storage medium storing instructions for extracting audio data and video data and generating images with a pre-trained model that align in an embedding space (figs 2 and 4; paragraphs 38 and 39; paragraph 47, lines 1-13; paragraph 119, lines 1-8), wherein the generated images are selectable thumbnail images (figs 9 and 10; paragraph 77, lines 1-16; paragraph 78, the last 10 lines; paragraph 79).
It would have been obvious to one of ordinary skill in the art at the time of the effective filing date to combine the thumbnail images of Wu with the image generation of Liu because this facilitates selection of images that are of higher importance or relevance to the user, such that the output images better reflect the desired output (see, e.g., Wu, paragraph 78, lines 1-16).
Regarding claim 18, Liu in view of Wu teaches the non-transitory computer readable storage medium of claim 17, further comprising: classifying the at least one audio feature vectors into clusters; and determining a representative audio feature vector for each cluster, wherein the generating the thumbnail includes inputting the representative audio feature vector into the image generator and determining the thumbnail generated by the image generator (Liu, paragraph 35, the last 9 lines; paragraph 37, lines 1-11 and 16-21; paragraphs 38 and 39; Wu, paragraph 24, the last 5 lines; paragraph 64; paragraph 77, lines 1-16; paragraph 78, the last 10 lines; paragraph 79; the motivation to combine is similar to that discussed above in the rejection of claim 17).
Regarding claim 19, Liu in view of Wu teaches the non-transitory computer readable storage medium of claim 18, wherein the generating the at least one thumbnail includes generating a plurality of thumbnails and outputting the plurality of generated thumbnails sequentially (Wu, paragraph 77, lines 1-16; paragraph 78, the last 10 lines; paragraph 79; Liu, figs 3 and 5B, frames 1-4; paragraph 37, lines 1-11; paragraph 38; the motivation to combine is similar to that discussed above in the rejection of claim 17).
Regarding claim 20, Liu in view of Wu teaches the non-transitory computer readable storage medium of claim 18, wherein the generating the at least one thumbnail includes generating a plurality of thumbnails, selecting a final thumbnail from the plurality of generated thumbnails, and outputting a final thumbnail (Wu, paragraph 77, lines 1-16; paragraph 78, the last 10 lines; paragraph 79; Liu, figs 3 and 5B, frames 1-4; paragraph 37, lines 1-11; paragraph 38; the motivation to combine is similar to that discussed above in the rejection of claim 17).
Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Marti (U.S. Patent Application Publication No. 2022/0337945); Selective sound modification for video communication.
Kim (U.S. Patent No. 11,501,794); Multimodal sentiment detection.
Arya (U.S. Patent No. 11,729,573); Audio enhanced augmented reality.
Chafni (U.S. Patent Application Publication No. 2023/0029608); Saliency-based compression.
Seol (U.S. Patent Application Publication No. 2024/0013462); Audio-driven facial animation with emotion support using machine learning.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID T WELCH whose telephone number is (571)270-5364. The examiner can normally be reached on Monday-Thursday, 8:30-5:30 EST, and alternate Fridays, 9:00-2:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
DAVID T. WELCH
Primary Examiner
Art Unit 2613
/DAVID T WELCH/Primary Examiner, Art Unit 2613