Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 2, 3, 13 and 14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 recites processing, by the computing system, the audio spectrogram with the machine-learned audio inpainting model comprises:
splitting, by the computing system, the audio spectrogram into non-overlapping patches;
embedding, by the computing system, each patch of the non-overlapping patches into a one-dimensional representation of the patch;
concatenating, by the computing system, the one-dimensional representations into a spectrogram embedding; and
concatenating the spectrogram embedding with two-dimensional Fourier positional encodings.
It is clear that the spectrogram embedding is generated based on the input spectrogram, which is split into non-overlapping patches, and embeddings of the patches are then concatenated to form the spectrogram embedding. However, it is not clear how the recited “two-dimensional Fourier positional encodings” are formed. The claim does not specify what is encoded by the “two-dimensional Fourier positional encodings” and it is unclear if they are generated from the input spectrogram, the one-dimensional patches, the spectrogram embeddings, or some other unnamed variable. Since it cannot be determined what the “two-dimensional Fourier positional encodings” represent, or how they are formed, claim 2 is considered indefinite.
Claim 3 depends on claim 2 and is therefore also considered indefinite.
Claim 13 recites similar limitations as claim 2 and is rejected for the same reasons as claim 2.
Claim 14 depends on claim 13 and is therefore also considered indefinite.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 4, 8-10, 12, 15-16 and 18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Tang et al. (Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration, hereinafter “Tang”).
In regard to claim 1, Tang discloses a method for performing audio inpainting, the method comprising:
receiving, by a computing system comprising one or more computing devices, an audio spectrogram, the audio spectrogram including at least one portion lacking audio content (see Fig. 2, a mel-spectrogram stream is input, where a portion to be inserted lacks audio content, section 2.2);
receiving, by the computing system, a textual transcript associated with the audio spectrogram and including text corresponding to the at least one portion lacking audio content (a transcript is input and converted to a phoneme stream, section 2.2);
processing, by the computing system, the audio spectrogram and the textual transcript with a machine-learned audio inpainting model to generate replacement audio content for the at least one portion of the audio spectrogram lacking audio content (Fig. 2, a neural network model processes the input text stream and mel-spectrogram stream to generate an output mel-spectrogram, sections 2.2 and 2.3); and
outputting, by the computing system, a completed audio spectrogram, the completed audio spectrogram having the replacement audio content in place of the at least one portion of the audio spectrogram lacking audio content (the original mel-spectrograms and the generated replacement mel-spectrograms are processed by a vocoder to synthesize the audio output, section 2.1, final paragraph).
In regard to claim 4, Tang discloses the method further comprising training, by the computing system, the machine-learned audio inpainting model (see section 3.1.2), wherein training the machine- learned audio inpainting model comprises:
receiving, by the computing system, a set of training data, the set of training data including a plurality of training audio spectrograms, each of the training audio spectrograms of the plurality of training audio spectrograms having an associated training textual transcript (audio clips and corresponding text transcripts are used for training, section 3.1.1);
generating, by the computing system, a plurality of masked audio spectrograms by masking random consecutive frames of each training audio spectrogram of the plurality of training audio spectrograms, wherein each masked audio spectrogram of the plurality of masked audio spectrograms is associated with the training audio spectrogram used to generate the masked audio spectrogram and the training textual transcript associated with the training audio spectrogram (the spectrograms of one random word in the training examples are removed during training, section 3.2);
processing, by the computing system, a masked audio spectrogram of the plurality of masked audio spectrograms and the training textual transcript associated with the masked audio spectrogram with the machine-learned audio inpainting model to generate an output spectrogram having replacement audio content in place of the masked frames of the masked audio spectrogram (the model generates spectrograms of the missing word, sections 2.3 and 3.2);
evaluating, by the computing system, a loss function that compares the output spectrogram with the training audio spectrogram associated with the masked audio spectrogram (the generated spectrogram is compared using a loss function to the original spectrogram, sections 2.3 and 3.2); and
modifying one or more parameters of the machine-learned audio inpainting model based on the loss function (neural network machine learning models inherently updates their parameters during training based on the loss function).
In regard to claim 8, Tang discloses the received textual transcript is an unaligned textual transcript (the model performs an alignment between the textual transcript and the spectrograms, section 2.2).
In regard to claim 9, Tang discloses the textual transcript comprises a sequence of natural language characters (English textual transcripts, section 3.1.1).
In regard to claim 10, Tang discloses the textual transcript includes textual content for the at least one portion of the audio spectrogram lacking audio content (a randomly chosen word removed from the spectrogram, section 3.2).
In regard to claim 12, Tang discloses a system for performing audio inpainting, the system comprising:
one or more processors (Nvidia Tesla V100 GPUs, section 3.1.2); and
a non-transitory, computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations (the GPUs and inherent computer required to run the GPUs include memory storing instructions), the operations comprising:
receiving, by a computing system comprising one or more computing devices, an audio spectrogram, the audio spectrogram including at least one portion lacking audio content (see Fig. 2, a mel-spectrogram stream is input, where a portion to be inserted lacks audio content, section 2.2);
receiving, by the computing system, a textual transcript associated with the audio spectrogram and including text corresponding to the at least one portion lacking audio content (a transcript is input and converted to a phoneme stream, section 2.2);
processing, by the computing system, the audio spectrogram and the textual transcript with a machine-learned audio inpainting model to generate replacement audio content for the at least one portion of the audio spectrogram lacking audio content (Fig. 2, a neural network model processes the input text stream and mel-spectrogram stream to generate an output mel-spectrogram, sections 2.2 and 2.3); and
outputting, by the computing system, a completed audio spectrogram, the completed audio spectrogram having the replacement audio content in place of the at least one portion of the audio spectrogram lacking audio content (the original mel-spectrograms and the generated replacement mel-spectrograms are processed by a vocoder to synthesize the audio output, section 2.1, final paragraph).
In regard to claim 15, Tang discloses the method further comprising training, by the computing system, the machine-learned audio inpainting model (see section 3.1.2), wherein training the machine- learned audio inpainting model comprises:
receiving, by the computing system, a set of training data, the set of training data including a plurality of training audio spectrograms, each of the training audio spectrograms of the plurality of training audio spectrograms having an associated training textual transcript (audio clips and corresponding text transcripts are used for training, section 3.1.1);
generating, by the computing system, a plurality of masked audio spectrograms by masking random consecutive frames of each training audio spectrogram of the plurality of training audio spectrograms, wherein each masked audio spectrogram of the plurality of masked audio spectrograms is associated with the training audio spectrogram used to generate the masked audio spectrogram and the training textual transcript associated with the training audio spectrogram (the spectrograms of one random word in the training examples are removed during training, section 3.2);
processing, by the computing system, a masked audio spectrogram of the plurality of masked audio spectrograms and the training textual transcript associated with the masked audio spectrogram with the machine-learned audio inpainting model to generate an output spectrogram having replacement audio content in place of the masked frames of the masked audio spectrogram (the model generates spectrograms of the missing word, sections 2.3 and 3.2);
evaluating, by the computing system, a loss function that compares the output spectrogram with the training audio spectrogram associated with the masked audio spectrogram (the generated spectrogram is compared using a loss function to the original spectrogram, sections 2.3 and 3.2); and
modifying one or more parameters of the machine-learned audio inpainting model based on the loss function (neural network machine learning models inherently updates their parameters during training based on the loss function).
In regard to claim 16, Tang discloses the loss function comprises a reconstruction loss function that determines a difference between the audio spectrogram used to generate the masked audio spectrogram and the completed audio spectrogram (a reconstruction loss between a generated spectrogram and a ground-truth spectrogram is used to train an inpaint model, section 2.3).
In regard to claim 18, Tang discloses the received textual transcript is an unaligned textual transcript (the model performs an alignment between the textual transcript and the spectrograms, section 2.2).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 5-7 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tang, in view of Ahmed et al. (U.S. Patent Application Pub. No. 2023/0282202, hereinafter “Ahmed”).
In regard to claim 5, Tang discloses training the machine-learned audio inpainting model comprises:
performing, by the computing system, a first training phase, the first training phase comprising using a reconstruction loss function to train the machine-learned audio inpainting model to identify and inpaint content of the portion of the audio spectrogram with the generated output to generate the completed audio spectrogram (a reconstruction loss between a generated spectrogram and a ground-truth spectrogram is used to train an inpaint model, section 2.3).
Tang discloses a Griffin-Lim vocoder is separately trained for simplicity, but suggests developing a fast and high-quality trainable spectrogram to waveform converter is ongoing (section 3.1.2).
Tang does not disclose, however, adversarial training to achieve higher perceptual quality and remove artifacts from the completed audio spectrogram.
Ahmed discloses a neural vocoder that is trained using adversarial training to achieve higher perceptual quality and remove artifacts from the completed audio spectrogram (a GAN vocoder is trained to achieve high quality synthesized speech, paragraphs [0301-0303]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to employ a second training phase comprising using adversarial training to achieve higher perceptual quality and remove artifacts from the completed audio spectrogram, because adversarial training synthesizes very high-quality speech at a low computational cost and with fast training, as taught by Ahmed (paragraph [0302]).
In regard to claim 6, Tang discloses the reconstruction loss function determines a difference between the audio spectrogram used to generate the masked audio spectrogram and the completed audio spectrogram (a distance between the generated spectrogram and the ground-truth spectrogram, section 2.3).
In regard to claim 7, Tang does not disclose the adversarial training comprises using a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms.
Ahmed discloses the adversarial training comprises using a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms (GAN architecture with discriminators, paragraph [0302]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms, because adversarial training synthesizes very high-quality speech at a low computational cost and with fast training, as taught by Ahmed (paragraph [0302]).
In regard to claim 17, Tang does not disclose the loss function comprises an adversarial training loss function that uses comprises using a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms.
Ahmed discloses the loss function comprises an adversarial training loss function that uses comprises using a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms (GAN architecture with discriminators, paragraph [0302]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use an adversarial training loss function that uses comprises using a discriminator model to differentiate between synthesized audio spectrograms and real audio spectrograms, because adversarial training synthesizes very high-quality speech at a low computational cost and with fast training, as taught by Ahmed (paragraph [0302]).
Claim(s) 11 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tang, in view of Jaegle et al. (Perceiver: General Perception with Iterative Attention, hereinafter “Jaegle”).
In regard to claim 2, Tang does not disclose
In regard to claims 11 and 19, Tang discloses a transformer-based model architecture (section 2.2), but does not disclose a learned latent query and a learned output query.
Jaegle discloses a more efficient means of representing data in transformer models, where the models comprise a learned latent query and a learned output query (learned latent queries producing learned output queries through a Transformer tower, section 3.1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply a learned latent query and a learned output query to the machine-learned audio inpainting model, because it would eliminate the quadratic scaling problem of a classical Transformer model and allow very deep models to be constructed, as taught by Jaegle (section 1, final two paragraphs).
Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tang, in view of Official Notice.
Claim 20 recites A non-transitory, computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising the same method as recited in claim 1. Tang discloses this method (see rejection of claim 1, above).
Tang also discloses the method is executed on a computer comprising GPUs (section 3.1.2).
However, Tang does not expressly disclose a non-transitory computer readable medium storing the instructions that cause the processors to perform the operations of claim 1.
Official Notice is taken that it is notoriously well-known in the art to create a non-transitory computer-readable media comprising instructions for executing an algorithm when given a description of that algorithm (such as the disclosed claimed algorithm).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to store instructions for performing the operations recited in the claim on a non-transitory computer readable medium because, as is widely known in the art, it would allow the instructions to be provided and executed by any computer capable of reading the instructions from the medium.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Perucci et al., Li et al., Prablanc, Schechner et al., Kegler et al., and Tan et al. disclose additional methods for speech inpainting.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN LOUIS ALBERTALLI whose telephone number is (571)272-7616. The examiner can normally be reached M-F 8AM-3PM, 4PM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached at 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
BLA 3/21/26
/BRIAN L ALBERTALLI/ Primary Examiner, Art Unit 2656