Notice of Pre-AIA or AIA Status
● The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
● This action is responsive to the following communication: US Patent Application filed on 12/20/2023.
● Claims 1-11 are currently pending.
Information Disclosure Statement
● The information disclosure statement (IDS) submitted on 4/2/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
Claim limitations “prediction unit,” synthesis unit” has/have been interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because it uses/they use a generic placeholder “unit” coupled with functional language without reciting sufficient structure to achieve the function. Furthermore, the generic placeholder is not preceded by a structural modifier. Claim elements in this application that use the word “unit” are presumed to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action. Similarly, claim elements that do not use the word “unit” are presumed not to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.
Since the claim limitation(s) invokes 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, claim(s) 10 has/have been interpreted to cover the corresponding structure described in the specification that achieves the claimed function, and equivalents thereof.
A review of the specification shows that the following appears to be the corresponding structure described in the specification for the 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph limitation:
If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action.
If applicant does not intend to have the claim limitation(s) treated under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112 , sixth paragraph, applicant may amend the claim(s) so that it/they will clearly not invoke 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, or present a sufficient showing that the claim recites/recite sufficient structure, material, or acts for performing the claimed function to preclude application of 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-3, 5-8, 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Kotaro et al (JP-2023014765, English translation is herein provided) in view of Bai et al (A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing, 13 pages).
Regarding claim 1, Kotaro discloses a speech synthesis (speech synthesizer, abstract) method comprising:
a step of predicting a duration (a duration prediction 30 is provided, and the duration predictor is trained to minimize the error between the predicted value of the length of each phoneme predicted by the duration predictor and the length of the phoneme extracted from the voice data for the feature extractor 26, see page 3 of English translation) of each phoneme;
a step of encoding (encoder 20 receives the input data, including text data or phoneme data, and converts it into phoneme laten representations, see page 3 of English translation) the text to be synthesized and extracting (a feature quantity extractor 26 which extracts a length of a phoneme as a feature quantity from a source speaker speech as to the detail information, abstract) a text sequence which is expressed by feature information of the text;
a step of generating a speech frame sequence by regulating a length (the sound length adjustment unit 32 adjusts the length of each phoneme of the phoneme latent expression included in the input data input from the encoder 20 based on the predicted value of the length of each phoneme predicted by the time length predictor 30. and output as extended phoneme latent representations. Therefore, the length adjustment unit 32 determines the length of the synthesized speech to be generated, see page 3 of English translation) of each phoneme of the text sequence according to the predicted duration of each phoneme;
a step of synthesizing a speech (speech synthesizer 100, see abstract and page 2 of English translation for details) for from the generated speech frame sequence.
Kotaro fails to expressly teach and/or suggest speech mask.
Bai, in the same of endeavor for speech synthesis, teaches a well-known example of speech mask (duration predictor that predicts the duration of a phoneme in a masked spectrogram (fig. 3, section 3), and restores speech using A3T by inputting reference speech, MASK, prompt text, and target text in a TTS model, fig. 4, section 3).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention by modifying speech synthesis of Kotaro to include speech mask as taught by Bai to enhance speech quality (e.g. avoid speech interference such as noise by using a speech mask).
Therefore, it would have been obvious to combine Kotaro with Bai to obtain the invention as specified in claim 1.
Regarding claim 2, Kotaro further teaches the speech synthesis method of claim 1, wherein the length (the speech length (time length) of the speech data of the source speaker and that of the speech data of the target speaker synthesized by the speech synthesizer 100 are adjusted to be equal. That is, the target speaker's speech of the same length (time length) as the source speaker's speech is synthesized as parallel data. Therefore, there is no need for a technique such as dynamic time warping (DTW) for matching the length (time length) of the source speaker's speech and the target speaker's speech to be synthesized, see page 5 of English translation) of the speech mask is length of the speech which is synthesized at the step of the synthesizing.
Regarding claim 3, Bai further teaches the speech synthesis method of claim 2, wherein the length of the speech mask (speech mask length, section 2.2, also see fig. 3) is set by a user.
Regarding claim 5, Kotaro further teaches the speech synthesis method of claim 1, wherein the step of predicting comprises predicting a duration (a duration prediction 30 is provided, and the duration predictor is trained to minimize the error between the predicted value of the length of each phoneme predicted by the duration predictor and the length of the phoneme extracted from the voice data for the feature extractor 26, see page 3 of English translation) of each phoneme corresponding to a speech prompt and a duration of each phoneme corresponding to the speech mask, from the speech prompt, a text prompt which is text information of the speech prompt, the speech mask, and the text to be synthesized with the speech mask.
Regarding claim 6, Kotaro further teaches the speech synthesis method of claim 5, wherein the step of predicting comprises concatenating the speech prompt, the text prompt, the speech mask, and the next to be synthesized, and inputting the concatenated information (in constructing the speech conversion model, learning data obtained by combining the speech data of the source speaker, which is the source of conversion, and the speech data of the target speaker, which is the destination of conversion, is used. Speech data of the target speaker is used as teacher data in the learning data. In this embodiment, the speech data of the source speaker and the speech data of the target speaker, which is parallel data synthesized from the speech data of the source speaker and information such as text indicating the content of the speech data of the source speaker in the speech synthesizer 100. are combined and used as learning data, page 6 of English translation) to a prediction model which is trained to predict a duration of a phoneme. Bai also teaches “The key idea is to concatenate the prompt and the target together into a new utterance input, where the target speech is consist ofn [MASK] and nis predicted by a duration predictor. By inputting the concatenated speech and text, A3T model will predict the spectrogram of these masked frames. The role of the reference text and speech in our model is similar to prompts in language model (Browo et al., 2020), and hence we call it prompt-based decoding/generation.” See section 3.5.
Regarding claim 7, Bai further teaches the speech synthesis method of claim 5, wherein the step of predicting comprises predicting a speech frame-phoneme alignment (alignment preprocessing, section 2.2.2) on the speech mask.
Regarding claim 8, Kotaro further teaches the speech synthesis method of claim 5, wherein the speech prompt is expressed by speech feature vector, and wherein the speech feature is one of MFCC, a Mel-spectrogram (In speech conversion device 200, linear conversion layer 42, encoder 20, pitch predictor 34, Energy predictor 36 and decoder 24 are trained by machine learning. That is, the pitch predictor 34 predicts the pitch of each phoneme predicted by the pitch predictor 34 with respect to the pitch of the phoneme extracted from the speech data of the target speaker in the feature quantity extractor 26. It is trained to minimize the error with the value. Also, the energy predictor 36 predicts the magnitude (energy) of the phoneme predicted by the energy predictor 36 with respect to the magnitude (energy) of the phoneme extracted from the speech data of the target speaker by the feature amount extractor 26. It is trained to minimize the error with the predicted value. The decoder 24 is also trained to minimize the error between the mel-spectrogram extracted from the speech data of the target speaker in the feature extractor 26 and the mel-spectrogram generated by the decoder 24 . Linear transform layer 42 and encoder 20 are also trained such that the training in duration predictor 30, pitch predictor 34, energy predictor 36 and decoder 24 is optimized, see page 7 of English translation) a spectrogram.
Regarding claims 10-11 recite limitations that are similar and in the same scope of invention as to those in claim 1 above; therefore, claims 10-11 are rejected for the same rejection rationale/basis as described in claim 1.
Claim(s) 4 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Kotaro/Bai as applied to claims 1-3, 5-8, 10-11 above, and further in view of Gupta et al (US 20240155071).
Regarding claims 4 and 9, the combination of Kotaro/Bai fail to teach and/or suggest zero-padding and up-sampling.
Gupta, in the same field of endeavor, teaches a well-known example of zero-padding (par. 60) and up-sampling (par. 60).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention by modifying speech synthesis of Kotaro/Bai to include methods/steps of zero-padding/up-sampling as taught by Gupta to create a smooth signal.
Therefore, it would have been obvious to combine Kotaro/Bai with Gupta to obtain the invention as specified in claims 4 and 9.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THIERRY L PHAM whose telephone number is (571)272-7439. The examiner can normally be reached M-F, 11-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/THIERRY L PHAM/Primary Examiner, Art Unit 2654