DETAILED ACTION
This communication is in response to the Application filed on 08/06/2024. Claims 1-20 are pending and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/15/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1, 6-7, 11, 15 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 18 of U.S. Patent No. 12087275. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the issued patent are similar in scope as that of the instant application except for slight wording differences. Therefore, the claims of the issued patent application map to each of the limitations of the instant application and therefore anticipates the claims set forth. Please see the Mapping Table below.
Each of the dependents listed above map to the independent claim 18.
Claims 3-5, 10, 14 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 18 of U.S. Patent No. 12087275 in view of Jia (as cited below).
Each of these dependent claims are rejected based on the mappings to Jia as noted in the prior art rejections below for claims 3-5, 10, and 14. It would have been obvious to have modified the TTS model as taught by the issued patent with the TTS model of Jia in order to be able to leverage the knowledge of speaker variability learned by the speaker encoder in order to generalize well and synthesize natural speech from speakers that were never seen during training (see Jia [0005]).
Instant Application: 18795734
Issued Patent: 17673417
1. A computing system for novel speaker generation, the computing system comprising:
one or more processors; and
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining an input dataset, wherein the input dataset comprises input text data and one or more speaker preferences, wherein the one or more speaker preferences are descriptive of one or more speaker characteristics;
processing the one or more speaker preferences with a first machine-learned model to determine a speaker embedding in an embedding space;
processing the input text data and the speaker embedding with a second machine-learned model to generate predicted speech data; and
providing the predicted speech data, wherein the predicted speech data comprises data descriptive of one or more sound waves, wherein the predicted speech data differs from a plurality of training speech examples associated with a plurality of training datasets used to train the first machine-learned model and the second machine-learned model.
18. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising:
obtaining an input dataset, wherein the input dataset comprises phoneme data and speaker metadata;
processing the speaker metadata with a first machine-learned model to determine a speaker embedding in a learned embedding space;
processing the phoneme data and the speaker embedding with a second machine-learned model to generate predicted speech data; and
providing the predicted speech data, wherein the predicted speech data comprises data descriptive of one or more sound waves, wherein the predicted speech data differs from a plurality of training speech examples associated with a plurality of training datasets used to train the first machine-learned model and the second machine-learned model.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 3-7, 10-12, 14-15, and 17-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jia et al. (WO 2019/222591 A1).
As to claim 1, 11, and 17, Jia teaches a computing system for novel speaker generation, the computing system comprising:
one or more processors (see [0064], processor or multiple processors); and
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising (see [0066], where storage device described which includes instructions):
obtaining an input dataset, wherein the input dataset comprises input text data and one or more speaker preferences (see [0027], where input text together with an audio representation of a target speaker is described), wherein the one or more speaker preferences are descriptive of one or more speaker characteristics (see [0027], where the audio representation is of a target speaker and see [0028], where speaker vector may capture the characteristic speaking rate of the speaker);
processing the one or more speaker preferences with a first machine-learned model to determine a speaker embedding in an embedding space (see [0028]-[0029], where speaker vector/speaker embedding is determined by way of a neural network such as a LSTM and see [0039], which goes into more detail on the LSTM and embedding spaces);
processing the input text data and the speaker embedding with a second machine-learned model to generate predicted speech data (see [0053], where the input text and the speaker embedding vector is input to a spectrogram generation neural network for generation of audio of the input text); and
providing the predicted speech data, wherein the predicted speech data comprises data descriptive of one or more sound waves, wherein the predicted speech data differs from a plurality of training speech examples associated with a plurality of training datasets used to train the first machine-learned model and the second machine-learned model (see [0053], where the system generates audio representation of text in the voice of the target speaker and see [0056], where speaker embedding vector is different from speaker embedding vectors used in training the spectrogram generation neural network and parameters of the speaker embedding neural network).
As to claim 11 and 17, apparatus claim 1 and 17 and method claim 11 are related as apparatus and the method of using same, with each claimed element's function corresponding to the claimed apparatus function. Accordingly claims 11 and 17 is similarly rejected under the same rationale as applied above with respect to apparatus claim.
As to claim 3, Jia teaches wherein generating the predicted speech data comprises autoregressively predicting a spectrogram sequence based on the input text data and the speaker embedding (see [0061]-[0062], where audio representation of the input text can provide a time domain representation where a vocoder is used and can be a sample by sample autoregressive wavenet) (e.g. The examiner notes that as per para [0053], the vocoder receives this speaker embedding).
As to claim 4, Jia teaches wherein the predicted speech data comprises audio frequency data, wherein the audio frequency data is descriptive of a mel spectrogram representation (see [0034], where audio representation generation by the spectrogram generation engine can be a mel spectrogram).
As to claim 5, Jia teaches wherein the audio frequency data differs from a plurality of training audio frequency datasets associated with a plurality of training datasets used to train the first machine-learned model and the second machine-learned model (see [0007], [0043], where the output of the system is a novel or unseen speaker never seen before from a single short clip).
As to claim 6, Jia teaches wherein the first machine-learned model comprises an embedding model (see [0028]-[0029], and [0038]-[0039], where LSTM speaker encoder is the embedding model).
As to claim 7, Jia teaches wherein the second machine-learned model comprises a generation model (see [0053], spectrogram generation neural network).
As to claim 10 and 18, Jia teaches wherein processing the one or more speaker preferences with the first machine-learned model to determine the speaker embedding in the learned embedding space comprises: determining the speaker embedding based on a learned distribution within the embedding space (see [0052], where speaker embedding vectors are generated based on audio representations of speech from the same speaker being closer together as well as different speakers that are distant from each other and see [0039], LSTM speaker encoder generates the speaker embedding cosine similarity to other embedding space which has been interpreted as the learned distribution). Claim 18 additionally recites wherein the speaker embedding is representative of a desired speaker (where the citation provided from claim 10 addresses this aspect as well).
As to claim 12, Jia teaches further comprising:
evaluating, by the computing system, a loss function that evaluates a difference between training audio data of a training dataset and the predicted speech data (see [0045], reconstruction loss is the loss between the reconstructed speech and the input/ground truth) ; and
adjusting, by the computing system, one or more parameters of at least one of the embedding model or the generation model based at least in part on the loss function (see [0045], reconstruction loss can be used to drive other parts of system 200 and see Figure 2 feedback from decoder 226 into attention 224).
As to claim 14, Jia teaches wherein the predicted speech data comprises audio data descriptive of a phoneme sequence of the input text data spoken by a synthetic speaker (see [0057], where spectrogram generation neural network is trained to predict mel spectrograms from a sequency of phoneme based on Tacotron 2).
As to claim 15, Jia teaches wherein the embedding model and the generation model are part of a neural-network-based text-to-speech model (see [0040], where synthesizer cam be an end-to-end synthesis network and see Figure 2, where speaker encoder 210 (claim 1 mapping) and synthesizer 220 are implemented using a NN).
As to claim 19, Jia teaches wherein the predicted speech data comprises audio data descriptive of a phoneme sequence spoken by the desired speaker (see [0057], where spectrogram generation neural network is trained to predict mel spectrograms from a sequency of phoneme based on Tacotron 2).
As to claim 20, Jia teaches wherein the speaker embedding differs from each of a plurality of training embeddings, wherein the plurality of training embeddings are associated with a plurality of training datasets used for training a text-to-speech model comprising the first machine-learned model and the second machine-learned model (see [0056], where the speaker embedding vector is different from any speaker embedding vectors used during training of the speaker verification neural network or the spectrogram generation neural network).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 2 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Jia in view of Wang et al. (US 2021/0280197).
As to claim 2, Jia teaches all of the limitations as in claim 1.
Furthermore, Jia teaches processing one or more training examples with the first machine-learned model to generate one or more particular speaker embedding (see [0028]-[0029], where embedding vector is created based on speaker encoder and see training described in [0030])
However, Jia does not specifically teach wherein the speaker embedding is determined based on one or more learned distributions, wherein the one or more learned distributions were determined based on: determining the one or more learned distributions based on annotating the one or more particular speaker embeddings with one or more speaker labels associated with the one or more training examples.
Wang does teach wherein the speaker embedding is determined based on one or more learned distributions, wherein the one or more learned distributions were determined based on: determining the one or more learned distributions based on annotating the one or more particular speaker embeddings with one or more speaker labels associated with the one or more training examples (see [0057], where probability distribution is predicted based on corresponding speaker label).
Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed inventions to have modified the TTS model as taught by Jia with the distribution as taught by Wang in order to be able to produce speaker boundaries (see Wang [0003]) which would benefit Jia’s speaker encoder model which is also used to verify a speaker’s presence and Wang would further provide the benefit of allowing a speaker’s presence to be determined and verified.
As to claim 13, Jia teaches all of the limitations as in claim 12.
However, Jia does not specifically teach wherein determining, by the computing system, a prior distribution based at least in part on the speaker embedding in the embedding space and a speaker label of the training dataset, wherein the speaker label comprises one or more specific speaker characteristics associated with a speaker.
Wang does teach in determining, by the computing system, a prior distribution based at least in part on the speaker embedding in the embedding space and a speaker label of the training dataset, wherein the speaker label comprises one or more specific speaker characteristics associated with a speaker ([0049], where probability distribution is predicted using an RNN based on previous speaker label sequence and prior sequence of embeddings).
Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed inventions to have modified the TTS model as taught by Jia with the distribution as taught by Wang in order to be able to produce speaker boundaries (see Wang [0003]) which would benefit Jia’s speaker encoder model which is also used to verify a speaker’s presence and Wang would further provide the benefit of allowing a speaker’s presence to be determined and verified.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Jia in view of Cho et al. (“Learning Speaker Embedding from Text-to-Speech”, 2020).
As To claim 8, Jia teaches all of the limitations as in claim 1.
However, Jia does not specifically teach wherein the first machine-learned model and the second machine-learned model were jointly trained.
Cho does teach wherein the first machine-learned model and the second machine-learned model were jointly trained (see page 2, left column, paragraph after eqn 4, where vocoder and speaker encoder are jointly trained).
Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed inventions to have modified the training of the TTS model as taught by Jia with the joint training as taught by Cho in order to build controllable robust TTS system (See Cho, page 1, right column, lines 4-5).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Jia in view of Cho, as applied in claim 8, above, and further in view of Hsu et al. (“Text-Free Image-to-Speech Synthesis Using Learned Segmental Units”).
As to claim 9, Jia in view of Cho teach all of the limitations as in claim 8.
However, Jia does not specifically teach wherein the first machine-learned model and the second machine-learned model are part of a two-level maximum likelihood estimation model trained in order to learn a distribution over speaker embeddings.
Hsu does teach wherein the first machine-learned model and the second machine-learned model are part of a two-level maximum likelihood estimation model trained in order to learn a distribution over speaker embeddings (see page 5288, sect. 3.5, last paragraph, where two MLE estimations based on the two models.) (e.g. The examiner notes that although the reference inputs image instead of text, that one skilled in the art would readily be able to substitute the image aspects with text aspects per the primary reference of Jia in order to perform similar training as both architectures are based on a two model approach).
Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed inventions to have modified the TTS model as taught by Jia in view of Cho with the two level MLE as taught by Hsu in order to optimize models for better accuracy as is evident from the usage of objective functions (see Hsu page 5288, sect. 3.5, last paragraph).
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Jia in view of Kim et al. (US 2020/0082807).
As to claim 16, Jia teaches all of the limitations as in claim 11.
However, Jia does not specifically teach wherein the embedding model and the generation model are part of a recurrent attention-based text-to-speech model.
Kim does teach wherein the embedding model and the generation model are part of a recurrent attention-based text-to-speech model (see [0085], where attention RNN as part of the TTS model).
Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed inventions to have modified the TTS model as taught by Jia with the attention RNN as taught by Kim in order to determine synthesized speech reflecting articulatory feature of the first speaker, emotion, prosody, voice tone and pitch (see Kim [0085]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARAS D SHAH whose telephone number is (571)270-1650. The examiner can normally be reached Monday-Thursday 7:30AM-2:30PM, 5PM-7PM (EST), Friday 8AM-noon (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653
02/14/2026