Last updated: May 29, 2026

Application No. 18/457,221

UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Final Rejection §103

Filed

Aug 28, 2023

Priority

Oct 07, 2021 — continuation of 11/769,481 +2 more

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

2 (Final)

Interview Optional

— +7.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 89% grant rate with +7.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 406 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allowance Rate

361 granted / 406 resolved

+26.9% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 0m

Avg Prosecution

26 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

11.6%

-28.4% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 101 Abstract Idea rejection of claims 1-20 have been considered and found persuasive due amendments, and the rejection has been withdrawn.
Applicant's arguments with respect to 35 U.S.C. 102 in regards to claims 1, 9 and 16 have been considered but are moot due to new grounds of rejection necessitated by amendments. See detailed rejection below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2 and 5-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Binkowski et al. (“High Fidelity Speech Synthesis with Adversarial Networks”; Sep. 25, 2019; DeepMind) in view of Kim et al. (“Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”; 2021; Proceedings of the 38th International Conference on Machine Learning, PMLR).

Claim 1,
Binkowski teaches a computer-implemented method, comprising: updating one or more parameters of one or more text-to-speech (TTS) machine learning systems using, at least, a first distribution corresponding to a first set of audio samples and a second distribution corresponding to a second set of synthesized audio samples ([2.2] [3.3] [D Training Details] the generator, which attempts to produce samples that mimic the reference distribution, and the discriminator, which tries to differentiate between real and generated samples; instead of a single discriminator, we use an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples; we train all models with a single discriminator step).
The difference between the prior art and the claimed invention is that Binkowski does not explicitly teach removing, after the updating of the one or more parameters of the one or more TTS systems, the second distribution to form an inferencing distribution; and causing inferencing by the one or more TTS systems to be performed using the inferencing distribution.
Kim teaches removing, after the updating of the one or more parameters of the one or more TTS systems, the second distribution to form an inferencing distribution ([2.1.1] [Eq. 1] [Fig. 1] [2.5] where p0(z|c) denotes a prior distribution and q0(z|x) is an approximate posterior distribution; Figure 1. System diagram depicting (a) training procedure and (b) inference procedure; the posterior encoder and discriminator are only used for training, not for inference); and 
causing inferencing by the one or more TTS systems to be performed using the inferencing distribution ([2.] [2.1.1] [2.5] Figures 1a and 1b show the training and inference procedures of our method, respectively; where p0(z|c) denotes a prior distribution; the posterior encoder and discriminator are only used for training, not for inference).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Binkowski with teachings of Kim by modifying high fidelity speech synthesis with adversarial networks as taught by Binkowski to include removing, after the updating of the one or more parameters of the one or more TTS systems, the second distribution to form an inferencing distribution; and causing inferencing by the one or more TTS systems to be performed using the inferencing distribution as taught by Kim for the benefit of outperforming the best publicly available TTS systems and achieving a MOS comparable to ground truth (Kim [Abstract]).

Claim 2,
Binkowski further teaches the computer-implemented method of claim 1, further comprising: generating the second set of synthesized audio samples ([Abstract] [3.2] a conditional feed-forward generator producing raw speech audio; the input to G is a sequence of linguistic and pitch features at 200Hz, and its output is the raw waveform at 24kHz); and 
generating the second distribution for the second set of synthesized audio samples ([eq. 8] XG = (DS (G(ci; zi)))Ni=1; in the conditional case, we use the same conditioning in the reference and generated samples, comparing conditional distributions p(xG|c) and p(xreal|c); in unconditional case, we compare p(xG) and p(xreal)).

Claim 5,
Kim further teaches the computer-implemented method of claim 1, wherein one or more identifiers are applied to the second set of synthesized audio samples ([2.5.3] [4.3] [Fig. 2b] for the multi-speaker setting, we add a linear layer that transforms speaker embedding and add it to the input latent variables z; fig. 2b shows the lengths of 100 utterances generated with each of five speaker identifies from our model).

Claim 6,
Binkowski further teaches the computer-implemented method of claim 1, further comprising: inferencing a generated audio clip corresponding to a text input ([Abstract] [Introduction] the TTS task consists in the conversion of text into speech audio; our architecture is composed of a conditional feed-forward generator producing raw speech audio).

Claim 7,
Kim further teaches the computer-implemented method of claim 1, wherein the first distribution includes a phoneme distribution ([2.2.2] [2.5.2] the duration distribution of given phonemes; the sampling procedure is relatively simple; the phoneme duration is sampled; a text encoder that processes the input phonemes ctext).

Claim 8,
Kim further teaches the computer-implemented method of claim 7, wherein a phoneme duration for sampling from the phoneme distribution is determined at inference ([2.2.2] [2.5] phoneme duration is sampled; sampling phoneme duration from the stochastic duration predictor; training and inference procedures, that sampling is the inference-side duration determination).

Claim(s) 3 and 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Binkowski et al. (“High Fidelity Speech Synthesis with Adversarial Networks”; Sep. 25, 2019; DeepMind) in view of Kim et al. (“Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”; 2021; Proceedings of the 38th International Conference on Machine Learning, PMLR) and further in view of Mohammadi (US 10,186,252).

Claim 3,
Binkowski and Kim teach all the limitations in claim 2. The difference between the prior art and the claimed invention is that Binkowski nor Kim explicitly teach wherein generating the second set of synthesized audio samples further comprises: modifying one or more features of at least a subset of the first set of audio samples.
Mohammadi teaches wherein generating the second set of synthesized audio samples further comprises: modifying one or more features of at least a subset of the first set of audio samples (col. 1 line 53 to col. 2 line 13] [col. 2 line 46 to col. 3 line 8] the text is converted using three deep neural networks, which are first trained using training speech from this same speaker; three main characteristics of human speech 1) speaking rate or duration 2) intonation, pitch 3) spectrum; a speech synthesizer modifies the temporal length of each normalized spectrogram; the pitch contours associated with phonemes are also normalized before retrieval and subsequently de-normalized).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Binkowski with teachings of Kim by modifying high fidelity speech synthesis with adversarial networks as taught by Binkowski to include wherein generating the second set of synthesized audio samples further comprises: modifying one or more features of at least a subset of the first set of audio samples as taught by Mohammadi for the benefit of yielding speech that is realistic sounding (Mohammadi [Abstract]).

Claim 4,
Binkowski and Kim teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Binkowski nor Kim explicitly teach wherein the second set of synthesized audio samples includes at least one audio sample from the first set of audio samples.
Mohammadi teaches wherein the second set of synthesized audio samples includes at least one audio sample from the first set of audio samples ([col. 1 line 53 to col. 2 line 13] [col. 2 line 46 to col. 3 line 8] [claim 11] the text is converted using three deep neural networks, which are first trained using training speech from this same speaker; the spectrum generator is configured to generate the normalized spectrograms based in part on spectral representations of audio data from a speaker; waveforms that are concatenated into the synthetic speech; waveforms that are concatenated into the synthetic speech).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Binkowski with teachings of Kim by modifying high fidelity speech synthesis with adversarial networks as taught by Binkowski to include wherein the second set of synthesized audio samples includes at least one audio sample from the first set of audio samples as taught by Mohammadi for the benefit of yielding speech that is realistic sounding (Mohammadi [Abstract]).

Claim(s) 9-11 and 14-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (“Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”; 2021; Proceedings of the 38th International Conference on Machine Learning, PMLR) in view of Binkowski et al. (“High Fidelity Speech Synthesis with Adversarial Networks”; Sep. 25, 2019; DeepMind).

Claim 9,
Kim teaches a processor, comprising: one or more processing units to ([3.3] we used mixed precision training on 4 NVIDA V100 GPUs): 
remove the one or more synthetic training clips from the distribution to form an inference distribution ([2.1.1] [Fig. 1] [2.5] p0)z|c) denotes a prior distribution and q0(z|x) is an approximate posterior distribution; Fig. 1 system diagram depicting a) training procedure and b) inference procedure; the posterior encoder and discriminator are only used for training, not for inference); 
sample, during inferencing and using the TTS machine learning system, from the inference distribution ([2.2.2] 2.1.1] [Fig. 1] the sampling procedure is relatively simple; the phoneme duration is sampled from random noise through the inverse transformation of the stochastic duration predictor; p0(z|c) denotes a prior distribution; Fig. 1 shows the inference procedure); and
generating, using one or more samples from the inference distribution, an audio clip corresponding to an input text sequence ([Abstract] [5.1] our model 1) learns to synthesize raw waveforms directly from text 3) generates sample in parallel; stochastic duration predictor to synthesize speech with diverse rhythms from input text).
The difference between the prior art and the claimed invention is that Kim does not explicitly teach update one or more parameters of one or more text-to-speech (TTS) machine learning systems using one or more synthetic training clips; identify one or more locations within a distribution corresponding to the one or more synthetic training clips.
Binkowski teaches update one or more parameters of one or more text-to-speech (TTS) machine learning systems using one or more synthetic training clips ([3.1] [3.2] [3.3] [App. D] for training, we sample 2 second windows; a 2s training clip corresponds to a sequence of 48000 samples; RWDs operate on randomly subsampled fragments of the real or generated samples; we train all models with a single discriminator step per generator step);
identify one or more locations within a distribution corresponding to the one or more synthetic training clips ([Eq. 1] [Appl. B.3] XG={DS (G(ci, zi))}; generated samples, comparing conditional distribution p(xG|c) and p(xreal|c); in the unconditional case, we compare p(xG) and p(xreal)).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kim with teachings of Binkowski by modifying the conditional variational autoencoder with adversarial learning for end-to-end text to speech as taught by Kim to include update one or more parameters of one or more text-to-speech (TTS) machine learning systems using one or more synthetic training clips; identify one or more locations within a distribution corresponding to the one or more synthetic training clips as taught by Binkowski for the benefit of generating high-fidelity speech with naturalness comparable to the state-of-the-art models (Binkowski [Abstract]).

Claim 10,
Kim further teaches the processor of claim 9, wherein the one or more TTS machine learning systems form at least a portion of an end-to-end parallel speech synthesis system ([Abstract] [Intro] [5.1] we present a parallel end-to-end TTS method; our model 1) learns to synthesize raw waveforms directly from text 3) generates samples in parallel; we adopt a VAE to a parallel end-to-end TTS system).

Claim 11,
Kim further teaches the processor of claim 10, wherein the one or more processing units are further to generate the one or more synthetic training clips from a plurality of audio segments ([3.3] we adopt the windowed generator training, a method of generating only a part of raw waveforms; we randomly extract segments of latent representations and also extract the corresponding audio segments from the ground truth raw waveforms as training targets).

Claim 14,
Binkowski further teaches the processor of claim 9, wherein one or more identifiers are applied to the synthetic training clips ([3.2] as the generator is producing raw audio (e.g. a 2s training clip corresponds to a sequence of 48000 samples); convolutions are preceded by conditional batch normalization, conditioned on, the concatenation of z and a one-hot representation of the speaker ID in the multi-speaker case).

Claim 15,
Binkowski further teaches the processor of claim 9, wherein the one or more processing units are further to generate the distribution using at least the one or more synthetic training clips ([App. B.3] in the conditional case, we use the same conditioning in the reference and generated samples, comparing conditional distributions p(xG|c) and p(xreal)).

Claim 16,
Kim teaches a system comprising: one or more processing units ([3.3] 4 NVIDIA V100 GPUs) 
to perform, based at least on an inferencing distribution, inferencing using one or more machine learning models ([Fig. 1] [Eq. 1] [2.] figures 1a and 1b show the training and inference procedures of our method, respectively; p0(z|c) denotes a prior distribution of the latent variables z given condition c)
wherein the inferencing distribution is generated, at least, from a training distribution of the one or more machine learning models ([Eq. 1] [Fig. 1] p0(z|c) denotes a prior distribution and q0(z|x) is an approximate posterior distribution; Fig. 1 System diagram depicting a) training procedure and b) inference procedure)
that omits one or more distribution values corresponding to a set of synthesized audio samples ([2.5] the posterior encoder and discriminator are only used for training, not for inference).
The difference between the prior art and the claimed invention is that Kim does not explicitly teach to produce an output audio clip corresponding to an input text segment, after updating one or more parameters of the one or more machine learning models that were determined using the training distribution.
Binkowski teaches to produce an output audio clip corresponding to an input text segment ([Abstract] [Intro] the TTS task consists in the conversion of text into speech audio; our architecture is composed of a conditional feed-forward generator producing raw speech audio), 
after updating one or more parameters of the one or more machine learning models that were determined using the training distribution ([2.2] [App. D] the generator attempts to produce samples that mimic the reference distribution and the discriminator, differentiate between real and generated samples; we train all models with a single discriminator step per generator step).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kim with teachings of Binkowski by modifying the conditional variational autoencoder with adversarial learning for end-to-end text to speech as taught by Kim to include to produce an output audio clip corresponding to an input text segment, after updating one or more parameters of the one or more machine learning models that were determined using the training distribution as taught by Binkowski for the benefit of generating high-fidelity speech with naturalness comparable to the state-of-the-art models (Binkowski [Abstract]).

Claim 17,
Binkowski further teaches the system of claim 16, wherein the training distribution includes distribution values for a first distribution corresponding to a set of audio samples and the set of synthesized audio samples ([4.2] [App. B.3] assume that variables xreal and xG are drawn from the real and generated distributions; in the conditional case, comparing conditional distributions p(xG|c) and p(xreal|c); in the unconditional case, FDSD and KDSD compare p(xG) and p(xreal)).

Claim 18,
Binkowski further teaches the system of claim 17, wherein the one or more processing units are further to generate the set of synthesized audio samples ([Abstract] [App. B.3] [Eq. 8] our architecture is composed of a conditional feed-forward generator producing raw speech audio; see eq. 8).

Claim 19,
Binkowski further teaches the system of claim 16, wherein the one or more processing units are further to use the one or more machine learning models to generate synthetic speech based at least on an input text sequence ([Abstract] [Intro] the TTS task consists in the conversion of text into speech audio; GAN-TTS a generative adversarial network for TTS; a conditional feed-forward generator producing raw speech audio).

Claim 20,
Kim further teaches the system of claim 16, wherein the one or more machine learning models form at least a portion of an end-to-end parallel speech synthesis system ([Intro] [5.1] we present a parallel end to end TTS method; we adopt a VAE to a parallel end to end TTS system; our model, generates samples in parallel).

Claim(s) 12-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (“Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”; 2021; Proceedings of the 38th International Conference on Machine Learning, PMLR) in view of Binkowski et al. (“High Fidelity Speech Synthesis with Adversarial Networks”; Sep. 25, 2019; DeepMind)  and further in view of Iqbal et al. (“Enhancing audio augmentation methods with consistency learning; 2021; IEEE).

Claim 12,
Kim and Binkowski teach all the limitations in claim 10. The difference between the prior art and the claimed invention is that Kim nor Binkowski explicitly teaches wherein the one or more synthetic training clips correspond to augmented audio samples including one or more features augmented based at least on an augmentation probability.
Iqbal teaches wherein the one or more synthetic training clips correspond to augmented audio samples including one or more features augmented based at least on an augmentation probability ([Abstract] [Intro] 3.1] [4.3] data augmentation is an inexpensive way to increase training data diversity; Pitch shifting: The pitch of the audio clip is shifted; a specific variation is selected randomly each time; combination applies either pitch-shifting or reverberations randomly with equal probability).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Kim with teachings of Iqbal by modifying the conditional variational autoencoder with adversarial learning for end-to-end text to speech as taught by Kim to include wherein the one or more synthetic training clips correspond to augmented audio samples including one or more features augmented based at least on an augmentation probability as taught by Iqbal for the benefit of improving learning by enforcing consistency (Iqbal [Abstract]).

Claim 13,
Iqbal further teaches the processor of claim 12, wherein the one or more features correspond to at least one of pitch or energy ([3.1] pitch shifting: the pitch of the audio clip is shifted).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/               Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Aug 28, 2023

Application Filed

Oct 28, 2025

Non-Final Rejection mailed — §103

Jan 15, 2026

Applicant Interview (Telephonic)

Jan 15, 2026

Examiner Interview Summary

Jan 26, 2026

Response Filed

Apr 22, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,165

Patent 12608559

METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT

3y 0m to grant Granted Apr 21, 2026

18/696,802

Patent 12609128

METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM

2y 0m to grant Granted Apr 21, 2026

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

3y 6m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

1y 9m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

3y 3m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

89%

Grant Probability

97%

With Interview (+7.7%)

2y 0m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.