Last updated: May 29, 2026

Application No. 18/449,969

UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Non-Final OA §103

Filed

Aug 15, 2023

Priority

Oct 07, 2021 — continuation of 11/769,481 +1 more

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

3 (Non-Final)

Interview Optional

— +7.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 89% grant rate with +7.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 406 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allowance Rate

361 granted / 406 resolved

+26.9% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 0m

Avg Prosecution

26 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

11.6%

-28.4% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to Double Patenting of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. 
Applicant's arguments with respect to 35 U.S.C. 101 Abstract Idea rejection of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn.
Independent claim 1 is rejected. Independent claims 10 and 17 are allowable. Dependent claims 5 and 9 are objected to. See detailed description below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2 and 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742).

Claim 1,
Mohammadi teaches a computer-implemented method, comprising ([col. 1 lines 52-53] a system and method for converting text to speech): 
determining a phoneme distribution from respective phoneme durations of a plurality of first audio segments provided as training data ([Fig. 1] [col. 3 lines 10-40] [col. 4 lines 5-16] obtaining phoneme durations from training audio by forced alignment and forming a representation (matrix) of phoneme timing (duration); see “duration matrix 141”);
determining, from at least respective phoneme pitches or respective phoneme energies of the plurality of first audio segments, at least a pitch distribution, corresponding to the respective phoneme pitches, or an energy distribution, corresponding to the respective phoneme energies ([col. 4 lines 5-16] extracting/representing pitch per phoneme in a pitch representation; see “pitch matrix 142”; the pitch contour based on log f0, normalized/fixed-length per phoneme);
determining a speech alignment for a second audio segment, generated by the text-to- speech model, based at least on the duration alignment, the phoneme distribution, and at least one of the pitch distribution or the energy distribution ([col. 7 lines 22-46] aligning generated speech units by duration-based time scaling: aligning the generated spectrogram/pitch contours to the phoneme durations by stretches or compresses operations in de-normalization and then generating speech segments and concatenating); and
generating a synthesized audio recitation, as an output audio signal, corresponding to the second audio segment ([col. 7 lines 22-46] producing the output speech signal (concatenated segments) as synthesized speech).
The difference between the prior art and the claimed invention is that Mohammadi does not explicitly teach determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations.
Qian teaches determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations ([Figs. 2-3] [0020] [0023] obtaining durations from a duration model and aligning/adjusting durations to meet a user-modifiable total duration (T is the total duration)).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi with teachings of Qian by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations as taught by Qian for the benefit of post-editing synthesized speech to make it sound more natural (Qian [0002]).

Claim 2,
Mohammadi further teaches the computer-implemented method of claim 1, wherein the pitch distribution corresponds to frequency and the energy distribution corresponds to amplitude ([col. 4 lines 5-16] the pitch corresponds to fundamental frequency).

Claim 7,
Qian further teaches The computer-implemented method of claim 1, wherein the synthesized audio recitation is generative such that a first synthesized recitation is different from a second synthesized recitation, each of the first synthesized recitation and the second synthesized recitation based on the sequence of text ([Abstract] a visual representation of synthesized speech as one or more waveforms, along with the corresponding text from which the speech was synthesized; to change data corresponding to the prosody e.g. duration, pitch, and/or loudness data with respect to at least one part of the speech; the changed speech can be played back to hear the change in prosody resulting from the interactive changes).

Claim 8,
Mohammadi further teaches the computer-implemented method of claim 1, further comprising: aligning a plurality of text tokens, from the sequence of text, to respective mel frames, based on the duration alignment ([col. 3 lines 41-58] [col. 7 lines 22-31] converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch; t is the time-resolution of the spectrogram section per phoneme; and the de-normalize spectrum module 482 either stretches or compresses the normalized spectrum so that it is the length specified by the duration).

Claim(s) 3-4 and 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742) and further in view of Wei et al. (“Neural Network-Based Modeling of Phonetic Durations”; Sept. 6,2019).

Claim 3,
Mohammadi and Qian teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Mohammadi nor Qian explicitly teach applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range.
Wei teaches applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range ([2.3] [2.4] [5.] we obtain the reference durations from forced-alignment; we detect outliers using the output from the DNN; we use the value of the bin to which the reference duration belongs as the probability of the duration, and by ranking the probabilities we can get a list of phonemes with the lowest probabilities; these phonemes with unlikely duration, which we regard as outliers, can indicate misalignments or departures from the transcription; in training material for TTS these anomalies can be used to correct transcriptions and dictionary entries as well as to exclude unsuitable speech from the TTS training set).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi and Qian with teachings of Wei by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range as taught by Wei for the benefit of checking that text-to-speech (TTS) training speech follows the script and words are pronounced as expected (Wei [Abstract]).

Claim 4,
Qian further teaches the computer-implemented method of claim 3, wherein the prior distribution is cigar-shaped ([0019] multi-space probability distribution HMM).

Claim 6,
Qian the computer-implemented method of claim 3, wherein the prior distribution is constructed from a beta-binomial distribution ([0019] multi-space probability distribution HMM).

Allowable Subject Matter
Claims 10-20 are allowed.
Claims 5 and 9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
For Claim 10:
Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach the synthetic alignment is based on probabilistically sampling phoneme distributions for a plurality of audio segments used to train the text-to-speech model, at inference, and a first alignment between the sequence of text and a total speech duration.
For Claim 17:
	Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach determine one or more vectors corresponding to one or more speaker characteristics based on a concentrated probability distribution of a text sequence from the text and mel-frames of the alignment distribution across the duration of the plurality of audio samples.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Show 1 earlier event

Oct 22, 2025

Non-Final Rejection mailed — §103

Jan 15, 2026

Applicant Interview (Telephonic)

Jan 15, 2026

Examiner Interview Summary

Jan 26, 2026

Response Filed

Feb 24, 2026

Final Rejection mailed — §103

May 14, 2026

Request for Continued Examination

May 19, 2026

Response after Non-Final Action

May 26, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,165

Patent 12608559

METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT

3y 0m to grant Granted Apr 21, 2026

18/696,802

Patent 12609128

METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM

2y 0m to grant Granted Apr 21, 2026

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

3y 6m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

1y 9m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

3y 3m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

89%

Grant Probability

97%

With Interview (+7.7%)

2y 0m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.