Prosecution Insights
Last updated: May 29, 2026
Application No. 18/449,969

UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Non-Final OA §103
Filed
Aug 15, 2023
Priority
Oct 07, 2021 — continuation of 11/769,481 +1 more
Examiner
PATEL, SHREYANS A
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
3 (Non-Final)
89%
Grant Probability
Favorable
3-4
OA Rounds
0m
Est. Remaining
97%
With Interview

Examiner Intelligence

Grants 89% — above average
89%
Career Allowance Rate
361 granted / 406 resolved
+26.9% vs TC avg
Moderate +8% lift
Without
With
+7.7%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 0m
Avg Prosecution
26 currently pending
Career history
449
Total Applications
across all art units

Statute-Specific Performance

§101
11.0%
-29.0% vs TC avg
§103
67.1%
+27.1% vs TC avg
§102
11.6%
-28.4% vs TC avg
§112
0.8%
-39.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments with respect to Double Patenting of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. Applicant's arguments with respect to 35 U.S.C. 101 Abstract Idea rejection of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. Independent claim 1 is rejected. Independent claims 10 and 17 are allowable. Dependent claims 5 and 9 are objected to. See detailed description below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-2 and 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742). Claim 1, Mohammadi teaches a computer-implemented method, comprising ([col. 1 lines 52-53] a system and method for converting text to speech): determining a phoneme distribution from respective phoneme durations of a plurality of first audio segments provided as training data ([Fig. 1] [col. 3 lines 10-40] [col. 4 lines 5-16] obtaining phoneme durations from training audio by forced alignment and forming a representation (matrix) of phoneme timing (duration); see “duration matrix 141”); determining, from at least respective phoneme pitches or respective phoneme energies of the plurality of first audio segments, at least a pitch distribution, corresponding to the respective phoneme pitches, or an energy distribution, corresponding to the respective phoneme energies ([col. 4 lines 5-16] extracting/representing pitch per phoneme in a pitch representation; see “pitch matrix 142”; the pitch contour based on log f0, normalized/fixed-length per phoneme); determining a speech alignment for a second audio segment, generated by the text-to- speech model, based at least on the duration alignment, the phoneme distribution, and at least one of the pitch distribution or the energy distribution ([col. 7 lines 22-46] aligning generated speech units by duration-based time scaling: aligning the generated spectrogram/pitch contours to the phoneme durations by stretches or compresses operations in de-normalization and then generating speech segments and concatenating); and generating a synthesized audio recitation, as an output audio signal, corresponding to the second audio segment ([col. 7 lines 22-46] producing the output speech signal (concatenated segments) as synthesized speech). The difference between the prior art and the claimed invention is that Mohammadi does not explicitly teach determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations. Qian teaches determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations ([Figs. 2-3] [0020] [0023] obtaining durations from a duration model and aligning/adjusting durations to meet a user-modifiable total duration (T is the total duration)). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi with teachings of Qian by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations as taught by Qian for the benefit of post-editing synthesized speech to make it sound more natural (Qian [0002]). Claim 2, Mohammadi further teaches the computer-implemented method of claim 1, wherein the pitch distribution corresponds to frequency and the energy distribution corresponds to amplitude ([col. 4 lines 5-16] the pitch corresponds to fundamental frequency). Claim 7, Qian further teaches The computer-implemented method of claim 1, wherein the synthesized audio recitation is generative such that a first synthesized recitation is different from a second synthesized recitation, each of the first synthesized recitation and the second synthesized recitation based on the sequence of text ([Abstract] a visual representation of synthesized speech as one or more waveforms, along with the corresponding text from which the speech was synthesized; to change data corresponding to the prosody e.g. duration, pitch, and/or loudness data with respect to at least one part of the speech; the changed speech can be played back to hear the change in prosody resulting from the interactive changes). Claim 8, Mohammadi further teaches the computer-implemented method of claim 1, further comprising: aligning a plurality of text tokens, from the sequence of text, to respective mel frames, based on the duration alignment ([col. 3 lines 41-58] [col. 7 lines 22-31] converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch; t is the time-resolution of the spectrogram section per phoneme; and the de-normalize spectrum module 482 either stretches or compresses the normalized spectrum so that it is the length specified by the duration). Claim(s) 3-4 and 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742) and further in view of Wei et al. (“Neural Network-Based Modeling of Phonetic Durations”; Sept. 6,2019). Claim 3, Mohammadi and Qian teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Mohammadi nor Qian explicitly teach applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range. Wei teaches applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range ([2.3] [2.4] [5.] we obtain the reference durations from forced-alignment; we detect outliers using the output from the DNN; we use the value of the bin to which the reference duration belongs as the probability of the duration, and by ranking the probabilities we can get a list of phonemes with the lowest probabilities; these phonemes with unlikely duration, which we regard as outliers, can indicate misalignments or departures from the transcription; in training material for TTS these anomalies can be used to correct transcriptions and dictionary entries as well as to exclude unsuitable speech from the TTS training set). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi and Qian with teachings of Wei by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range as taught by Wei for the benefit of checking that text-to-speech (TTS) training speech follows the script and words are pronounced as expected (Wei [Abstract]). Claim 4, Qian further teaches the computer-implemented method of claim 3, wherein the prior distribution is cigar-shaped ([0019] multi-space probability distribution HMM). Claim 6, Qian the computer-implemented method of claim 3, wherein the prior distribution is constructed from a beta-binomial distribution ([0019] multi-space probability distribution HMM). Allowable Subject Matter Claims 10-20 are allowed. Claims 5 and 9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is a statement of reasons for the indication of allowable subject matter: For Claim 10: Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach the synthetic alignment is based on probabilistically sampling phoneme distributions for a plurality of audio segments used to train the text-to-speech model, at inference, and a first alignment between the sequence of text and a total speech duration. For Claim 17: Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach determine one or more vectors corresponding to one or more speaker characteristics based on a concentrated probability distribution of a text sequence from the text and mel-frames of the alignment distribution across the duration of the plurality of audio samples. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. SHREYANS A. PATEL Primary Examiner Art Unit 2653 /SHREYANS A PATEL/Examiner, Art Unit 2659
Read full office action

Prosecution Timeline

Show 1 earlier event
Oct 22, 2025
Non-Final Rejection mailed — §103
Jan 15, 2026
Applicant Interview (Telephonic)
Jan 15, 2026
Examiner Interview Summary
Jan 26, 2026
Response Filed
Feb 24, 2026
Final Rejection mailed — §103
May 14, 2026
Request for Continued Examination
May 19, 2026
Response after Non-Final Action
May 26, 2026
Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12608559
METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT
3y 0m to grant Granted Apr 21, 2026
Patent 12609128
METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM
2y 0m to grant Granted Apr 21, 2026
Patent 12586597
ENHANCED AUDIO FILE GENERATOR
3y 6m to grant Granted Mar 24, 2026
Patent 12586561
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE
1y 9m to grant Granted Mar 24, 2026
Patent 12548549
ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)
3y 3m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
89%
Grant Probability
97%
With Interview (+7.7%)
2y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month