Prosecution Insights
Last updated: April 19, 2026
Application No. 18/449,969

UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Final Rejection §103§DP
Filed
Aug 15, 2023
Examiner
PATEL, SHREYANS A
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
2 (Final)
89%
Grant Probability
Favorable
3-4
OA Rounds
2y 3m
To Grant
96%
With Interview

Examiner Intelligence

Grants 89% — above average
89%
Career Allow Rate
359 granted / 403 resolved
+27.1% vs TC avg
Moderate +7% lift
Without
With
+7.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 3m
Avg Prosecution
46 currently pending
Career history
449
Total Applications
across all art units

Statute-Specific Performance

§101
21.3%
-18.7% vs TC avg
§103
36.0%
-4.0% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 403 resolved cases

Office Action

§103 §DP
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments with respect to Double Patenting of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. Applicant's arguments with respect to 35 U.S.C. 101 Abstract Idea rejection of claims 1-20 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. Independent claim 1 is rejected. Independent claims 10 and 17 are allowable. Dependent claims 5 and 9 are objected to. See detailed description below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-2 and 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742). Claim 1, Mohammadi teaches a computer-implemented method, comprising ([col. 1 lines 52-53] a system and method for converting text to speech): determining a phoneme distribution from respective phoneme durations of a plurality of first audio segments provided as training data ([Fig. 1] [col. 3 lines 10-40] [col. 4 lines 5-16] obtaining phoneme durations from training audio by forced alignment and forming a representation (matrix) of phoneme timing (duration); see “duration matrix 141”); determining, from at least respective phoneme pitches or respective phoneme energies of the plurality of first audio segments, at least a pitch distribution, corresponding to the respective phoneme pitches, or an energy distribution, corresponding to the respective phoneme energies ([col. 4 lines 5-16] extracting/representing pitch per phoneme in a pitch representation; see “pitch matrix 142”; the pitch contour based on log f0, normalized/fixed-length per phoneme); determining a speech alignment for a second audio segment, generated by the text-to- speech model, based at least on the duration alignment, the phoneme distribution, and at least one of the pitch distribution or the energy distribution ([col. 7 lines 22-46] aligning generated speech units by duration-based time scaling: aligning the generated spectrogram/pitch contours to the phoneme durations by stretches or compresses operations in de-normalization and then generating speech segments and concatenating); and generating a synthesized audio recitation, as an output audio signal, corresponding to the second audio segment ([col. 7 lines 22-46] producing the output speech signal (concatenated segments) as synthesized speech). The difference between the prior art and the claimed invention is that Mohammadi does not explicitly teach determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations. Qian teaches determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations ([Figs. 2-3] [0020] [0023] obtaining durations from a duration model and aligning/adjusting durations to meet a user-modifiable total duration (T is the total duration)). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi with teachings of Qian by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include determining, for an input received at a text-to-speech model trained, as least in part, using the training data, a duration alignment between a sequence of text associated with the input and a total speech duration using the respective phoneme durations as taught by Qian for the benefit of post-editing synthesized speech to make it sound more natural (Qian [0002]). Claim 2, Mohammadi further teaches the computer-implemented method of claim 1, wherein the pitch distribution corresponds to frequency and the energy distribution corresponds to amplitude ([col. 4 lines 5-16] the pitch corresponds to fundamental frequency). Claim 7, Qian further teaches The computer-implemented method of claim 1, wherein the synthesized audio recitation is generative such that a first synthesized recitation is different from a second synthesized recitation, each of the first synthesized recitation and the second synthesized recitation based on the sequence of text ([Abstract] a visual representation of synthesized speech as one or more waveforms, along with the corresponding text from which the speech was synthesized; to change data corresponding to the prosody e.g. duration, pitch, and/or loudness data with respect to at least one part of the speech; the changed speech can be played back to hear the change in prosody resulting from the interactive changes). Claim 8, Mohammadi further teaches the computer-implemented method of claim 1, further comprising: aligning a plurality of text tokens, from the sequence of text, to respective mel frames, based on the duration alignment ([col. 3 lines 41-58] [col. 7 lines 22-31] converts the audio representation of the speaker from an audio file to a representation in terms of Mel Cepstral coefficients and pitch; t is the time-resolution of the spectrogram section per phoneme; and the de-normalize spectrum module 482 either stretches or compresses the normalized spectrum so that it is the length specified by the duration). Claim(s) 3-4 and 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mohammadi (US 10,186,252) in view of Qian et al. (US 2010/0066742) and further in view of Wei et al. (“Neural Network-Based Modeling of Phonetic Durations”; Sept. 6,2019). Claim 3, Mohammadi and Qian teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Mohammadi nor Qian explicitly teach applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range. Wei teaches applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range ([2.3] [2.4] [5.] we obtain the reference durations from forced-alignment; we detect outliers using the output from the DNN; we use the value of the bin to which the reference duration belongs as the probability of the duration, and by ranking the probabilities we can get a list of phonemes with the lowest probabilities; these phonemes with unlikely duration, which we regard as outliers, can indicate misalignments or departures from the transcription; in training material for TTS these anomalies can be used to correct transcriptions and dictionary entries as well as to exclude unsuitable speech from the TTS training set). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mohammadi and Qian with teachings of Wei by modifying text to speech synthesis using deep neural network with constant unit length spectrogram as taught by Mohammadi to include applying a prior distribution to the duration alignment configured to exclude pairs of phonemes and durations from the plurality of first audio segments that are outside of a specified range as taught by Wei for the benefit of checking that text-to-speech (TTS) training speech follows the script and words are pronounced as expected (Wei [Abstract]). Claim 4, Qian further teaches the computer-implemented method of claim 3, wherein the prior distribution is cigar-shaped ([0019] multi-space probability distribution HMM). Claim 6, Qian the computer-implemented method of claim 3, wherein the prior distribution is constructed from a beta-binomial distribution ([0019] multi-space probability distribution HMM). Allowable Subject Matter Claims 10-20 are allowed. Claims 5 and 9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is a statement of reasons for the indication of allowable subject matter: For Claim 10: Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach the synthetic alignment is based on probabilistically sampling phoneme distributions for a plurality of audio segments used to train the text-to-speech model, at inference, and a first alignment between the sequence of text and a total speech duration. For Claim 17: Mohammadi (US 10,186,252) in view of Ping et al. (“Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”; Feb. 22, 2018) teach all the limitations. The difference between the prior art and the claimed invention is the Mohammadi nor Ping explicitly teach determine one or more vectors corresponding to one or more speaker characteristics based on a concentrated probability distribution of a text sequence from the text and mel-frames of the alignment distribution across the duration of the plurality of audio samples. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. SHREYANS A. PATEL Primary Examiner Art Unit 2653 /SHREYANS A PATEL/Examiner, Art Unit 2659
Read full office action

Prosecution Timeline

Aug 15, 2023
Application Filed
Oct 18, 2025
Non-Final Rejection — §103, §DP
Jan 15, 2026
Applicant Interview (Telephonic)
Jan 15, 2026
Examiner Interview Summary
Jan 26, 2026
Response Filed
Feb 20, 2026
Final Rejection — §103, §DP (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586597
ENHANCED AUDIO FILE GENERATOR
2y 5m to grant Granted Mar 24, 2026
Patent 12586561
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE
2y 5m to grant Granted Mar 24, 2026
Patent 12548549
ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)
2y 5m to grant Granted Feb 10, 2026
Patent 12548583
ACOUSTIC CONTROL APPARATUS, STORAGE MEDIUM AND ACCOUSTIC CONTROL METHOD
2y 5m to grant Granted Feb 10, 2026
Patent 12536988
SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
89%
Grant Probability
96%
With Interview (+7.4%)
2y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 403 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month