Prosecution Insights
Last updated: May 29, 2026
Application No. 18/352,980

CONFIGURABLE NEURAL SPEECH SYNTHESIS

Non-Final OA §103
Filed
Jul 14, 2023
Priority
Jun 12, 2020 — provisional 62/705,127 +1 more
Examiner
PULLIAS, JESSE SCOTT
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Soundhound Inc.
OA Round
5 (Non-Final)
83%
Grant Probability
Favorable
5-6
OA Rounds
0m
Est. Remaining
96%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allowance Rate
879 granted / 1059 resolved
+21.0% vs TC avg
Moderate +13% lift
Without
With
+12.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
32 currently pending
Career history
1100
Total Applications
across all art units

Statute-Specific Performance

§101
5.4%
-34.6% vs TC avg
§103
80.4%
+40.4% vs TC avg
§102
8.6%
-31.4% vs TC avg
§112
1.2%
-38.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1059 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 01/27/26 has been entered. This office action is in response to correspondence 01/27/26 regarding application 18/352,980, in which claims 1 and 14 were amended, claims 10 and 20 were cancelled, and new claims 21-22 were added. Claims 1, 3-9, 12, 14-19, 21, and 22 are pending in the application and have been considered. Response to Arguments Applicant’s arguments on pages 7-12 regarding the 35 U.S.C. 103 rejections of claims 1, 3-10, 12, and 14-20 based on Zhao, Sahu, and Li have been considered but are moot in view of the new grounds for rejection based in part on the newly discovered reference to Saito et al. (“Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks”. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 26, NO. 1, JANUARY 2018). Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 3-9, 12, 14-19, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al. (“Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder”. IEEE Access, Volume 6, pages 60578-60488, 24 September 2018) in view of Sahu et al. (“Modeling Feature Representations for Affective Speech using Generative Adversarial Networks”. arXiv:1911.00030v1 [cs.LG] 31 Oct 2019), in further view of Li et al. (US 20090006096), in further view of Saito et al. (“Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks”. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 26, NO. 1, JANUARY 2018). Consider claim 1, Zhao discloses a computerized process of training a neural speech synthesis model (training of the WaveNet vocoder, which is a neural speech synthesis model, Abstract, page 60478) that can generate speech audio conditioned on a value of a voice property (the generator of GAN is conditioned on speaker code, page 60479, Section I C), the computerized process comprising: obtaining source samples of speech audio (for each speaker, speech waveforms were recorded with sampling frequency of 16 kHz and 16-bit PCM format, page 60484, Section IV); obtaining a transcription of said source samples (obtaining text corresponding to the 1000 utterances used for training is inherent to use as a TTS training example, Section IV, page 60484), labeling the source samples with discrete values of a voice property (speaker codes consist of seven dimensions, where six dimensions represent speaker identity and the other dimension denotes gender, page 60484, Section IV); training, from the source samples and labels, a discriminator (Discriminator is trained using Natural Speech Params, Fig 3, page 60481; see Algorithm 1 on page 60483, step 2, noting that y is the mel-spectragram derived from the speech samples, page 60484 section IV, and c is the speaker code, i.e. label); and training the neural speech synthesis model by: synthesizing a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples (generate s^ from the well-trained WaveNet model using parameter C, see Algorithm 1 on page 60483, step 2; WaveNet models a joint distribution of sequential data as a product of conditional distributions, page 60481, Section IIC), computing corresponding probabilities for the synthesized speech samples using the discriminator (discriminator estimates the probability that the sample y came from a real data set distribution rather than generator distribution, page 60481, Section IIB). Zhao does not specifically mention wherein the discrete values comprises at least an attitude property; and a discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis. Sahu discloses discrete values comprises at least an attitude property (utterances are labeled as happy, sad, angry, excitement, and neutral, page 4, Section 4.1.1); and a discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis (discriminator D_2 auxiliary layer outputs conditional distribution of the labels, i.e. probability vector, given the synthetic feature vectors, i.e. the synthesized voice data, page 5, Section 4.2). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao such that the discrete values comprises at least an attitude property; and such that the discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis in order to optimize training for separating the emotional classes, as suggested by Sahu (page 12, Section 6), predictably improving classification robustness, as suggested by Sahu (page 1, Section 1). The references cited are analogous art in the same field of speech synthesis. Zhao and Sahu do not specifically mention discrete values comprises a timbre voice property; and voice property values including timbre voice properties for voice synthesis; receiving, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis; and outputting synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values. Li discloses discrete values comprises a timbre voice property (e.g. Hoarse-Like, Fig. 4, [0032], the slider showing 3 discrete notches); and voice property values including timbre voice properties for voice synthesis (Hoarse-Like being a timbre property applied to the synthesized speech, [0032], [0047]); receiving, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis (user adjusts parameters using sliders and buttons, [0061], Fig. 4, to enter Hoarse-Like, a timbre property, and other parameters indicative of various emotions, i.e. attitudes, [0020]); and outputting synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values (the parameters are applied to the synthesized speech, [0032], [0047], and the waveform is output, [0062]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu such that discrete values comprises a timbre voice property; and voice property values including timbre voice properties for voice synthesis; receiving, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis; and outputting synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values in order to ease the process of customizing a text-to-speech engine, as suggested by Li ([0002]), predictably reducing user frustration, as suggested by Li ([0003]). The references cited are analogous art in the same field of speech synthesis. Zhao, Sahu, and Li do not specifically mention computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities; and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples. Saito discloses computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities (minimizing the Discriminator loss L(GAN)ADV, which depends on differences between the discriminator’s output probabilities and the desired classification, via backpropagation of the neural synthesis model, Fig. 2, page 86); and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples (generation loss LMGE, which depends on differences between the generated speech the natural speech, is minimized via backpropagation, Fig. 2, page 86, causing the synthesized speech to more closely match the source). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao, Sahu, and Li by computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities; and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples in order to alleviate the effect of over-smoothing generated speech parameters, as suggested by Saito (Section 1, page 85), predictably improving synthesized speech quality, as suggested by Saito (Section 1, pages 84-85). The references cited are analogous art in the same field of speech synthesis. Consider claim 14, Zhao discloses a computer system for training a neural speech synthesis model (training of the WaveNet vocoder, which is a neural speech synthesis model, Abstract, page 60478) to generate speech audio conditioned on a value of a voice property (the generator of GAN is conditioned on speaker code, page 60479, Section I C), comprising: at least one processor (a GeForce GTX 1080 was used for training, page 60484, Section IV, which required a week to train the WaveNet vocoder and 8 minutes to synthesize 10 seconds of speech); and memory including instructions that, when executed by the at least one processor, cause the computer system (inherent for running the artificial neural network experiments on the data sets described on page 60484, Section IV) to: obtain source samples of speech audio (for each speaker, speech waveforms were recorded with sampling frequency of 16 kHz and 16-bit PCM format, page 60484, Section IV); obtain a transcription of said source samples (obtaining text corresponding to the 1000 utterances used for training is inherent to use as a TTS training example, Section IV, page 60484); label the source samples with discrete values of a voice property (speaker codes consist of seven dimensions, where six dimensions represent speaker identity and the other dimension denotes gender, page 60484, Section IV); train, from the source samples and labels, a discriminator (Discriminator is trained using Natural Speech Params, Fig 3, page 60481; see Algorithm 1 on page 60483, step 2, noting that y is the mel-spectragram derived from the speech samples, page 60484 section IV, and c is the speaker code, i.e. label); and train the neural speech synthesis model by: synthesizing a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples (generate s^ from the well-trained WaveNet model using parameter C, see Algorithm 1 on page 60483, step 2; WaveNet models a joint distribution of sequential data as a product of conditional distributions, page 60481, Section IIC), computing corresponding probabilities for the synthesized speech samples using the discriminator (discriminator estimates the probability that the sample y came from a real data set distribution rather than generator distribution, page 60481, Section IIB). Zhao does not specifically mention wherein the discrete values comprises at least an attitude property; and a discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis. Sahu discloses discrete values comprises at least an attitude property (utterances are labeled as happy, sad, angry, excitement, and neutral, page 4, Section 4.1.1); and a discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis (discriminator D_2 auxiliary layer outputs conditional distribution of the labels, i.e. probability vector, given the synthetic feature vectors, i.e. the synthesized voice data, page 5, Section 4.2). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao such that the discrete values comprises at least an attitude property; and such that the discriminator configured to output a probability vector representing voice property values including one or more attitude properties for voice synthesis for reasons similar to those for claim 1. Zhao and Sahu do not specifically mention discrete values comprises a timbre voice property; and voice property values including timbre voice properties for voice synthesis; receive, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis; and output synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values. Li discloses discrete values comprises a timbre voice property (e.g. Hoarse-Like, Fig. 4, [0032], the slider showing 3 discrete notches); and voice property values including timbre voice properties for voice synthesis (Hoarse-Like being a timbre property applied to the synthesized speech, [0032], [0047]); receiving, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis (user adjusts parameters using sliders and buttons, [0061], Fig. 4, to enter Hoarse-Like, a timbre property, and other parameters indicative of various emotions, i.e. attitudes, [0020]); and outputting synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values (the parameters are applied to the synthesized speech, [0032], [0047], and the waveform is output, [0062]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu such that discrete values comprises a timbre voice property; and voice property values including timbre voice properties for voice synthesis; receiving, via a graphic user interface, a plurality of parameter values representing output voice property values including one or more attitude properties and timbre voice properties for voice synthesis; and outputting synthesized speech audio, wherein the synthesized speech audio are defined by the plurality of parameter values representing output voice property values for reasons similar to those for claim 1. Zhao, Sahu, and Li do not specifically mention computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities; and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples. Saito discloses computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities (minimizing the Discriminator loss L(GAN)ADV, which depends on differences between the discriminator’s output probabilities and the desired classification, via backpropagation of the neural synthesis model, Fig. 2, page 86); and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples (generation loss LMGE, which depends on differences between the generated speech the natural speech, is minimized via backpropagation, Fig. 2, page 86, causing the synthesized speech to more closely match the source). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao, Sahu, and Li by computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities; and computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples for reasons similar to those for claim 1. Consider claim 3, Zhao discloses the neural speech synthesis model is configured to: receive a string of text and at least one voice property value with a perceptible meaning (for speech synthesis from text, the generator is conditioned on linguistic vectors, page 60481, Section IIB, as well as the speaker codes, where six dimensions represent speaker identity and the other dimension denotes gender, page 60484, Section IV, the gender of the speaker being perceivable by a listener); synthesize speech audio corresponding to the string of text using a neural speech synthesis model that conditions a sound of speech audio on the at least one voice property value to generate synthesized speech audio (the generated audio samples produced using WaveNet, which are produced from the Generator conditioned on the speaker codes, page 60483, Fig. 5); and output the synthesized speech audio, wherein the sound of the synthesized speech audio perceptually relates to the at least one voice property value (as perceived by the listeners in the listening tests, which perceived same or different speakers, page 60486, Section VA). Consider claim 4, Zhao discloses discrete values further comprise at least one of a gender voice property, an age voice property, and an accent voice property (six dimensions of the speaker codes represent speaker identity and the other dimension denotes gender, page 60484, Section IV, noting that the claim language only requires “at least one of”). Consider claim 5, Zhao and Sahu do not, but Li discloses the speech synthesis model is further configured to: enable download of the synthesized speech audio (the synthesized .wav file is downloaded by the user over the internet, [0062]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu that the speech synthesis model is further configured to: enable download of the synthesized speech audio for reasons similar to those for claim 1. Consider claim 6, Zhao discloses the speech synthesis model is further configured to: enable playback of the synthesized speech audio (inherent for conducting listening tests, which perceived same or different speakers, page 60486, Section VA). Consider claim 7, Zhao and Sahu do not, but Li discloses the speech synthesis model is further configured to: provide a graphical user interface that includes one of a text input field or a voice property value input field (user adjusts parameters using sliders and buttons, [0061], on the GUI of Fig. 4, to input voice properties, [0020]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu such that the speech synthesis model is further configured to: provide a graphical user interface that includes one of a text input field or a voice property value input field for reasons similar to those for claim 1. Consider claim 8, Zhao discloses the string of text is associated with at least one text tag (linguistic features in the form of linguistic vectors x are considered to categorize, i.e. tag, the text string, page 60481, Section IIB; at the very least, the linguistic vectors are “associated” with the input text). Consider claim 9, Zhao discloses the string of text indicates dynamically configurable voice parameter values (for speech synthesis from text, the generator is conditioned on linguistic vectors, page 60481, Section IIB; therefore the source text string may be said to “indicate” dynamically configurable voice parameter values, noting that the claim language does not actually require dynamically configuring them). Consider claim 12, Zhao discloses the source samples of the speech audio are obtained from one of a person and an audio generation system (six speakers who provided 1000 utterances, Section IV, page 60484). Consider claim 15, Zhao discloses the discrete values comprise at least one of a gender voice property, an age voice property, and an accent voice property (six dimensions of the speaker codes represent speaker identity and the other dimension denotes gender, page 60484, Section IV, noting that the claim language only requires “at least one of”). Consider claim 16, Zhao and Sahu do not, but Li discloses the speech synthesis model is further configured to: enable download of the synthesized speech audio (the synthesized .wav file is downloaded by the user over the internet, [0062]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu that the speech synthesis model is further configured to: enable download of the synthesized speech audio for reasons similar to those for claim 1. Consider claim 17, Zhao discloses the neural speech synthesis model is further configured to: enable playback of the synthesized speech audio (inherent for conducting listening tests, which perceived same or different speakers, page 60486, Section VA). Consider claim 18, Zhao and Sahu do not, but Li discloses the speech synthesis model is further configured to: provide a graphical user interface that includes one of a text input field or a voice property value input field (user adjusts parameters using sliders and buttons, [0061], on the GUI of Fig. 4, to input voice properties, [0020]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhao and Sahu such that the speech synthesis model is further configured to: provide a graphical user interface that includes one of a text input field or a voice property value input field for reasons similar to those for claim 1. Consider claim 19, Zhao discloses the string of text is associated with at least one text tag (linguistic features in the form of linguistic vectors x are considered to categorize, i.e. tag, the text string, page 60481, Section IIB; at the very least, the linguistic vectors are “associated” with the input text). Consider claim 21, Zhao discloses training the neural speech synthesis model further comprises: training the speech synthesis model for text to speech and training the speech synthesis model's ability to provide different voice sounds (Algorithm 1 Training Algorithm for Acoustic Modeling, page 60483; training the TTS is considered to train the speech synthesis model's ability to provide different voice sounds such as vowels and consonants). Consider claim 22, Zhao discloses training the neural speech synthesis model further comprises: training the speech synthesis model for text to speech and training the speech synthesis model's ability to provide different voice sounds (Algorithm 1 Training Algorithm for Acoustic Modeling, page 60483; training the TTS is considered to train the speech synthesis model's ability to provide different voice sounds such as vowels and consonants). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Andrew Flanders can be reached on 571/272-7516. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). /Jesse S Pullias/ Primary Examiner, Art Unit 2655 05/07/26
Read full office action

Prosecution Timeline

Show 5 earlier events
Jul 16, 2025
Request for Continued Examination
Jul 17, 2025
Response after Non-Final Action
Aug 01, 2025
Non-Final Rejection mailed — §103
Oct 31, 2025
Response Filed
Nov 17, 2025
Final Rejection mailed — §103
Jan 27, 2026
Request for Continued Examination
Jan 28, 2026
Response after Non-Final Action
May 11, 2026
Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12639531
System and method for increasing the accuracy of text summarization
2y 2m to grant Granted May 26, 2026
Patent 12632483
Determining Repair Information Via Automated Analysis Of Structured And Unstructured Repair Data
3y 1m to grant Granted May 19, 2026
Patent 12632659
EXPLAINABLE AND EFFICIENT TEXT SUMMARIZATION
2y 7m to grant Granted May 19, 2026
Patent 12626063
FORMING A HYPOTHESIS SET FROM SENTENCES ACROSS DOCUMENTS REPRESENTATIVE OF DIFFERENT STANCES TAKEN ACROSS THE DOCUMENTS
2y 9m to grant Granted May 12, 2026
Patent 12626070
SERVERLESS FUNCTIONAL ROUTING FOR LARGE LANGUAGE MODEL INFERENCE SERVICE
2y 3m to grant Granted May 12, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6
Expected OA Rounds
83%
Grant Probability
96%
With Interview (+12.7%)
2y 7m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 1059 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month