Last updated: April 19, 2026

Application No. 18/174,145

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, AND A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM

Non-Final OA §103

Filed

Feb 24, 2023

Examiner

ISKENDER, ALVIN ALIK

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Spotify AB

OA Round

3 (Non-Final)

Interview Optional

— +60.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 25 resolved cases, 2023–2026

Examiner Intelligence

ISKENDER, ALVIN ALIK View full profile →

Grants 48% of resolved cases

Career Allow Rate

12 granted / 25 resolved

-14.0% vs TC avg

Strong +60% interview lift

Without

With

+60.3%

Interview Lift

resolved cases with interview

Typical timeline

3y 4m

Avg Prosecution

20 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

15.6%

-24.4% vs TC avg

§103

53.0%

+13.0% vs TC avg

§102

25.8%

-14.2% vs TC avg

§112

5.4%

-34.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 25 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 04 November 2025 have been fully considered but they are not persuasive.
Applicant argues that Chen does not teach “training a second model of the synthesizer, distinct from the first model, to output speech data with a second level of the speech attribute, different from the first level of the speech attribute, by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value that is larger than the first predetermined value;
As explained in the below rejection, Wang teaches this limitation in Sections 2.2., 6, and 6.1. The inference method disclosed by Wang conditions the first model on new data, creating a second model distinct from the first. Conditioning is understood to those with ordinary skill in the art to be further training of a model. In Section 6, “We train models using 147 hours of American English audiobook data”, referring to the inference method, also indicates this. There isn’t a scaling operation that modifies the output of a trained model directly, rather a model may be conditioned on a scaled token, or the model may be conditioned on a further set of data (such as the 147 hours of audiobook data used as an example in Wang). The latter maps to training a second model by further training the first model using the second sub-dataset. The outcome is the creation of multiple models that emphasize different aspects of speech, such as faster or emotive speech. 

Claim Objections
Claim 8 is objected to because of the following informalities: The word 'wherein' should be removed from "wherein further comprising refreshing the second model" in lines 1-2.  
Claims 9-11 are objected for being dependent on claim 8 with similar defect.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3-6, 17-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen (US 20200135172 A1) in view of Wang ("Style Tokens:  Unsupervised Style Modeling, Control and Transfer in
End-to-End Speech Synthesis") and Tits (Exploring Transfer Learning for Low Resource Emotional TTS).

Regarding claim 1, Chen discloses a method of text-to-speech synthesis, comprising:
training a synthesizer comprising a prediction network, including:
obtaining a first sub-dataset and a second sub-dataset, wherein the first sub-dataset and the second sub-dataset each comprise audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset; (claim 1, [0024]-[0026], [0028]-[0030]: a training set containing text and audio data of a plurality of speakers, and an adaptation set containing text and audio data for a new individual speaker)
training a first model using the first sub-dataset until a performance metric reaches a first predetermined value; ([0028], [0035]: training a model with a first plurality of speakers, training inherently involves optimizing a value to a predetermined margin; maximizing log-likelihood;)
after training the second model:
receiving text; ([0026], text is used as an input)
inputting the received text in a synthesizer, wherein the synthesizer comprises a prediction network that is configured to convert the received text into speech data having a speech attribute, ([0026]: the synthesizer converts input text to a waveform having voice characteristics of a particular speaker)
outputting the speech data corresponding to the received text, ([0026]: output waveform)
However, Chen does not disclose:
wherein the speech attribute comprises one or more of the group consisting of: intention, projection, pace, and accent;
training a first model to output the speech data with a first level of the speech attribute;
training a second model of the synthesizer, distinct from the first model, to output speech data with a second level of the speech attribute, different from the first level of the speech attribute, by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value that is larger than the first predetermined value;
selecting one of the trained first and second models as the prediction network based on a desired level of the speech attribute of the output speech data corresponding to the received text;
Wang does disclose wherein the speech attribute comprises one or more of the group consisting of: intention, projection, pace, and accent; (Abstract, Introduction: adjusting speaking style including intention and speed)
training a first model to output the speech data with a first level of the speech attribute; (Section 6.1.2: style scaling; the model may be conditioned to produce faster or more animated speech, for example)
training a second model of the synthesizer, distinct from the first model, to output speech data with a second level of the speech attribute, different from the first level of the speech attribute, by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value that is larger than the first predetermined value; (Section 2.2.: conditioning the initial model on new data to achieve style control or style transfer, conditioning is understood in the art to be further training of the model; Section 6: “We train models using 147 hours of American English audiobook data”, indicating that the inference/conditioning method is further training; Section 6.1: training multiple models to accentuate certain attributes of speech such as speaking speed or the emotion of the speech)
selecting one of the trained first and second models as the prediction network based on a desired level of the speech attribute of the output speech data; (Section 2.2, 6.1: Models can be conditioned to output different styles or attribute levels)
Tits also discloses training a first model to output the speech data with a first level of the speech attribute using a first sub-dataset (Figure 1: untuned neutral model);
training a second model to output the speech data with a second level of the speech attribute using a second sub-dataset (Figure 1: models fine-tuned to sound more amused, angry, disgusted, or sleepy using a sub-dataset)
While Chen discloses outputting speech data having particular voice characteristics with its prediction model, it does not disclose the particular speech attributes claimed. Wang does disclose this concept, as well as the ability to scale the speech attribute to varying degrees. Tits also shows how fine-tuning with sub-datasets may be used for such a purpose. It would have been obvious to one with ordinary skill in the art before the effective filing date to incorporate the teachings of Wang and Tits with Chen because it would allow one to control attributes such as the speed and style of speech synthesis models (Wang Abstract).

Regarding claim 3, the limitations of parent claim 1 are disclosed as explained above. Chen further discloses the method wherein the performance metric comprises one or more of the group consisting of: a validation loss, a speech pattern accuracy test, a mean opinion score (MOS), a MUSHRA score, a transcription metric, an attention score, and a robustness score. ([0037]: using hold-out data [i.e. a validation set for computing validation loss] to calculate an early stopping point;)

Regarding claim 4, the limitations of parent claim 3 are disclosed as explained above. Chen further discloses the method wherein the performance metric is the validation loss, and the first predetermined value is less than the second predetermined value. ([0037]: using hold-out data [i.e. a validation set for computing validation loss] to calculate an early stopping point. The particular threshold to calculate the stopping point for each set of training is the result of routine optimization and experimentation with machine-learning parameters well-known to those with ordinary skill in the art. Furthermore it would be reasonable to assume that the error would be higher for the second model as it is adapted to a more specialized and smaller dataset)

Regarding claim 5, the limitations of parent claim 4 are disclosed as explained above. Chen further teaches the method wherein the second predetermined value is 0.6 or less. ([0037]: early termination based on a particular fraction of hold-out data; The particular threshold to calculate the stopping point would be arrived at through routine optimization and experimentation with machine-learning parameters well-known to those with ordinary skill in the art.)

Regarding claim 6, the limitations of parent claim 3 are disclosed as explained above. Tits further teaches the method wherein the performance metric is the transcription metric, and the second predetermined value is 1 or less. (Section 3.1: Word accuracy i.e. transcription metric. The metric is between 0 and 1, see Table 2)

Regarding claims 17-18, they are analogous to claim 1 and are thus rejected in a similar fashion.

Claim(s) 2 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Wang and Tits as applied to claims 1 and 18 above, and further in view of Chu (An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation).

Regarding claim 2, the limitations of parent claim 1 are disclosed as explained above. None of Wang, Tits or Chen disclose the method wherein the obtaining of the prediction network further comprises refreshing the second model, wherein refreshing the second model comprises further training the second model using the first sub-dataset until the performance metric reaches a third predetermined value.
	However, Chu does teach the method wherein the obtaining of the prediction network further comprises refreshing the second model, wherein refreshing the second model comprises further training the second model using the first sub-dataset until the performance metric reaches a third predetermined value. (Section 3.3: Mixed fine tuning, further train the fine-tuned model with the initial first data until convergence)
	It would have been obvious to one with ordinary skill in the art before the effective filing date to incorporate the mixed fine-tuning technique of Chu because it addresses the problem of overfitting to the adaptation data (See Chen section 1).

	Regarding claim 19, it is analogous to claim 2 and is thus rejected in a similar fashion.

Claim(s) 7, 12-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Wang and Tits in view of Lee (Learning pronunciation from a foreign language in speech synthesis networks).

Regarding claim 7, the following is obvious over Chen in view of Wang and Tits for the same reasons as explained for claim 1 above as they are analogous:
training a synthesizer comprising a prediction network, including
obtaining a first sub-dataset and a second sub-dataset, wherein the first sub-dataset and the second sub-dataset each comprise audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset;
training a first model to output the speech data with a first level of the speech attribute until a performance metric reaches a first predetermined value;
training a second model of the synthesizer, distinct from the first model, to output speech data with a second level of the speech attribute, different from the first level of the speech attribute, by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value that is larger than the first predetermined value; 
selecting one of the trained first and second models as the prediction network based on a desired level of the speech attribute of the output speech data.
after training the second model:
receiving text;
inputting the received text in a synthesizer, wherein the synthesizer comprises a prediction network that is configured to convert the received text into speech data having a speech attribute, wherein the speech attribute comprises one or more of the group consisting of emotion, intention, projection, pace, and accent; and
outputting the speech data
However, neither Chen, Wang, nor Tits disclose combining the first sub-dataset and the second sub-dataset into a combined dataset, wherein the combined dataset comprises audio samples and corresponding text from the first sub-dataset and the second sub-dataset;
training a first model using the combined dataset;
Lee does disclose combining the first sub-dataset and the second sub-dataset into a combined dataset, wherein the combined dataset comprises audio samples and corresponding text from the first sub-dataset and the second sub-dataset; (Section 2.2: Union of the high resource language dataset and the low resource language dataset)
training a first model using the combined dataset; (Section 2.2: Pre-training a first model using the union)
	It would have been obvious to one with ordinary skill in the art before the effective filing data to combine the first and second datasets as in Lee when training the first model as in Chen, because doing so would leverage the prior availability of the second dataset and improve the generation quality (See Lee Section 3).

Regarding claim 12, the limitations of parent claim 7 are disclosed as explained above. Chen further discloses the method wherein the second sub-dataset comprises fewer samples than the first sub-dataset. ([0030]: adaptation data orders of magnitude smaller than the training data)

Regarding claim 13, the limitations of parent claim 7 are disclosed as explained above. Chen further discloses the method wherein the audio samples of the first sub-dataset and the second sub-dataset are recorded by a human actor. ([0027]-[0028]: recorded audio of speakers)

Regarding claim 14, the limitations of parent claim 7 are disclosed as explained above. Tits further discloses the method wherein the first model is pre-trained prior to training with the first sub-dataset or the combined dataset. (Figure 1: model is pre-trained on LJ-speech before being trained with the neutral subset)

Regarding claim 15, the limitations of parent claim 14 are disclosed as explained above. Tits further discloses the method wherein the first model is pre-trained using a dataset comprising audio samples from one or more human voices. (Section 2.4: recordings of speech from a speaker)

Regarding claim 16, the limitations of parent claim 7 are disclosed as explained above. Tits further discloses the method wherein the audio samples of the first sub-dataset and of the second sub-dataset are from a same domain, wherein the same domain refers to a topic that the method is applied in. (Fig 1: The first neutral subset and the second emotion subset are from a shared emotion database)

Claim(s) 8-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Wang and Tits in view of Lee as applied to claim 7 above, and further in view of Chu.

Regarding claim 8, the limitations of parent claim 7 are disclosed as explained above. None of Tits, Wang, Lee, or Chen disclose the method comprising refreshing the second model, wherein refreshing the second model comprises further training the second model using the combined dataset.
	However, Chu does disclose the method wherein the obtaining of the prediction network further comprises refreshing the second model, wherein refreshing the second model comprises further training the second model using the combined dataset. (Section 3.3: Mixed Fine Tuning)
It would have been obvious to one with ordinary skill in the art before the effective filing date to incorporate the mixed fine-tuning technique of Chu because it addresses the problem of overfitting to the adaptation data (See Chen section 1).

Regarding claim 9, the limitations of parent claim 8 are disclosed as explained above. Chu further discloses the method wherein refreshing the second model is performed until a performance metric reaches a predetermined value. (Section 3.3: Mixed fine-tuning until convergence)

Regarding claim 10, the limitations of parent claim 9 are disclosed as explained above. Chu further discloses the method wherein the performance metric comprises one or more of the group consisting of: of a validation loss, a speech pattern accuracy test, a mean opinion score (MOS), a MUSHRA score, a transcription metric, an attention score and a robustness score. ([0037]: using hold-out data [i.e. a validation set for computing validation loss] to calculate an early stopping point;)

Regarding claim 11, the limitations of parent claim 10 are disclosed as explained above. Chen further discloses the method wherein the performance metric is the validation loss, and the predetermined value is 0.6 or less. ([0037]: early termination based on a particular fraction of hold-out data; The particular threshold to calculate the stopping point would be arrived at through routine optimization and experimentation with machine-learning parameters well-known to those with ordinary skill in the art.)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALVIN ISKENDER whose telephone number is (703)756-4565. The examiner can normally be reached M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, HAI PHAN can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ALVIN ISKENDER/               Examiner, Art Unit 2654                                   

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Feb 24, 2023

Application Filed

Dec 14, 2024

Non-Final Rejection — §103

Mar 26, 2025

Response Filed

Mar 26, 2025

Applicant Interview (Telephonic)

Mar 27, 2025

Examiner Interview Summary

Jun 28, 2025

Final Rejection — §103

Oct 30, 2025

Applicant Interview (Telephonic)

Nov 01, 2025

Examiner Interview Summary

Nov 04, 2025

Request for Continued Examination

Nov 13, 2025

Response after Non-Final Action

Jan 10, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/188,310

Patent 12562244

COMBINING DOMAIN-SPECIFIC ONTOLOGIES FOR LANGUAGE PROCESSING

2y 5m to grant Granted Feb 24, 2026

17/911,224

Patent 12531078

NOISE SUPPRESSION FOR SPEECH ENHANCEMENT

2y 5m to grant Granted Jan 20, 2026

17/926,994

Patent 12505825

SPONTANEOUS TEXT TO SPEECH (TTS) SYNTHESIS

2y 5m to grant Granted Dec 23, 2025

17/750,973

Patent 12456457

ALL DEEP LEARNING MINIMUM VARIANCE DISTORTIONLESS RESPONSE BEAMFORMER FOR SPEECH SEPARATION AND ENHANCEMENT

2y 5m to grant Granted Oct 28, 2025

18/054,153

Patent 12407783

DOUBLE-MICROPHONE ARRAY ECHO ELIMINATING METHOD, DEVICE AND ELECTRONIC EQUIPMENT

2y 5m to grant Granted Sep 02, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

48%

Grant Probability

99%

With Interview (+60.3%)

3y 4m

Median Time to Grant

High

PTA Risk

Based on 25 resolved cases by this examiner. Grant probability derived from career allow rate.