Last updated: April 19, 2026

Application No. 18/562,962

TEXT-BASED SPEECH GENERATION

Final Rejection §103

Filed

Nov 21, 2023

Examiner

TENGBUMROONG, NATHAN NARA

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Microsoft Technology Licensing, LLC

OA Round

2 (Final)

Interview Optional

— +75.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 14 resolved cases, 2023–2026

Examiner Intelligence

TENGBUMROONG, NATHAN NARA View full profile →

Grants 43% of resolved cases

Career Allow Rate

6 granted / 14 resolved

-19.1% vs TC avg

Strong +75% interview lift

Without

With

+75.0%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

34 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

27.2%

-12.8% vs TC avg

§103

54.3%

+14.3% vs TC avg

§102

14.8%

-25.2% vs TC avg

§112

3.2%

-36.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 14 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 2/04/2026. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
Claims 1 and 10 are amended. Claims 5-9 are cancelled and claims 12-15 were previously cancelled. Claims 16-23 are newly added. As such, claims 1-4, 10-11, and 16-23 are presented for examination.

Response to Arguments
Rejection under 35 U.S.C. 101
Applicant’s arguments have been fully considered and are persuasive. The amended independent claims recite generating embeddings of an initial phoneme sequence, generating a first a first phoneme sequence by inserting a feature representation of an additional phoneme into the initial phoneme sequence, determining the additional phoneme using a model trained on raw spontaneous speech and labeled phonemes, generating a second phoneme sequence using duration determining module that classifies phonemes according to duration, and decoding the second phoneme sequence to generate spontaneous speech. Thus, the claim recites additional elements, such as a specific model trained on spontaneous speech and labeled phonemes, that provide an improvement to speech synthesis for spontaneous speech.

Rejection under 35 U.S.C. 103
Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 10-11, 16-21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Joly et al. (US 11978431 B1; hereinafter referred to as Joly) in view of Cong et al. (Cong, Jian, et al. "Controllable context-aware conversational speech synthesis." arXiv preprint arXiv:2106.10828 (2021).; hereinafter referred to as Cong) and Van Der Ploeg et al. (US 20230351990 A1; hereinafter referred to as Van Der Ploeg).
Regarding claim 1, Joly teaches: a computer-implemented method, comprising: generating, via an embedder ([col 8, lines 41-43] a phoneme encoder 304 that processes input data 302 to determine phoneme embedding data 306), embeddings of an initial phoneme sequence in vector form corresponding to text ([col 8, lines 43-47] The phoneme encoder 304 may be a neural network and may process the text input data 302, which may be a sequence of phonemes representing text, to determine the phoneme embedding data 306, which may be a vector of N values that represents the sequence), the initial phoneme sequence comprising feature representations of a plurality of phonemes… ([col 8, lines 55-59] The phoneme embedding data 306 may thus correspond to a point in an embedding space corresponding to the text input data 302, wherein the embedding space is an N-dimensional space representing all possible words, sentences, paragraphs, chapters, or books);
generating, based on the first phoneme sequence, a second phoneme sequence ([col 10, lines 5-15] The phoneme duration predictor 550 may thus determine, for a given item of feature embedding data 526, how many items of phoneme data (and/or parts of phonemes) should be modified using the item of feature embedding data 526. For example, if a given item of duration data corresponds to duration of “5,” the upsampling encoder 552 may upsample (e.g., duplicate) a corresponding item of feature embedding data 526 by a factor of 5 and apply the feature embedding data to five phonemes in the phoneme embedding data 506) by using a duration determining module comprising a routing module and multiple experts models…([col 10, lines 16-21] The phoneme duration predictor 550 may include one or more BiLSTM layer(s) that may process the phoneme embedding data 506, and one or more CNN layer(s) that may process the output of the BiLSTM layer(s). One or more LSTM layer(s) may process the output(s) of the CNN layer(s) to determine the duration data 554);
and decoding the second phoneme sequence ([col 9, lines 6-12] A speech decoder 310 may process both the phoneme embedding data 306 and the feature embedding data 312a . . . 312n to determine audio output data 314. The audio output data 314 may include a representation of synthesized speech that corresponds to words in the text input data 302 as well as prosody represented in the feature embedding data 312a . . . 312n) to generate spontaneous-style speech corresponding to the text ([col 2, lines 18-22] Synthesized speech may appear to a human listener to be “flat” or “robotic”; by predicting the audio properties, and by then introducing appropriate variations in the audio properties, the synthesized speech may appear more natural to the human listener).
Joly does not explicitly, but Cong discloses: generating a first phoneme sequence by inserting the feature representation of an additional phoneme into the initial phoneme sequence ([2.1] It is worth noticing that the filled-pause here does not come from a normal rhythm change in fluent reading-style speech, and it is a spontaneous event may be inserted anywhere in an utterance. Together with the input phones, tones, and prosody labels, these phoneme-level linguistic features are utilized to predict the corresponding acoustic targets. We simply replicate them to form phoneme-level representations according to the character pronunciations), the additional phoneme being related to a characteristic of spontaneous speech ([2.1] We mainly focus on two common acoustic spontaneous behaviors–prolongation and filled-pause), and the additional phoneme being generated by a model trained using data that includes a sequence of phonemes ([3] The manually-labelled spontaneous behaviors are used in training, which are not readily unavailable during inference for new conversation. We need to determine the location and type of spontaneous behaviors in the text automatically. To this end, we design an individual predictor to predict the spontaneous behaviors directly from text features) determined from raw speech in a spontaneous style and labels identifying the additional phonemes ([4.1] The total of 486 conversations are from two female speakers, where each conversation includes 10-20 rounds on a specific topic. It contains about 7 hour speech at 16kHz, where there are about 3,218 manually-labelled spontaneous behaviors (prolongation and filled-pause). We reserve 10 complete conversations, a total of 64 conversation pairs with 128 utterance as a test set. This configuration is applied to both the acoustic model and the spontaneous behavior predictor);
 Joly and Cong are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Joly to combine the teachings of Cong because doing so would allow for the use of a model trained on labeled spontaneous speech data in order to determine different positions and types of spontaneous speech in an input, leading to improved spontaneous speech synthesis (Cong [5] we utilize an acoustic context encoder to model the entrainment in conversation and a BERT encoder to extract semantic information from texts, which significantly improve the performance of conversational speech syn thesis. To achieve flexible control during inference, we also propose a behavior predictor to determine the positions and types of spontaneous behaviors from, which can be used to control the degree of fluency in the synthesized speech).
The combination of Joly and Cong does not specifically, but Van der Ploeg teaches: wherein the routing module classifies a phoneme among the plurality of phonemes and the additional phoneme into a category related to a duration length of the phoneme ([0063] the spoken length of phonemes may be determined and/or categorized according to their position in a larger syntactic unit (e.g., a word or sentence), their part of speech, or their meaning. In some examples, a dictionary-like reference may provide a phoneme length for specific phonemes and degrees of accent. For example, some phonemes may be categorized as having a phoneme length of less than 0.1 seconds, less than 0.2 seconds, less than 0.3 seconds, less than 0.4 seconds, or less than 1.0 seconds. Similarly, some pauses may be categorized according to their length during natural spoken speech), and wherein an expert model corresponding to the category among the multiple expert models is selected to predict the duration of the phoneme… ([0097] The length of the pauses and the phonemes represented in the text input may be determined with the help of open source software or other sources of information regarding the prosodic, syntactic, and semantic features of the text or voice. The process may involve a lookup table that synthesizes duration information about phonemes and pauses between syllables, words, sentences, and other units from other sources which describe normal speech. The process may also involve a neural sequence-to-sequence model, for instance, a transformer or LSTM trained to predict sequences of durations and pauses from sequences of words (which may be mapped to an embedding space)).
Joly, Cong, and Van Der Ploeg are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Joly and Cong to combine the teachings of Van Der Ploeg because doing so would allow for different phonemes to be categorized based on their duration and used for synthesizing audio that facilitates comprehension, leading to clearer and more natural voice synthesis (Van Der Ploeg [0067] the plurality of spoken pause lengths and the plurality of spoken phoneme lengths applied in steps 210 and 212, respectively, may be determined with reference to one or more parameters. Those parameters may include optimal breaks between sentences, optimal tempo, optimal time signature, optimal pitch range, and optimal length of phonemes, where optimality is measured with respect to facilitating comprehension and/or recollection)

Regarding claim 2, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 1. Van Der Ploeg further teaches: wherein generating the second phoneme sequence based on the first phoneme sequence comprises: determining a category of the phoneme among the plurality of phonemes and the additional phoneme ([0063] the spoken length of phonemes may be determined and/or categorized according to their position in a larger syntactic unit (e.g., a word or sentence), their part of speech, or their meaning. In some examples, a dictionary-like reference may provide a phoneme length for specific phonemes and degrees of accent. For example, some phonemes may be categorized as having a phoneme length of less than 0.1 seconds, less than 0.2 seconds, less than 0.3 seconds, less than 0.4 seconds, or less than 1.0 seconds. Similarly, some pauses may be categorized according to their length during natural spoken speech);
and predicting the duration of the phoneme by using an expert model of multiple expert models that corresponds to the category ([0097] The length of the pauses and the phonemes represented in the text input may be determined with the help of open source software or other sources of information regarding the prosodic, syntactic, and semantic features of the text or voice. The process may involve a lookup table that synthesizes duration information about phonemes and pauses between syllables, words, sentences, and other units from other sources which describe normal speech. The process may also involve a neural sequence-to-sequence model, for instance, a transformer or LSTM trained to predict sequences of durations and pauses from sequences of words (which may be mapped to an embedding space)).

Regarding claim 4, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 1. Cong further teaches: wherein the additional phoneme comprises at least one of: a phoneme indicating a pause; a phoneme indicating a repetition; and a phoneme indicating an idiom ([2.1] We mainly focus on two common acoustic spontaneous behaviors–prolongation and filled-pause. Specially, we define two linguistic features for these behaviors named prolongation and filled-pause in the text-side, where pronunciation prolongation may occur at the end of a character and filled-pause may come right after the character).

Regarding claim 10, Joly teaches: an electronic device, comprising: a processing unit; and a memory coupled to the processing unit and comprising instructions stored thereon which, when executed by the processing unit, cause the device to perform acts comprising… ([col 12, lines 47-52] Each of these devices/systems (110/120) may include one or more controllers/processors (804/904), which may each include a central processing unit (CPU) for processing data and computer-readable instructions, and a memory (806/906) for storing data and instructions of the respective device). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.

Regarding claim 11, it recites similar limitations as claim 2 and therefore is rejected similarly.

Regarding claim 16, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 1. Joly further teaches: wherein decoding the second phoneme sequence comprises generating a mel-spectrogram ([col 4, lines 26-27] The output of the decoder may be a mel-spectrogram).

Regarding claim 17, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 16. Joly further teaches: wherein the mel-spectrogram is converted to the spontaneous-style speech ([col 4, lines 27-29] a vocoder may process the mel-spectrogram to determine time-domain audio data representing the speech).

Regarding claim 18, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 1. Joly further teaches: wherein the spontaneous-style speech is provided to an output device ([col 13, lines 23-26] the device 110 may include input/output device interfaces 802 that connect to a variety of components such as an audio output component (e.g., a microphone 1004 or a loudspeaker 1006)).

Regarding claim 19, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 18. Joly further teaches: wherein the output device comprises a loudspeaker ([col 13, lines 23-26] the device 110 may include input/output device interfaces 802 that connect to a variety of components such as an audio output component (e.g., a microphone 1004 or a loudspeaker 1006)).

Regarding claim 20, Joly teaches: a non-transitory, machine-readable medium, comprising instructions, which when performed by a processor of a machine, causes the processor to perform operations… ([col 13, lines 7-9] A device's computer instructions may be stored in a non-transitory manner in non-volatile memory (806/906), storage (808/908), or an external device(s)). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.
Regarding claim 21, it recites similar limitations as claim 2 and therefore is rejected similarly.

Regarding claim 23, it recites similar limitations as claim 4 and therefore is rejected similarly.

Claims 3 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Joly in view of Cong and Van Der Ploeg, as applied to claims 1-2, 4, 10-11, 16-21, and 23 above, and further in view of Arik et al. (US 20180336880 A1; hereinafter referred to as Arik).
Regarding claim 3, the combination of Joly, Cong, and Van Der Ploeg teaches: the method of claim 1. The combination of Joly, Cong, and Van Der Ploeg does not explicitly, but Arik teaches: wherein determining the spontaneous style speech corresponding to the text based on the second phoneme sequence comprises: generating a third phoneme sequence by updating the second phoneme sequence based on a speech characteristic of a target speaker ([0106] A trained frequency model (e.g., 325 in FIG. 3) receive as inputs the phonemes, the speaker identifier, and the phoneme durations and outputs (1615) frequency profiles for the phonemes relative to the speaker identifier. Also see Fig. 1.); 
and determining, based on the third phoneme sequence, the spontaneous-style speech corresponding to both of the text and the target speaker ([0106] a trained vocal model (e.g., 355 in FIG. 3) receives as input the speaker identifier, the phonemes, the phoneme durations, the frequency profiles for the phonemes (e.g., a frequency profile for a phoneme is the fundamental frequency profile and a probability of whether it is voiced) to synthesize (1620) a signal representing synthesized speech of the written text 210, in which the synthesized audio has audio characteristics corresponding to the speaker identity).
Joly, Cong, Van Der Ploeg, and Arik are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Joly, Cong, and Van Der Ploeg to combine the teachings of Arik because doing so would allow for improved spontaneous voice synthesis by using phoneme durations and speaker embeddings to generate more natural speech (Arik [0105] the trained vocal model 355 receives the phonemes 235 of the input text 210, the phoneme durations 260 from the trained duration model 340, the frequency profiles 230 from the trained frequency model 325, and an input speaker embedding 305 identifying the speaker and outputs a signal representing synthesized human speech of the input text 210 that has audio characteristics corresponding to the speaker identity).

Regarding claim 22, it recites similar limitations as claim 3 and therefore is rejected similarly.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NATHAN TENGBUMROONG/Examiner, Art Unit 2654                     

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Nov 21, 2023

Application Filed

Oct 02, 2025

Non-Final Rejection — §103

Dec 29, 2025

Examiner Interview Summary

Dec 29, 2025

Applicant Interview (Telephonic)

Jan 02, 2026

Response Filed

Feb 06, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/173,495

Patent 12530536

Mixture-Of-Expert Approach to Reinforcement Learning-Based Dialogue Management

2y 5m to grant Granted Jan 20, 2026

17/876,156

Patent 12451142

NON-WAKE WORD INVOCATION OF AN AUTOMATED ASSISTANT FROM CERTAIN UTTERANCES RELATED TO DISPLAY CONTENT

2y 5m to grant Granted Oct 21, 2025

17/883,265

Patent 12412050

MULTI-PLATFORM VOICE ANALYSIS AND TRANSLATION

2y 5m to grant Granted Sep 09, 2025

Study what changed to get past this examiner. Based on 3 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

43%

Grant Probability

99%

With Interview (+75.0%)

3y 0m

Median Time to Grant

Moderate

PTA Risk

Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.