Last updated: April 19, 2026

Application No. 18/797,760

MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING

Non-Final OA §101§103

Filed

Aug 08, 2024

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +7.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 403 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allow Rate

359 granted / 403 resolved

+27.1% vs TC avg

Moderate +7% lift

Without

With

+7.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

46 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

21.3%

-18.7% vs TC avg

§103

36.0%

-4.0% vs TC avg

§102

22.6%

-17.4% vs TC avg

§112

8.8%

-31.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 403 resolved cases

Office Action

§101 §103

4ecDETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101. 
Claims 1 and 11 are directed to an abstract idea. The claims are directed to the abstract idea of translating spoken language from one language to another and converting it to speech, a fundamental linguistic process that humans perform naturally. The steps of receiving speech, recognizing it, translating it, and outputting synthesized speech in another language represent the kind of organizing human activity and mental process which are abstract. The recitation of generic computer components (a speech recognizer, translator, and TTS model) does not transform the nature of the claim, as these are simply computer-implemented analogs to conventional human translation tasks.
The claim limitations are further considered individually and as an ordered combination, do not integrate the abstract idea into a practical application in a meaningful way. The claim does not recite any particular improvement to the functioning of a computer or to speech processing technology itself.
There are no claims to a specific technical improvement in how speech recognition, translation accuracy, or TTS synthesis is performed. The claim simply puts together known technologies to achieve a result (translated synthesized speech) that is itself the abstract idea. The claim lacks an inventive concept sufficient to transform the abstract idea into patent-eligible subject matter. The use of a speech recognizer, a translator, and a TTS model are generic functional recitations that reflect conventional components well-known in the art. 
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the claims are (i) mere instructions to implement the idea on a computer, and/or (ii) recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. There is further no improvement to the computing device. 
Dependent claims 2-10 and 12-20 further recite an abstract idea performable by a human and do not amount to significantly more than the abstract idea as they do not provide steps other than what is conventionally known in speech processing.
Claims 2 and 12: do not recite a specific technical improvement.
Claims 3 and 13: a mathematical model without a specific technological improvement.
Claims 4 and 14: not integrated into a practical application.
Claims 5 and 15: converting between data representations and outputting results without a particular application.
Claims 6 and 16: there is no inventive concept.
Claims 7 and 17: there is no inventive concept.
Claims 8 and 18: there is no inventive concept.
Claims 9 and 19: there is no inventive concept nor a technological improvement.
Claims 10 and 20: there is no inventive concept.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3-4, 6-7, 10-11, 13-14, 16-17 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Quidilig et al. (WO 2010129056) in view of Wang et al. (“Tacotron: Towards End-to-End Speech Synthesis”; Apr. 6, 2017).

Claims 1 and 11,
Quidilig teaches a computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising ([0028] speech processing system 200 may be implemented as a dedicated processing hardware or as software executable on a general purpose processor):
receiving a spoken input comprising an utterance spoken in a first language, the spoken input comprising a phrase and an instruction to synthesize the phrase into speech in a second language different than the first language ([0046] the user 10 then speaks ("Sample Speech 1") the following: "send email to John at domain dot com subject line test only email message hi John comma new line test only period question mark exclamation mark translate to Spanish send now");
processing, using a speech recognizer, the spoken input to convert the spoken input into corresponding text in the first language; processing, using a translator, the corresponding text in the first language to transliterate the corresponding text into translated text that recites the phrase in the second language ([0049] and the speech processing system 200 receives and processes the input audio stream and by invoking the speech to text function 122; each audio segment of Sample Speech 1 is then converted into corresponding text by the speech to text function 122); and
processing, using ([0063-0064] phrase "translate to" corresponds to the language translation operation 130; the Optional Function Parameter is "Spanish." Accordingly, in the present example, the Subject Line, the Message Text, or both are translated to Spanish, and the translated text, in Spanish, is then sent via email to the recipient).
The difference between the prior art and the claimed invention is that Quidilig does not explicitly teach processing, using a text-to-speech (TTS) model configured to receive the translated text that recites the phrase in the second language as input, the translated text that recites the phrase in the second language to generate an output audio feature representation as output from the TTS model, the output audio feature representation representing synthesized speech of the translated text that recites the phrase in the second language.
Wang teaches processing, using a text-to-speech (TTS) model configured ([Fig. 1] [3. Model Architecture] the model takes characters as input and outputs the corresponding raw spectrogram, which is then fed to Griffin-Lim reconstruction algorithm to synthesize speech; at a high-level, our model takes characters as input and produces spectrogram frames, which are then converted to waveforms; we use 80-band mel spectrogram as the target).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Quidilig with teachings of Wang by modifying the system and method for speech processing and speech to text as taught by Quidilig to include processing, using a text-to-speech (TTS) model configured to generate an output audio feature representation as output from the TTS model, the output audio feature representation representing synthesized speech of the translated text that recites as taught by Wang for the benefit of generating speech at the frame level which is faster than sample-level autoregressive methods (Wang [Abstract]).

Claims 3 and 13,
Wang further teaches the computer-implemented method of claim 1, wherein processing the translated text that recites the phrase in the second language to generate the output audio feature representation as output from the TTS model comprises, for each of a plurality of time steps: processing, using an encoder neural network, a respective portion of translated text for the time step to generate a corresponding text encoding for the time step ([3.2 Encoder] the encoder is to extract robust sequential representations of text; the input to the encoder is a character sequence, where each character is represented as a one-hot vector and embedded into a continuous vector; A CBHG module transforms the pre-net outputs into the final encoder representation used by the attention module); and 
processing, using a decoder neural network, the text encoding for the time step to generate a corresponding output audio feature representation for the time step ([3.3 Decoder] a stateful recurrent layer produces the attention query at each decoder time step; use 80-band mel-scale spectrogram as the target; the first decoder step is conditioned on an all-zero frame, which represents a <GO> frame, in inference, at decoder step t, the last frame of the r predictions is fed as input to the decoder at step t +1).

Claims 4 and 14,
Wang further teaches the computer-implemented method of claim 1, wherein the output audio feature representation comprises mel-frequency spectrograms ([3.3 Decoder] mel-scale spectrogram).

Claims 6 and 16,
Wang further teaches the computer-implemented method of claim 1, wherein the translated text corresponds to a character input representation ([3.2 Encoder] the input to the encoder is a character sequence).

Claims 7 and 17,
Wang further teaches the computer-implemented method of claim 1, wherein the translated text corresponds to a phoneme input representation ([2.] the model is trained on phoneme inputs).

Claims 10 and 20,
Quidilig further teaches the computer-implemented method of claim 1, wherein the first language comprises English and the second language comprises French ([0046] translate first language text (English) into a second language text (Spanish or another language); this purely a design choice).

Claim(s) 2 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Quidilig et al. (WO 2010129056) in view of Wang et al. (“Tacotron: Towards End-to-End Speech Synthesis”; Apr. 6, 2017) and further in view of Jia et al. (“Transfer Learning from Speaker Verfication to Multispeaker Text-to-Speech Synthesis”; 2018).

Claims 2 and 12,
Quidilig and Wang teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Quidilig nor Wang explicitly teach obtaining a speaker embedding specifying specific voice characteristics of a target speaker for cloning a voice of the target speaker in synthesized speech, wherein processing, using the TTS model, the translated text further comprises processing, using the TTS model configured to receive the speaker embedding and the translated text that recites the phrase in the second language as input, the speaker embedding and the translated text to generate the output audio feature representation as output from the TTS model, the output feature representation representing the synthesized speech of the translated text that recites the phrase in the second language and that clones the voice of the target speaker.
Jia teaches obtaining a speaker embedding specifying specific voice characteristics of a target speaker for cloning a voice of the target speaker in synthesized speech ([Abstract] [Introduction] generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; few seconds of un-transcribed reference audio from a target speaker is used to synthesize new speech in that speaker’s voice; a speaker-discriminative embedding network that captures the space of speaker characteristics), 
wherein processing, using the TTS model, the translated text further comprises processing, using the TTS model configured to receive the speaker embedding and the translated text that recites the phrase in the second language as input, the speaker embedding and the translated text to generate the output audio feature representation as output from the TTS model, the output feature representation representing the synthesized speech of the translated text that recites the phrase in the second language and that clones the voice of the target speaker ([Abstract] [2.] [2.2] generates a mel spectrogram from text, conditioned on the speaker embedding; an embedding vector for the target speaker is concatenated with the synthesizer encoder output at each time step; predicts a mel spectrogram from a sequence of grapheme or phoneme inputs, conditioned on the speaker embedding vector).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Quidilig and Wang with teachings of Jia by modifying the system and method for speech processing and speech to text as taught by Quidilig to include obtaining a speaker embedding specifying specific voice characteristics of a target speaker for cloning a voice of the target speaker in synthesized speech, wherein processing, using the TTS model, the translated text further comprises processing, using the TTS model configured to receive the speaker embedding and the translated text that recites the phrase in the second language as input, the speaker embedding and the translated text to generate the output audio feature representation as output from the TTS model, the output feature representation representing the synthesized speech of the translated text that recites the phrase in the second language and that clones the voice of the target speaker as taught by Jia for the benefit of showing that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation (Jia [Abstract]).

Claim(s) 5 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Quidilig et al. (WO 2010129056) in view of Wang et al. (“Tacotron: Towards End-to-End Speech Synthesis”; Apr. 6, 2017) and further in view of Tachibana et al. (US 2016/0012035).

Claims 5 and 15,
The computer-implemented method of claim 1, wherein the operations further comprise: inverting, using a waveform synthesizer, the output audio feature representation into a time-domain waveform ([3.4] we use the Griffin-Lim algorithm (Griffin & Lim, 1984) to synthesize waveform from the predicted spectrogram); and 
The difference between the prior art and the claimed invention is that Quidilig nor Wang explicitly teach generating, using the time-domain waveform, a synthesized speech representation of the translated text that clones the voice of the target speaker in the second language.
Tachibana teaches generating, using the time-domain waveform, a synthesized speech representation of the translated text that clones the voice of the target speaker in the second language ([Abstract] [0012] [0033] carries out filtering using parameters of a spectral envelope representing vocal tract characteristics or the like to generate a speech waveform (Gaussian distribution at time T); second-language synthesis for a specific/target speaker and selecting output that sounds like the target speaker (select a speech synthesis dictionary of the specific speaker in the first language and a speech synthesis dictionary of the specific speaker in the second language).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Quidilig and Wang with teachings of Jia by modifying the system and method for speech processing and speech to text as taught by Quidilig to include generating, using the time-domain waveform, a synthesized speech representation of the translated text that clones the voice of the target speaker in the second language as taught by Tachibana for the benefit of improving the quality of synthetic speech (Tachibana [0004]).

Claim(s) 8 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Quidilig et al. (WO 2010129056) in view of Wang et al. (“Tacotron: Towards End-to-End Speech Synthesis”; Apr. 6, 2017) and further in view Luo et al. (CN 105702130).

Claims 8 and 18,
Quidilig and Wang teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Quidilig nor Wang explicitly teach wherein the translated text corresponds to an 8-bit Unicode Transformation Format (UTF-8) encoding sequence.
Luo teaches wherein the translated text corresponds to an 8-bit Unicode Transformation Format (UTF-8) encoding sequence ([Invention Contents] identity of the user character code is UTF-8, appointing the translation of source language and target language type).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Quidilig and Wang with teachings of Luo by modifying the system and method for speech processing and speech to text as taught by Quidilig to include wherein the translated text corresponds to an 8-bit Unicode Transformation Format (UTF-8) encoding sequence as taught by Luo for the benefit of realizing simple alternate for communication between deaf and normal person (Luo [Invention Contents]).

Claim(s) 9 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Quidilig et al. (WO 2010129056) in view of Wang et al. (“Tacotron: Towards End-to-End Speech Synthesis”; Apr. 6, 2017) and further in view Micheal et al. (JP 2007072927).

Claims 9 and 19,
Quidilig and Wang teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Quidilig nor Wang explicitly teach a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text.
Micheal teaches a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text ([Tech Solution] a first set of sentences in first language; a second set of sentences in the second language).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Quidilig and Wang with teachings of Michael by modifying the system and method for speech processing and speech to text as taught by Quidilig to include a first language training set comprising a plurality of utterances spoken in the first language and corresponding reference text; and a second language training set comprising a plurality of utterances spoken in the second language and corresponding reference text as taught by Micheal for the benefit of classifying a sentence into a good or bad translated sentence (Michael [Tech Solution]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zhang et al. (US 11,580,952) – A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/               Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Aug 08, 2024

Application Filed

Feb 18, 2026

Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

2y 5m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

2y 5m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

2y 5m to grant Granted Feb 10, 2026

18/589,789

Patent 12548583

ACOUSTIC CONTROL APPARATUS, STORAGE MEDIUM AND ACCOUSTIC CONTROL METHOD

2y 5m to grant Granted Feb 10, 2026

18/201,105

Patent 12536988

SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM

2y 5m to grant Granted Jan 27, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

89%

Grant Probability

96%

With Interview (+7.4%)

2y 3m

Median Time to Grant

Low

PTA Risk

Based on 403 resolved cases by this examiner. Grant probability derived from career allow rate.