Last updated: May 29, 2026

Application No. 18/708,348

METHOD FOR PROVIDING VOICE SYNTHESIS SERVICE AND SYSTEM THEREFOR

Non-Final OA §102§103§112

Filed

May 08, 2024

Priority

Nov 09, 2021 — RE 10-2021-0153451 +1 more

Examiner

CHAVEZ, RODRIGO A

Art Unit

2658

Tech Center

2600 — Communications

Assignee

LG Electronics Inc.

OA Round

1 (Non-Final)

This examiner grants 51% of cases after interview

— +38.6% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 233 resolved cases, 2023–2026

Examiner Intelligence

CHAVEZ, RODRIGO A View full profile →

Grants 51% of resolved cases

Career Allowance Rate

119 granted / 233 resolved

-10.9% vs TC avg

Strong +39% interview lift

Without

With

+38.6%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

15 currently pending

Career history

252

Total Applications

across all art units

Statute-Specific Performance

§101

3.8%

-36.2% vs TC avg

§103

84.7%

+44.7% vs TC avg

§102

9.6%

-30.4% vs TC avg

§112

0.7%

-39.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 233 resolved cases

Office Action

§102 §103 §112

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/08/2024 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claim 1 is objected to because of the following informalities: limitation “being inputted second text” is not a proper sentence, the limitation appears to be missing the word “when” at the beginning of the sentence as it is written in similar claim 9.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 5, 7, 8 and 14-15 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 5 recites the limitation "the voice synthesis model" for which independent claim 1 recites two different “a voice synthesis model”.  There is insufficient antecedent basis for this limitation in the claim because it is unclear which voice synthesis model is being referred to.	Claim 7 recites the limitation "the generated voice synthesis model" for which independent claim 1 recites two different generations of “a voice synthesis model”.  There is insufficient antecedent basis for this limitation in the claim because it is unclear which “voice synthesis model” is being referred to.

	Claim 8 depends on claim 7 and therefore it is rejected similarly based on its dependency.

Claim 14 recites the limitation "the generated voice synthesis model" for which independent claim 9 recites two different generations of “a voice synthesis model”.  There is insufficient antecedent basis for this limitation in the claim because it is unclear which “voice synthesis model” is being referred to.

Claim 15 depends on claim 14 and therefore it is rejected similarly based on its dependency.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-5, 7-12, 14 and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Garman (US PG Pub 20210256961).

	As per claims 1 and 9, Garman discloses:	A method of providing voice synthesis service and an artificial intelligence-based synthesis service system, comprising:	an artificial intelligence device (Garman; Fig. 10; p. 0070 - In the example shown in FIG. 10, exemplary memory contents are shown for both a training system (AI/computing device) and a speech synthesis system (AI/computing device). While in some cases, these may both be implemented in the same computer system, typically, they are implemented in different computer systems. For example, the training system may typically be implemented in a server system or systems, while the speech synthesis system may typically be implemented in an end user device, such as a smartphone, tablet, personal computer, etc); and 	a computing device configure to exchanges data with the artificial intelligence device (Garman; Fig. 10; p. 0070 - In the example shown in FIG. 10, exemplary memory contents are shown for both a training system (AI/computing device) and a speech synthesis system (AI/computing device). While in some cases, these may both be implemented in the same computer system, typically, they are implemented in different computer systems. For example, the training system may typically be implemented in a server system or systems, while the speech synthesis system may typically be implemented in an end user device, such as a smartphone, tablet, personal computer, etc), wherein the computing device includes: 	a processor (Garman; Fig. 10, items 1002A-1002N; p. 0066 - Computer system 1000 may include one or more processors (CPUs) 1002A-1002N) configured to:	receive sound source data for synthesizing a speaker's voice for a plurality of predefined first texts through a voice synthesis service platform that provides a development toolkit (Garman; Fig. 8, item 104; p. 0044 - the end-user can optionally add his or her voice to the inventory of speakers. To add a voice, the user may read and record a series of prompts in his or her native or fluent language, for example…; see also p. 0024 - Examples of such uses may include presenting voice instructions, reading texts of books, magazines, etc. aloud, etc…; see also p. 0062); 	learn tone conversion for the speaker's sound source data using a pre-generated tone conversion base model (Garman; p. 0041 - …When trained, DRNN 228 may be a single universal (base) speech model and may encode all the information necessary to produce speech for all of the trained languages and speakers. The system learns, in the sense of “deep learning”, the phonemes for the various languages, as well as the voice characteristics of each of the speakers. The resulting DRNN 228 may be a single model of speech that contains all of the phonemes and prosodic patterns (tone conversion) for each of the languages and the voice characteristics of each of the included speakers; see also p. 0036; see also p. 0037-0040 – Every utterance (speaker’s sound source data) used for training may be encoded using a feature extraction function 230…; see also p. 0027 - In embodiments, the language 208 for each utterance is presented to the DRNN via embedding, while all other layers are language-independent. This allows the ASR 212 to employ transfer learning from one language to the next, resulting in an ASR 212 that gets more robust with each language that is added. For example, training on 8 different languages may produce an accuracy of about 80% at frame level phoneme identification. In addition to an embedding for each language, there is an embedding for a “universal” language. This “universal” language may be trained with a small percentage of data from all languages, and is useful for doing recognition on an “unknown” language that is not already trained, such as a “new” language); 	generate a voice synthesis model for the speaker through learning the tone conversion (Garman; p. 0073 - Model training routines 1016 may include software routines to train the model using, for example, a sequence of embedded phonemes, prosodic values, language identifiers, and speaker identifiers as input, along with acoustic features, to generate trained model data 1018); 	when being inputted second text (Garman; p. 0045 - the inputs may include: the text to be spoken 302…); 	generate a voice synthesis model through voice synthesis inference based on the voice synthesis model for the speaker and the second text (Garman; p. 0052 - The inputs for the phonemes 316, accents 308, and speakers 306 may be fed into an Embedding layer to generated embeddings 320, 324, 326. The prosodic inputs may be fed into an Embedding Bag laver to generate embeddings 322. These inputs may include, but are not limited to, stress, tone, focus, syllable position, punctuation type, part-of-speech; generating the speaker embeddings (voice synthesis inference) using the prosodic inputs (tone conversion)); and 	generate a synthesized voice using the voice synthesis model (Garman; Fig. 3, item 314; p. 0053).	As per claims 2 and 10, Garman discloses:	The method and system of claims 1 and 9, wherein the step of receiving sound source data for synthesizing the speaker's voice for the plurality of predefined first texts includes: receiving the speaker's sound source multiple times for each first text; and generating sound source data for synthesizing the speaker's voice based on the speaker's sound source input multiple times (Garman; Fig. 8, item 104; p. 0044 - the end-user can optionally add his or her voice to the inventory of speakers. To add a voice, the user may read and record a series of prompts in his or her native or fluent language, for example…; see also p. 0024 - Examples of such uses may include presenting voice instructions, reading texts of books, magazines, etc. aloud, etc…; see also p. 0062).

	As per claims 3 and 11, Garman discloses:	The method and system of claims 2 and 10, wherein the sound source data for voice synthesis of the speaker is an average value of the speaker's sound source input multiple times (Garman; p. 0032 - In embodiments, the variables of prosody used by the system may include, for example, pitch, duration, and loudness. For each of these, the prosodic value may be relative to an average or predictable value for that variable).

	As per claims 4 and 12, Garman discloses:	The method and system of claims 3 and 11, wherein the step of learning the tone conversion includes performing speaker transfer learning based on the tone conversion base model (Garman; p. 0027 - In embodiments, the language 208 for each utterance is presented to the DRNN via embedding, while all other layers are language-independent. This allows the ASR 212 to employ transfer learning from one language to the next, resulting in an ASR 212 that gets more robust with each language that is added. For example, training on 8 different languages may produce an accuracy of about 80% at frame level phoneme identification. In addition to an embedding for each language, there is an embedding for a “universal” language. This “universal” language may be trained with a small percentage of data from all languages, and is useful for doing recognition on an “unknown” language that is not already trained, such as a “new” language).

	As per claims 5, Garman discloses:	The method of claim 1, wherein a plurality of the voice synthesis model is generated for the speaker (Garman; p. 0073 - Model training routines 1016 may include software routines to train the model using, for example, a sequence of embedded phonemes, prosodic values, language identifiers, and speaker identifiers as input, along with acoustic features, to generate trained model data 1018; see also p. 0052 - The inputs for the phonemes 316, accents 308, and speakers 306 may be fed into an Embedding layer to generated embeddings 320, 324, 326. The prosodic inputs may be fed into an Embedding Bag laver to generate embeddings 322. These inputs may include, but are not limited to, stress, tone, focus, syllable position, punctuation type, part-of-speech; generating the speaker embeddings (voice synthesis inference) using the prosodic inputs (tone conversion)).
	As per claims 7 and 14, Garman discloses:	The method and system of claims 1 and 9, further comprises: receiving a speaker ID and third text (Garman; p. 0045 - …the inputs may include: the text to be spoken 302, the language of the text 304, identification of the speaker 306, and the output accent 308 (which may be the same as the language of the text)…; see also p. 0034 - each speaker 210 has a unique identifier that may, for example, be derived from the name of the corpus that contains them and the identifier within that corpus); calling the generated voice synthesis model for the speaker corresponding to the speaker ID (Garman; p. 0052 - A speaker may be chosen. This may be a built-in speaker or an enrolled speaker. An output language may be chosen. This may be the language of the text or some other language. Typically, the language of the text may be chosen, producing accent-free speech in the target language. However, if desired, any accent can be introduced intentionally. The series of phonemes may be converted to speech using the output language and the selected speaker's voice characteristics); synthesizing voice for the third text based on the called voice synthesis model (Garman; p. 0053 - The output of DRNN 228 for each frame represents the acoustic features of speech for that frame. Decoder 312 may take the acoustic features and generate a speech signal); and generating a synthesized voice for the third text (Garman; p. 0053 - The output of DRNN 228 for each frame represents the acoustic features of speech for that frame. Decoder 312 may take the acoustic features and generate a speech signal).  

	As per claims 8 and 15, Garman discloses:	The method and system of claims 7 and 14, further comprises: receiving an input for at least one of volume level, pitch, and speed for the generated synthesized voice; and adjusting one of a volume level, pitch, and speed for the generated synthesized voice based on the received input (Garman; p. 0051 - Finally, the duration of each phoneme, in frames, may be determined 406. For this step, speaker 306 may be used, as speech rate varies between speakers. For a “fast” speaker, phonemes will have, on average, shorter duration. Note also that phoneme duration varies between languages. The text conversion system may handle this interaction of speaker and language to produce a speech rate that is consistent with both the speaker and language. In addition to duration, the other prosodic elements, pitch, and loudness, are also postulated during text conversion).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Garman in view of Lin (US PG Pub 20200058288).
	As per claims 6, Garman discloses:	The method of claim 1, upon which claims 6.	Garman, however, fails to disclose wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis.	Lin does teach wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis (Lin; p. 0041 - the processing apparatus 170 may select the text script 153 for model training (step S230). The text script 153 for model training may be the same or different from the indicating text in step S210, or may be other text materials designed to facilitate subsequent training of the timbre transformation model (for example, sentences including all finals or vowels)). 	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Garman to include wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis, as taught by Lin, because although a specific text article can be converted into synthetic human voice through text-to-speech (TTS) technology, there is no related existing products that provides a friendly operation interface for the user to select the voice timbre of a specific person that the user intends to listen to (Lin; p. 0004).	As per claim 13, the claim recites language similar to the combination of claims 5 and 6, and therefore claim 13 is rejected similarly in view of Garman and Lin.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:	Fernandez Guajardo (US PG Pub 20220392428) discloses: An online system receives, from a client device of a posting user, a script for a voice-based content item. The online system retrieves a voice synthesis model stored in the user profile of the posting user and generates a synthetic audio stream using the retrieved voice synthesis model and based on the received script. The online system presents the generated synthetic audio stream to the posting user and receives instructions for modifying the synthetic audio stream. The online system generates a second audio stream based on the received instructions and composes the voice-based tent item based on the generated second audio stream (Fernandez Guajardo; Abstract).	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/RODRIGO A CHAVEZ/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658

Read full office action

Prosecution Timeline

May 08, 2024

Application Filed

Apr 03, 2026

Non-Final Rejection mailed — §102, §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/544,008

Patent 12620044

SYSTEMS AND METHODS FOR TRACKING DISASTER FOOTPRINTS WITH SOCIAL STREAMING DATA

4y 5m to grant Granted May 05, 2026

18/175,355

Patent 12597430

MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL

3y 1m to grant Granted Apr 07, 2026

17/579,750

Patent 12579984

DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS

4y 1m to grant Granted Mar 17, 2026

17/513,419

Patent 12541653

ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE

4y 3m to grant Granted Feb 03, 2026

17/532,315

Patent 12542136

DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS

4y 2m to grant Granted Feb 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

51%

Grant Probability

90%

With Interview (+38.6%)

3y 3m (~1y 2m remaining)

Median Time to Grant

Low

PTA Risk

Based on 233 resolved cases by this examiner. Grant probability derived from career allowance rate.