Prosecution Insights
Last updated: May 29, 2026
Application No. 18/708,348

METHOD FOR PROVIDING VOICE SYNTHESIS SERVICE AND SYSTEM THEREFOR

Non-Final OA §102§103§112
Filed
May 08, 2024
Priority
Nov 09, 2021 — RE 10-2021-0153451 +1 more
Examiner
CHAVEZ, RODRIGO A
Art Unit
2658
Tech Center
2600 — Communications
Assignee
LG Electronics Inc.
OA Round
1 (Non-Final)
51%
Grant Probability
Moderate
1-2
OA Rounds
1y 2m
Est. Remaining
90%
With Interview

Examiner Intelligence

Grants 51% of resolved cases
51%
Career Allowance Rate
119 granted / 233 resolved
-10.9% vs TC avg
Strong +39% interview lift
Without
With
+38.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
15 currently pending
Career history
252
Total Applications
across all art units

Statute-Specific Performance

§101
3.8%
-36.2% vs TC avg
§103
84.7%
+44.7% vs TC avg
§102
9.6%
-30.4% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 233 resolved cases

Office Action

§102 §103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement (IDS) submitted on 05/08/2024 was filed. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Claim Objections Claim 1 is objected to because of the following informalities: limitation “being inputted second text” is not a proper sentence, the limitation appears to be missing the word “when” at the beginning of the sentence as it is written in similar claim 9. Appropriate correction is required. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 5, 7, 8 and 14-15 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim 5 recites the limitation "the voice synthesis model" for which independent claim 1 recites two different “a voice synthesis model”. There is insufficient antecedent basis for this limitation in the claim because it is unclear which voice synthesis model is being referred to. Claim 7 recites the limitation "the generated voice synthesis model" for which independent claim 1 recites two different generations of “a voice synthesis model”. There is insufficient antecedent basis for this limitation in the claim because it is unclear which “voice synthesis model” is being referred to. Claim 8 depends on claim 7 and therefore it is rejected similarly based on its dependency. Claim 14 recites the limitation "the generated voice synthesis model" for which independent claim 9 recites two different generations of “a voice synthesis model”. There is insufficient antecedent basis for this limitation in the claim because it is unclear which “voice synthesis model” is being referred to. Claim 15 depends on claim 14 and therefore it is rejected similarly based on its dependency. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 1-5, 7-12, 14 and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Garman (US PG Pub 20210256961). As per claims 1 and 9, Garman discloses: A method of providing voice synthesis service and an artificial intelligence-based synthesis service system, comprising: an artificial intelligence device (Garman; Fig. 10; p. 0070 - In the example shown in FIG. 10, exemplary memory contents are shown for both a training system (AI/computing device) and a speech synthesis system (AI/computing device). While in some cases, these may both be implemented in the same computer system, typically, they are implemented in different computer systems. For example, the training system may typically be implemented in a server system or systems, while the speech synthesis system may typically be implemented in an end user device, such as a smartphone, tablet, personal computer, etc); and a computing device configure to exchanges data with the artificial intelligence device (Garman; Fig. 10; p. 0070 - In the example shown in FIG. 10, exemplary memory contents are shown for both a training system (AI/computing device) and a speech synthesis system (AI/computing device). While in some cases, these may both be implemented in the same computer system, typically, they are implemented in different computer systems. For example, the training system may typically be implemented in a server system or systems, while the speech synthesis system may typically be implemented in an end user device, such as a smartphone, tablet, personal computer, etc), wherein the computing device includes: a processor (Garman; Fig. 10, items 1002A-1002N; p. 0066 - Computer system 1000 may include one or more processors (CPUs) 1002A-1002N) configured to: receive sound source data for synthesizing a speaker's voice for a plurality of predefined first texts through a voice synthesis service platform that provides a development toolkit (Garman; Fig. 8, item 104; p. 0044 - the end-user can optionally add his or her voice to the inventory of speakers. To add a voice, the user may read and record a series of prompts in his or her native or fluent language, for example…; see also p. 0024 - Examples of such uses may include presenting voice instructions, reading texts of books, magazines, etc. aloud, etc…; see also p. 0062); learn tone conversion for the speaker's sound source data using a pre-generated tone conversion base model (Garman; p. 0041 - …When trained, DRNN 228 may be a single universal (base) speech model and may encode all the information necessary to produce speech for all of the trained languages and speakers. The system learns, in the sense of “deep learning”, the phonemes for the various languages, as well as the voice characteristics of each of the speakers. The resulting DRNN 228 may be a single model of speech that contains all of the phonemes and prosodic patterns (tone conversion) for each of the languages and the voice characteristics of each of the included speakers; see also p. 0036; see also p. 0037-0040 – Every utterance (speaker’s sound source data) used for training may be encoded using a feature extraction function 230…; see also p. 0027 - In embodiments, the language 208 for each utterance is presented to the DRNN via embedding, while all other layers are language-independent. This allows the ASR 212 to employ transfer learning from one language to the next, resulting in an ASR 212 that gets more robust with each language that is added. For example, training on 8 different languages may produce an accuracy of about 80% at frame level phoneme identification. In addition to an embedding for each language, there is an embedding for a “universal” language. This “universal” language may be trained with a small percentage of data from all languages, and is useful for doing recognition on an “unknown” language that is not already trained, such as a “new” language); generate a voice synthesis model for the speaker through learning the tone conversion (Garman; p. 0073 - Model training routines 1016 may include software routines to train the model using, for example, a sequence of embedded phonemes, prosodic values, language identifiers, and speaker identifiers as input, along with acoustic features, to generate trained model data 1018); when being inputted second text (Garman; p. 0045 - the inputs may include: the text to be spoken 302…); generate a voice synthesis model through voice synthesis inference based on the voice synthesis model for the speaker and the second text (Garman; p. 0052 - The inputs for the phonemes 316, accents 308, and speakers 306 may be fed into an Embedding layer to generated embeddings 320, 324, 326. The prosodic inputs may be fed into an Embedding Bag laver to generate embeddings 322. These inputs may include, but are not limited to, stress, tone, focus, syllable position, punctuation type, part-of-speech; generating the speaker embeddings (voice synthesis inference) using the prosodic inputs (tone conversion)); and generate a synthesized voice using the voice synthesis model (Garman; Fig. 3, item 314; p. 0053). As per claims 2 and 10, Garman discloses: The method and system of claims 1 and 9, wherein the step of receiving sound source data for synthesizing the speaker's voice for the plurality of predefined first texts includes: receiving the speaker's sound source multiple times for each first text; and generating sound source data for synthesizing the speaker's voice based on the speaker's sound source input multiple times (Garman; Fig. 8, item 104; p. 0044 - the end-user can optionally add his or her voice to the inventory of speakers. To add a voice, the user may read and record a series of prompts in his or her native or fluent language, for example…; see also p. 0024 - Examples of such uses may include presenting voice instructions, reading texts of books, magazines, etc. aloud, etc…; see also p. 0062). As per claims 3 and 11, Garman discloses: The method and system of claims 2 and 10, wherein the sound source data for voice synthesis of the speaker is an average value of the speaker's sound source input multiple times (Garman; p. 0032 - In embodiments, the variables of prosody used by the system may include, for example, pitch, duration, and loudness. For each of these, the prosodic value may be relative to an average or predictable value for that variable). As per claims 4 and 12, Garman discloses: The method and system of claims 3 and 11, wherein the step of learning the tone conversion includes performing speaker transfer learning based on the tone conversion base model (Garman; p. 0027 - In embodiments, the language 208 for each utterance is presented to the DRNN via embedding, while all other layers are language-independent. This allows the ASR 212 to employ transfer learning from one language to the next, resulting in an ASR 212 that gets more robust with each language that is added. For example, training on 8 different languages may produce an accuracy of about 80% at frame level phoneme identification. In addition to an embedding for each language, there is an embedding for a “universal” language. This “universal” language may be trained with a small percentage of data from all languages, and is useful for doing recognition on an “unknown” language that is not already trained, such as a “new” language). As per claims 5, Garman discloses: The method of claim 1, wherein a plurality of the voice synthesis model is generated for the speaker (Garman; p. 0073 - Model training routines 1016 may include software routines to train the model using, for example, a sequence of embedded phonemes, prosodic values, language identifiers, and speaker identifiers as input, along with acoustic features, to generate trained model data 1018; see also p. 0052 - The inputs for the phonemes 316, accents 308, and speakers 306 may be fed into an Embedding layer to generated embeddings 320, 324, 326. The prosodic inputs may be fed into an Embedding Bag laver to generate embeddings 322. These inputs may include, but are not limited to, stress, tone, focus, syllable position, punctuation type, part-of-speech; generating the speaker embeddings (voice synthesis inference) using the prosodic inputs (tone conversion)). As per claims 7 and 14, Garman discloses: The method and system of claims 1 and 9, further comprises: receiving a speaker ID and third text (Garman; p. 0045 - …the inputs may include: the text to be spoken 302, the language of the text 304, identification of the speaker 306, and the output accent 308 (which may be the same as the language of the text)…; see also p. 0034 - each speaker 210 has a unique identifier that may, for example, be derived from the name of the corpus that contains them and the identifier within that corpus); calling the generated voice synthesis model for the speaker corresponding to the speaker ID (Garman; p. 0052 - A speaker may be chosen. This may be a built-in speaker or an enrolled speaker. An output language may be chosen. This may be the language of the text or some other language. Typically, the language of the text may be chosen, producing accent-free speech in the target language. However, if desired, any accent can be introduced intentionally. The series of phonemes may be converted to speech using the output language and the selected speaker's voice characteristics); synthesizing voice for the third text based on the called voice synthesis model (Garman; p. 0053 - The output of DRNN 228 for each frame represents the acoustic features of speech for that frame. Decoder 312 may take the acoustic features and generate a speech signal); and generating a synthesized voice for the third text (Garman; p. 0053 - The output of DRNN 228 for each frame represents the acoustic features of speech for that frame. Decoder 312 may take the acoustic features and generate a speech signal). As per claims 8 and 15, Garman discloses: The method and system of claims 7 and 14, further comprises: receiving an input for at least one of volume level, pitch, and speed for the generated synthesized voice; and adjusting one of a volume level, pitch, and speed for the generated synthesized voice based on the received input (Garman; p. 0051 - Finally, the duration of each phoneme, in frames, may be determined 406. For this step, speaker 306 may be used, as speech rate varies between speakers. For a “fast” speaker, phonemes will have, on average, shorter duration. Note also that phoneme duration varies between languages. The text conversion system may handle this interaction of speaker and language to produce a speech rate that is consistent with both the speaker and language. In addition to duration, the other prosodic elements, pitch, and loudness, are also postulated during text conversion). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Garman in view of Lin (US PG Pub 20200058288). As per claims 6, Garman discloses: The method of claim 1, upon which claims 6. Garman, however, fails to disclose wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis. Lin does teach wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis (Lin; p. 0041 - the processing apparatus 170 may select the text script 153 for model training (step S230). The text script 153 for model training may be the same or different from the indicating text in step S210, or may be other text materials designed to facilitate subsequent training of the timbre transformation model (for example, sentences including all finals or vowels)). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and system of Garman to include wherein only the first text selected from the plurality of predefined first text is used for the voice synthesis, as taught by Lin, because although a specific text article can be converted into synthetic human voice through text-to-speech (TTS) technology, there is no related existing products that provides a friendly operation interface for the user to select the voice timbre of a specific person that the user intends to listen to (Lin; p. 0004). As per claim 13, the claim recites language similar to the combination of claims 5 and 6, and therefore claim 13 is rejected similarly in view of Garman and Lin. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes: Fernandez Guajardo (US PG Pub 20220392428) discloses: An online system receives, from a client device of a posting user, a script for a voice-based content item. The online system retrieves a voice synthesis model stored in the user profile of the posting user and generates a synthetic audio stream using the retrieved voice synthesis model and based on the received script. The online system presents the generated synthetic audio stream to the posting user and receives instructions for modifying the synthetic audio stream. The online system generates a second audio stream based on the received instructions and composes the voice-based tent item based on the generated second audio stream (Fernandez Guajardo; Abstract). Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /RODRIGO A CHAVEZ/Examiner, Art Unit 2658 /RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action

Prosecution Timeline

May 08, 2024
Application Filed
Apr 03, 2026
Non-Final Rejection mailed — §102, §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12620044
SYSTEMS AND METHODS FOR TRACKING DISASTER FOOTPRINTS WITH SOCIAL STREAMING DATA
4y 5m to grant Granted May 05, 2026
Patent 12597430
MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL
3y 1m to grant Granted Apr 07, 2026
Patent 12579984
DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS
4y 1m to grant Granted Mar 17, 2026
Patent 12541653
ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE
4y 3m to grant Granted Feb 03, 2026
Patent 12542136
DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS
4y 2m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2
Expected OA Rounds
51%
Grant Probability
90%
With Interview (+38.6%)
3y 3m (~1y 2m remaining)
Median Time to Grant
Low
PTA Risk
Based on 233 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month