Last updated: May 28, 2026

Application No. 18/396,025

METHOD OF CONSTRUCTING TRAINING DATASET FOR SPEECH SYNTHESIS THROUGH FUSION OF LANGUAGE, SPEAKER, AND EMOTION WITHIN UTTERANCE

Final Rejection §103§112

Filed

Dec 26, 2023

Priority

Nov 06, 2023 — RE 10-2023-0151356

Examiner

JACKSON, JAKIEDA R

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Korea Electronics Technology Institute

OA Round

2 (Final)

Interview Optional

— +15.4% interview lift. Examiner has a relatively high allowance rate (74%); +15.4% interview lift. A written response may suffice.

Based on 911 resolved cases, 2023–2026

Examiner Intelligence

JACKSON, JAKIEDA R View full profile →

Grants 74% — above average

Career Allowance Rate

675 granted / 911 resolved

+12.1% vs TC avg

Strong +15% interview lift

Without

With

+15.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

26 currently pending

Career history

946

Total Applications

across all art units

Statute-Specific Performance

§101

13.3%

-26.7% vs TC avg

§103

66.8%

+26.8% vs TC avg

§102

12.8%

-27.2% vs TC avg

§112

1.1%

-38.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 911 resolved cases

Office Action

§103 §112

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
In response to the Office Action mailed August 27, 2025, applicant submitted an amendment filed on November 21, 2025, in which the applicant amended and requested reconsideration.

Response to Arguments
Applicants argue that the prior art cited fails to teach the claims as amended.  Applicants’ arguments are persuasive, but are moot in view of new ground of rejection.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1 and 10 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. The term “speaker identity” is not mentioned or described in the specification or the claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claim(s) 1-19 is/are rejected under 35 U.S.C. 103(a) as being anticipated by Kapilow et al. (USPN 7,454,348) in view of Pearson (PGPUB 2021/0217431).

Regarding claims 1 and 10, Kapilow discloses a processor-implemented method and system, hereinafter referenced a method of constructing a training dataset for a text-to-speech model, comprising: 
collecting, by a processor, speech data including different speech utterance information (fig. 2a, elements 302-304 with column 2, lines 58-67) comprising digitized audio signals and corresponding metadata including at least on of language, speaker identity and emotion (column 4, line 58 – column 5, line 20);
generating, by the processor, a training dataset by a storing the fused feature sequence as a synthetic utterance dataset configured as training input for the text-to-speech model (blend voice; fig. 3A, element 308 with column 3, lines 2-27); and
training, by the processor, the text-to-speech model using the synthetic utterance dataset such that synthesis performance across variations of language, speaker, and emotion is improved (column 4, line 58 – column 5, line 20), but does not specifically teach increasing, by the processor, the speech data by fusing the collected speech data within a single utterance, time sequence, including combining time-series portions of at least two speech data samples to generate increased speech data.
Pearson discloses a method wherein increasing, by the processor, the speech data by fusing the collected speech data within a single utterance, time sequence, including combining time-series portions of at least two speech data samples to generate increased speech data (voice morphing apparatus may operate upon a series of time steps; p. 0066, wherein a plurality of audio data sets are used; p. 0069, 0082), to provide natural sounding audio.
Therefore, it would have been obvious to one of ordinary skill of the art, before the effective filing data of the claimed invention, to modify the method as described above, to assist with providing intelligible audio.
Regarding claim 2, Kapilow discloses a method wherein the training dataset includes a text on speech data and speech utterance information as input data of the speech synthesis model, wherein the training dataset includes speech data as output data of the speech synthesis model (synthetic; column 2, lines 36-67).  
Regarding claim 3, Kapilow discloses a method wherein the speech utterance information includes at least one of a language of a text, a speaker uttering the text, and an emotion of the speaker uttering the text (emotion; column 2, lines 25-35).  
Regarding claim 4, Kapilow discloses a method wherein the increasing comprises fusing speech data of different languages within one utterance according to time series (languages/accents; column 2, lines 25-35).  
`	Regarding claim 5, Kapilow discloses a method wherein the increasing comprises fusing speech data of different speakers within one utterance according to time series (blended characteristics of two voices; column 3, lines 2-27).  
Regarding claim 6, it is interpreted and rejected for similar reasons as set forth in the combination of claims 4-5.
Regarding claim 7, Kapilow discloses a method wherein the step of increasing comprises fusing speech data of different emotions within one utterance according to time series (emotion; column 2, lines 25-35).  
Regarding claim 8, Kapilow discloses a method wherein the increasing comprises fusing speech data of different paralinguistic expressions within one utterance according to time series (voice characteristics; column 2, lines 10-67).  
Regarding claim 9, Kapilow discloses a method further comprising training the speech synthesis model with a generated training dataset (train; column 4, lines 10-21).  
Regarding claim 11, Kapilow discloses a training method of a speech synthesis model, the method comprising: 
a step of increasing speech data by fusing speech data having different speech utterance information within one utterance (blend voice; fig. 3, element 308 with column 3, lines 2-27); 
a step of generating a training dataset by using the increased speech data (train; column 4, lines 10-21); and
 a step of training the speech synthesis model with the generated training dataset (synthetic; column 2, lines 36-67).
Regarding claim 12, Kapilow discloses a training method wherein the training dataset includes a text on speech data and speech utterance information as input data of the speech synthesis model, wherein the training dataset includes speech data as output data of the speech synthesis model (column 4, line 58 – column 5, line 20).  
Regarding claim 13, Kapilow discloses a training method wherein the speech utterance information includes at least one of a language of a text, a speaker uttering the text, and an emotion of the speaker uttering the text (language/emotion; column 5, lines 1-20).  
Regarding claim 14, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different languages (column 4, lines 10-40).
Regarding claim 15, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different speakers (column 4, lines 10-40).  Furthermore, Pearson discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different speakers (different speakers; p. 0104) within one utterance according to time series (p. 0066). 
Regarding claim 16, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different languages (column 5, lines 1-20) and different speakers (column 4, lines 10-40).
Regarding claim 17, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different emotions (column 5, lines 1-20).
Regarding claim 18, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein, for the increasing, the processor is configured to fuse speech data of different paralinguistic expressions (pitch; column 3, line 28 – column 4, line 9). 
Regarding claim 19, it is interpreted and rejected for similar reasons as set forth above.  In addition, Kapilow discloses a training method wherein the processor is further configured to train a speech synthesis model with the generated training dataset (synthesis; column 2, lines 10-67).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  This information has been detailed in the PTO 892 attached (Notice of References Cited).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAKIEDA R JACKSON whose telephone number is (571)272-7619. The examiner can normally be reached Mon - Fri 6:30a-2:30p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571.272.5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAKIEDA R JACKSON/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Dec 26, 2023

Application Filed

Aug 27, 2025

Non-Final Rejection mailed — §103, §112

Nov 21, 2025

Response Filed

Mar 25, 2026

Final Rejection mailed — §103, §112

May 21, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/519,305

Patent 12639530

SYSTEMS AND METHODS FOR A CONVERSATIONAL ASSISTANT USING ARTIFICIAL INTELLIGENCE IN A FREIGHT MANAGEMENT PLATFORM

2y 6m to grant Granted May 26, 2026

18/594,019

Patent 12633282

PROCESSING METHOD AND PROCESSING APPARATUS OF SOUND SIGNAL

2y 2m to grant Granted May 19, 2026

17/490,514

Patent 12609102

TRAINING DATASET GENERATION FOR SPEECH-TO-TEXT SERVICE

4y 6m to grant Granted Apr 21, 2026

18/587,860

Patent 12609111

CLUSTERING AND MINING ACCENTED SPEECH FOR INCLUSIVE AND FAIR SPEECH RECOGNITION

2y 1m to grant Granted Apr 21, 2026

18/151,953

Patent 12603079

PROVIDING A REPOSITORY OF AUDIO FILES HAVING PRONUNCIATIONS FOR TEXT STRINGS TO PROVIDE TO A SPEECH SYNTHESIZER

3y 3m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

74%

Grant Probability

90%

With Interview (+15.4%)

3y 0m (~7m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 911 resolved cases by this examiner. Grant probability derived from career allowance rate.