Last updated: April 19, 2026

Application No. 18/842,141

Speech Synthesizer and Method for Speech Synthesis

Non-Final OA §102§103

Filed

Aug 28, 2024

Examiner

COLUCCI, MICHAEL C

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Siemens Aktiengesellschaft

OA Round

1 (Non-Final)

Interview Optional

— +15.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 990 resolved cases, 2023–2026

Examiner Intelligence

COLUCCI, MICHAEL C View full profile →

Grants 76% — above average

Career Allow Rate

749 granted / 990 resolved

+13.7% vs TC avg

Strong +15% interview lift

Without

With

+15.3%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

41 currently pending

Career history

1031

Total Applications

across all art units

Statute-Specific Performance

§101

14.2%

-25.8% vs TC avg

§103

59.2%

+19.2% vs TC avg

§102

8.5%

-31.5% vs TC avg

§112

6.0%

-34.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 990 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


DETAILED ACTION

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-4, 7, 9-11, 14, and 15 rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO).
Re claim 1, ZHAO teahces
1. A speech synthesizer, comprising: (synthesizer expressly fig. 5)
A processor with a speech analysis module to analyze and process natural language to formulate content of an utterance and an emotional module to perform emotional modeling of the utterance in synthetic speech; (hardware in fig. 14a-14b… and extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
a neural network with an AI system programmed with a generic algorithm; (a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
a microphone with recording; (mic 0041)
a memory storing a recording of natural and/or artificially spoken speech as acoustic data; (0108 and 0110 such as housing the neural network e.g. DNN 0104 using multiple emotions…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
wherein processor receives, analyzes, and processes the acoustic data stored in the memory; (0108 and 0110 such as housing the neural network e.g. DNN 0104 using multiple emotions…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
wherein the speech analysis module and the emotional module are connected to the neural network; (in fig. 5 a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
wherein the neutral network provides a suggestion for the emotional modeling with regard to content of the utterance; (creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058  … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
wherein the AI system develops a suggestion for the emotional modeling on the basis of appropriate training data at least partly generated by human interaction; and (a human provides the input and the model is therefore trained, then the system is tasked in creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058  … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
a speaker for reproducing synthetic speech. (0121 speaker using fig. 5 synthesis output)

Re claim 2, ZHAO teaches
2. The speech synthesizer as claimed in claim 1, further comprising a speech processing model using a deep learning architecture to generate human-like text. (modeling human input to produce human output… a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)

Re claim 3, ZHAO teaches
3. The speech synthesizer as claimed in claim 1, further comprising an interface to a library. (corpus for instance 0118-0121 with 0043…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)

Re claim 4, ZHAO teaches
4. The speech synthesizer as claimed in claim 1, further comprising a module to capture human emotions with a series of controllers, each of which can be assigned to different emotions. (emotions assigned to slots and different text as the text tags are controllers under BRI via a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)

Re claim 7, ZHAO teaches
7. The speech synthesizer as claimed in claim 1, wherein the memory stores acquired data for comparison with already existing data. (comparison of data 0077 0097…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)


Re claim 9, ZHAO teaches
9. A method for speech synthesis, the method comprising: (synthesizer fig. 5)
playing back synthetic and/or human speech; (fig. 5 playing back human speech)
capturing one or more human responses to the speech in real time; (a command would be in real time, using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
converting the captured data into machine-processable data; (fig. 5 conversions)
storing the machine-processable data; (fig. 14a-b e.g. an interface with processor)
repeating the above multiplicity of times; (as in fig. 7 & 0078 nth number of repetitions e.g. a user using the device more than once)
forwarding the machine-processable data as training data to a neural network to provide solutions for speech synthesis via generic programming based on the machine-processable data; (a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
implementing suggestions for speech synthesis generated by the AI system by means of a suitably configured processor; and (a human provides the input and the model is therefore trained, then the system is tasked in creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058  … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)
broadcasting the synthesized speech. (0121 speaker using fig. 5 synthesis output)

Re claim 10, ZHAO teaches
10. The method as claimed in claim 9, further comprising capturing human response with regard to emotions including: admiration, pleasure, fear, annoyance, approval, compassion, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, agitation, anxiety, gratitude, sorrow, joy, love, nervousness, optimism, pride, and/or awareness. (e.g. 102, happy/joy, anger, etc. depending on context various interpretations, using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)

Re claim 11, ZHAO teaches
11. The method as claimed in claim 10, further comprising capturing and intensity of the human response. (creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058 …a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)

Re claim 14, ZHAO teaches
14. The method as claimed in claim 9, wherein the multiplicity of times is from 2 to 1,000 times. (as in fig. 7 0078 nth number of repetitions e.g. a user using the device more than once)

Re claim 15, ZHAO teaches
15. The method as claimed in claim 1, wherein the processor provides a classification of the various learned emotional models. (one or more models, several sub-models 0033 0053…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3)


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 6, 8, and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO) in view of US 20200074154 A1 el Kaliouby; Rana et al (hereinafter el Kaliouby).
Re claim 5, while ZHAO teaches speech synthesis with emotion, it fails to teach:
5. The speech synthesizer as claimed in claim 1, wherein the microphone includes a filter for noise selection. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as using generalized concepts such as noise filtration to produce a cleaner signal in ZHAO using well-known noise filters as in el Kaliouby.

Re claim 6, while ZHAO teaches speech synthesis with emotion, it fails to teach:
6. The speech synthesizer as claimed in claim 1, wherein the microphone captures breathing sounds. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as extracting breathing information as a form of emotional analysis, thereby contributing to data extraction of ZHAO to expand emotion into cognitive states, moods, mental states, emotional states, etc. 

Re claim 8, while ZHAO teaches speech synthesis with emotion, it fails to teach:
8. The speech synthesizer as claimed in claim 1, wherein the memory compresses incoming data. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as well-known multi-media compression to be able to handle streamed inputs in smaller amounts for communication e.g.  H.264 to remove unnecessary data to reduce file size while maintaining quality of the audio (or video or both).

Re claim 13, while ZHAO teaches speech synthesis with emotion, it fails to teach:
13. The method as claimed in claim 9, further comprising identifying breathing sounds of the human speaker. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as extracting breathing information as a form of emotional analysis, thereby contributing to data extraction of ZHAO to expand emotion into cognitive states, moods, mental states, emotional states, etc. 

Claims 5, 6, 8, and 13is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO) in view of US 20230127120 A1 CHANDRAN; Chakkaradeep Chinnakonda et al. (hereinafter CHANDRAN).
Re claim 12, while ZHAO teaches speech synthesis with emotion, it fails to teach:
12. The method as claimed in claim 9, further comprising identifying filler words in the speech. (removing filler words 0033-0034 0069)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el CHANDRAN to allow for combining prior art elements according to known methods to yield predictable results such as removing a filler word for a smoother more articulate output by tagging the filler word in the text so that the synthesizer can remove it for a more natural sounding output.


Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

US 11605384 B1	Dalton; Steven et al.
AI driven emotional analysis in conversation

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov

Read full office action

Prosecution Timeline

Aug 28, 2024

Application Filed

Feb 24, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/515,502

Patent 12592240

ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT

2y 5m to grant Granted Mar 31, 2026

18/585,168

Patent 12586570

CHUNK-WISE ATTENTION FOR LONGFORM ASR

2y 5m to grant Granted Mar 24, 2026

18/131,021

Patent 12573405

WORD CORRECTION USING AUTOMATIC SPEECH RECOGNITION (ASR) INCREMENTAL RESPONSE

2y 5m to grant Granted Mar 10, 2026

18/656,274

Patent 12573380

MANAGING AMBIGUOUS DATE MENTIONS IN TRANSFORMING NATURAL LANGUAGE TO A LOGICAL FORM

2y 5m to grant Granted Mar 10, 2026

18/492,177

Patent 12567414

SYSTEM AND METHOD FOR DETECTING A WAKEUP COMMAND FOR A VOICE ASSISTANT

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

91%

With Interview (+15.3%)

3y 1m

Median Time to Grant

Low

PTA Risk

Based on 990 resolved cases by this examiner. Grant probability derived from career allow rate.