Prosecution Insights
Last updated: April 19, 2026
Application No. 18/842,141

Speech Synthesizer and Method for Speech Synthesis

Non-Final OA §102§103
Filed
Aug 28, 2024
Examiner
COLUCCI, MICHAEL C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Siemens Aktiengesellschaft
OA Round
1 (Non-Final)
76%
Grant Probability
Favorable
1-2
OA Rounds
3y 1m
To Grant
91%
With Interview

Examiner Intelligence

Grants 76% — above average
76%
Career Allow Rate
749 granted / 990 resolved
+13.7% vs TC avg
Strong +15% interview lift
Without
With
+15.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
41 currently pending
Career history
1031
Total Applications
across all art units

Statute-Specific Performance

§101
14.2%
-25.8% vs TC avg
§103
59.2%
+19.2% vs TC avg
§102
8.5%
-31.5% vs TC avg
§112
6.0%
-34.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 990 resolved cases

Office Action

§102 §103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 1-4, 7, 9-11, 14, and 15 rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO). Re claim 1, ZHAO teahces 1. A speech synthesizer, comprising: (synthesizer expressly fig. 5) A processor with a speech analysis module to analyze and process natural language to formulate content of an utterance and an emotional module to perform emotional modeling of the utterance in synthetic speech; (hardware in fig. 14a-14b… and extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) a neural network with an AI system programmed with a generic algorithm; (a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) a microphone with recording; (mic 0041) a memory storing a recording of natural and/or artificially spoken speech as acoustic data; (0108 and 0110 such as housing the neural network e.g. DNN 0104 using multiple emotions…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) wherein processor receives, analyzes, and processes the acoustic data stored in the memory; (0108 and 0110 such as housing the neural network e.g. DNN 0104 using multiple emotions…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) wherein the speech analysis module and the emotional module are connected to the neural network; (in fig. 5 a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) wherein the neutral network provides a suggestion for the emotional modeling with regard to content of the utterance; (creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058 … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) wherein the AI system develops a suggestion for the emotional modeling on the basis of appropriate training data at least partly generated by human interaction; and (a human provides the input and the model is therefore trained, then the system is tasked in creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058 … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) a speaker for reproducing synthetic speech. (0121 speaker using fig. 5 synthesis output) Re claim 2, ZHAO teaches 2. The speech synthesizer as claimed in claim 1, further comprising a speech processing model using a deep learning architecture to generate human-like text. (modeling human input to produce human output… a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 3, ZHAO teaches 3. The speech synthesizer as claimed in claim 1, further comprising an interface to a library. (corpus for instance 0118-0121 with 0043…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 4, ZHAO teaches 4. The speech synthesizer as claimed in claim 1, further comprising a module to capture human emotions with a series of controllers, each of which can be assigned to different emotions. (emotions assigned to slots and different text as the text tags are controllers under BRI via a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 7, ZHAO teaches 7. The speech synthesizer as claimed in claim 1, wherein the memory stores acquired data for comparison with already existing data. (comparison of data 0077 0097…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 9, ZHAO teaches 9. A method for speech synthesis, the method comprising: (synthesizer fig. 5) playing back synthetic and/or human speech; (fig. 5 playing back human speech) capturing one or more human responses to the speech in real time; (a command would be in real time, using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) converting the captured data into machine-processable data; (fig. 5 conversions) storing the machine-processable data; (fig. 14a-b e.g. an interface with processor) repeating the above multiplicity of times; (as in fig. 7 & 0078 nth number of repetitions e.g. a user using the device more than once) forwarding the machine-processable data as training data to a neural network to provide solutions for speech synthesis via generic programming based on the machine-processable data; (a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) implementing suggestions for speech synthesis generated by the AI system by means of a suitably configured processor; and (a human provides the input and the model is therefore trained, then the system is tasked in creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058 … using a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) broadcasting the synthesized speech. (0121 speaker using fig. 5 synthesis output) Re claim 10, ZHAO teaches 10. The method as claimed in claim 9, further comprising capturing human response with regard to emotions including: admiration, pleasure, fear, annoyance, approval, compassion, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, agitation, anxiety, gratitude, sorrow, joy, love, nervousness, optimism, pride, and/or awareness. (e.g. 102, happy/joy, anger, etc. depending on context various interpretations, using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 11, ZHAO teaches 11. The method as claimed in claim 10, further comprising capturing and intensity of the human response. (creating an estimate or suggestion per se for emotion with a scale such as a strength of emotion via metric 0058 …a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Re claim 14, ZHAO teaches 14. The method as claimed in claim 9, wherein the multiplicity of times is from 2 to 1,000 times. (as in fig. 7 0078 nth number of repetitions e.g. a user using the device more than once) Re claim 15, ZHAO teaches 15. The method as claimed in claim 1, wherein the processor provides a classification of the various learned emotional models. (one or more models, several sub-models 0033 0053…a neural network e.g. DNN using multiple emotions 0104…extracting multiple emotions from speech and thus tagged emotion text 0067 0073 with 0102 using emotional models and speech processing to extract intent with emotion 0029 such as for a command 0057, as in fig. 5 using the ASR results into text from fig. 3) Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 5, 6, 8, and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO) in view of US 20200074154 A1 el Kaliouby; Rana et al (hereinafter el Kaliouby). Re claim 5, while ZHAO teaches speech synthesis with emotion, it fails to teach: 5. The speech synthesizer as claimed in claim 1, wherein the microphone includes a filter for noise selection. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as using generalized concepts such as noise filtration to produce a cleaner signal in ZHAO using well-known noise filters as in el Kaliouby. Re claim 6, while ZHAO teaches speech synthesis with emotion, it fails to teach: 6. The speech synthesizer as claimed in claim 1, wherein the microphone captures breathing sounds. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as extracting breathing information as a form of emotional analysis, thereby contributing to data extraction of ZHAO to expand emotion into cognitive states, moods, mental states, emotional states, etc. Re claim 8, while ZHAO teaches speech synthesis with emotion, it fails to teach: 8. The speech synthesizer as claimed in claim 1, wherein the memory compresses incoming data. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as well-known multi-media compression to be able to handle streamed inputs in smaller amounts for communication e.g. H.264 to remove unnecessary data to reduce file size while maintaining quality of the audio (or video or both). Re claim 13, while ZHAO teaches speech synthesis with emotion, it fails to teach: 13. The method as claimed in claim 9, further comprising identifying breathing sounds of the human speaker. (el Kaliouby 0056 noise filtering while including or excluding inputs such as breathing sounds 0140 in learning models e.g. emotion-based 0093 in well-known multimedia compression schemes) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el Kaliouby to allow for combining prior art elements according to known methods to yield predictable results such as extracting breathing information as a form of emotional analysis, thereby contributing to data extraction of ZHAO to expand emotion into cognitive states, moods, mental states, emotional states, etc. Claims 5, 6, 8, and 13is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20220122580 A1 ZHAO; Pei et al. (hereinafter ZHAO) in view of US 20230127120 A1 CHANDRAN; Chakkaradeep Chinnakonda et al. (hereinafter CHANDRAN). Re claim 12, while ZHAO teaches speech synthesis with emotion, it fails to teach: 12. The method as claimed in claim 9, further comprising identifying filler words in the speech. (removing filler words 0033-0034 0069) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of ZHAO to incorporate the above claim limitations as taught by el CHANDRAN to allow for combining prior art elements according to known methods to yield predictable results such as removing a filler word for a smoother more articulate output by tagging the filler word in the text so that the synthesizer can remove it for a more natural sounding output. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 11605384 B1 Dalton; Steven et al. AI driven emotional analysis in conversation Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C COLUCCI whose telephone number is (571)270-1847. The examiner can normally be reached on M-F 9 AM - 5 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /MICHAEL COLUCCI/Primary Examiner, Art Unit 2655 (571)-270-1847 Examiner FAX: (571)-270-2847 Michael.Colucci@uspto.gov
Read full office action

Prosecution Timeline

Aug 28, 2024
Application Filed
Feb 24, 2026
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12592240
ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT
2y 5m to grant Granted Mar 31, 2026
Patent 12586570
CHUNK-WISE ATTENTION FOR LONGFORM ASR
2y 5m to grant Granted Mar 24, 2026
Patent 12573405
WORD CORRECTION USING AUTOMATIC SPEECH RECOGNITION (ASR) INCREMENTAL RESPONSE
2y 5m to grant Granted Mar 10, 2026
Patent 12573380
MANAGING AMBIGUOUS DATE MENTIONS IN TRANSFORMING NATURAL LANGUAGE TO A LOGICAL FORM
2y 5m to grant Granted Mar 10, 2026
Patent 12567414
SYSTEM AND METHOD FOR DETECTING A WAKEUP COMMAND FOR A VOICE ASSISTANT
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
91%
With Interview (+15.3%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 990 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month