Last updated: May 29, 2026

Application No. 17/580,289

Text-to-Speech Adapted by Machine Learning

Non-Final OA §103

Filed

Jan 20, 2022

Priority

Dec 23, 2016 — provisional 62/438,873 +2 more

Examiner

ARMSTRONG, ANGELA A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Soundhound AI Ip LLC

OA Round

3 (Non-Final)

Interview Optional

— +8.2% interview lift. Interview lift (+8.2%) is below the 15.0% threshold. A written response is recommended.

Based on 646 resolved cases, 2023–2026

Examiner Intelligence

ARMSTRONG, ANGELA A View full profile →

Grants 74% — above average

Career Allowance Rate

480 granted / 646 resolved

+12.3% vs TC avg

Moderate +8% lift

Without

With

+8.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 10m

Avg Prosecution

18 currently pending

Career history

672

Total Applications

across all art units

Statute-Specific Performance

§101

11.3%

-28.7% vs TC avg

§103

69.4%

+29.4% vs TC avg

§102

8.7%

-31.3% vs TC avg

§112

3.8%

-36.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 646 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on September 4, 2025 has been entered.  Claims 1, 4, 9, and 14 have been amended.  Claims 1-20 remain pending.


Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kirsch et al (US Patent Application Publication No. 2010/0057465), hereinafter Kirsch, in view of Flores et al (US Patent Application Publication No. 2017/0244834), hereinafter Flores.
Kirsch discloses variable text-to-speech for automotive application.  Regarding claim 1, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: producing a TTS prosody parameter according to a model [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech, an attribute of which depends upon the TTS prosody parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and driving a speaker to produce audio as represented by the digital audio samples, wherein prosody parameter can change at run time for more dynamic effects [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].   Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener or that the prosody parameter is based on the processed data and wherein the TTS prosody parameters computed by the model comprise a machine learning algorithm trained using historical listener attribute and behavior data.  In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics [from data gathered in the user profile and user interactions] that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”]; can provide processes to identify speech traits can be based on pattern recognition, including hidden Markov models, neural networks, pattern matching, frequency estimation, mixed models and deep learning [para 0040]; provides for a custom agent manager 124 can collect and analyze data pertaining to VIVR agent features 134 selected by various users 150-154. Based on such analysis, the custom agent manager 124 can learn how different users select different VIVR agent features 134. From time to time, the custom agent manager 124 can automatically update one or more baseline VIVR agent profiles 132 to implement VIVR agent features [para 0065] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features  shown to be more effective in satisfying users [para 0052].  Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the TTS tuning module system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores.
Regarding claim 2, the combination of Kirsch and Flores teaches processing the sensor signal determines a value of a situational attribute [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics],  and producing the TTS prosody parameter is in dependence upon the value of the situational attribute [TTS parameters & TTS Tuning Module -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics].   
Regarding claim 3, the combination of Kirsch and Flores teaches  the dependence upon the value of the situational attribute is programmable using text rules  [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics].  
Regarding claim 4, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: processing a sensor signal of a vehicle to determine a value of a situational attribute [312; 320]; producing a TTS parameter according to a model in dependence upon a value of the situational attribute [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech based on input text such that an attribute of the digital audio samples depends upon the TTS parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031];  and driving a speaker to produce audio as represented by the digital audio samples [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].   Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener to, in part, determine the value of the situational attribute, wherein the model utilizes two or more listener profile features selected from the group consisting of age, gender, emotional state and linguistic background.  In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics [from data gathered in the user profile and user interactions – including a language spoken by the user 150, a dialect spoken by the user 150, a particular accent of the user's speech, vocabulary, language and colloquialisms used by the user 150, the user's speech rate or speech tempo, a gender corresponding to the user's tone of voice, a sentiment of the user 150] that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”]; can provide processes to identify speech traits can be based on pattern recognition, including hidden Markov models, neural networks, pattern matching, frequency estimation, mixed models and deep learning [para 0040]; provides for a custom agent manager 124 can collect and analyze data pertaining to VIVR agent features 134 selected by various users 150-154. Based on such analysis, the custom agent manager 124 can learn how different users select different VIVR agent features 134. From time to time, the custom agent manager 124 can automatically update one or more baseline VIVR agent profiles 132 to implement VIVR agent features [para 0065] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features  shown to be more effective in satisfying users [para 0052].  Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the TTS tuning module system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores.
Regarding claim 5, the combination of Kirsch and Flores  teaches the dependence upon the value of the situational attribute is programmable using text rules [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].  
Regarding claim 6, the combination of Kirsch and Flores  teaches the TTS parameter represents a prosody attribute [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics].    
Regarding claim 7, the combination of Kirsch and Flores  teaches prosody can be changed at run time for more dynamic effects [TTS speed and volume changes with changing vehicle speed -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ real time user interaction analytics].  
Regarding claim 8, the combination of Kirsch and Flores teaches synthesizing the digital audio samples of speech such that prosody attribute is further based on markup in the input text [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].    
Regarding claim 9, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: processing a sensor signal of a vehicle [312; 320]; producing a TTS parameter according to a model [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech, an attribute of which depends upon the TTS parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and driving a speaker to produce audio as represented by the digital audio samples [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].  Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener or that the prosody parameter is based on the processed data and the sensor signal from the vehicle, wherein the TTS prosody parameters computed by the model comprise a machine learning algorithm trained using historical listener attribute and behavior data.   In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features  shown to be more effective in satisfying users [para 0052].  Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores and subsequently modify the vehicle sensor based synthesized speech output so as to ensure the speech is intelligible for the user and the user is able to ascertain the important content of the speech.
Regarding claim 10, the combination of Kirsch and Flores teaches processing the sensor signal determines a value of a situational attribute  [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’s sentiment processing],   and producing the TTS parameter is in dependence upon the value of the situational attribute  [TTS parameters & TTS Tuning Module -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing].     
Regarding claim 11, the combination of Kirsch and Flores  teaches the dependence upon the value of the situational attribute is programmable using text rules [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].    
Regarding claim 12, the combination of Kirsch and Flores  teaches the TTS parameter represents a prosody attribute  [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics].      
Regarding claim 13, the combination of Kirsch and Flores teaches prosody can change at run time for more dynamic effects [TTS speed and volume changes with changing vehicle speed -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ real-time synthesis parameter updates].    
Regarding claim 14, Kirsch teaches a text-to-speech system [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], comprising a computer processor programmed to: perform machine learned parametric speech synthesis using a TTS voice parameter [TTS speech synthesis based on TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031] and produce the TTS voice parameter by a function that transforms at least one voice attribute and at least one situational attribute according to a model  [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]  wherein the model has a specified rule-based algorithm coded with a user text rule [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].    Kirsch fails to specifically teach measuring how listeners respond to the TTS voice parameter and improving the machine learned parametric speech synthesis based on how listeners respond to the TTS voice parameter.  In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features  shown to be more effective in satisfying users [para 0052].  Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores.
Regarding claim 15, the combination of Kirsch and Flores teaches the user text rule depends on situational attributes [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing].      
Regarding claim 16, the combination of Kirsch and Flores teaches a situational attribute is noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].  
Regarding claim 17, the combination of Kirsch and Flores teaches the user text rule depends on noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].
Regarding claim 18, the combination of Kirsch and Flores teaches the TTS voice parameter is related to volume  [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].      
Regarding claim 19, the combination of Kirsch and Flores teaches the at least one situational attribute is a noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing]; the TTS voice parameter is related to volume  [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031];  the user text rule depends on a value of the at least one situational attribute  [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and the value of the at least one situational attribute is received from a sensor in a vehicle  [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031].


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Jan 20, 2022

Application Filed

Oct 23, 2024

Non-Final Rejection mailed — §103

Feb 24, 2025

Response Filed

Jun 04, 2025

Final Rejection mailed — §103

Sep 04, 2025

Request for Continued Examination

Sep 08, 2025

Response after Non-Final Action

Dec 17, 2025

Non-Final Rejection mailed — §103

Apr 17, 2026

Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

17/924,466

Patent 12640146

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD AND RECORDING MEDIUM

3y 6m to grant Granted May 26, 2026

18/183,522

Patent 12640140

ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF

3y 2m to grant Granted May 26, 2026

18/132,793

Patent 12626704

DETECTING VISUAL ATTENTION DURING USER SPEECH

3y 1m to grant Granted May 12, 2026

18/089,392

Patent 12608566

Method and Apparatus for Selecting Sample Corpus Used to Optimize Translation Model

3y 3m to grant Granted Apr 21, 2026

18/240,480

Patent 12602547

DOMAIN ADAPTING GRAPH NETWORKS FOR VISUALLY RICH DOCUMENTS

2y 7m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

74%

Grant Probability

82%

With Interview (+8.2%)

3y 10m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 646 resolved cases by this examiner. Grant probability derived from career allowance rate.