Prosecution Insights
Last updated: April 19, 2026
Application No. 17/580,289

Text-to-Speech Adapted by Machine Learning

Non-Final OA §103
Filed
Jan 20, 2022
Examiner
ARMSTRONG, ANGELA A
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Soundhound AI Ip LLC
OA Round
3 (Non-Final)
75%
Grant Probability
Favorable
3-4
OA Rounds
3y 11m
To Grant
84%
With Interview

Examiner Intelligence

Grants 75% — above average
75%
Career Allow Rate
478 granted / 641 resolved
+12.6% vs TC avg
Moderate +10% lift
Without
With
+9.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
25 currently pending
Career history
666
Total Applications
across all art units

Statute-Specific Performance

§101
21.9%
-18.1% vs TC avg
§103
43.7%
+3.7% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
7.7%
-32.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 641 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on September 4, 2025 has been entered. Claims 1, 4, 9, and 14 have been amended. Claims 1-20 remain pending. Claim Rejections - 35 USC § 103 The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action. Claims 1-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kirsch et al (US Patent Application Publication No. 2010/0057465), hereinafter Kirsch, in view of Flores et al (US Patent Application Publication No. 2017/0244834), hereinafter Flores. Kirsch discloses variable text-to-speech for automotive application. Regarding claim 1, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: producing a TTS prosody parameter according to a model [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech, an attribute of which depends upon the TTS prosody parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and driving a speaker to produce audio as represented by the digital audio samples, wherein prosody parameter can change at run time for more dynamic effects [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener or that the prosody parameter is based on the processed data and wherein the TTS prosody parameters computed by the model comprise a machine learning algorithm trained using historical listener attribute and behavior data. In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics [from data gathered in the user profile and user interactions] that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”]; can provide processes to identify speech traits can be based on pattern recognition, including hidden Markov models, neural networks, pattern matching, frequency estimation, mixed models and deep learning [para 0040]; provides for a custom agent manager 124 can collect and analyze data pertaining to VIVR agent features 134 selected by various users 150-154. Based on such analysis, the custom agent manager 124 can learn how different users select different VIVR agent features 134. From time to time, the custom agent manager 124 can automatically update one or more baseline VIVR agent profiles 132 to implement VIVR agent features [para 0065] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features shown to be more effective in satisfying users [para 0052]. Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the TTS tuning module system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores. Regarding claim 2, the combination of Kirsch and Flores teaches processing the sensor signal determines a value of a situational attribute [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics], and producing the TTS prosody parameter is in dependence upon the value of the situational attribute [TTS parameters & TTS Tuning Module -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics]. Regarding claim 3, the combination of Kirsch and Flores teaches the dependence upon the value of the situational attribute is programmable using text rules [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics]. Regarding claim 4, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: processing a sensor signal of a vehicle to determine a value of a situational attribute [312; 320]; producing a TTS parameter according to a model in dependence upon a value of the situational attribute [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech based on input text such that an attribute of the digital audio samples depends upon the TTS parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and driving a speaker to produce audio as represented by the digital audio samples [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener to, in part, determine the value of the situational attribute, wherein the model utilizes two or more listener profile features selected from the group consisting of age, gender, emotional state and linguistic background. In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics [from data gathered in the user profile and user interactions – including a language spoken by the user 150, a dialect spoken by the user 150, a particular accent of the user's speech, vocabulary, language and colloquialisms used by the user 150, the user's speech rate or speech tempo, a gender corresponding to the user's tone of voice, a sentiment of the user 150] that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”]; can provide processes to identify speech traits can be based on pattern recognition, including hidden Markov models, neural networks, pattern matching, frequency estimation, mixed models and deep learning [para 0040]; provides for a custom agent manager 124 can collect and analyze data pertaining to VIVR agent features 134 selected by various users 150-154. Based on such analysis, the custom agent manager 124 can learn how different users select different VIVR agent features 134. From time to time, the custom agent manager 124 can automatically update one or more baseline VIVR agent profiles 132 to implement VIVR agent features [para 0065] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features shown to be more effective in satisfying users [para 0052]. Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the TTS tuning module system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores. Regarding claim 5, the combination of Kirsch and Flores teaches the dependence upon the value of the situational attribute is programmable using text rules [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 6, the combination of Kirsch and Flores teaches the TTS parameter represents a prosody attribute [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics]. Regarding claim 7, the combination of Kirsch and Flores teaches prosody can be changed at run time for more dynamic effects [TTS speed and volume changes with changing vehicle speed -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ real time user interaction analytics]. Regarding claim 8, the combination of Kirsch and Flores teaches synthesizing the digital audio samples of speech such that prosody attribute is further based on markup in the input text [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 9, Kirsch teaches a method of speech synthesis [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], the method comprising: processing a sensor signal of a vehicle [312; 320]; producing a TTS parameter according to a model [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; synthesizing digital audio samples of speech, an attribute of which depends upon the TTS parameter [TTS Speech Synthesizer -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and driving a speaker to produce audio as represented by the digital audio samples [p0025 – TTS audio stream played to the driver based on current state of vehicle/environment-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Kirsch fails to teach processing data relating to at least one of an attribute of a listener and a profile for the listener or that the prosody parameter is based on the processed data and the sensor signal from the vehicle, wherein the TTS prosody parameters computed by the model comprise a machine learning algorithm trained using historical listener attribute and behavior data. In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features shown to be more effective in satisfying users [para 0052]. Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores and subsequently modify the vehicle sensor based synthesized speech output so as to ensure the speech is intelligible for the user and the user is able to ascertain the important content of the speech. Regarding claim 10, the combination of Kirsch and Flores teaches processing the sensor signal determines a value of a situational attribute [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’s sentiment processing], and producing the TTS parameter is in dependence upon the value of the situational attribute [TTS parameters & TTS Tuning Module -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing]. Regarding claim 11, the combination of Kirsch and Flores teaches the dependence upon the value of the situational attribute is programmable using text rules [TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 12, the combination of Kirsch and Flores teaches the TTS parameter represents a prosody attribute [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ user interaction analytics]. Regarding claim 13, the combination of Kirsch and Flores teaches prosody can change at run time for more dynamic effects [TTS speed and volume changes with changing vehicle speed -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ real-time synthesis parameter updates]. Regarding claim 14, Kirsch teaches a text-to-speech system [Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031], comprising a computer processor programmed to: perform machine learned parametric speech synthesis using a TTS voice parameter [TTS speech synthesis based on TTS parameters & TTS Tuning Modules -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031] and produce the TTS voice parameter by a function that transforms at least one voice attribute and at least one situational attribute according to a model [TTS parameters & TTS Tuning Model -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031] wherein the model has a specified rule-based algorithm coded with a user text rule [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Kirsch fails to specifically teach measuring how listeners respond to the TTS voice parameter and improving the machine learned parametric speech synthesis based on how listeners respond to the TTS voice parameter. In a similar field of endeavor, Flores [para 0022; 0039; 0044-0054] teaches virtual voice response agent individually configured for a user, in which the voice response agent 136 can be customized that result in TTS engine 122 providing synthesized speech having characteristics that are tailored to the user 150 and the user's sentiments on the present call to be customized to present a speech personality matching the user's personality traits and speech patterns, and appropriate for the user's current sentiment [“listener’s responses”] and specifically teaches the system’s user interaction analytics can identify virtual intelligent agent features shown to be more effective in satisfying users [para 0052]. Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the user interaction analytics to generate user tailored speech synthesis as suggested by Flores, in the system of Kirsch, and the results would have been predictable and would provide speech synthesis that is more effective in satisfying the user, as suggested by Flores. Regarding claim 15, the combination of Kirsch and Flores teaches the user text rule depends on situational attributes [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing]. Regarding claim 16, the combination of Kirsch and Flores teaches a situational attribute is noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 17, the combination of Kirsch and Flores teaches the user text rule depends on noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 18, the combination of Kirsch and Flores teaches the TTS voice parameter is related to volume [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Regarding claim 19, the combination of Kirsch and Flores teaches the at least one situational attribute is a noise level [Interior noise sensor (208); Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031; Flores’ sentiment processing]; the TTS voice parameter is related to volume [TTS parameters & TTS Tuning Modules…pitch, speed, volume -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; the user text rule depends on a value of the at least one situational attribute [TTS parameters & TTS Tuning Modules for synthesis based on driver preferences-- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]; and the value of the at least one situational attribute is received from a sensor in a vehicle [sensor interface -- Fig 2; Fig 3; Fig 7; Fig 8; para 0009-0012; 0022-0026; 0027-0031]. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. ANGELA A. ARMSTRONG Primary Examiner Art Unit 2659 /ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659
Read full office action

Prosecution Timeline

Jan 20, 2022
Application Filed
Oct 19, 2024
Non-Final Rejection — §103
Feb 24, 2025
Response Filed
May 31, 2025
Final Rejection — §103
Sep 04, 2025
Request for Continued Examination
Sep 08, 2025
Response after Non-Final Action
Dec 13, 2025
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602547
DOMAIN ADAPTING GRAPH NETWORKS FOR VISUALLY RICH DOCUMENTS
2y 5m to grant Granted Apr 14, 2026
Patent 12596879
METHOD AND SYSTEM FOR IDENTIFYING CITATIONS WITHIN REGULATORY CONTENT
2y 5m to grant Granted Apr 07, 2026
Patent 12585892
AUTO-TRANSLATION OF CUSTOMIZED ASSISTANT
2y 5m to grant Granted Mar 24, 2026
Patent 12555491
Inclusive Intelligence for Digital Workplace
2y 5m to grant Granted Feb 17, 2026
Patent 12547843
SYSTEMS AND METHODS FOR GENERALIZED ENTITY MATCHING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
75%
Grant Probability
84%
With Interview (+9.5%)
3y 11m
Median Time to Grant
High
PTA Risk
Based on 641 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month