Last updated: May 29, 2026

Application No. 18/526,441

Interactive System Rendering Human Speaker Specified Expressions

Non-Final OA §103

Filed

Dec 01, 2023

Examiner

WOZNIAK, JAMES S

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Disney Enterprises Inc.

OA Round

3 (Non-Final)

This examiner grants 60% of cases after interview

— +39.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 391 resolved cases, 2023–2026

Examiner Intelligence

WOZNIAK, JAMES S View full profile →

Grants 60% of resolved cases

Career Allowance Rate

233 granted / 391 resolved

-2.4% vs TC avg

Strong +39% interview lift

Without

With

+39.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 8m

Avg Prosecution

21 currently pending

Career history

429

Total Applications

across all art units

Statute-Specific Performance

§101

7.2%

-32.8% vs TC avg

§103

82.5%

+42.5% vs TC avg

§102

5.8%

-34.2% vs TC avg

§112

4.2%

-35.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 391 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment

In response to the Final Office Action from 12/17/2025, Applicant has filed a Request for Continued Examination (RCE) on 3/11/2026.  In this reply, Applicant has amended independent claims 1, 8, and 15 to further specify that analysis of the audio segment of interest involves determining "stress, rhythm, and intonation of the audio segment of interest for use in replicating a pronunciation of the audio segment of interest by the human speaker," that converting the text string uses an added TTS module and the determined stress, rhythm, and intonation of the audio segment of interest to produce the replicating audio response, and include an additional playback step of the audio response of a non-human social agent replicating the human speaker using an added audio speaker.  New claims 22-26 have also been added.  
Applicant has also argued that the prior art of record fails to teach the limitations added via the instant amendment (Remarks, Pages 20-22).  These arguments have been fully considered, however, are moot with respect to the new grounds of rejection further in view of Tischer (U.S. PG Publication: 2004/0111271 A1).

Applicant argues that the amendments to claim 19 resolve the indefiniteness rejections under 35 U.S.C. 112(b) (Remarks, Page 10).  In response to the correction of the antecedent basis issues in these claims, the 35 U.S.C. 112(b) rejection is now moot and has been withdrawn.

Applicant has presented a number of arguments traversing the patent subject matter eligibility rejection of claims 1, 3-8, 10-15, and 17-21 under 35 U.S.C. 101 in light of the amended claims entered via the instant RCE.  In particular, Applicant argues that the claims defined in independent claims 1, 8, and 15 are directed towards patent eligible subject matter because under step 2A prong 1 the amended independent claims, given the added prosodic analysis and hardware elements and speech synthesis technologies, cannot be practically be performed in the human mind (Remarks, Pages 14-15), under step 2A prong 2 the amended claims relate to a practical application and improvement in human computer interaction (Remarks, Pages 16-17), and under step 2B the amended claim includes a non-conventional arrangement of components (Remarks, Pages 18-19).
In response, while it is maintained that the amended independent claims place no limits upon how stress, rhythm, and intonation are analyzed such that a human (e.g., a speech pathologist) can make observations on such voice features under the broadest reasonable interpretation (BRI) and that Applicant did not invent/improve computer automation of the manual process of transcribing speech and reading out text (i.e., ASR and STT), the amended claims do utilize the analyzed audio features to meaningfully synthesize audio that is then played back via a speaker that relates to a practical application for the identified abstract idea via synthesized speech playback of a non-human social agent replicating the human speaker pronunciation.  Accordingly, while most of Applicant's arguments are not found to be persuasive, the 35 U.S.C. 101 rejection has been withdrawn since the claims have been found to be eligible by applying the identified abstract mental process under the BRI to a practical application under Step 2A prong 2.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, 8, 11, 15, 18, 22, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Peddinti, et al. (U.S. PG Publication:  2022/0284882 A1) in view of Tischer (U.S. PG Publication: 2004/0111271 A1).
With respect to Claim 1, Peddinti discloses:
A system comprising: 
a microphone (microphone for audio capture, Paragraphs 0026 and 0029); 
an audio speaker (speaker for communicating an audible audio signal, Paragraphs 0026 and 0035); 
an automatic speech recognition (ASR) sensor (ASR system/sensor relating acoustics to word confidence scores including a neural network and/or confidence models, Paragraphs 0029, 0049, 0057; Fig. 1, Element 140); 
a text-to-speech (TTS) module (text-to-speech (TTS) system/software, Paragraphs 0027 and 0035); and 
a speech-to-text (STT) module (speech-to-text software within the ASR system that uses confidence data to generate/assemble a transcription of the spoken user query, Paragraphs 0025, 0029, 0049, and 0057);
a computing platform having a hardware processor and a system memory storing a software code and a natural language understanding (NLU) model; the hardware processor configured to execute the software code to (computing device having a processor and memory that stores software code and an NLU model, Paragraphs 0021, 0029, and 0066-0067):
receive, via the microphone, an audio input, the audio input including speech by a human speaker (input speech data from a user is received at a digital assistant via a microphone, Paragraphs 0026 and 0029; Fig. 1, Element 12); 
produce, using the ASR sensor and the STT module, a machine generated a text transcription of the audio input (Paragraph 0029- “generate, as output, a transcription 142 of the query 12”; utilizing the ASR system to generate the word confidence scores and STT software to generate the transcription, Paragraphs 0025, 0029, 0049, and 0057; note that the ASR system and STT software a machine based thus the transcription is “machine generated” as claimed); 
identify, using the NLU model and the machine generated text transcription, an audio segment of interest (identification of (e.g., "the particular word" such as the name of a city referenced in a spoken user query) audio data in the spoken input using the NLU module/model, Paragraph 0021 and 0029-0031); 
analyze one or more audio characteristics of the feature of the audio segment of interest (obtaining audio related features in the form of "pronunciation-related features" for the particular term, Paragraphs 0031, 0040, and 0045);
generate, using the machine generated text transcription, a text string corresponding to the audio segment of interest (generating “a textual representation of the response to the query” with respect to the spoken user query audio data term of interest, Paragraphs 0029-0031); 
convert the text string, using the TTS module, to produce an audio response that replicates the pronunciation of the audio segment of the human speaker and that has a characteristic voice of a non-human social agent (use of the text-to-speech (TTS) system of the digital assistant relying upon a user pronunciation of the particular word in the query differs from that of the TTS pronunciation and modifies the response such that the digital assistant (i.e., non-human social agent) utters "the synthesized speech that approximates how a human would pronounce words formed by the sequence of graphemes/characters defining the TTS input 152 including the textual representation of the response to the query 12," Paragraphs0027, 0031-0032 (i.e., the human speaker pronunciation is “replicated” because it is an approximation of the human pronunciation); see also paragraph 0027- "the TTS audio 154 pronounces the particular word using the one of the user pronunciation 202"); and
play, using the audio speaker, the audio response having the characteristic voice of the non-human social agent while replicating the pronunciation of the human speaker (audibly outputting the synthesized speech of the digital assistant via the speaker response to the user query relying upon the user pronunciation of the audio of interest, Paragraphs 0027, 0031-0032, and 0035).
While Peddinti teaches analyzing an audio segment of interest with respect to "pronunciation-related features" in generating a TTS output replicating a user's pronunciation by a digital assistant (see Paragraphs 0031-0032), Peddinti does not teach that analysis involves determining "stress, rhythm, and intonation of the audio segment of interest for use in replicating a pronunciation of the audio segment of interest by the human speaker" nor then converting the text string using those same "stress, rhythm, and intonation" features.  Tischer, however, discloses that speech samples recorded of interest from a person's "own voice file" are analyzed and then used for "customizing...text to speech" with parameters including "emphasis" (i.e., stress), "rhythm", and "intonations" in the voice input samples of words (Paragraphs 0009, 0034-0035, 0037,0041, 0053, 0055, and 0061 (discussing the use of a person's "own voice file")).
Peddinti and Tischer are analogous art because they are from a similar field of endeavor in text-to-speech synthesis.  Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to utilize the user pronunciation of words with respect to various prosodic features as taught by Tischer to the word-specific pronunciations taught by Peddinti to provide a predictable result of providing synthesized voices that are recognizable and with greater clarity (Tischer, Paragraph 0023).
With respect to Claim 4, Peddinti further discloses:
The system of claim 1, wherein the audio segment of interest comprises a first name, a surname, a nickname, a name of a pet, a place name, a brand name, or a company name (contact name or musical artist first and last names, Paragraph 0022 and 0057; name of a place (city/restaurant examples), Paragraphs 0030 and 0046).
Claim 8 is a method embodiment carrying out the functionality of the system set forth in claim 1, and thus, is rejected under similar rationale.
Claim 11 contains subject matter similar to claim 4, and thus, are rejected under similar rationale.
Claim 15 is a "computer-readable non-transitory medium" embodiment storing processor instructions for carrying out the functionality of system claim 1, and thus, is rejected under similar rationale.  Moreover, Peddinti further recites method implementation as a program stored on a non-transitory computer-readable medium (Paragraphs 0071-0072).
Claim 18 contains subject matter similar to claim 4, and thus, are rejected under similar rationale.
With respect to Claim 22, Peddinti further discloses:
The system of claim 1, wherein the audio segment of interest comprises a phoneme string (audio segment of interest including words, Paragraph 0027 and 0029; note that phonemes are a basic unit of speech so a audio input including words inherently comprises phonemes; also note that Tischer more explicitly discloses an audio input comprising phoneme strings in Paragraphs 0016 and 0036).
With respect to Claim 24, Peddinti and Tischer further discloses:
The system of claim 1, wherein the software code includes an audio input block, a segmentation block, an analysis block, an alignment block and a response generation block (software implementation including code sections (Paragraphs 0071-0072) include processing for receiving audio (Paragraph 0064), generating text segments from an audio input such as words (Paragraph 0029), and response generation (Paragraph 0064); note that alignment/phonemes for user prosody generation is taught by Tischer in Paragraphs 0034, 0036, and 0041).

Claims 5, 12, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Peddinti, et al. in view of Tischer and further in view of Ingel, et al. (U.S. PG Publication:  2025/0006182 A1).
With respect to Claim 5, Peddinti in view of Tischer teaches the system for outputting a response to a spoken user input relying upon a user pronunciation as applied to Claim 1.  Although Peddinti models TTS speech pronunciations of particular terms after user pronunciations, the particular terms discussed in Peddinti in view of Tischer do not refer to "a non-vocal sound produced by the human speaker" as set forth in claim 5.   Ingel, however, teaches a voice assistant voice clone that clones/mimics non-vocal sounds from a user such as laughter (Paragraphs 0061, 0174, 0405, and 0420).
Peddinti, Tischer, and Ingel are analogous art because they are from a similar field of endeavor in text-to-speech synthesis.  Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to include non-speech/verbal vocalizations taught by Ingel in the user pronunciations that are adapted into TTS taught by Peddinti in view of Tischer to provide a predictable result of producing a synthetic voice that better approximates a user by also including distinct non-verbal sounds like crying and laughter.
Claims 12 and 21 contain subject matter similar to claim 5, and thus, is rejected under similar rationale.

Claims 6, 13, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Peddinti, et al. in view of Tischer and further in view of Schaaf, et al. (U.S. Patent:  9,405,741 B1).
With respect to Claim 6, Peddinti in view of Tischer teaches the system for outputting a response to a spoken user input relying upon a user pronunciation as applied to Claim 1.  Peddinti in view of Tischer does not teach the substitute language replacement procedure for prohibited words in a reply as set forth in claim 6.  Schaaf, however, discloses:
a language database stored in the system memory, the language database including a list of prohibited words and a plurality of generic responses (database of terms that are prohibited/offensive and corresponding generic replacements (e.g., “Playing the requested song” instead of actually using the prohibited song title), Col. 3, Line 35- Col. 4, Line 17; Col. 9, Lines 7-14; Col. 15, Line 56- Col. 16, Line 30), wherein the hardware processor is further configured to execute the software code to:
determine whether the text string comprises a word in the list of prohibited words; select, based on the machine generated text transcription, a substitute response from among the plurality of generic responses; and replace the text string with the substitute response when producing the audio response (spoken request is transcribed with natural language understanding and when a word that is on the prohibited/offensive list is encountered, a generic substitution in the audio output response is provided (e.g., “Playing the requested song” instead of actually using the prohibited song title in a similar response), Col. 11, Lines 1-25; Col. 15, Line 56- Col. 16, Line 30).
Peddinti, Tischer, and Schaaf are analogous art because they are from a similar field of endeavor in text-to-speech synthesis in digital assistants.  Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to enable the digital assistant of Peddinti in view of Tischer to use the prohibited term substitution taught by Schaaf in response generation to provide a predictable result of improving the user experience by removing a output that the user may find to be inappropriate (Schaaf, Col. 2, Lines 26-34). 
Claims 13 and 19 contain subject matter similar to claim 6, and thus, are rejected under similar rationale.

Claims 7, 14, 20, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Peddinti, et al. in view of Tischer in view of Winter, et al. (U.S. PG Publication:  2019/0096387 A1; cited in the PTO-892 from 7/21/2025).
With respect to Claim 7, Peddinti in view of Tischer teaches the system for outputting a response to a spoken user input relying upon a user pronunciation as applied to Claim 1.  Peddinti in view of Tischer does not teach the detection and removal of impediments in a spoken input of interest for a system response as set forth in claim 7.  Winter, however, discloses: 
The system of claim 1, wherein the audio segment of interest includes a speech impediment element (detection of user audio self-repairs/impediments such as a stutter or repetition in an utterance, Paragraphs 0034-0036), and wherein the hardware processor is further configured to execute the software code to: remove, before playing the audio response, the speech impediment from the audio response (performing natural language processing to "remove" the speech impediment including the stutter/repetition from the TTS output, Paragraphs 0030 and 0034-0035).
Peddinti, Tischer, and Winter are analogous art because they are from a similar field of endeavor in interactive systems utilizing text-to-speech synthesis.  Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to employ the filler word removal taught by Winter in the response generation taught by Peddinti in view of Tischer to provide for a more accurate read back of the audio of interest (Winter, Paragraph 0034).
Claims 14 and 20 contain subject matter similar to claim 7, and thus, are rejected under similar rationale.
With respect to Claim 25, Winter further discloses:
The system of claim 7, wherein the speech impediment element comprises one or more repeated syllables due to stuttering by the human speaker or any other disfluency by the human speaker (note that the user disfluencies that are removed or repaired included a "stutter" that inherently includes one or more syllables of a word or a filler/repetition as another type of disfluency, Paragraph 0035).

Claims 23 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Peddinti, et al. in view of Tischer in view of Hantrakul, et al. (U.S. PG Publication:  2023/0377591 A1).
With respect to Claim 23, Peddinti in view of Tischer teaches the system for outputting a non-human digital assistant response to a spoken user input relying upon a user pronunciation as applied to Claim 1.  Although Peddinti deals with computer systems and neural networks (see Paragraph 0035) that likely could provide a digital agent response to a received audio input in near real-time on the order of one hundred milliseconds or less, Peddinti in view of Tischer do not explicitly describe such near-real time processing. Hantrakul, however, discloses speech synthesis to generate a "real time" reply with a perception of an immediate response on the order of "within few milliseconds."  Note that the ordinary and customary meaning of "few" is a "small number" (https://www.dictionary.com/browse/few).  Thus, a small number would be within the range disclosed by the applicant and obvious to choose on the order of 50 ms or less in order to provide a response that is "virtually immediate when observed by a user" (Paragraph 0029).
Peddinti, Tischer, and Hantrakul are analogous art because they are from a similar field of endeavor in speech synthesis.  Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to provide the responses of Peddinti in view of Tischer in real time as taught by Hantrakul to provide a predictable result of providing a response that is "virtually immediate when observed by a user" (Paragraph 0029) and implement continuous interaction between the user and the digital assistant in Peddinti.
Claim 26 contains subject matter similar to claim 23, and thus, are rejected under similar rationale.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:  
Sivaprasad, et al. ("Emotional prosody control for speech generation,” 2001)- teaches capturing rhythm and emphasis as a part of prosody control from a source speaker to mimic a reference speaker (Section 2, Page 2 and Section 5 Page 4).
Beaufays, et al. (U.S. PG Publication:  2022/0115000 A1)- teaches voice characteristics used to bias TTS generation with respect to prosodic properties including "one or more of intonation, tone, stress, rhythm, tempo, or pause...that are indicative of speech of the user of the client device" (Paragraphs 0023 and 0087).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES S WOZNIAK whose telephone number is (571)272-7632. The examiner can normally be reached 7-3, off alternate Fridays.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant may use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JAMES S. WOZNIAK
Primary Examiner
Art Unit 2655



/JAMES S WOZNIAK/               Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Dec 01, 2023

Application Filed

Jul 21, 2025

Non-Final Rejection mailed — §103

Oct 21, 2025

Response Filed

Dec 17, 2025

Final Rejection mailed — §103

Mar 11, 2026

Request for Continued Examination

Mar 13, 2026

Response after Non-Final Action

Apr 27, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/585,204

Patent 12640139

METHOD AND APPARATUS FOR IMPROVING PERFORMANCE OF ARTIFICIAL INTELLIGENCE MODEL USING SPEECH RECOGNITION RESULTS AS TEXT INPUT

2y 3m to grant Granted May 26, 2026

18/535,521

Patent 12609113

NATURAL LANGUAGE PROCESSING SYSTEMS AND METHODS FOR INTENT CLASSIFICATION OF SPEECH TRANSCRIPTION

2y 4m to grant Granted Apr 21, 2026

18/544,354

Patent 12609106

EMOTIVE TEXT-TO-SPEECH WITH AUTO DETECTION OF EMOTIONS

2y 4m to grant Granted Apr 21, 2026

18/399,876

Patent 12597422

SPEAKING PRACTICE SYSTEM WITH RELIABLE PRONUNCIATION EVALUATION

2y 3m to grant Granted Apr 07, 2026

18/488,578

Patent 12586569

Knowledge Distillation with Domain Mismatch For Speech Recognition

2y 5m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

60%

Grant Probability

99%

With Interview (+39.4%)

3y 8m (~1y 2m remaining)

Median Time to Grant

High

PTA Risk

Based on 391 resolved cases by this examiner. Grant probability derived from career allowance rate.