Last updated: April 19, 2026
Application No. 18/491,266
SYSTEMS AND METHODS FOR PROVIDING NON-LEXICAL CUES IN SYNTHESIZED SPEECH

Final Rejection §101§103
Filed
Oct 20, 2023
Examiner
MASTERS, KRISTEN MICHELLE
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Intel Corporation
OA Round
2 (Final)
This examiner grants 62% of cases after interview

— +24.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 40 resolved cases, 2023–2026
Examiner Intelligence

MASTERS, KRISTEN MICHELLE View full profile →
Grants 62% of resolved cases
Career Allow Rate
25 granted / 40 resolved
+0.5% vs TC avg
Strong +25% interview lift
Without
With
+24.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
36 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
35.2%
-4.8% vs TC avg
§103
46.9%
+6.9% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§101 §103
Detailed Action 
This communication is in response to the Arguments and Amendments filed on 1/23/2026. 
Claims 2-22 are pending and have been examined. Hence, this action has been made Final.
Claims 1, 10 and 16 are independent storage, apparatus and apparatus claims, respectively.
Apparent priority: 9/26/2014.
Any previous objection/rejection not mentioned in this Office Action has been withdrawn by the Examiner. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The Applicant has amended the claims to include “trigger, at a speech synthesizer,” 
As to the claim rejections 35 U.S.C. § 101, Applicant notes,
Applicant has amended the independent claims to recite a speech synthesizer. As defined in the application, a speech synthesizer is not an abstract idea. See, e.g., P[0028].
Examiner notes the addition of the speech synthesizer acts as a routine conventional output mechanism. Examiner notes this as a generic computer component. The speech synthesizer is noted as additional elements and does not amount to significantly more than the judicial exception.
Applicants Arguments and Amendments do not overcome the 35 U.S.C. § 101 rejections. 
As to the claim rejections 35 U.S.C. § 103 Applicant notes,
Applicant notes independent claim 2 recites "insert a non-verbal disfluency cue at a first insertion point to enhance speech to be synthesized from text". The Office Action states that this is shown in P[0045] of Yamabe. However, Yamabe discusses only adding "audio effects". Nowhere does Yamabe teach or suggest inserting a non-verbal disfluency cue at a first insertion point to enhance speech to be synthesized from text. 
Examiner notes Yamabe does teach this limitation in [0045]. Examiner notes the citations read on the claim language as currently presented. Examiner notes Yamabe makes the insertions for the purpose of synthesized audio effects as presented in the claims. Yamabe adds vocalization features (pitch modulation, filler utterances, etc.) which constitute non-verbal cues inserted into speech. 
Applicant notes as stated in the specification, examples of non-verbal disfluencies include "'um,''oh,''uh,'or the like". P[0024]. Nowhere do Yamabe or Huang teach or suggest adding a non-verbal disfluency to audio data generated from converting the text data to speech. In fact, Yamabe teaches away from adding non-verbal disfluencies, focusing on "senlantic conversion to audio..." 
Examiner notes “or the like” is broad language. Yamabe adds vocalization features (pitch modulation, filler utterances, etc.) which constitute non-verbal cues inserted into speech.
Applicant notes independent claim 2 recites "insert a prosody cue at a second insertion point, the prosody cue associated with a second tag of the markup language". The Office Action states that this is shown in Huang (col 4, lines 18-60). However, Huang discusses only "[tagging] each word of the text with a descriptive label, ,..[determining]the prosodic parameters ... transmitted to speech synthesizer". Nowhere does Huang teach or suggest inserting a prosody cue as recited in the claims. 
Examiner notes Huang does teach this limitation. (see Huang (4:18-60) “(8) When input device 14 receives text data, the input device 14 routes the text data input to the synthesis components 15 which perform speech synthesis. FIGS. 5-7 illustrate the speech synthesis technique employed in the preferred embodiment of the present invention and will be described in more detail below. The natural language processor (NLP) 32 receives the input text and tags each word of the text with a descriptive label. The tags are passed to a letter-to-sound (LTS) component 33 and a prosody engine 35. The letter-to-sound component 33 utilizes dictionary input from the dictionary storage 22 and letter-to-phoneme rules from the letter-to-phoneme rule storage 40 to convert the letters in the input text to phonemes. The letter-to-sound component 33 may, for example, determine the proper pronunciation of the input text. The letter-to-sound component 33 is connected to a phonetic string and stress component 34. The phonetic string and stress component 33 generates a phonetic string with proper stressing for the input text, that is passed to a prosody engine 35. The letter-to-sound component 33 and phonetic stress component 33 may, in alternative embodiments, be encapsulated into a single component. The prosody engine 35 receives the phonetic string and inserts pause markers and determines the prosodic parameters which indicate the intensity, pitch, and duration of each phoneme in the string. The prosody engine 35 uses prosody models, stored in prosody database storage 42. The phoneme string with pause markers and the prosodic parameters indicating pitch, duration, and amplitude is transmitted to speech synthesizer 36….”
Examiner further notes Yamabe also teaches this limitation by inserting filler sounds and laughter using markup language.
Applicants arguments and amendments do not overcome the 35 U.S.C. § 103 rejections. 
Updated mappings to reflect the amendments have been provided below. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 2-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The independent Claims are directed to statutory categories:
Claim 2 is a storage system claim and directed to the machine or manufacture category of patentable subject matter.
Claim 10 is an apparatus claim and is directed to the machine or manufacture category of patentable subject matter.
Claim 16 is an apparatus claim and is directed to the machine or manufacture category of patentable subject matter.

Independent claim 2 recites,
“2. A memory comprising machine readable instructions to cause one or more of at least one processor circuits to: 
insert a non-verbal disfluency cue at a first insertion point to enhance speech to be synthesized from text, the non-verbal disfluency cue associated with a first tag of a markup language; (this relates to a human using pen and paper to insert a non-verbal disfluency cue or tag.)
insert a prosody cue at a second insertion point, the prosody cue associated with a second tag of the markup language; (this relates to a human using pen and paper to insert a prosody cue or tag.)
and trigger, at a speech synthesizer, a synthesis of the speech based on the text including the non-verbal disfluency cue, and the prosody cue (this relates to a human synthesizing speech using the human voice.)

The Dependent Claims do not include additional limitations that could incorporate the abstract idea into a practical application or cause the Claim as a whole to amount to significantly more than the underlying abstract idea.
	
As to Independent Claim 10, Claim 10 is a parallel apparatus claim with limitations similar to the of Claim 2 and is rejected under the same rationale.

As to Independent Claim 16, Claim 16 is a parallel apparatus claim with limitations similar to the of Claim 2 and is rejected under the same rationale.

This judicial exception is not integrated into a practical application. In particular, claims 1 and 10 recite additional elements of “processors”, and “memory”. For example, in [0010] of the as filed specification, there is description of using a processor 102, memory 104, an audio output 106, an input device 108, and a network interface 140.  The processor 102 may be dedicated to the system 100 or may be incorporated into and/or borrowed from another system (e.g., a natural language interface) or computing device, such as a desktop computer or a mobile computing device (e.g., laptop, tablet, smartphone, or the like).  The memory 104 may be coupled to or otherwise accessible by the processor 102..” Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer is noted as a general computer. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible. 

Dependent claim 3 recites,
“3. The memory of claim 2, wherein the instructions cause one or more of the at least one processor circuits to determine a user intent from a natural language input by the user (this relates to a human determining user intent using logic and reasoning and perception.)

Dependent claim 4 recites,
“4. The memory of claim 3, wherein the instructions cause one or more of the at least one processor circuits to determine the user intent based on machine learning. (this relates to a human determining user intent using logic and reasoning and perception.)

Dependent claim 5 recites,
“5. The memory of claim 3, wherein the instructions cause one or more of the at least one processor circuits to cause a device to take an action based on the user intent. (This relates to a human taking an action.)


Dependent claim 6 recites,
“6. The memory of claim 2, wherein the instructions cause one or more of the at least one processor circuits insert a phrasal stress cue on a word in the speech and trigger the synthesis of the speech with the phrasal stress. (this relates to a human using speech or pen and paper to insert a stress cue)

Dependent claim 7 recites,
“7. The memory of claim 2, wherein the instructions cause one or more of the at least one processor circuits to determine a user intent from user behavior. (this relates to a human using logic and reasoning and perception to determine intent based on another individual’s behavior.)

Dependent claim 8 recites,
“8. The memory of claim 7, wherein the instructions cause one or more of the at least one processor circuits to cause a device to take an action based on the user intent. (This relates to a human taking an action.)

Dependent claim 9 recites,
“The memory of claim 2, wherein to trigger the synthesis of the speech, the instructions cause one or more of the at least one processor circuits to cause a speaker to output the speech. (this relates to a human using voice to output speech.) A speaker is noted as a generic element.

Regarding Dependent claims 11 and 16, Claims 11 and 16 are parallel apparatus claims with limitations similar to that of claim 3 and are rejected under the same rationale.
Regarding Dependent claims 12 and 19, Claims 12 and 19 are parallel apparatus claims with limitations similar to that of claim 5 and are rejected under the same rationale.
Regarding Dependent claims 13 and 20, Claims 13 and 20 are parallel apparatus claims with limitations similar to that of claim 6 and are rejected under the same rationale.
Regarding Dependent claims 14 and 21, Claims 14 and 21 are parallel apparatus claims with limitations similar to that of claim 7 and are rejected under the same rationale.
Regarding Dependent claims 15 and 22, Claims 15 and 22 are parallel apparatus claims with limitations similar to that of claim 8 and are rejected under the same rationale.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 2, 10, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Yamabe (U.S. Patent Number US 20090157407 A1), in view of Huang (U.S. Patent Number US 5913193 A).

Regarding Independent Claim 2, Yamabe teaches 
A memory comprising machine readable instructions to cause one or more of at least one processor circuits to: (see Yamabe [0054] “…These computer program instructions may also be stored in a computer-readable memory that may direct a computing device or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block(s) or step(s)…”)  insert a non-verbal disfluency cue at a first insertion point to enhance speech to be synthesized from text, the non-verbal disfluency cue associated with a first tag of a markup language; (see Yamabe [0045] “In an exemplary embodiment, audio effects may be based at least in part on tag information, such as HTML tags, used to format the text, which may include for example having a short pause in the audio playback of the converted text data following an HTML tag for a line break, having the converted audio data be played back louder over portions of text encased in HTML tags which serve to bold or emphasize words, inserting an introduction of linked pages at the tail end of the audio if there are hyperlinks to other HTML pages contained within the source blog data 104, and/or the like. In another exemplary embodiment, audio effects may be based at least in part on special word pairings or on special HTML tags embedded within the source blog data 104 that serve a purpose other than to format the text. For example, the TTS converter 112 may determine to add an audio effect of a dog barking in response to reading a word pairing within the semantic structure model 110 such as "barking dog" or in response to special HTML tags such as &lt;bark&gt;&lt;/bark&gt; created for the purpose of adding audio effects to the converted file. In another exemplary embodiment, audio effects may be based at least in part on special character combinations embedded within the text extracted from the blog data 104 by the parser 108 and contained within the semantic structure model 110. Examples of such special character combinations include what are known as emoticons, or smiley faces, such as ";)" or ":)." In response to encountering such a character combination a laughing voice audio effect may be added to the audio data generated by the TTS converter 112. It will be appreciated, however, that the above examples are merely a few examples of means for determining from the data contained within the semantic structure model 110 whether to and what audio effects to add to the converted audio data and that the invention is not limited to just these example scenarios. Moreover, the term "tags" as used herein should be construed not just to include tags used in a markup language, but to include any similar means or device used to designate data formatting or special effects which should be added upon semantic conversion to audio and/or video data.”)
Yamabe does not specifically teach insert a prosody cue at a second insertion point, the prosody cue associated with a second tag of the markup language; and trigger at a speech synthesizer, a synthesis of the speech based on the text including the non-verbal disfluency cue, and the prosody cue. (see Huang (4:18-60) “(8) When input device 14 receives text data, the input device 14 routes the text data input to the synthesis components 15 which perform speech synthesis. FIGS. 5-7 illustrate the speech synthesis technique employed in the preferred embodiment of the present invention and will be described in more detail below. The natural language processor (NLP) 32 receives the input text and tags each word of the text with a descriptive label. The tags are passed to a letter-to-sound (LTS) component 33 and a prosody engine 35. The letter-to-sound component 33 utilizes dictionary input from the dictionary storage 22 and letter-to-phoneme rules from the letter-to-phoneme rule storage 40 to convert the letters in the input text to phonemes. The letter-to-sound component 33 may, for example, determine the proper pronunciation of the input text. The letter-to-sound component 33 is connected to a phonetic string and stress component 34. The phonetic string and stress component 33 generates a phonetic string with proper stressing for the input text, that is passed to a prosody engine 35. The letter-to-sound component 33 and phonetic stress component 33 may, in alternative embodiments, be encapsulated into a single component. The prosody engine 35 receives the phonetic string and inserts pause markers and determines the prosodic parameters which indicate the intensity, pitch, and duration of each phoneme in the string. The prosody engine 35 uses prosody models, stored in prosody database storage 42. The phoneme string with pause markers and the prosodic parameters indicating pitch, duration, and amplitude is transmitted to speech synthesizer 36. The prosody models may be speaker-independent or speaker-dependent. (9) The speech synthesizer 36 converts the phonetic string into the corresponding string of diphones or other acoustical units, selects the best instance for each unit, adjusts the instances in accordance with the prosodic parameters and generates a speech waveform reflecting the input text. For illustrative purposes in the discussion below, it will be assumed that the speech synthesizer converts the phonetic string into a string of diphones. Nevertheless, the speech synthesizer could alternatively convert the phonetic string into a string of alternative acoustical units. In performing these tasks, the synthesizer utilizes the instances for each unit which are stored in unit storage 28.”)
Yamabe and Huang are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the memory of Yamabe to incorporate insert a prosody cue at a second insertion point, the prosody cue associated with a second tag of the markup language; and trigger a synthesis of the speech based on the text including the non-verbal disfluency cue, and the prosody cue of Huang. This allows for improved naturalness of synthesized speech as recognized by Huang (9:57-58).

Regarding independent Claim 10, Claim 10 is an apparatus claim with limitations similar to that of claim 2 and is rejected under the same rationale. Additionally, Yamabe teaches An apparatus comprising: memory; instructions; and processor circuitry to: (see Yamabe [0054] “…As will be appreciated, any such computer program instructions may be loaded onto a computing device or other programmable apparatus (e.g., hardware) to produce a machine, such that the instructions which execute on the computing device or other programmable apparatus create means for implementing the functions specified in the flowchart block(s) or step(s). These computer program instructions may also be stored in a computer-readable memory that may direct a computing device or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block(s) or step(s). The computer program instructions may be loaded onto a computing device or other programmable apparatus to cause a series of operational steps to be performed on the computing device or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computing device or other programmable apparatus provide steps for implementing the functions specified in the flowchart block(s) or step(s).”)
Regarding independent Claim 16, Claim 16 is an apparatus claim with limitations similar to that of Claim 2 and is rejected under the same rationale. Additionally, Yamabe teaches An apparatus comprising:means for storing instructions; and means for executing the instructions to: (see Yamabe [0054] “…As will be appreciated, any such computer program instructions may be loaded onto a computing device or other programmable apparatus (e.g., hardware) to produce a machine, such that the instructions which execute on the computing device or other programmable apparatus create means for implementing the functions specified in the flowchart block(s) or step(s). These computer program instructions may also be stored in a computer-readable memory that may direct a computing device or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block(s) or step(s). The computer program instructions may be loaded onto a computing device or other programmable apparatus to cause a series of operational steps to be performed on the computing device or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computing device or other programmable apparatus provide steps for implementing the functions specified in the flowchart block(s) or step(s).”)

Claims 3-5, 7, 8, 11, 12, 14, 15, 17-19, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Yamabe (U.S. Patent Number US 20090157407 A1), in view of Huang (U.S. Patent Number US 5913193 A) and further in view of Hakkani-Tur (U.S. Patent Number US 8473451 B1).

As to Claim 3 Yamabe in view of Huang teaches 3. The memory of claim 2,
Yamabe in view of Huang do not specifically teach wherein the instructions cause one or more of the at least one processor circuits to determine a user intent from a natural language input by the user. However, Hakkani-Tur does teach this limitation. (see Hakkani-Tur (3:35-53) “(12) Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text, generate audible "speech" from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. Further, the modules of system 100 may operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having an audio processing capability, for example a PDA with audio and a WiFi network interface) may have an ASR module wherein a user may say "call mom" and the smartphone may act on the instruction without a "spoken dialog interaction".”)
Yamabe in view of Huang  and Hakkani-Tur are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the memory combination of Yamabe and Huang to incorporate the teachings of Hakkani-Tur to include wherein the instructions cause one or more of the at least one processor circuits to determine a user intent from a natural language input by the user. This allows for natural language to be understood and interpreted as recognized by Hakkani-Tur (3:35-53).

As to Claim 4 Yamabe in view of Huang and further in view of Hakkani-Tur teaches 4. The memory of claim 3, 
Furthermore, Hakkani-Tur teaches wherein the instructions cause one or more of the at least one processor circuits to determine the user intent based on machine learning. (see Hakkani-Tur (3:15-21) “(10) FIG. 1 is a functional block diagram of an exemplary natural language spoken dialog system 100. Natural language spoken dialog system 100 may include an automatic speech recognition (ASR) module 102, a spoken language understanding (SLU) module 104, a dialog management (DM) module 106, a spoken language generation (SLG) module 108, and a text-to-speech (TTS) module 110.”) (see Hakkani-Tur (6:56-7:13) “(33) Text sanitization may also help protect data against some indirect threats. To protect against indirect threats, the utterances may be changed and therefore, the utterance and call-type distribution, by up-sampling or down-sampling the data. By changing the utterance and call-type distribution, data mining methods attacking privacy would not work. Spoken language understanding models may be trained using spoken dialog utterances labeled with user intents (call-types). Changing the utterance and therefore, the call-type distribution will prevent others from extracting such indirect information. The utterances may be down-sampled by collapsing the same or very similar utterances into one example. This is known as cloning. For example, once the filled pauses and some other dysfluencies are filtered, and some named entities and frequent sequences are collapsed into their category tags, the utterances may be down-sampled by just collapsing. For up-sampling, some utterances may be selected and duplicated by adding variations and by inserting dysfluencies, using a synonym list to change words, paraphrasing, or changing the named entity values. In an alternative implementation, the utterances may be compiled into a finite state machine (FSM), and may generate as many paths and utterances as needed. FSMs may be used to generate the alternatives of similar frequent sequences such as "I would like to" and "I wanna", and named entities.”)

As to Claim 5, Yamabe in view of Huang and further in view of Hakkani-Tur teaches 5. The memory of claim 3, 
Furthermore, Hakkani-Tur teaches wherein the instructions cause one or more of the at least one processor circuits to cause a device to take an action based on the user intent. (see Hakkani-Tur (3:35-53) “(12) Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text, generate audible "speech" from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. Further, the modules of system 100 may operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having an audio processing capability, for example a PDA with audio and a WiFi network interface) may have an ASR module wherein a user may say "call mom" and the smartphone may act on the instruction without a "spoken dialog interaction".”)

As to Claim 7, Yamabe in view of Huang teaches 7. The memory of claim 2,
Yamabe in view of Huang doe not specifically teach wherein the instructions cause one or more of the at least one processor circuits to determine a user intent from user behavior. However, Hakkani-Tur does teach this limitation (see Hakkani-Tur (3:35-53) “(12) Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text, generate audible "speech" from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. Further, the modules of system 100 may operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having an audio processing capability, for example a PDA with audio and a WiFi network interface) may have an ASR module wherein a user may say "call mom" and the smartphone may act on the instruction without a "spoken dialog interaction".”) 
Yamabe in view of Huang  and Hakkani-Tur are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the memory combination of Yamabe and Huang to incorporate the teachings of Hakkani-Tur to include the instructions cause one or more of the at least one processor circuits to determine a user intent from user behavior. This allows for natural language to be understood and interpreted as recognized by Hakkani-Tur (3:35-53).

As to Claim 8, Yamabe in view of Huang and further in view of Hakkani-Tur teaches The memory of claim 7, 
Furthermore, Hakkani-Tur teaches wherein the instructions cause one or more of the at least one processor circuits to cause a device to take an action based on the user intent. (see Hakkani-Tur (3:35-53) “(12) Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text, generate audible "speech" from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. Further, the modules of system 100 may operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having an audio processing capability, for example a PDA with audio and a WiFi network interface) may have an ASR module wherein a user may say "call mom" and the smartphone may act on the instruction without a "spoken dialog interaction".”)

Regarding Dependent claims 11 and 16, Claims 11 and 16 are parallel apparatus claims with limitations similar to that of claim 3 and are rejected under the same rationale.
Regarding Dependent claims 12 and 19, Claims 12 and 19 are parallel apparatus claims with limitations similar to that of claim 5 and are rejected under the same rationale. 
Regarding Dependent claims 14 and 21, Claims 14 and 21 are parallel apparatus claims with limitations similar to that of claim 7 and are rejected under the same rationale.
Regarding Dependent claims 15 and 22, Claims 15 and 22 are parallel apparatus claims with limitations similar to that of claim 8 and are rejected under the same rationale.


Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yamabe (U.S. Patent Number US 20090157407 A1), in view of Huang (U.S. Patent Number US 5913193 A) and further in view of Freeland (U.S. Patent Number US 20030028380 A1).

As to Claim 6 Yamabe in view of Huang teaches The memory of claim 2, 
Yamabe in view of Huang do not specifically teach wherein the instructions cause one or more of the at least one processor circuits insert a phrasal stress cue on a word in the speech and trigger the synthesis of the speech with the phrasal stress. However, Freeland does teach this limitation. (see Freeland, [0165] “An additional method of producing better prosodic models for use in TTS systems is similar to the prosidy training method described above but is suitable for use in STS systems. In an STS system, the user's voice input is required to generate the text for conversion by the TTS system to a character's voice. The recorded audio file of the user's input speech can thus be analysed for its prosodic model which is subsequently used to train the TTS system's prosodic response as described above. Effectively, this method allows the STS system to mimic the user's original intonation and speaking speed. Yet another method of producing better prosodic models for use in TTS systems involves marking up the input text with emotional cues to the TTS system. One such markup language is SABLE which looks similar to HTML. Regions of the text to be converted to speech that require specific emphasis or emotion are marked with escape sequences that instruct the TTS system to modify the prosodic model from what would otherwise be produced. For example, a TTS system would probably generate the word `going` with rising pitch in the text message "So where do you think you're going?". A markup language can be used to instruct the TTS system to generate the word `you're` with a sarcastic emphasis and the word `going` with an elongated duration and falling pitch. This markup would modify the prosidy generation phase of the TTS or STS system. Whilst this method of prosidy generation is prior art, one novel extension is to include emotion markups in the actual corpus (the corpus is the textual script of all of the recordings that make up the recorded speech database) and lots of different emotional speech recordings so that the recorded speech database has a large variation in prosidy and the TTS can use the markups in the corpus to enhance the unit selection algorithm.”)
Yamabe in view of Huang and Freeland are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the memory combination of Yamabe and Huang to incorporate the instructions cause one or more of the at least one processor circuits insert a phrasal stress cue on a word in the speech and trigger the synthesis of the speech with the phrasal stress of Freeland. This allows for improved rate and naturalness of the spoken audio as recognized by Freeland [0118].


Regarding Dependent claims 13 and 20, Claims 13 and 20 are parallel apparatus claims with limitations similar to that of claim 6 and are rejected under the same rationale.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Yamabe (U.S. Patent Number US 20090157407 A1), in view of Huang (U.S. Patent Number US 5913193 A) and further in view of Tsang (U.S. Patent Number US 8731905 B1).

As to Claim 9, Yamabe in view of Huang teaches The memory of claim 2,
Yamabe in view of Huang do not specifically teach wherein to trigger the synthesis of the speech, the instructions cause one or more of the at least one processor circuits to cause a speaker to output the speech. However, Tsang does teach this limitation. (see Tsang, (6:36-52) “(21) A processor may be adapted by software to use the segmentation of the text from a document into phrasal segments for various purposes. A system may employ the segmentation to highlight successive segments in the document for display to a user in sequence to facilitate the user reading the document. For example, the user may desire to read the text from a document, a portion of which is shown in FIG. 2a. The system may also employ a text-to-speech (TTS) engine, which are widely available. Such a TTS engine can be instructed by the system to convert one or a sequence of words into sounds that mimic how a person would speak the words so that they can be played aloud through acoustic speakers in electronic communication with the system. Such a TTS engine will speak the words provided to it at a rate, which may be adjustable, that is comparable to normal human speech. Rates of 120 to 180 words per minute are typical, although lower or higher rates may be suitable for some purposes.”)
Yamabe in view of Huang and Tsang are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the memory combination of Yamabe and Huang to incorporate to trigger the synthesis of the speech, the instructions cause one or more of the at least one processor circuits to cause a speaker to output the speech of Tsang. This allows for significant improvements in the understanding of a listener of the text being spoken by providing extra time for the listener to digest the content of the current segment as recognized by Tsang (7:1-4).
Conclusion

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Oct 20, 2023
Application Filed
Oct 18, 2025
Non-Final Rejection — §101, §103
Jan 23, 2026
Response Filed
Apr 03, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/513,614
Patent 12592219
Hearing Device User Communicating With a Wireless Communication Device
2y 5m to grant Granted Mar 31, 2026
17/415,675
Patent 12548569
METHOD AND SYSTEM OF DETECTING AND IMPROVING REAL-TIME MISPRONUNCIATION OF WORDS
2y 5m to grant Granted Feb 10, 2026
17/790,795
Patent 12548564
SYSTEM AND METHOD FOR CONTROLLING A PLURALITY OF DEVICES
2y 5m to grant Granted Feb 10, 2026
17/940,549
Patent 12547894
ENTROPY-BASED ANTI-MODELING FOR MACHINE LEARNING APPLICATIONS
2y 5m to grant Granted Feb 10, 2026
18/311,150
Patent 12547840
MULTI-STAGE PROCESSING FOR LARGE LANGUAGE MODEL TO ANSWER MATH QUESTIONS MORE ACCURATELY
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
87%
With Interview (+24.7%)
3y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEMS AND METHODS FOR PROVIDING NON-LEXICAL CUES IN SYNTHESIZED SPEECH

This examiner grants 62% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email