Last updated: April 19, 2026
Application No. 18/497,967
SYSTEMS FOR PROVIDING REAL-TIME FEEDBACK TO REDUCE UNDESIRED SPEAKING PATTERNS, AND METHODS OF USING THE SAME

Non-Final OA §101§103
Filed
Oct 30, 2023
Examiner
LEE, EUNICE SOMIN
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Cdc Phone App Ip 2023 LLC
OA Round
1 (Non-Final)
Interview Optional

— +27.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

LEE, EUNICE SOMIN View full profile →
Grants 89% — above average
Career Allow Rate
24 granted / 27 resolved
+26.9% vs TC avg
Strong +27% interview lift
Without
With
+27.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
18.7%
-21.3% vs TC avg
§103
53.0%
+13.0% vs TC avg
§102
7.3%
-32.7% vs TC avg
§112
2.7%
-37.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§101 §103
DETAILED ACTION
This communication is in response to the Application filed on October 30, 2023. 
Claims 1 - 14 are pending and have been examined. 
Claim 1 is independent.
Domestic priority: October 31, 2022.
PCT/US2023/078252 was filed on October 30, 2023.

           

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Information Disclosure Statement
The information disclosure statements (IDS) submitted on January 2, 2024 and May 3, 2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.


Drawings
The drawings filed on October 30, 2023 have been accepted and considered by the Examiner.


Claim Rejections - 35 USC § 101


35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 - 2, 4 - 9,  13 - 14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to Abstract Idea without significantly more.

Regarding Claim 1,
Step 1: Claim 1 recites a system for providing speech-related feedback, which falls under the statutory category of machine.
Step 2A Prong I: As drafted, Claim 1 is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “a system for providing speech-related feedback”, nothing in the claim element precludes the step from practically being performed in the mind. Accordingly, the claim recites a judicial exception, and the analysis must therefore proceed to Step 2A Prong II.
Step 2A Prong II: The additional limitations of “receive an audio signal at an input device based on the speech of an individual within proximity of the system”, “assess the audio signal for the presence of one or more instances of filler speech”, “assess the audio signal to verify a speaker identity to determine if the audio signal originates from speech of a targeted user of the device”, “upon detecting the presence of at least one instance of filler speech and confirming that the audio signal originates from a verified targeted user of the device, outputting a discreet sensory signal to the targeted user informing them of the detection of filler speech”, as drafted can be practically performed in the human mind using observation, evaluation, and judgement or by a human using a pen and paper. Using observation, evaluation, and judgement, a human can hear/receive audio/speech of an individual within proximity. Using observation, evaluation, and judgement, a human can assess the presence of filler speech (ah, um, you know, etc.). Using observation, evaluation, and judgement, a human can recognize familiar/unfamiliar voices to verify a speaker identity. Practicing in front of another person, using observation, evaluation, and judgement, a human can upon detecting the presence of at least one instance of filler speech and confirming that the audio signal originates from a verified targeted user, provide feedback/notification to the individual about speech habits. Accordingly, the additional element(s) do(es) not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea and the claim is therefore directed to the judicial exception.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using “a processor”, “memory accessible by the processor”, “device” to perform the aforementioned steps amounts to no more than mere instructions to apply the exception using a generic computer component, which cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 2,
Dependent claim 2 further narrows the steps for providing speech-related feedback by reciting “to determine the presence of at least one instance of filler speech”, “one or more filler words”, “one or more filler sounds”, “speaking pattern or behavior with one or more vocal disfluencies”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 4,
Dependent claim 4 further narrows the steps for providing speech-related feedback by reciting “assess the audio signal for detection of filler speech in the form of filler words”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 5,
Dependent claim 5 further narrows the steps for providing speech-related feedback by reciting “determine the context of a detected filler word to determine if the detected filler words was used in a proper non-filler context”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 6,
Dependent claim 6 further narrows the steps for providing speech-related feedback by reciting “converting the audio signal to a text transcript”, “text-searching the text transcript for words matching a predetermined list of filler words”, “the predetermined list of filler words being stored in the memory”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 7,
Dependent claim 7 further narrows the steps for providing speech-related feedback by reciting “identify surrounding words in proximity to the detected filler word”, “determining based on the surrounding words whether the detected filler word was used in a proper non-filler context”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 8,
Dependent claim 8 further narrows the steps for providing speech-related feedback by reciting “selectively update the predetermined list of filler words to either remove or add words”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 9,
Dependent claim 9 further narrows the steps for providing speech-related feedback by reciting “assess the audio signal for detection of filler speech in the form of filler sounds”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.


Regarding Claim 13,
Dependent claim 13 further narrows the steps for providing speech-related feedback by reciting “output discreet sensory signals in the form of one or more of: haptic signals, auditory signals, and visual signals”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.

Regarding Claim 14,
Dependent claim 14 further narrows the steps for providing speech-related feedback by reciting “record a history of detected filler speech for review by a user”. However, the claimed limitations recite only the idea of steps for providing speech-related feedback with a high level of generality. No additional components appear in the claim that apply the abstract idea to a practical application or amount to significantly more than the abstract idea. No additional limitations are present. The claim is not patent eligible.


Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


Claims 1 - 5, 9 are rejected under 35 U.S.C. 103(a) as being unpatentable over Liao et al., (U.S. Patent Application Publication 2021/0065582), hereinafter referred to as Liao, in view of Tsiartas et al., (U.S. Patent Application 2017/0084295), hereinafter referred to as Tsiartas.
Regarding Claim 1, Liao teaches:
1. A system for providing speech-related feedback, comprising: [Liao, “If they continue using filler words, the program may continue displaying notifications. This can be very helpful in assisting a user improve their public speaking and presentation skills.” Par. 0037]
a processor, [Liao, “[0066] The machine 700 may include processors 710, memory 730, and I/O components 750, which may be communicatively coupled via, for example, a bus 702. The bus 702 may include multiple buses coupling various elements of machine 700 via various bus technologies and protocols. In an example, the processors 710 (including, for example, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), an ASIC, or a suitable combination thereof) may include one or more processors 712a to 712n that may execute the instructions 716 and process data.” Par. 0066]
memory accessible by the processor, and [Liao, “[0067] The memory/storage 730 may include a main memory 732, a static memory 734, or other memory, and a storage unit 736, both accessible to the processors 710 such as via the bus 702.” Par. 0067]
programmed instructions and data stored in the memory and executable by the processor whereby the device is configured to: [Liao, “…memory comprising executable instructions (i.e., the claimed “programmed instructions”) that, when executed by the processor, cause the data processing system (i.e., the claimed “device”) to perform functions,” Par. 0078; “The memory/storage 730 may also store temporary, intermediate, and/or long-term data (i.e., the claimed “data stored in memory”) for processors 710. The instructions (i.e., the claimed “programmed instructions”) 716 may also reside (i.e., the claimed “stored”), completely or partially, within the memory 732, 734, within the storage unit 736, within at least one of the processors 710 (for example, within a command buffer or cache memory), within memory at least one of I/O components 750, or any suitable combination thereof, during execution thereof.” Par. 0067]
receive an audio signal at an input device based on the speech of an individual within proximity of the system; [Liao, “The function may include receiving audio data (i.e., the claimed “audio signal”) from a speech rehearsal session over a network, the speech rehearsal session being performed for a digital presentation, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, determining a number of syllables in each of the plurality of words, calculating a speaking rate based at least in part on the number of syllables, determining if the speaking rate is within a threshold range, and enabling display of a notification on a display device in real time, if the speaking rate falls outside the threshold range.” Par. 0004; “In addition, in some implementations, a user device (i.e., the claimed “input device”) can be configured to transmit data (i.e., the claimed “audio signal”) captured locally (i.e., the claimed “individual within proximity of the system”) during use of relevant application(s) to the cloud or the local ML program and provide supplemental training data that can serve to fine-tune or increase the effectiveness of the MLA.” Par. 0023]
assess the audio signal for the presence of one or more instances of filler speech; [Liao, “This may occur when the speech rehearsal assistance service detects the use of one or more filler words in the user's speech (i.e., the claimed “filler speech”).” Par. 0037; “The function may include receiving audio data (i.e., the claimed “audio signal”) from a speech rehearsal session over a network, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, detecting utterance of a filler phrase or sound (i.e., the claimed “filler speech”) during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device, wherein detecting the utterance of the filler phrase or sound (i.e., the claimed “filler speech”) is done based on at least one of the transcript of the audio data or the audio data (i.e., the claimed “audio signal”).” Par. 0005]
upon detecting the presence of at least one instance of filler speech and confirming that the audio signal originates from a verified targeted user of the device, outputting a discreet sensory signal to the targeted user informing them of the detection of filler speech. [Liao, “The function may include receiving audio data (i.e., the claimed “audio signal”) from a speech rehearsal session over a network, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, detecting utterance of a filler phrase or sound (i.e., the claimed “filler speech”) during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification (i.e., the claimed “outputting a discreet sensory signal”) on a display device, wherein detecting the utterance of the filler phrase or sound (i.e., the claimed “filler speech”) is done based on at least one of the transcript of the audio data or the audio data (i.e., the claimed “audio signal”).” Par. 0005; “If they continue using filler words (i.e., the claimed “filler speech”), the program may continue displaying notifications (i.e., the claimed “outputting a discreet sensory signal to the targeted user informing them of the detection of filler speech”). This (i.e., the claimed “informing them of the detection of filler speech”) can be very helpful in assisting a user improve their public speaking and presentation skills.” Par. 0037]
Liao fails to teach assessing the audio signal to verify a speaker identity to determine if the audio signal originates from speech of a targeted user of the device.
However, Tsiartas teaches:
assess the audio signal to verify a speaker identity to determine if the audio signal originates from speech of a targeted user of the device; and [Tsiartas, “The speaker verification/speaker identification module 459 can utilize speaker-specific models to associate speaker identities with the speaker segments (i.e., the claimed “audio signal”) identified by the diarization module 458 (i.e., the claimed “determine if the audio signal originates from speech of a targeted user”). That is, the speech data of the current sample can be compared to an existing speaker-specific model to verify the identity of the speaker (the speaker's identity matches his or her stated identity) or to identify the speaker (this is who the speaker is).” Par. 0089; “Certain implementations of the disclosed technology are directed to an electronic device, system, or computer program product configured to provide as output real-time/interaction-time speech-derived speaker state analytics.” Par. 0098]
upon detecting the presence of at least one instance of filler speech and confirming that the audio signal originates from a verified targeted user of the device, outputting a discreet sensory signal to the targeted user informing them of the detection of filler speech. [Tsiartas, “The speaker verification/speaker identification module 459 can utilize speaker-specific models to associate speaker identities with the speaker segments (i.e., the claimed “audio signal”) identified by the diarization module 458 (i.e., the claimed “determine if the audio signal originates from speech of a targeted user”). That is, the speech data of the current sample can be compared to an existing speaker-specific model to verify the identity of the speaker (the speaker's identity matches his or her stated identity) or to identify the speaker (this is who the speaker is).” Par. 0089; “Certain implementations of the disclosed technology are directed to an electronic device, system, or computer program product configured to provide as output real-time/interaction-time speech-derived speaker state analytics.” Par. 0098]
Liao and Tsiartas pertain to real-time speech systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the real-time speech systems art to modify Liao’s teachings of “machine learning models trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device” for “helping users to improve their public speaking and presentation skills” (Liao, Par. 0005, Par. 0037) with the teachings of “speaker verification/speaker identification module” (Tsiartas, Par. 0089) taught by Tsiartas in order to enhance “speaker identification tools to diarize a multi-party recording and direct identified speech to appropriate instances of disclosed analytic engines”, “detect cases of mismatched identity”, and “provide detection of additional speaker data in non-labeled speech for building up additional background model information to improve model quality” (Tsiartas, Par. 0067).

Regarding Claim 2, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler speech in the form of filler words and filler sounds, and [Liao, see mapping applied to claim 1]
to determine the presence of at least one instance of filler speech upon detecting the presence of at least one instance of either [Liao, see mapping applied to claim 1]
one or more filler words, [Liao, “The function may include receiving audio data (i.e., the claimed “audio signal”) from a speech rehearsal session over a network, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, detecting utterance of a filler phrase (i.e., the claimed “filler words”) or sound during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device, wherein detecting the utterance of the filler phrase (i.e., the claimed “filler words”) or sound is done based on at least one of the transcript of the audio data or the audio data (i.e., the claimed “audio signal”).” Par. 0005]
one or more filler sounds, [Liao, “The function may include receiving audio data (i.e., the claimed “audio signal”) from a speech rehearsal session over a network, receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session, detecting utterance of a filler phrase or sound during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound (i.e., the claimed “filler sounds”), enabling real time display of a notification on a display device, wherein detecting the utterance of the filler phrase or sound (i.e., the claimed “filler sounds”) is done based on at least one of the transcript of the audio data or the audio data (i.e., the claimed “audio signal”).” Par. 0005]
or a speaking pattern or behavior with one or more vocal disfluencies. [Liao, “In another example, utterance of filler words and sounds or disfluency in speech (i.e., the claimed “vocal disfluencies”) may be detected and notification may be provided to the speaker in real-time to inform the user of issues they need to address as they are speaking.” Par. 0020]

Regarding Claim 3, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler words using a text-classification model, and [Liao, see mapping applied to claim 1; “As a general matter, the methods and systems described herein may include, or otherwise make use of, a machine-trained model to identify contents related to a text. Machine learning (ML) generally involves various algorithms that can automatically learn over time. The foundation of these algorithms is generally built on mathematics and statistics that can be employed to predict events, classify entities (i.e., the claimed “text classification”), diagnose problems, and model function approximations. As an example, a system can be trained using data generated by a ML model in order to identify patterns in user activity, determine associations between various words and contents (e.g., icons, images, or emoticons) and/or identify filler words or speaking disfluency in speech.” Par. 0023; “detecting utterance of a filler phrase or sound during the speech rehearsal session using at least in part a machine learning model trained for identifying filler phrases and sounds in a text (i.e., the claimed “text classification”);” Par. 0093; “The filler word detection model (i.e., the claimed “text classification model), on the other hand, may examine the transcript of the audio data to determine if any words in the transcript correspond to filler words, sounds or phrases.” Par. 0027]
assess the audio signal for detection of filler sounds using an acoustic-classification model. [Liao, see mapping applied to claim 1; “Another approach for detecting utterance of filler phrases may involve the use of a deep neural network. For example, a masked convolutional or a recurrent convolutional neural network may be developed that examines every time stamp in every audio frame of the audio data across a certain window to determine how to classify the words in the audio (i.e., the claimed “acoustic classification”).” Par. 0052; “Similarly, the audio-based model (i.e., the claimed “acoustic classification model”) may examine the audio data to detect filler words and/or other disfluencies based on audio data. Other models may also be used.” Par. 0027]

Regarding Claim 4, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler speech in the form of filler words. [Liao, see mapping applied to claim 1]

Regarding Claim 5, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is further configured to determine the context of a detected filler word to determine if the detected filler words was used in a proper non-filler context. [Liao, see mapping applied to claim 1; “Filler words may include words, phrases and sounds that are not necessary to a spoken sentence or phrase and may include phrases such as like, basically, I mean, um, uh, and the like. To detect such filler words a first approach may simply examine the transcript of the audio portion to determine if any phrases or sounds identified as potential filler phrases exist in the transcribed text. However, this may not always identify the correct filler phrases, as some phrases may sometimes be a necessary part of a sentence (i.e., the claimed “proper non-filler context”). For example, the word “like” can be filler phrase or it can be necessary to a sentence (i.e., the claimed “proper non-filler context”). To distinguish such cases, a trained ML model may be used that examines the text and identifies which phrases are necessary to the text (i.e., the claimed “proper non-filler context”). To achieve this, context, meaning, grammar and the like may be taken into account. The trained ML model may be a natural language processing (NPL) model such as NPLs known in the art.” Par. 0050]

Regarding Claim 9, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler speech in the form of filler sounds. [Liao, see mapping applied to claim 1]


Claims 6 - 8, 13 - 14 are rejected under 35 U.S.C. 103(a) as being unpatentable over Liao in view of Tsiartas and Krasikov et al., (U.S. Patent Application 2006/0106611), hereinafter referred to as Krasikov.
Regarding Claim 6, Liao in view of Tsiartas has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler words by: [Liao, see mapping applied to claim 1]
converting the audio signal to a text transcript [Liao, “Filler words may include words, phrases and sounds that are not necessary to a spoken sentence or phrase and may include phrases such as like, basically, I mean, um, uh, and the like. To detect such filler words a first approach may simply examine the transcript of the audio portion (i.e., the claimed “convert the audio signal to a text transcript”) to determine if any phrases or sounds identified as potential filler phrases exist in the transcribed text (i.e., the claimed “text transcript”).” Par. 0050]
text-searching the text transcript for words matching a predetermined list of filler words, [Liao, “The potentially identified phrases (i.e., the claimed “predetermined list of filler words”) may then be compared against the transcript to determine if they correspond (i.e., the claimed “match”) to actual words. If they do and those words are not identified as potential filler phrases, then those may be overlooked. However, if they do not correspond to actual words in the transcript or if they correspond (i.e., the claimed “match”) to words that may be filler phrases, then they may be identified as a detected utterance of a filler phrase.” Par. 0051; 
the predetermined list of filler words being stored in the memory. [Liao, “As an example, a system can be trained using data generated by a ML model in order to identify patterns in user activity, determine associations between various words and contents (e.g., icons, images, or emoticons) and/or identify filler words or speaking disfluency in speech. Such determination may be made following the accumulation (i.e., the claimed “list of filler words stored in memory”), review, and/or analysis of user data (i.e., the claimed “predetermined list of filler words”) the from a large number of users over time, that may be configured to provide the ML algorithm (MLA) with an initial or ongoing training set (i.e., the claimed “predetermined list of filler words”).” Par. 0023]
The combination fails to explicitly teach predetermined list stored in memory.
However, Krasikov teaches:
the predetermined list of filler words being stored in the memory. [Krasikov, “In one exemplary embodiment of the invention, the speech processing system can be programmed to recognize/detect certain words, expressions, sounds or unusual habits such as throat clearing or teeth clicking, for example, from a pre-programmed vocabulary list (i.e., the claimed “predetermined list of filler words”). During an active session in which the device monitors the user's speech, when the user utters one or more of the words and/or makes one or more of the sounds, for example, programmed in the vocabulary list (i.e., the claimed “predetermined list of filler words”), the language analysis system can cause alert notifications to be made instantaneously (real time) or in an off-line mode.” Par. 0006; “In one exemplary embodiment of the invention, the engine(s) (108) include speech recognition methods to process speech/audio data captured during a verbal session (e.g., conversation, speech, etc.) to recognize one or more target words, expressions, and/or sounds (e.g., clicking teeth) that are specified in a vocabulary list (i.e., the claimed “predetermined list of filler words”) persistently maintained (i.e., the claimed “stored”) in the local memory (109).” Par. 0016]
Liao, Tsiartas and Krasikov pertain to real-time speech systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the real-time speech systems art to modify Liao’s teachings of “machine learning models trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device” for “helping users to improve their public speaking and presentation skills” (Liao, Par. 0005, Par. 0037) with the teachings of “speaker verification/speaker identification module” (Tsiartas, Par. 0089) taught by Tsiartas and the explicit teachings of “words, expressions, and/or sounds (e.g., clicking teeth) that are specified in a vocabulary list (i.e., the claimed “predetermined list of filler words”) persistently maintained (i.e., the claimed “stored”) in the local memory” (Krasikov, Par. 0016) taught by Krasikov in order to enhance “speaker identification tools to diarize a multi-party recording and direct identified speech to appropriate instances of disclosed analytic engines”, “detect cases of mismatched identity”, and “provide detection of additional speaker data in non-labeled speech for building up additional background model information to improve model quality” (Tsiartas, Par. 0067) and “improve verbal communication stills by eliminating bad speech habits” (Krasikov, Par. 0003).

Regarding Claim 7, Liao in view of Tsiartas and Krasikov has been discussed above. The combination further teaches:
the system is further configured such, upon detecting a filler word in the text transcript, [Liao, see mapping applied to claim 6]
to identify surrounding words in proximity to the detected filler word and [Liao, see mapping applied to claim 5]
determining based on the surrounding words whether the detected filler word was used in a proper non-filler context. [Liao, see mapping applied to claim 5]

Regarding Claim 8, Liao in view of Tsiartas and Krasikov has been discussed above. The combination further teaches:
the system is configured for a user to selectively update the predetermined list of filler words to either remove or add words. [Liao, see mapping applied to claim 7; Krasikov, see mapping applied to claim 7; Krasikov, “For instance, the speech processing system (107) may include machine self-learning methods that enable the device (100) to learn the sounds of certain fillers or repetitive words that are identified during usage of the device (100) and compile/update the vocabulary list accordingly (i.e., the claimed predetermined list of filler words”). In other embodiments, the vocabulary list (i.e., the claimed predetermined list of filler words”) can be populated and edited (i.e., the claimed “either remove or add words”) by a user (i.e., the claimed “user to selectively update”) via a suitable user interface function.” Par. 0018]

Regarding Claim 13, Liao in view of Tsiartas and Krasikov has been discussed above. The combination further teaches:
the system is configured to output discreet sensory signals in the form of one or more of: haptic signals, auditory signals, and visual signals. [Liao, “In such cases, a notification may be displayed (i.e., the claimed “output discreet sensory signal”, “visual signal”) in the UI element 310 to inform the user of their disfluency. In an example, the notification may provide information about how the user is being disfluent. This may assist the user in addressing the issue.” Par. 0038; Krasikov, “During an active session in which the device monitors the user's speech, when the user utters one or more of the words and/or makes one or more of the sounds, for example, programmed in the vocabulary list, the language analysis system can cause alert notifications to be made instantaneously (real time) or in an off-line mode. For purposes of discreetness, real-time alert notifications (i.e., the claimed “output discreet sensory signals”) can be provided in one or more of various manners including tactile (i.e., the claimed “haptic”) alerts (minor vibration, slight pin-prick) or audio alerts (a low sound) (i.e., the claimed “auditory signals”), or off-line notification can be provided by generating session summary reports for user display and review (i.e., the claimed “visual signal”). In this way, only the user is made aware of when he/she has spoken inappropriately and the user can thus monitor him/herself without drawing undue attention from other persons conversing with the user.” Par. 0007]

Regarding Claim 14, Liao in view of Tsiartas and Krasikov has been discussed above. The combination further teaches:
the system is configured to record a history of detected filler speech for review by a user. [Liao, “In one implementation, in addition to the notifications provided in real-time, a summary report may also be provided to the user (i.e., the claimed “record a history of detected filler speech for review by a user”) after the rehearsal session is complete. The summary report (i.e., the claimed “record a history of detected filler speech for review by a user”) may provide an overall assessment of the users' performance and may include information such as the overall pace of speaking, the number and list of most frequently used filler words, the number of times the user was disfluent, the total time used for rehearsal, and the like.” Par. 0039; Krasikov, “For purposes of discreetness, real-time alert notifications can be provided in one or more of various manners including tactile alerts (minor vibration, slight pin-prick) or audio alerts (a low sound), or off-line notification can be provided by generating session summary reports for user display and review (i.e., the claimed “record a history of detected filler speech for review by a user”). In this way, only the user is made aware of when he/she has spoken inappropriately (i.e., the claimed “filler speech”) and the user can thus monitor him/herself without drawing undue attention from other persons conversing with the user.” Par. 0007; “The vocabulary list may contain various filler words or expressions (such as uh", "like", and "you know", etc.) (i.e., the claimed “filler speech”),” Par. 0016; 


Claims 10 - 12 are rejected under 35 U.S.C. 103(a) as being unpatentable over Liao in view of Tsiartas, Krasikov, and Sheeder et al., (U.S. Patent Application 2020/0335128), hereinafter referred to as Sheeder.
Regarding Claim 10, Liao in view of Tsiartas and Krasikov has been discussed above. The combination further teaches:
the system is configured to assess the audio signal for detection of filler sounds by comparing a waveform of the audio signal to waveforms of sound files of predetermined filler sounds, [Liao, see mapping applied to claim 6; Krasikov, see mapping applied to claim 6]
the sound files of the predetermined filler sounds being stored in the memory. [Krasikov, “The captured audio stream (i.e., the claimed “audio signal”) is analyzed in real-time using the embedded speech processing system (107) to detect target words/expressions/sounds that are spoken/made by the user, if any (step 201). During the active session, the detection results are output to the analyzer module (110) and such results are stored together with any other relevant session data, if any, that may be collected during the active session (step 202).” Par. 0030; “More specifically, the data processing system (101) comprises a speech processing system (107) having one or more engines (108) for processing speech/audio data that is captured via the microphone, and local memory (109) for persistent storage of data files and meta information used by the speech/audio engines (108) for processing captured speech/audio signals.” Par. 0016]
The combination fails to explicitly teach waveforms.
However, Sheeder teaches:
the system is configured to assess the audio signal for detection of filler sounds by comparing a waveform of the audio signal to waveforms of sound files of predetermined filler sounds, [Sheeder, “For example, interstitial sounds can include hesitation sounds (e.g., “um,” “uh”); elongated syllables (e.g., an elongated “to(ooo)” at the end of the phrase “I'm going to”); repetitions (e.g., “and, and, and . . . ”); trailing filler words (e.g., “like,” “I mean”), and/or other indications that the user is likely to provide additional input audio belonging to the current utterance.” Par. 0049; “Audio input data 1016 can be compared with the speech segments of speech/sensor data 1029—for example, by comparing an audio waveform of audio input data (i.e., the claimed “audio signal”) 1016 with analogous waveforms of speech/sensor data (i.e., the claimed “sound files”) 1029, or by comparing parameters of audio input data 1016 (such as may be characterized at stage 1075) with analogous parameters of speech/sensor data 1029. Based on such comparisons, probability value 1078 can be determined for audio input data 1016.” Par. 0057; “…audio waveform data, sensor parameters, or raw sensor data described above); and to determine probability value 1078 based on the relative distances between an element of input data 1010 and its neighbors in the N-dimensional space. As another example, support vector machines can be used at stage 1076 to determine, based on speech/sensor database (i.e., the claimed “sound files of predetermined filler sounds”) 1029,” Par. 0059]
Liao, Tsiartas and Sheeder pertain to real-time speech systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the real-time speech systems art to modify Liao’s teachings of “machine learning models trained for identifying filler phrases and sounds in a text, upon detecting the utterance of the filler phrase or sound, enabling real time display of a notification on a display device” for “helping users to improve their public speaking and presentation skills” (Liao, Par. 0005, Par. 0037) with the teachings of “speaker verification/speaker identification module” (Tsiartas, Par. 0089) taught by Tsiartas and the teachings of “comparing an audio waveform of audio input data (i.e., the claimed “audio signal”) 1016 with analogous waveforms of speech/sensor data (i.e., the claimed “sound files”)” (Sheeder, Par. 0057) taught by Sheeder in order to enhance “speaker identification tools to diarize a multi-party recording and direct identified speech to appropriate instances of disclosed analytic engines”, “detect cases of mismatched identity”, and “provide detection of additional speaker data in non-labeled speech for building up additional background model information to improve model quality” (Tsiartas, Par. 0067) and “overcome problem of ASR systems of determining, promptly and in real-time” which “corresponding speech segment indicates a pause or an interstitial sound” (i.e. the claimed “filler words”) (Sheeder, Par. 0006, Par. 0057).

Regarding Claim 11, Liao in view of Tsiartas, Krasikov and Sheeder has been discussed above. The combination further teaches:
the system is configured for a user to selectively update the sound files of predetermined filler sounds to either remove or add sound files. [Liao, see mapping applied to claim 6; Krasikov, see mapping applied to claim 6; Liao, see mapping applied to claim 7; Krasikov, see mapping applied to claim 7; Krasikov, see mapping applied to claim 8; Krasikov, see mapping applied to claim 10; Sheeder, see mapping applied to claim 10; Claim is directed to repeating the subject matter for a user to selectively update predetermined filler sounds. However, repeating steps known from prior art with sound files with is straightforward, amounts to the normal use of the teachings of Liao in view of Tsiartas, Krasikov and Sheeder and are rejected under similar rationale.]

Regarding Claim 12, Liao in view of Tsiartas, Krasikov and Sheeder has been discussed above. The combination further teaches:
the system is configured to verify speaker identity [Tsiartas, see mapping applied to claim 1]
by comparing one or more waveforms of the audio signal to one or more waveforms from sound files of user voice recordings, [Sheeder, see mapping applied to claim 10; Krasikov, see mapping applied to claim 10; Tsiartas, “…a comparison to one or more previously-obtained speech analytics of a particular user (such as a comparison of the user's current speech-based analytics to the user's previously measured analytics (i.e., the claimed “user voice recordings”), which may indicate a change in the user's state over a period of time), or a comparison to a population of users, e.g., a comparison of the user's particular analytics to a set of analytics that are representative of a selected reference demographic such as age, gender, language fluency, etc.” Par. 0041]
the sound files of the user voice recordings being stored in the memory. [Sheeder, see mapping applied to claim 10; Krasikov, see mapping applied to claim 10; Tsiartas, “…a comparison to one or more previously-obtained speech analytics of a particular user (such as a comparison of the user's current speech-based analytics to the user's previously measured analytics (i.e., the claimed “user voice recordings”), which may indicate a change in the user's state over a period of time), or a comparison to a population of users, e.g., a comparison of the user's particular analytics to a set of analytics that are representative of a selected reference demographic such as age, gender, language fluency, etc.” Par. 0041]


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Giovanardi et al., (U.S. Patent Application Publication 2024/0428791) teaches identifying filler speech.
Vuarrraz, (WO2024079605) teaches identifying filler speech.
Pakhomov et al., (U.S. Patent 9,576,593) teaches identifying filler speech using audio waveforms.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to EUNICE LEE whose telephone number is 571-272-1886. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EUNICE LEE/Examiner, Art Unit 2656

 /BHAVESH M MEHTA/ Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Oct 30, 2023
Application Filed
Sep 24, 2025
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/449,809
Patent 12603078
GENERATING SPEECH DATA USING ARTIFICIAL INTELLIGENCE TECHNIQUES
2y 5m to grant Granted Apr 14, 2026
17/992,605
Patent 12597365
AUTOMATIC TRANSLATION BETWEEN SIGN LANGUAGE AND SPOKEN LANGUAGE
2y 5m to grant Granted Apr 07, 2026
18/205,615
Patent 12585876
METHOD OF TRAINING POS TAGGING MODEL, COMPUTER-READABLE RECORDING MEDIUM AND POS TAGGING METHOD
2y 5m to grant Granted Mar 24, 2026
18/518,786
Patent 12579385
EMBEDDED TRANSLATE, SUMMARIZE, AND AUTO READ
2y 5m to grant Granted Mar 17, 2026
18/140,389
Patent 12566928
READABILITY BASED CONFIDENCE SCORE FOR LARGE LANGUAGE MODELS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+27.3%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allow rate.