Last updated: April 19, 2026
Application No. 18/526,682
SYSTEMS AND METHODS FOR PROVIDING LOW LATENCY USER FEEDBACK ASSOCIATED WITH A USER SPEAKING SILENTLY

Final Rejection §103§112
Filed
Dec 01, 2023
Examiner
CAUDLE, PENNY LOUISE
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Wispr AI Inc.
OA Round
2 (Final)
Interview Optional

— +15.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 69 resolved cases, 2023–2026
Examiner Intelligence

CAUDLE, PENNY LOUISE View full profile →
Grants 67% — above average
Career Allow Rate
46 granted / 69 resolved
+4.7% vs TC avg
Strong +16% interview lift
Without
With
+15.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
19 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
21.0%
-19.0% vs TC avg
§103
43.7%
+3.7% vs TC avg
§102
15.8%
-24.2% vs TC avg
§112
17.1%
-22.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 69 resolved cases
Office Action

§103 §112
DETAILED ACTION
This examination is in response to the communication filed on 11/11/2025. Claims 1-20 are currently pending, where claims 1, 19 and 20 have been amended.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/14/2025, 11/11/2025, and 01/14/2026 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
Applicant's arguments regarding Priority have been fully considered but they are not persuasive. On page 7 of the Response, Applicant asserts that the Provisional’s description of (1) “outputting decoded silent speech can be performed in real-time with the user speaking” and (2) “signal measurements produced by speech (e.g., EMG signals) can be measured up to 200 milliseconds or longer prior to the sound produced by the speech” provide adequate support “substantially in parallel” and “…elapsed from when the signal indicative of the speech muscle activation patterns of the user were measured”, respectfully. The Examiner respectfully disagree. 
In “real-time” is a term of art which generally refers to the timing of an event or output, whereas “in parallel” is a term of art which generally refers the sequence of events/outputs. Although this terms may be related, the provisional application does not provide a clear mapping between “real-time” and in parallel.
Applicant's arguments regarding the rejection of claim 12 under §112, fourth paragraph, have been fully considered but they are not persuasive. On page 7, last paragraph to page 8, second paragraph, of the Response, Applicant appears to argue that the disclosure of that “the user’s speech can occur over an extended period of time” somehow distinguishes between outputting “the synthesized audio signal of the input speech substantially in parallel in time with the user speaking” as recited in independent  claim 1 and the limitation “outputting the audio signal of the input speech of the user substantially in parallel in time with the user speaking comprises playing back the audio signal of the input speech of the user at a time that has elapsed from when the signal indicative of the speech muscle activation patterns of the user were measured.” The Examiner respectfully disagrees. 
Although the Examiner appreciates that the specification supports that the “speech muscle activation patterns” occurs some time prior to the vocal speech sounds being output by the user, the currently pending claim language does not specify the “user speaking” as being when the vocal sounds are output. Under a broadest reasonable interpretation a user speaking includes the speech muscle activation. Accordingly, the added limitation of “at a time that has elapsed…” as recited in claim 12 (1) improperly broadens the substantially in parallel limitation of claim 1 or (2) does not alter the limitation of claim 1 as in parallel with the user speaking necessarily requires activation of the speech muscles to have occurred prior to the outputting of the audio signal.  Therefore, the previous rejection under 112, 4th is maintained. 
If Applicant’s intention is to distinguish between outputting the synthesized audio upon detection of the speech muscle activation but before the vocal output and/or some time after vocal output. The Examiner suggests amending the claims to clarify what is meant by “the user speaking”
Applicant’s arguments with respect to amended claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 12 rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  
Claim 12, which depends from claim 1, recites that the “outputting the audio signal of the input speech of the user substantially in parallel in time with the user speaking” as recited in claim 1 comprises “playing back the audio signal of the input speech of the user at a time that has elapsed from the when the signal indicative of the speech muscle activation patterns of the user were measured”. Accordingly, the independent base claim requires that the output be substantially in parallel in time with the user speaking” which limits the outputting to substantially the same time as the user speech. However, claim 12 now recites that the output is after some time has elapsed from indication of the speech muscle activation patters of the user were measured which either (1) improperly broadens the substantially in parallel limitation of claim 1 or (2) does not alter the limitation of claim 1 as in parallel with the user speaking necessarily requires activation of the speech muscles to have occurred prior to the outputting of the audio signal.   Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.
For examination purposes, claim 12 is interpreted limiting “substantial in parallel” to being some predefined time period after detection of activation of the speech muscles.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-3, 12-15 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Diener et al. “An initial Investigation into the Real-Time Conversion of Facial Surface EMG Signals to Audible Speech” 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 2016, pp. 888-891,(herein “Diener”) in view of Antunes et al. (US 10,834,265 B1; herein “Antunes”).
Regarding claims 1, 19 and 20, Diener teaches a system, method, and non-transitory computer readable medium for synthesizing input speech of a user, the system comprising:
a speech system configured to measure a signal indicative of speech muscle activation patterns of the user when the user is speaking (p. 888, section I, 2nd paragraph teaches “Our approach to building an SSI uses facial surface electromyography to capture electrical signals generated by the articulatory muscles and directly (without first performing speech recognition) converts them to audible speech”);
a machine learning model configured to synthesize an audio signal of the input speech of the user using the signal indicative of the speech muscle activation patterns of the user (p. 889, Section II. D. Mapping teaches “we use a deep neural network (DNN) using rectified-linear units to perform the mapping of EMG to audio features”); and 
a processor configured to output the synthesized audio signal of the input speech substantially in parallel in time with the user speaking (p. 890, Fig. 4 illustrates the complete pipeline during real-time conversion; p. 888, Abstract teaches “Our system allows for performing EMG-to-Speech conversion with low latency…enabling near instantaneous audio output during audible as well as silent speech production”; and p. 888, section II teaches “EMG pre-processing, mapping and synthesis have to be performed both in real-time…as well as with low latency (i.e., with a low delay between user speech production and audio output)”).
Although Diener teaches frame buffering, Diener fails to explicitly disclose receiving a first audio prediction from the machine learning model, the first audio prediction comprising a first audio frame of the synthesized audio signal and a second audio frame of synthesized audio signal; outputting the first audio frame; and outputting the second audio frame when there is a delay between an end of outputting the first audio frame and receiving a second audio prediction.
Antunes teaches a de-jitter buffer for processing audio data, e.g.,. voice-over-IP data, that includes, inter alia, receiving a first audio prediction from the machine learning model, the first audio prediction comprising a first audio frame of the synthesized audio signal and a second audio frame of synthesized audio signal (Col. 7, lines 59-67 teaches “At the receiver side, voice packets are stored in a queue to be played back, which is generally referred to as the de-jitter buffer…”)
outputting the first audio frame (Col. 7, lines 59-67 teaches “…The de-jitter buffer generally stores packets and delays their playback for a specified maximum amount of time to reduce the effect of packet delay variations…”); and
outputting the second audio frame when there is a delay between an end of outputting the first audio frame and receiving a second audio prediction(Col. 7, lines 59-67 teaches “…The de-jitter buffer generally stores packets and delays their playback for a specified maximum amount of time to reduce the effect of packet delay variations…”).
Diener differs from the claimed invention, as defined by claims 1, 19, and 20, in that Diener fails to explicitly disclose utilizing a de-jitter buffer to reduce the impact of delays in receiving the audio frames. De-jitter buffers are well known in the art as evidenced by Antunes. Therefore, it would have been obvious to one having ordinary skill in the art, before the effective filing data of the invention, to have modified the system taught by Diener to include a de-jitter buffer as taught by Antunes as it merely constitutes the combination of known architecture to achieve the predictable result of reducing or prevent audio jitter due to latency.
Regarding claim 2, the combination of Diener and Antunes teaches all of the elements of claim 1 (see detailed element mapping above). In addition, Diener further teaches synthesizing the audio signal of the input speech of the user comprises: 
inputting the signal indicative of the speech muscle activation patterns of the user to the machine learning model to generate a representation of the audio signal of the input speech of the user (Under a broadest reasonable interpretation “muscle activation patterns” is interpreted as corresponding to electrical signals generated by articulatory muscles. p. 888, section I, 2nd paragraph teaches “Our approach to building an SSI uses facial surface electromyography to capture electrical signals generated by the articulatory muscles and directly (without first performing speech recognition) converts them to audible speech”); and
synthesizing the audio signal of the input speech of the user using the representation of the audio signal (p. 890, Fig. 4, the DNN mapping is fed to the synthesis module and p. 888, Abstract teaches “extraction of EMG features, mapping of EMG features to audio features, synthesis of audio waveforms from audio features and output of the audio waveforms via speakers or headphones” ).  
Regarding claim 3, the combination of Diener and Antunes teaches all of the elements of claim 2 (see detailed element mapping above). In addition, Diener further teaches the representation of the audio signal comprises a spectrogram of the input speech of the user (p. 889, Section II.E. Synthesis teaches “As our audio features are magnitude spectrograms…”).  
Regarding claim 12, the combination of Diener and Antunes teaches all of the elements of claim 1 (see detailed element mapping above). In addition, Diener further teaches outputting the audio signal of the input speech of the user substantially in parallel in time with the user speaking comprises playing back the audio signal of the input speech of the user at a time that has elapsed from when the signal indicative of the speech muscle activation patterns of the user were measured (p. 888, section II teaches “EMG pre-processing, mapping and synthesis have to be performed both in real-time…as well as with low latency (i.e., with a low delay between user speech production and audio output)”). 
Regarding claim 13, the combination of Diener and Antunes teaches all of the elements of claim 12 (see detailed element mapping above). In addition, Diener further teaches the time that has elapsed is less than 200ms (p. 890, Section III. A. Component latency teaches “Computation latency: Latency due to the computation time taken up by feature calculation, mapping and synthesis…total overall latency l = lnet + lbuf +lcomp =…= 49.97 ms” ).  
Regarding claim 14, the combination of Diener and Antunes teaches all of the elements of claim 12 (see detailed element mapping above). In addition, Diener further teaches the time that has elapsed is less than 50ms (p. 890, Section III. A. Component latency teaches “Computation latency: Latency due to the computation time taken up by feature calculation, mapping and synthesis…total overall latency l = lnet + lbuf +lcomp =…= 49.97 ms”).  
Regarding claim 15, the combination of Diener and Antunes teaches all of the elements of claim 12 (see detailed element mapping above). In addition, Diener further teaches the time that has elapsed is a period between when the speech muscle activation patterns of the user are produced and when a sound would be produced if the user were to speak out loud (the time period/delay from speech muscle activation to when a sound would be produced is interpreted as about 200 ms based on the Specification p. 31, lines 2-7.  p. 890, Section III. A. Component latency teaches “Computation latency: Latency due to the computation time taken up by feature calculation, mapping and synthesis…total overall latency l = lnet + lbuf +lcomp =…= 49.97 ms”).  
Regarding claim 18, the combination of Diener and Antunes teaches all of the elements of claim 1 (see detailed element mapping above). In addition, Diener further teaches the processor is configured to output the synthesized audio signal to a receiving device configured to playback the synthesized audio signal (p. 888, Abstract teaches “…synthesis of audio waveforms from audio features and output of the audio waveforms via speakers or headphones” speaker or headphones are interpreted as a receiving device). 
Claims 4, 6, 7 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Diener and Antunes as applied to claim 1 above, and further in view of Kapur et al. “AlterEgo: A Personalized Wearable Silent Speech Interface (cited  by Applicant in the IDS filed on 05/16/2024; herein “Kapur”).
Regarding claim 4, the combination of Diener and Antunes teaches all of the elements of claim 1 (see detailed element mapping above). In addition, Diener further teaches the speech system is a wearable device comprising an electromyography (EMG) sensor, whereby the signal indicative of the speech muscle activation patterns of the user when the user is speaking comprises EMG data received from the EMG sensor when the user is speaking (p. 888, section I, 2nd paragraph teaches “Our approach to building an SSI uses facial surface electromyography to capture electrical signals generated by the articulatory muscles and directly (without first performing speech recognition) converts them to audible speech” and Abstract teaches “The system uses a pipeline approach to enable only acquisition of EMG data, extraction of EMG features…” and Fig. 1 illustrates positioning of EMG electrodes).
Diener fails to explicitly disclose that the EMG acquisition device is a wearable device.
Kapur teaches a wearable device comprising electrodes for acquiring a EMG signals from muscular areas of interest. See Fig. 2 and Fig. 3. Thus, Kapur teaches a wearable device comprising an electromyography (EMG) sensor, whereby the signal indicative of the speech muscle activation patterns of the user when the user is speaking comprises EMG data received from the EMG sensor when the user is speaking.
The combination of Diener and Antunes differs from the claimed invention as defined by claim 4, in that the combination fails to disclose the EMG sensors are part of a wearable device. Wearable non-invasive devices for acquiring EMG sensors are known in the art as evidenced by Kapur. Therefore, it would it would have been obvious to one having ordinary skill in the art, before the effective filing date of the invention, to have modified the SSI system taught by the combination of Diener and Antunes to include a wearable EMG sensor device as taught by Kapur as it merely constitutes the combination of known elements to achieve the predictable results of a wearable SSI interface.
Regarding claim 6, the combination of Diener, Antunes and Kapur teaches all of the elements of claim 4 (see detailed element mapping above). In addition, Diener further teaches the system further comprises a vocoder implementing an algorithm (p. 889, Section II.E. Synthesis teaches “As our audio features are magnitude spectrograms, the problem of synthesizing a waveform…we use the method proposed by Griffin and Lim [12], implemented to operate on a continuous stream of data” The method proposed by Griffin and Lim utilizes a vocoder.); and 
synthesizing the audio signal of the input speech of the user comprises: 
using the machine learning model to convert the EMG data to a spectrogram (p. 889, Section II.C2) Audio features teaches “to represent the audio signal, our system uses a magnitude spectrogram” and Section II. D. teaches “we use a deep neural network (DNN)…to perform the mapping of EMG to audio features”. ); and 
using the vocoder implementing an algorithm to convert the spectrogram to the audio signal representing the speech of the user (p. 889, Section II.E. Synthesis teaches “As our audio features are magnitude spectrograms, the problem of synthesizing a waveform…we use the method proposed by Griffin and Lim [12], implemented to operate on a continuous stream of data” The method proposed by Griffin and Lim utilizes a vocoder.). 
Regarding claim 7, the combination of Diener, Antunes and Kapur teaches all of the elements of claim 6 (see detailed element mapping above). In addition, Diener further teaches the algorithm implemented by the vocoder is a Griffin-Lim algorithm (p. 889, Section II.E. Synthesis teaches “As our audio features are magnitude spectrograms, the problem of synthesizing a waveform…we use the method proposed by Griffin and Lim [12], implemented to operate on a continuous stream of data”).  
Regarding claim 11, the combination of Diener, Antunes and Kapur teaches all of the elements of claim 4 (see detailed element mapping above). In addition, Diener further teaches the EMG sensor is configured to measure the EMG data when the user is speaking silently (p. 888, abstract teaches “Our system allows for performing EMG-to-Speech conversion with low latency and on a continuous stream of EMG data, enabling near instantaneous audio output during audible as well as silent speech production” ). 
Claims 5 and 8-10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Diener, Antunes and Kapur as applied to claim 4 above, and further in view of Gabrys et al. (US 2023/0260502 A1; herein “Gabrys”).
Regarding claim 5, the combination of Diener, Antunes and Kapur teaches all of the elements of claim 4 (see detailed element mapping above). Although, Diener teaches using a first machine learning model to convert the EMG data to a spectrogram (see discussion above with respect to claim 1) , Diener fails to disclose the system further comprises a second machine learning model and synthesizing the audio signal of the input speech of the user comprises: using the second machine learning model to convert the spectrogram to the audio signal of the input speech of the user.
Gabrys teaches a speech synthesis system that comprises a second machine learning model and synthesizing the audio signal of the input speech of the user comprises: using the second machine learning model to convert the spectrogram to the audio signal of the input speech of the user (¶[0039] teaches “A vocoder component 195 may receive the modified spectrogram data 184 and generate audio waveform data 315. The vocoder component 195 may be, for example, a universal neural vocoder based on Parallel WaveNet. The vocoder component 195 may take as input audio data in the form of, for example, a Mel-spectrogram…” ).
The combination of Diener, Antunes and Kapur differs from the claimed invention, as defined by claim 5, in that the combination fails to disclose that the vocoder utilized to synthesis the spectrogram data generated by the first machine learning model is a neural vocoder, e.g., includes a second machine learning model. Neural vocoders for synthesizing spectrogram data generated by a first machine learning model are known in the art as evidenced by Gabrys. Therefore, it would have been obvious to one having ordinary skill in the art to have modified the synthesis vocoder taught by the combination of Diener, Antunes and Kapur to include a neural vocoder as taught by Gabrys in order to provide a speech synthesis device which is capable of creating customized synthesized speech for many different potential target voices (Gabrys, ¶[0021].).
Regarding claim 8, the combination of Diener, Antunes and Kapur teaches all of the elements of claim 4 (see detailed element mapping above). However, the combination of Diener and Kapur fails to disclose that the machine learning model is trained to synthesize the audio signal of the input speech of the user from the EMG data in one of a plurality of voices.
Gabrys teaches a speech synthesis system that comprises the machine learning model is trained to synthesize the audio signal of the input speech of the user from the EMG data in one of a plurality of voices (¶[0021] teaches a speech synthesis system “incorporating the voice-modifying model may be used to create customized synthesized speech for many different potential target voices”).
The combination of Diener, Antunes and Kapur differs from the claimed invention, as defined by claim 8, in that the combination fails to disclose that synthesis system is trained to synthesis the audio/speech data in one of a plurality of voices. Speech synthesis systems that are trained to synthesis audio in one of a plurality of target voices are known in the art as evidenced by Gabrys. Therefore, it would have been obvious to one having ordinary skill in the art to have modified the synthesis vocoder taught by the combination of Diener, Antunes and Kapur to include a voice-modifying model as taught by Gabrys in order to provide a speech synthesis device which is capable of creating customized synthesized speech for many different potential target voices (Gabrys, ¶[0021].).  
Regarding claim 9, the combination of Diener, Antunes, Kapur, and Gabrys teaches all of the elements of claim 8 (see detailed element mapping above). In addition, Diener further teaches a first voice option of the plurality of voices comprises speech mimicking how the user should hear the own voice of the user (“mimicking how the user should hear the own voice of the user” is interpreted, based on the Specification p.31, lines 4-7, as the auditory feedback signal be played back with a latency of 200 ms or less. p. 890, Section III. A. Component latency teaches “Computation latency: Latency due to the computation time taken up by feature calculation, mapping and synthesis…total overall latency l = lnet + lbuf +lcomp =…= 49.97 ms ).  
Regarding claim 10, the combination of Diener, Antunes, Kapur and Gabrys teaches all of the elements of claim 9 (see detailed element mapping above). In addition, Gabrys further teaches the processor is further configured to change one or more attributes of the first voice option (¶[0039] teaches “the voice modifier component 190 may receive the synthesized spectrogram data 182, the adapted frequency data, and the target speaker embedding 166 of the target voice and generate a modified spectrogram data 184”).
The combination of Diener, Antunes and Kapur differs from the claimed invention, as defined by claim 10, in that the combination fails to disclose changing one or more attributes of the first voice option. Speech synthesis systems that are trained to synthesis audio in one of a plurality of target voices by modifying attributes of the target voice are known in the art as evidenced by Gabrys. Therefore, it would have been obvious to one having ordinary skill in the art to have modified the synthesis vocoder taught by the combination of Diener, Antunes and Kapur to include a voice-modifying model as taught by Gabrys in order to provide a speech synthesis device which is capable of creating customized synthesized speech for many different potential target voices (Gabrys, ¶[0021].).  
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Diener and Antunes as applied to claim 12 above, and further in view of Vandikas et al. (WO 2023/128847; herein “Vandikas”).
Regarding claim 16, the combination of Diener and Antunes r teaches all of the elements of claim 12 (see detailed element mapping above). In addition, Diener further teaches the audio signal of the input speech of the user is a first audio signal and the signal indicative of the speech muscle activation patterns of the user is a first signal ( ).
However, the combination of Diener and Antunes fails to disclose that the processor is further configured to, following the playback of the first audio signal, receive a second audio signal and a second signal indicative of the speech muscle activation patterns of the user indicative of the user speaking a correcting word; and the machine learning model is further configured to receive as input the second audio signal and the second signal indicative of the speech muscle activation patterns of the user to calibrate the machine learning model based on the correcting word.
Vandikas teaches a face mask for capturing speech produced by a wearer, wherein the mask includes sensors adapted to capture changes in the shape of a part of a face of the wearer while producing speech (Vandikas, ¶[006). In addition, Vandikas teaches the processor is further configured to, following the playback of the first audio signal, receive a second audio signal and a second signal indicative of the speech muscle activation patterns of the user indicative of the user speaking a correcting word (¶[0058] teaches “the wearer trains the machine learning model for the niche technical environment by uttering each word while the face mask 200 generates strain maps…” the words uttered during retraining for the calibration to the wearer’s speech are considered second audio signals and second signals indicative of the speech muscle activation patterns); and the machine learning model is further configured to receive as input the second audio signal and the second signal indicative of the speech muscle activation patterns of the user to calibrate the machine learning model based on the correcting word (¶[0058] teaches “the wearer trains the machine learning model for the niche technical environment by uttering each word while the face mask 200 generates strain maps. The strain maps may then be converted to a graph and then a graph-to-syllable model will learn to associate graphs with syllables, i.e., the syllables used in building words for the specific vocabulary. The face mask 200 which is pre-installed with graph-to-syllable models mitigates the extent of training required to calibrate the face mask 200 to the wearer’s speech”).  
  The combination of Diener and Antunes differs from the claimed invention as defined by claim 16, in that the combination fails to explicitly disclose receiving correcting words for calibrating the silent speech interface device. Silent Speech Interface devices which use correcting words, i.e., user speech while wearing the device, are known in the art as evidenced by Vandikas. Therefore, it would have been obvious to one having ordinary skill in the art to have modified the Silent Speech Interface taught by the combination of Diener and Antunes to include training the machine learning model for the niche technical environment by uttering each word while the face mask generates strain maps as taught by Vandikas in order to mitigate the extending of training required to calibrate the face mask to the wearer’s speech (Vandikas, ¶[0058].)
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Diener and Antunes as applied to claim 1 above, and further in view of Kapur et al. (US 20190074012 A1 cited  by Applicant in the IDS filed on 05/16/2024; herein “Kapur ‘012”).
Regarding claim 17, the combination of Diener and Antunes teaches all of the elements of claim 1 (see detailed element mapping above). However, the combination of Diener and Antunes fails to disclose the processor is further configured to detect a pause in the speech of the user and play back the audio signal in response to detecting the pause in the speech of the user.  
Kapur ‘012 teaches a silent speech interface that includes detecting a pause in the speech of the user and play back the audio signal in response to detecting the pause in the speech of the user (¶[0106] teaches “the system may buffer data regarding a group of phonemes until a pause exceeds the pause limit (e.g., 4 seconds); and then (b) the system may forward data regarding the group of phonemes to the NLP”).
The combination of Diener and Antunes differs from the claimed invention as defined by claim 17, in that Diener fails to explicitly disclose buffering the audio signal until a pause is detected. Silent Speech Interfaces which detect pauses in the user’s speech to buffer words for phrases are known in the art as evidenced by Kapur ‘012. Therefore, it would have been obvious to one having ordinary skill in the art to have modified the Silent Speech Interface taught by the combination of Diener and Antunes to include detecting pauses in the user’s speech as taught by Kapur ‘012 as it merely constitutes the combination of known processes to achieve the predictable result of buffering words for phrases prior to NLP processing.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PENNY L CAUDLE whose telephone number is (703)756-1432. The examiner can normally be reached M-Th 8:00 am to 5:00 pm eastern.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PENNY L CAUDLE/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Dec 01, 2023
Application Filed
Aug 06, 2025
Non-Final Rejection — §103, §112
Nov 11, 2025
Response Filed
Jan 23, 2026
Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/302,683
Patent 12592243
METHOD AND ELECTRONIC DEVICE FOR PERSONALIZED AUDIO ENHANCEMENT
2y 5m to grant Granted Mar 31, 2026
18/038,631
Patent 12573371
VOCABULARY SELECTION FOR TEXT PROCESSING TASKS USING POWER INDICES
2y 5m to grant Granted Mar 10, 2026
18/051,429
Patent 12566924
Apparatus for Evaluating and Improving Response, Method and Computer Readable Recording Medium Thereof
2y 5m to grant Granted Mar 03, 2026
18/334,876
Patent 12567433
AUTOMATED EVALUATION OF SYNTHESIZED SPEECH USING CROSS-MODAL AND CROSS-LINGUAL TRANSFER OF LANGUAGE ENCODING
2y 5m to grant Granted Mar 03, 2026
18/307,199
Patent 12554937
FEW SHOT INCREMENTAL LEARNING FOR NAMED ENTITY RECOGNITION
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
67%
Grant Probability
82%
With Interview (+15.5%)
3y 2m
Median Time to Grant
Moderate
PTA Risk
Based on 69 resolved cases by this examiner. Grant probability derived from career allow rate.