Last updated: April 19, 2026
Application No. 18/055,739
DETECTING AND CLASSIFYING FILLER WORDS IN AUDIO USING NEURAL NETWORKS

Final Rejection §103
Filed
Nov 15, 2022
Examiner
MUELLER, PAUL JOSEPH
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
4 (Final)
Interview Optional

— +34.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 128 resolved cases, 2023–2026
Examiner Intelligence

MUELLER, PAUL JOSEPH View full profile →
Grants 76% — above average
Career Allow Rate
97 granted / 128 resolved
+13.8% vs TC avg
Strong +35% interview lift
Without
With
+34.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
25 currently pending
Career history
153
Total Applications
across all art units
Statute-Specific Performance

§101
13.2%
-26.8% vs TC avg
§103
62.2%
+22.2% vs TC avg
§102
7.4%
-32.6% vs TC avg
§112
14.8%
-25.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 128 resolved cases
Office Action

§103
DETAILED ACTION
Introduction
This office action is in response to Applicant’s amendment filed on October 22, 2025. 
Claims 1, 2, 5, 9, 10, 13, 17, 18 and 20 have been amended. Claims 4 and 12 have been previously cancelled. Claims 1-3, 5-11 and 13-20 are pending in the application. As such, claims 1-3, 5-11 and 13-20 have been examined. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on November 15, 2022.  These drawings have been accepted and considered by the Examiner.

Response to Amendments and Arguments
In view of the amendments to claims, the amendments to claims 1, 2, 5, 9, 10, 13, 17, 18 and 20, have been acknowledged and entered.
In view of the amendments to claims, the objections to claims 2, 10 and 18 have been withdrawn.
In view of the amendments to claims, the rejections to claims 1-3, 5-11 and 13-20 under 35 U.S.C. 103 have been withdrawn.
In light of the amendments to the claims, new grounds for rejection for claims 1-3, 5-11 and 13-20 under 35 U.S.C. 103 are provided in the response below. New grounds for rejection is based at least upon the following new elements:
receiving an input including a media sequence, the media sequence including a video sequence paired with an audio sequence;
generating an output media sequence based on the modification to the visualization of the audio sequence, including adding transitions to the video sequence and the audio sequence at a point of removal of the at least one filler word.

Applicant’s arguments regarding the prior art rejections under 35 U.S.C 103, received on October 22, 2025, have been fully considered.
Applicant’s arguments with respect to claims 1-3, 5-11 and 13-20 have been considered, are directed to the newly amended matter in the claims, are not considered to be persuasive, and are addressed accordingly in the updated rejection rationale below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 7-9 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson et al. (US Patent Pub. No. 20220036005 A1), hereinafter de Brébisson, in view of Siyavudeen et al. (US Patent Pub. No. 20220399024 A1), hereinafter Siyavudeen, in view of Lewis et al. (US Patent Pub. No. 20230396833 A1), hereinafter Lewis, in view of Warnick et al. (US Patent Pub. No. 12154598 B1), hereinafter Warnick (as supported by provisional 63/324,714 files 3/29/2022).

Regarding claims 1 and 9, de Brébisson teaches a computer-implemented method and a non-transitory computer-readable storage medium (de Brébisson in [0024, claim 1] teaches using computer programs and associated computer-implemented techniques, and a non-transitory computer-readable medium)
[claim 9 only] storing executable instructions, which when executed by a processing device, cause the processing device to perform operations (de Brébisson in [0043] teaches storing instructions that can be executed by the processor)
comprising:
receiving an input including [a media sequence, the media sequence including a video sequence paired with] an audio sequence (de Brébisson in [0085] teaches the media production platform can produce a transcript by transcribing the audio file received);
determining filler word candidates [by comparing locations of voice activity in the audio sequence and a transcription] of the audio sequence (de Brébisson in [0067] teaches a media production platform identifies the filler words in a transcript);
displaying a visualization of the audio sequence, the visualization of the audio sequence including an identification of filler words identified from the filler word candidates (de Brébisson in [0025] teaches using an interface for editing a text or transcript, and in [0071] teaches each filler word may be visually highlighted by being rendered in a different color, underlined, bolded, etc.);
receiving a modification to the visualization of the audio sequence, the modification including a selection of at least one filler word for removal (de Brébisson in [0085] teaches if the user is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered));
and
generating an output media] sequence based on the modification to the visualization of the audio sequence, [including adding transitions to the video sequence and the audio sequence at a point of removal of the at least one filler word] (de Brébisson in [0084] teaches each word included in the transcript may be associated with a corresponding portion of the audio file so as to enable easy modification. As an example, if the individual is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered), and in [0085] teaches As an example, if the individual is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)).
de Brébisson does not teach, however Siyavudeen teaches
classifying, by a filler word classification model, each filler word candidate of the filler word candidates into one of a set of categories (Siyavudeen in [0029, 0032, Fig. 4] teaches using a filler words classifier which generates an observed filler words feature of the text data, and a filler word feature table 460).

    PNG
    media_image1.png
    621
    852
    media_image1.png
    Greyscale

Siyavudeen is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson further in view of Siyavudeen to allow for using a filler word classifier. Motivation to do so would allow for validating a digitized audio signal that is generated by a conference participant, where if the reference speech features are sufficiently similar to the observed speech features, the digitized audio signal is validated and the conference participant is allowed to remain in the conference (Siyavudeen [Abstract]).

de Brébisson, as modified above, teaches filler word candidates, the audio sequence, the output audio sequence, identification of a subset of the filler word candidates in a filler words category, and identified filler words.
de Brébisson, as modified above, does not teach, however Lewis teaches
determining filler word candidates by comparing locations of voice activity in the audio sequence and a transcription of the audio sequence (Lewis in [0016] teaches using timestamps for disfluency events (filler words), and in [0026] teaches using timestamps for uttered words, of a transcript, for direct analysis of comparing the timestamps of the disfluencies to the timestamps of the words in the transcript to confirm if a disfluency should be removed).
Lewis is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Lewis to allow for using timestamps to identify filler words. Motivation to do so would allow for a more thorough detection of disfluency segments (Lewis [0049]).

de Brébisson, as modified above, teaches receiving an input including an audio sequence, and generating an output audio sequence based on the modification to the visualization of the audio sequence.
de Brébisson, as modified above, does not teach, however Warnick teaches
receiving an input including a media sequence, the media sequence including a video sequence paired with an audio sequence (Warnick in [col 11 lines 1-20] teaches receiving video clips with corresponding audio),
generating an output media sequence based on the modification to the visualization of the audio sequence (Warnick in [col 11 lines 1-20] teaches removing filler words and generating synthetic video that matches the automatically edited audio), 
including adding transitions to the video sequence and the audio sequence at a point of removal of the at least one filler word (Warnick in [col 10 lines 34-50] teaches generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged).
Warnick is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Warnick to allow for generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged. Motivation to do so would provide for systems and methods that allow users to generate synthetic video segments that are synchronized with an edited audio track (Warnick [Abstract]).

Regarding claims 7 and 15, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 1 and 9.
de Brébisson teaches 
wherein classifying each filler word candidate of the filler word candidates into one of the set of categories 
[claim 15 only] wherein to classify each filler word candidate of the filler word candidates into one of the set of categories the instructions further cause the processing device to perform operations
comprises:
for each filler word candidate, assigning a category label to the filler word candidate (de Brébisson in [0024] teaches any filler words that are discovered in the transcript can be identified as such so that appropriate action(s) can be taken [here all filler words are identified as filler words]).

Regarding claims 8 and 16, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 1 and 9.
[claim 16 only] wherein the instructions further cause the processing device to perform operations
wherein displaying the visualization of the audio sequence further comprises
rendering a representation of the 

    PNG
    media_image2.png
    546
    774
    media_image2.png
    Greyscale

wherein the identification filler words identified from the filler word candidates is represented visually associated with the representation of the .

Claims 2 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Fage et al. (US Patent Pub. No. 20230252979 A1), hereinafter Fage.

Regarding claims 2 and 10, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 1 and 9.
de Brébisson teaches
wherein determining the filler word candidates by comparing the locations of voice activity in the audio sequence and a transcription of the audio sequence 
[claim 10 only] wherein to determine the filler word candidates by comparing the locations of voice activity in the audio sequence and a transcription of the audio sequence the instructions further cause the processing device to perform operations
comprises:
generating, by a speech recognition model, the transcription of the audio sequence (de Brébisson in [0046] teaches applying speech recognition to audio content to create a transcript).
de Brébisson, as modified above, does not teach, however Fage teaches
detecting, by a trained voice activity detection model, the locations of voice activity in the audio sequence (Fage in [0230] teaches using voice activity detection to discriminate between voice and non-voice input, including filler words);
and
determining the filler word candidates at detected locations of voice activity without corresponding transcript data in the transcription of the audio sequence (Fage in [0230] teaches using voice activity detection to discriminate between voice and non-voice input, (e.g., filler words or interjections such as “um,” “uh,” “hmm,” etc). In at least some examples, a “silent period” or “reduced-volume portion” can refer to a non-speech portion and/or a non-keyword portion of the voice input, even if audible sounds (such as background noise) are still present, (e.g., discriminating between voice input and non-voice input using spectral analysis or other suitable technique) [here the filler words are identified without using a transcript]).
Fage is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Fage to allow for using a filler word identifier. Motivation to do so would allow for cross-checking of keyword detection to improve confidence and lower error rates (Fage [0218]).

Claims 3 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Fage, in view of Sheikh et al. (US Patent Pub. No. 20200020340 A1), hereinafter Sheikh.

Regarding claims 3 and 11, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 2 and 10.
de Brébisson, as modified above, teaches 
wherein determining the filler word candidates at the detected locations of voice activity without corresponding transcript data in the transcription of the audio sequence further 
[claim 11 only] the instructions further cause the processing device to perform operations 
comprises:
de Brébisson, as modified above, does not teach, however Sheikh teaches
determining a start timecode and end timecode for each filler word candidate of the filler word candidates (Sheikh in [0034] teaches obtaining the time stamp parameter, for the given segments of validated classifier word, time stamp starts when the time occurrence of validated classifier word begins, and ends when the time occurrence of validated classifier word stops).
Sheikh is considered to be analogous to the claimed invention because it is in the same field of muting information from an audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Sheikh to allow for using a time stamp parameter. Motivation to do so would allow for muting segments of classified information from an audio (Sheikh [0007]).

Claims 5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Sheikh.

Regarding claims 5 and 13, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 1 and 9.
de Brébisson,as modified above, teaches the media sequence. 
de Brébisson teaches 
wherein the modification further includes
automatically deleting portions of the output indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)),
and 
substituting audio at the locations of the identified filler words (de Brébisson in [0046] teaches the media production platform may intelligently add, remove, or modify media in an audio).
de Brébisson, as modified above, does not teach, however Sheikh teaches
muting the output 
Sheikh is considered to be analogous to the claimed invention because it is in the same field of muting information from an audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Sheikh to allow for muting the information. Motivation to do so would allow for muting segments of classified information from an audio (Sheikh [0007]).

Claims 6 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Niemasik et al. (US Patent No. 10372991 B1), hereinafter Niemasik, in view of Cosgrove et al. (US Patent Pub. No. 20150220537 A1), hereinafter Cosgrove, in view of Rahman et al. (US Patent Pub. No. 20220054039 A1), hereinafter Rahman.

Regarding claims 6 and 14, de Brébisson, as modified above, teaches the computer-implemented method and non-transitory computer-readable storage medium of claims 1 and 9.
de Brébisson, as modified above, teaches the set of categories.
wherein the set of categories include 
de Brébisson, as modified above, does not teach, however Siyavudeen teaches
filler words (Siyavudeen in [0029, 0032, Fig. 4] teaches using a filler words classifier which generates an observed filler words feature of the text data, and a filler word feature table 460).

    PNG
    media_image1.png
    621
    852
    media_image1.png
    Greyscale

Siyavudeen is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Siyavudeen to allow for using a filler word classifier. Motivation to do so would allow for validating a digitized audio signal that is generated by a conference participant, where if the reference speech features are sufficiently similar to the observed speech features, the digitized audio signal is validated and the conference participant is allowed to remain in the conference (Siyavudeen [Abstract]).

de Brébisson, as modified above, does not teach, however Niemasik teaches
regular words (Niemasik in [col 18 ln 39-54] teaches an audio analyzer can be configured to analyze an audio signal and label one or more portions of the audio signal with one or more audio classifier labels. The one or more audio classifier labels can be descriptive of various audio events, such as laughter, clapping, crying, footsteps, wind, waves, animals, speech, or other sounds),
laughter (Niemasik in [col 18 ln 39-54] teaches an audio analyzer can be configured to analyze an audio signal and label one or more portions of the audio signal with one or more audio classifier labels. The one or more audio classifier labels can be descriptive of various audio events, such as laughter, clapping, crying, footsteps, wind, waves, animals, speech, or other sounds).
Niemasik is considered to be analogous to the claimed invention because it is in the same field of audio signal analysis and classification. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Niemasik to allow for classifying speech and laughter. Motivation to do so would allow for an audiovisual content capture, curation, and editing system that includes an image capture device that intelligently captures audiovisual content, which can leverage machine-learning to selectively store images and generate edited video at an image capture device (Niemasik [col 1 ln 8-14]).
de Brébisson, as modified above, does not teach, however Cosgrove teaches
music (Cosgrove in [0045] teaches Sound Type Classification (voices, laughter, cheering, yelling, music applause, barking, wind, surf, rain, thunder, engine sounds, skiing sounds, etc), if music is recognized, then use of an internet service to recognize what song is playing).
Cosgrove is considered to be analogous to the claimed invention because it is in the same field of audio signal analysis and classification. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Cosgrove to allow for classifying music. Motivation to do so would allow for the system to produce results that are customized to the interests and preferences of the user (Cosgrove [0022]).
de Brébisson, as modified above, does not teach, however Rahman teaches
breath (Rahman in [0052] teaches an audio segment classified as breathing).
Rahman is considered to be analogous to the claimed invention because it is in the same field of audio signal analysis and classification. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Rahman to allow for classifying breathing. Motivation to do so would allow for more effectively assisting the user in managing and improving the user's breathing habits (Rahman [0069]).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Fage, in view of Pakhomov et al. (US Patent No. 6535849 B1), hereinafter Pakhomov.

Regarding claim 17, de Brébisson teaches a system (de Brébisson in [0087] teaches a processing system)
comprising:
a memory component (de Brébisson in [0088] teaches a processing system uses a memory); 
and
a processing device coupled to the memory component, the processing device to perform operations (de Brébisson in [0087, 0088, Fig. 8] teaches a processing system uses a memory, they are coupled together, and they can implement operations)
comprising:
receiving an input including [a media sequence, the media sequence including a video sequence paired with] an audio sequence (de Brébisson in [0085] teaches the media production platform can produce a transcript by transcribing the audio file received);
generating an output so as to enable easy modification. As an example, if the individual is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)), 
wherein the output media] sequence includes] a modification of [the video sequence and] the audio sequence, [including adding transitions to the video sequence and the audio sequence at locations corresponding to segments of the audio sequence classified into the filler words category] 

    PNG
    media_image2.png
    546
    774
    media_image2.png
    Greyscale

de Brébisson does not teach, however Siyavudeen teaches
classifying, by the filler word classification model, voice activity within the detected locations of voice activity into one of a set of categories (Siyavudeen in [0029, 0032, Fig. 4] teaches using a filler words classifier which generates an observed filler words feature of the text data, and a filler word feature table 460).

    PNG
    media_image1.png
    621
    852
    media_image1.png
    Greyscale

Siyavudeen is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson further in view of Siyavudeen to allow for using a filler word classifier. Motivation to do so would allow for validating a digitized audio signal that is generated by a conference participant, where if the reference speech features are sufficiently similar to the observed speech features, the digitized audio signal is validated and the conference participant is allowed to remain in the conference (Siyavudeen [Abstract]).

de Brébisson, as modified above, teaches the detected locations, the voice activity, the audio sequence, filler word classification model, the output audio sequence, and identification of a subset of the voice activity as identified filler words.
de Brébisson, as modified above, does not teach, however Lewis teaches
providing the detected locations of the voice activity in the audio sequence to [a filler word classification model] (Lewis in [0016] teaches using timestamps for disfluency events (filler words), and in [0026] teaches using timestamps for uttered words, of a transcript, for direct analysis of comparing the timestamps of the disfluencies to the timestamps of the words in the transcript to confirm if a disfluency should be removed)
wherein the output audio sequence is a modification of the audio sequence [that includes an identification of a subset of the voice activity as identified filler words] (Lewis in [0016] teaches the audio stream may be edited/modified to remove at least some of the disfluency segments (e.g., portions of the audio recording that include disfluencies) to result in a shorter audio stream with speech that flows smoother/more fluently than the original audio stream, and in [0051] teaches identifying the presence of each disfluency event that is determined from a window of audio).
Lewis is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Lewis to allow for using timestamps to identify filler words. Motivation to do so would allow for a more thorough detection of disfluency segments (Lewis [0049]).

de Brébisson, as modified above, does not teach, however Fage teaches
detecting, by a trained voice activity detection model, locations of voice activity in the audio sequence (Fage in [0230] teaches using voice activity detection to discriminate between voice and non-voice input, including filler words).
Fage is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Fage to allow for using a filler word identifier. Motivation to do so would allow for cross-checking of keyword detection to improve confidence and lower error rates (Fage [0218]).

de Brébisson, as modified above, does not teach, however Pakhomov teaches
[classifying, by the filler word classification model, voice activity within the detected locations of voice activity into one of a set of categories] by analyzing segments of the audio sequence, determining probability values for each segment of a likelihood that the segments is associated with each category of the set of categories, and grouping contiguous segments having probability values above a threshold (Pakhomov in [col 6 ln 25-55] teaches identifying “filled pause words”, and in [col 6 ln 57 – col 7 ln 18] teaches determining the probability of a filled pause word being in a particular location).
Pakhomov is considered to be analogous to the claimed invention because it is in the same field of identifying speech in audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Pakhomov to allow for using a sliding window model. Motivation to do so would allow for a number of advantages over the prior art, such as, adaptation of the models of a speech recognizer without having to generate literal transcripts of recorded speech (Pakhomov [col 3 ln 37-46]).
de Brébisson, as modified above, does not teach, however Warnick teaches
receiving an input including a media sequence, the media sequence including a video sequence paired with an audio sequence (Warnick in [col 11 lines 1-20] teaches receiving video clips with corresponding audio),
generating an output media sequence based on the modification to the visualization of the audio sequence (Warnick in [col 11 lines 1-20] teaches removing filler words and generating synthetic video that matches the automatically edited audio), 
wherein the output media sequence includes a modification of the video sequence and the audio sequence, including adding transitions to the video sequence and the audio sequence at locations corresponding to segments of the audio sequence classified into the filler words category (Warnick in [col 10 lines 34-50] teaches generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged).
Warnick is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Warnick to allow for generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged. Motivation to do so would provide for systems and methods that allow users to generate synthetic video segments that are synchronized with an edited audio track (Warnick [Abstract]).


Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Fage, in view of Pakhomov, in view of Gupta et al. (US Patent No. 11538461 B1), hereinafter Gupta.

Regarding claim 18, de Brébisson, as modified above, teaches the system of claim 17.
de Brébisson, as modified above, teaches 
wherein to classify the voice activity within the detected locations of voice activity into one of the set of categories the processing device further performs operations comprising:
de Brébisson, as modified above, does not teach, however Siyavudeen teaches
the filler word classification model (Siyavudeen in [0029, 0032, Fig. 4] teaches using a filler words classifier which generates an observed filler words feature of the text data, and a filler word feature table 460).
Siyavudeen is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Siyavudeen to allow for using a filler word classifier. Motivation to do so would allow for validating a digitized audio signal that is generated by a conference participant, where if the reference speech features are sufficiently similar to the observed speech features, the digitized audio signal is validated and the conference participant is allowed to remain in the conference (Siyavudeen [Abstract]).
de Brébisson, as modified above, does not teach, however Fage teaches
identifying locations of discriminating between voice input and non-voice input using spectral analysis or other suitable technique) [here the filler words are identified without using a transcript]).
Fage is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Fage to allow for using a filler word identifier. Motivation to do so would allow for cross-checking of keyword detection to improve confidence and lower error rates (Fage [0218]).

de Brébisson, as modified above, does not teach, however Pakhomov teaches
[for each time segment], predicting a probability value, the probability value indicating a likelihood of the time segment including a filler word (Pakhomov in [col 6 ln 25-55] teaches identifying “filled pause words”, and in [col6 ln57 – col 7 ln 18] teaches determining the probability of a filled pause word being in a particular location); 
and
[identifying locations of the identified filler words] based on the probability values for each time segment and a threshold value (Pakhomov in [col 6 ln 25-55] teaches identifying “filled pause words”, and in [col6 ln57 – col 7 ln 18] teaches determining the probability of a pause word being in a particular location).
Pakhomov is considered to be analogous to the claimed invention because it is in the same field of identifying speech in audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Pakhomov to allow for using a sliding window model. Motivation to do so would allow for a number of advantages over the prior art, such as, adaptation of the models of a speech recognizer without having to generate literal transcripts of recorded speech (Pakhomov [col 3 ln 37-46]).

de Brébisson, as modified above, does not teach, however Gupta teaches
sliding [the filler word classification model] across the audio sequence in predetermined time segments (Gupta in [col 5 ln 55 – col 6 ln 28] teaches using a sliding window approach to identify audio segments of an audio sequence to determine speech segments, and teaches using a threshold value to make the determination, and the time segments are 800 ms).
Gupta is considered to be analogous to the claimed invention because it is in the same field of identifying speech in audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Gupta to allow for using a sliding window model. Motivation to do so would allow for identifying missing subtitles associated with a media presentation (Gupta [col 1 ln 60-65]).

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Warnick, in view of Fage, in view of Pakhomov, in view of Gupta, in view of Sheikh.

Regarding claim 19, de Brébisson, as modified above, teaches the system of claim 18.
de Brébisson teaches
wherein the processing device further performs operations comprising:
for each of the identified locations of the identified filler words (de Brébisson in [0084] teaches each word included in the transcript may be associated with a corresponding portion of the audio file so as to enable easy modification. As an example, if the individual is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)).
de Brébisson, as modified above, does not teach, however Sheikh teaches
determining a start timecode and end timecode (Sheikh in [0034] teaches obtaining the time stamp parameter, for the given segments of validated classifier word, time stamp starts when the time occurrence of validated classifier word begins, and ends when the time occurrence of validated classifier word stops)
Sheikh is considered to be analogous to the claimed invention because it is in the same field of muting information from an audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Sheikh to allow for using a time stamp parameter. Motivation to do so would allow for muting segments of classified information from an audio (Sheikh [0007]).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over de Brébisson, in view of Siyavudeen, in view of Lewis, in view of Fage, in view of Pakhomov, in view of Sheikh, in view of Warnick.

Regarding claim 20, de Brébisson, as modified above, teaches the system of claim 17.
de Brébisson,as modified above, teaches the media sequence. 
de Brébisson teaches
wherein to generate the output 
receiving modifications to the the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)),
wherein the modifications include one or more of: 
automatically deleting portions of the more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)), 
and 
substituting audio at the locations of the identified filler words (de Brébisson in [0046] teaches the media production platform may intelligently add, remove, or modify media in an audio); 
generating the output media sequence based on the received modifications to the audio sequence] (de Brébisson in [0084] teaches each word included in the transcript may be associated with a corresponding portion of the audio file so as to enable easy modification. As an example, if the individual is interested in deleting one or more words in the transcript (e.g., the phrase “no, no, no”), the individual may simply indicate as much by interacting with those word(s) in the word bar located along the bottom of the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)), 
the interface. As discussed above, the media production platform can then make appropriate adjustments to the underlying audio file (e.g., by deleting the portion in which those word(s) are uttered)).
de Brébisson, as modified above, does not teach, however Sheikh teaches
muting the 
Sheikh is considered to be analogous to the claimed invention because it is in the same field of muting information from an audio. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Sheikh to allow for using a time stamp parameter. Motivation to do so would allow for muting segments of classified information from an audio (Sheikh [0007]).
de Brébisson, as modified above, does not teach, however Warnick teaches
generating the output media sequence based on the received modifications to the audio sequence (Warnick in [col 11 lines 1-20] teaches removing filler words and generating synthetic video that matches the automatically edited audio, and in [col 10 lines 34-50] teaches generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged).
Warnick is considered to be analogous to the claimed invention because it is in the same field of filler words. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified de Brébisson, as modified above, further in view of Warnick to allow for generating an edited video segment corresponding to the removal of filler words, where the video is defocused briefly as part of the transitional effect between segments which are merged. Motivation to do so would provide for systems and methods that allow users to generate synthetic video segments that are synchronized with an edited audio track (Warnick [Abstract]).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL J. MUELLER whose telephone number is (571)272-1875. The examiner can normally be reached M-F 9:00am-5:00pm (Eastern).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

PAUL MUELLER
Examiner
Art Unit 2657



/PAUL J. MUELLER/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Nov 15, 2022
Application Filed
Oct 01, 2024
Non-Final Rejection — §103
Feb 28, 2025
Response Filed
Apr 11, 2025
Final Rejection — §103
Jun 25, 2025
Applicant Interview (Telephonic)
Jun 25, 2025
Examiner Interview Summary
Jun 26, 2025
Request for Continued Examination
Jun 30, 2025
Response after Non-Final Action
Aug 05, 2025
Non-Final Rejection — §103
Oct 22, 2025
Examiner Interview Summary
Oct 22, 2025
Response Filed
Oct 22, 2025
Applicant Interview (Telephonic)
Nov 07, 2025
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/083,119
Patent 12597419
NATURAL LANGUAGE PROCESSING APPARATUS AND NATURAL LANGUAGE PROCESSING METHOD
2y 5m to grant Granted Apr 07, 2026
18/373,547
Patent 12596867
Detecting Computer-Generated Hallucinations using Progressive Scope-of-Analysis Enlargement
2y 5m to grant Granted Apr 07, 2026
18/418,871
Patent 12596886
PERSONALIZED RESPONSES TO CHATBOT PROMPT BASED ON EMBEDDING SPACES BETWEEN USER AND SOCIETY
2y 5m to grant Granted Apr 07, 2026
18/518,155
Patent 12579378
USING LLM FUNCTIONS TO EVALUATE AND COMPARE LARGE TEXT OUTPUTS OF LLMS
2y 5m to grant Granted Mar 17, 2026
18/036,481
Patent 12562174
NOISE SUPPRESSION LOGIC IN ERROR CONCEALMENT UNIT USING NOISE-TO-SIGNAL RATIO
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+34.6%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 128 resolved cases by this examiner. Grant probability derived from career allow rate.
DETECTING AND CLASSIFYING FILLER WORDS IN AUDIO USING NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email