Last updated: April 19, 2026
Application No. 18/158,358
VOICE ACTIVITY DETECTION (VAD) BASED ON MULTIPLE INDICIA

Non-Final OA §101§103
Filed
Jan 23, 2023
Examiner
SOLAIMAN, FOUZIA HYE
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Analog Devices International Unlimited Company
OA Round
3 (Non-Final)
Interview Optional

— +55.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 63 resolved cases, 2023–2026
Examiner Intelligence

SOLAIMAN, FOUZIA HYE View full profile →
Grants 67% — above average
Career Allow Rate
42 granted / 63 resolved
+4.7% vs TC avg
Strong +56% interview lift
Without
With
+55.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
16 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
28.5%
-11.5% vs TC avg
§103
47.1%
+7.1% vs TC avg
§102
16.0%
-24.0% vs TC avg
§112
2.7%
-37.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 63 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/13/2025 has been entered.
 

Response to Arguments
35 USC 101
Applicant's arguments filed 10/10/2025 are considered in this office action. 101 rejection updated according to amendment below.
 
The Applicant  asserts that: {remark page 6}
Applicant asserts that (“.. The Office Action does not appear to address
this argument at all in the Response to Arguments section, as the Response to Arguments section does not appear to mention "mental step" at all (Office Action at pp. 2-5) and the§ 101 rejection appears to be identical to the previous Office Action. (Office Action at pp. 5-8 at Office Action received March 3, 2025 at pp. 2-5). Applicant respectfully requests that these arguments be addressed with specificity, and that Applicant be given a full and fair opportunity to respond. …”) Examiner notes that non final office action page 2-3, gives detail explanation how claim could be mental process. human can draw audio frequency representation on a piece of paper and analyze the graph mentally that graph contain speech or noise. This is mental process of a human to analyze data. The claim directed to abstract idea.

The Applicant further asserts that: {entire page 7, Remark}
Applicant asserts (“… The second stage 140 may operate on all of the digital audio signal from the incoming digital audio signal node 110, or it may only operate on the portions flagged as candidate speech intervals by the first stage 130 (e.g., in a cascaded manner). The second stage 140 may apply one or more additional processing steps to help determine whether an interval likely contains speech." …”) Examiner notes that human can do multiple things in a cascaded manner. It is not improvement of technology,  it’s just  improvement of process which human can do.
Applicant further asserts (“…. In particular, MPEP 2106.04(d)(1) states, "Second, if the specification sets forth an improvement in technology, the claim must be evaluated to ensure that the claim itself reflects the disclosed improvement. …”) Examiner notes that claim language does not reflect the requirement as it recited in the MPEP 2106.04(d)(1). Also, the claim does not have any additional element that tie the claim to practical application. The claim doses not have any additional elements {prong 2 is missing, figure below} other than generic computer components which integrate claim to be patent eligible. The claim never reach step 2B without additional element.  Therefore, claim recites abstract idea. 

From MPEP:
2106.04(d)(1): A claim reciting a judicial exception is not directed to the judicial exception if it also recites additional elements demonstrating that the claim as a whole integrates the exception into a practical application. One way to demonstrate such integration is when the claimed invention improves the functioning of a computer or improves another technology or technical field. The application or use of the judicial exception in this manner meaningfully limits the claim by going beyond generally linking the use of the judicial exception to a particular technological environment, and thus transforms a claim into patent-eligible subject matter. Such claims are eligible at Step 2A because they are not “directed to” the recited judicial exception.

    PNG
    media_image1.png
    618
    800
    media_image1.png
    Greyscale

 


35 USC 102:
Regarding rejection to claims under 35 U.S.C. §102, applicant amended independent claims 1, 19 and 20. Regarding the rejection to claim 1, 19 and 20 applicant argued (Remarks, pages 8-9) that previously cited references fail to teach newly added limitations in the amended claim 1, 19 and 20. Applicant further argued (Remarks, pages 9) that dependent claims 2-16 and 18, and 19 are allowable because of dependency.
Applicant arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection (35 U.S.C. §103)  does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.



Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim(s) 1, 19, and 20, the limitation(s) of “receive audio data”, “applying”, and “applying”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human receiving audio data and drawing audio frequency representation on a piece of paper based on human knowledge, and evaluating frequency value would be speech duration or not, representing this as two or three steps would mental process of a human. Also, human can compare candidate speech duration values with predefined threshold/rule/FFT apply and evaluate/assess speech/noise is present or not. 

 If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claim(s) recite(s) an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of “receiver circuit”, “processor circuit” and “memory circuit” in claim 20, reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using [0011], [0031], [0064] in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim(s) is/are directed to an abstract idea.
The claim(s) do(es) not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to receive, transform, determine, determine, determine, determine, and indicate amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim(s) is/are not patent eligible.
With respect to claim(s) 2, the claim(s) recite(s), “establishing respective frames defining specified durations within the digital representation of the audio signal” reads on a human mind creating/drawing frame by frame corresponding to audio signal No additional limitations are present.
With respect to claim(s) 3, the claim(s) recite(s), “receiving a digital representation of an audio signal comprises receiving a streamed representation” reads on a human mind obtain real time conversation or any continuous signal and this continuous audio signal can be represented to as a frame. No additional limitations are present.
With respect to claim(s) 4, the claim(s) recite(s) “first frequency-domain indicator includes determining a representation of a frequency dispersion of spectral components of the digital representation of the audio signal,” , reads on a human spectral analysis of frequency domain frames. No additional limitations are present.
With respect to claim(s) 5, the claim(s) recite(s),” comparing the determined representation of the dispersion with a first threshold“ which reads on a human looking at the two values to determine if the calculated value meets the threshold value. No additional limitations are present.
With respect to claim(s) 6, the claim(s) recite(s), “adjusting the first threshold“ which reads on a human adjusting or tunning threshold value based on spectral centroids. No additional limitations are present
With respect to claim(s) 7, the claim(s) recite(s), “adjusting the first threshold“ which reads on a human adjusting or tunning threshold value based on frame where no speech present. No additional limitations are present
With respect to claim(s) 8 and 9, the claim(s) recite(s), mathematical calculation of the cepstrum converts a signal to the frequency domain through Fourier transform, takes the logarithm, then applies another Fourier transform to find inverse function. No additional limitations are present
With respect to claim(s) 10, the claim(s) recite(s),  “comparing the determined central tendency to a threshold” comparing two value and determine threshold is exceed or within limit. This is mental activity and comparing two numbers is mathematical algorithm at the same time. Therefore, this claim recites abstract idea. No additional limitations are present
With respect to claim(s) 11-13, the claim(s) recite(s),  “MFC indicator” defines audio signal and how scattered are frequency component in the spectral graph, and further analyze these graphs based on speech present prediction. Human analyzing data is human activity which are abstract idea. Therefore, this claim recites abstract idea. No additional limitations are present
With respect to claim(s) 14, the claim(s) recite(s),  “second stage comprises both an MFC indicator and pitch indicator” which reads on a human to subsequent steps/stages are there to analyze portion of audio has speech or not, and system analyzes the MFCC to identify the pitch of the words. No additional limitations are present.
With respect to claim(s) 15 and 16, the claim(s) recite(s) duration of speech sending  to another system as it receiving partially, which reads on a human portion of speech send to external device as it receives. These claims are pre and post solution activity. No additional limitations are present
With respect to claim(s) 17 and 18, the claim(s) recite(s) “applying a third stage comprising at least one temporal indicator to assess”, which reads on a human predefined temporal relationship and comparing with candidate speech duration values or not. No additional limitations are present
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 2, 3, 4, 11, 15, 16, 17, 18 19, and 20  is/are rejected under 35 U.S.C. 103 as being unpatentable over York et al. in view of  Otani et al. US 20050108004 A1.
Regarding Claim 1, York teaches 
1. A machine-implemented method for detecting voice activity, the method comprising: receiving a digital representation of an audio signal. York  teaches (“(9) …  obtain one or more data streams from one or more sensors; determine, based on the one or more data streams, whether a user associated with the electronic device is speaking; in accordance with a determination that the user is speaking: …”col. 2, lines 42-45) (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. …”col. 46, lines 56-57 ) by York et al. US 11705130 B2

applying a first stage comprising determining a first frequency-domain indicator from the digital representation of the audio signal to identify a candidate speech duration; and
York  teaches (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. In some examples, speech detector 852 divides an input data stream into frames (e.g., portions) of a predetermined length, e.g., 25, 50, 100, 200, 300, 400, or 500 milliseconds. In some examples, speech detector 852 determines whether each frame indicates that a user is speaking (indicates user speech). For example, speech detector 852 determines a probability of user speech for each frame and/or makes a “yes” or “no” decision regarding the presence of user speech for each frame. In this manner, speech detector 852 can identify the boundaries of user speech in a data stream and determine the duration(s) during which a user is speaking or not speaking.” Col. 47, lines 1-13) by York et al. US 11705130 B2

the second stage  comprising determining at least one of a mel- frequency cepstral (MFC) indicator or a pitch indicator from the digital representation of the audio signal corresponding to the identified candidate speech duration to assess whether the identified candidate speech duration contains speech. York  teaches (“(169) STT processing module 730 includes one or more ASR systems 758. The one or more ASR systems 758 can process the speech input that is received through I/O processing module 728 to produce a recognition result. Each ASR system 758 includes a front-end speech pre-processor. The front-end speech pre-processor extracts representative features from the speech input. For example, the front-end speech pre-processor performs a Fourier transform on the speech input to extract spectral features that characterize the speech input as a sequence of representative multi-dimensional vectors. …” COL. 35, lines 43-53) (“(221) In some examples, determination module 850 determines whether a first data stream obtained from a speech sensor (or a portion thereof) indicates user speech. In some examples, determining whether the first data stream indicates user speech includes analyzing the time domain and/or frequency domain features of the first data stream to determine whether such features indicate human speech. Exemplary time domain features include zero crossing rates, short-time energy, spectral energy, spectral flatness, and autocorrelation. Exemplary frequency domain features include mel-frequency cepstral coefficients, linear predictive cepstral coefficients, and mel-frequency discrete wavelet coefficients. In addition to analyzing such features, one of skill in the art will appreciate that any other suitable technique (e.g., comparing the data stream to a human speech model) may be employed to determine whether a data stream indicates user speech. For example, speech detector 852 processes the first data stream to determine that a 2 second portion of the data stream indicates user speech. For example, speech detector 852 determines that each frame of a consecutive series of frames (having a collective 2 second duration) indicates user speech.” col. 47, lines 33- 54)  (“(230) … in some examples, determination module 850 determines whether a length (e.g., number of words, length of audio) of a previous notification is below a threshold length. In some examples, if the length is determined to be below the threshold length, …”)  (“(234) In FIG. 8A, the user is determined to be not speaking. Specifically, during a 7 second time window after the text message was received, a 2 second duration of no user speech is determined. During that duration, device 810 provides the audio output “Mom says ‘where are you?’””)  (“(278) In some examples, it is determined, based on obtained data stream(s), whether a user associated with an electronic device (e.g., devices 800 and 810) is speaking (e.g., by determination module 850). Determining whether a user is speaking is performed according to any of the techniques discussed above with respect to FIGS. 8A-8C. For example, determining whether a user is speaking includes determining whether a data stream obtained from a vibration sensor (e.g., bone conduction microphone) indicates user speech for a predetermined duration (e.g., 0.2, 0.3, 0.4, or 0.5 seconds). …”) by York et al. US 11705130 B2
applying a second stage pitch of the audio signal to the candidate speech duration identified using the first frequency- domain indicator of the first stage, corresponding to the identified candidate speech duration to assess whether the identified candidate speech duration contains speech.  Otani  teaches applies FFT on the input sound signal (“[0149] (S42) The power spectrum calculator 64a applies FFT on the input sound signal and supplies the resulting power spectrum to the talkspurt detector 64b.”) (“[0038] In the spectral analysis approach, the power spectrum of a signal is calculated with fast Fourier transform (FFT), wavelet transform, or other known algorithms. In the case of FFT, the Fourier transform algorithm converts a time series of samples into a set of components in the frequency domain, i.e., the frequency spectrum of the signal. Suppose now that a time-domain data stream x for one frame period is given. The given stream is converted to a frequency-domain dataset X=(X[k].vertline.k=1, 2, . . . N), where k is frequency and N is the total number of subdivided (i.e., discretized) frequency bands.”) (“[0055] It is generally known that speech signals have different spectral envelopes and pitch structures, which result in uneven distribution of frequency components. Spectral envelopes represent the timbre of voice, which is determined by the shape of a speaker's vocal tract (i.e., structure of organs from vocal chords to mouth). A change in the shape of a vocal tract affects its transfer function including resonance characteristics, thus causing uneven distribution of acoustic energies over frequency. Pitch structures indicate the tone height, which comes from the frequency of vocal chord vibration. A temporal change in the pitch structure gives a particular accent or intonation in speech. Background noises, on the other hand, are known to have a relatively uniform spectrum. For this reason, white noise approximation or pink noise approximation is often made to represent them.”) (“[0092] … It sets an appropriate flag to indicate the result. FIG. 15 illustrates how talkspurts are differentiated from noise periods, where the horizontal axis represents frames (time) …”) (“0054] Talkspurt periods can be distinguished from noise periods by calculating the flatness of a power spectrum in the way described above. The following will explain how the spectral flatness varies depending on whether the signal contains speech or only background noise.”) (“[0058] The flatness factor FLT1 of signal X1 (FIG. 7) is obviously greater than FLT2 of signal X2 (FIG. 8). This fact indicates that the signal X1 is speech while the signal X2 is noise. Note here that a larger value of FLT means a less flat spectrum, and that a smaller value of FLT means a flatter spectrum. Talkspurts can be identified by calculating flatness factors of spectrums and comparing them (the voice/noise discriminator 13 actually compares the flatness factor with a predetermined threshold).”) Otani  teaches (“[0114] (S14) The voice/noise discriminator 33f compares the flatness factor of each frame with a predetermined threshold. Through this comparison the voice/noise discriminator 33f determines whether the frame in question is speech or noise, and it sets an appropriate flag to indicate the result.”) by Otani et al. US 20050108004 A1. Otani teaches FFT applies to the input signal, calculate flatness of the spectrum and compare with threshold if speech or noise is present.
Otani is considered to be analogous to the claimed invention because it relates to a voice activity detector, and more particularly to a voice activity detector which discriminates talkspurts from background noises in a given input signal .
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York to incorporate the teachings of Otani in order to include temporal length. 
One could have been motivated to do so because model can sets an appropriate flag to indicate the result. (“[0114] … and it sets an appropriate flag to indicate the result.”) by Otani et al. US 20050108004 A1

Claim 19 is a method claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale. Additionally:
York teaches: 
19. A machine-implemented method for detecting voice activity, the method comprising: receiving a digital representation of an audio signal; York  teaches (“(9) …  obtain one or more data streams from one or more sensors; determine, based on the one or more data streams, whether a user associated with the electronic device is speaking; in accordance with a determination that the user is speaking: …”col. 2, lines 42-45) (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. …”col. 46, lines 56-57 ) by York et al. US 11705130 B2
establishing respective frames defining specified durations within the digital representation of the audio signal; York  teaches (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. In some examples, speech detector 852 divides an input data stream into frames (e.g., portions) of a predetermined length, e.g., 25, 50, 100, 200, 300, 400, or 500 milliseconds. In some examples, speech detector 852 determines whether each frame indicates that a user is speaking (indicates user speech). For example, speech detector 852 determines a probability of user speech for each frame and/or makes a “yes” or “no” decision regarding the presence of user speech for each frame. In this manner, speech detector 852 can identify the boundaries of user speech in a data stream and determine the duration(s) during which a user is speaking or not speaking.” Col. 46, lines 65-67)  (“… For example, speech detector 852 processes the first data stream to determine that a 2 second portion of the data stream indicates user speech. For example, speech detector 852 determines that each frame of a consecutive series of frames (having a collective 2 second duration) indicates user speech.” col. 47, lines 33- 54) by York et al. US 11705130 B2  
 applying a first stage comprising determining a first frequency-domain indicator from at least one of the respective frames of the digital representation of the audio signal to identify a candidate speech duration; York  teaches (“(221) In some examples, determination module 850 determines whether a first data stream obtained from a speech sensor (or a portion thereof) indicates user speech. In some examples, determining whether the first data stream indicates user speech includes analyzing the time domain and/or frequency domain features of the first data stream to determine whether such features indicate human speech. Exemplary time domain features include zero crossing rates, short-time energy, spectral energy, spectral flatness, and autocorrelation. Exemplary frequency domain features include mel-frequency cepstral coefficients, linear predictive cepstral coefficients, and mel-frequency discrete wavelet coefficients. In addition to analyzing such features, one of skill in the art will appreciate that any other suitable technique (e.g., comparing the data stream to a human speech model) may be employed to determine whether a data stream indicates user speech. For example, speech detector 852 processes the first data stream to determine that a 2 second portion of the data stream indicates user speech. For example, speech detector 852 determines that each frame of a consecutive series of frames (having a collective 2 second duration) indicates user speech.” col. 47, lines 33- 54)  by York et al. US 11705130 B2
Claim 20 is a system claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale. Additionally:
Regarding Claim 20, York teaches 
20. A voice activity detection (VAD) system, the system comprising: a receiver circuit, configured to receive a digital representation of an audio signal; and a processor circuit coupled with a memory circuit, the memory circuit containing instructions that, when executed by the processor circuit, cause the processor circuit to: (“(40)… Device 200 includes memory 202 (which optionally includes one or more computer-readable storage mediums), memory controller 222, one or more processing units (CPUs) 220, peripherals interface 218, RF circuitry 208, audio circuitry 210, speaker 211, microphone 213, input/output (I/O) subsystem 206, other input control devices 216, and external port 224. …”col. 8, lines 46-53) (“(46) Peripherals interface 218 is used to couple input and output peripherals of the device to CPU 220 and memory 202. The one or more processors 220 run or execute various software programs and/or sets of instructions stored in memory 202 to perform various functions for device 200 and to process data. In some embodiments, peripherals interface 218, CPU 220, and memory controller 222 are implemented on a single chip, such as chip 204. In some other embodiments, they are implemented on separate chips.” Col.10, lines 42-50) by York et al. US 11705130 B2

Regarding Claim 2, the combination teaches the method claim 1 as identified above.
York further teaches:
2. The method of claim 1, comprising establishing respective frames defining specified durations within the digital representation of the audio signal, wherein at least one of the first stage or the second stage operates on at least one of the respective frames.  York  teaches (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. In some examples, speech detector 852 divides an input data stream into frames (e.g., portions) of a predetermined length, e.g., 25, 50, 100, 200, 300, 400, or 500 milliseconds. In some examples, speech detector 852 determines whether each frame indicates that a user is speaking (indicates user speech). For example, speech detector 852 determines a probability of user speech for each frame and/or makes a “yes” or “no” decision regarding the presence of user speech for each frame. In this manner, speech detector 852 can identify the boundaries of user speech in a data stream and determine the duration(s) during which a user is speaking or not speaking.” Col, 46 lines 65-67 and Col 47, lines 1-13) by York et al. US 11705130 B2

Regarding Claim 3, the combination teaches the method claim 2 as identified above.
York further teaches:
3. The method of claim 2, wherein the receiving a digital representation of an audio signal comprises receiving a streamed representation; and wherein the establishing respective frames includes assigning or receiving the respective frames based on the streamed representation. York  teaches (“(218) In some examples, determination module 850 includes speech detector 852 configured to process data streams. In some examples, speech detector 852 includes a voice activity detector (VAD) configured to detect speech in a data stream. In some examples, speech detector 852 divides an input data stream into frames (e.g., portions) of a predetermined length, e.g., 25, 50, 100, 200, 300, 400, or 500 milliseconds. In some examples, speech detector 852 determines whether each frame indicates that a user is speaking (indicates user speech). For example, speech detector 852 determines a probability of user speech for each frame and/or makes a “yes” or “no” decision regarding the presence of user speech for each frame. In this manner, speech detector 852 can identify the boundaries of user speech in a data stream and determine the duration(s) during which a user is speaking or not speaking.” Col, 46 lines 65-67 and Col 47, lines 1-13) by York et al. US 11705130 B2

Regarding Claim 4, the combination teaches the method claim 2 as identified above.
York further teaches:
4. The method of claim 2, wherein the first frequency-domain indicator includes determining a representation of a frequency dispersion of spectral components of the digital representation of the audio signal, the dispersion determined from a frequency domain transform corresponding to one frame.  York  teaches (“(221) In some examples, determination module 850 determines whether a first data stream obtained from a speech sensor (or a portion thereof) indicates user speech. In some examples, determining whether the first data stream indicates user speech includes analyzing the time domain and/or frequency domain features of the first data stream to determine whether such features indicate human speech. Exemplary time domain features include zero crossing rates, short-time energy, spectral energy, spectral flatness, and autocorrelation. Exemplary frequency domain features include mel-frequency cepstral coefficients, linear predictive cepstral coefficients, and mel-frequency discrete wavelet coefficients. In addition to analyzing such features, one of skill in the art will appreciate that any other suitable technique (e.g., comparing the data stream to a human speech model) may be employed to determine whether a data stream indicates user speech. For example, speech detector 852 processes the first data stream to determine that a 2 second portion of the data stream indicates user speech. For example, speech detector 852 determines that each frame of a consecutive series of frames (having a collective 2 second duration) indicates user speech.” col. 47, lines 33- 54)  by York et al. US 11705130 B2
Regarding Claim 11, the combination teaches the method claim 1 as identified above.
York further teaches:
11. The method of claim 1, wherein the second stage comprises an MFC indicator. Teaches York  teaches (“(221) In some examples, determination module 850 determines whether a first data stream obtained from a speech sensor (or a portion thereof) indicates user speech. In some examples, determining whether the first data stream indicates user speech includes analyzing the time domain and/or frequency domain features of the first data stream to determine whether such features indicate human speech. Exemplary time domain features include zero crossing rates, short-time energy, spectral energy, spectral flatness, and autocorrelation. Exemplary frequency domain features include mel-frequency cepstral coefficients, linear predictive cepstral coefficients, and mel-frequency discrete wavelet coefficients. In addition to analyzing such features, one of skill in the art will appreciate that any other suitable technique (e.g., comparing the data stream to a human speech model) may be employed to determine whether a data stream indicates user speech. For example, speech detector 852 processes the first data stream to determine that a 2 second portion of the data stream indicates user speech. For example, speech detector 852 determines that each frame of a consecutive series of frames (having a collective 2 second duration) indicates user speech.” col. 47, lines 33- 54) by York et al. US 11705130 B2

Regarding Claim 15, the combination teaches the method claim 1 as identified above. 
York further teaches:
15. The method of claim 1, comprising sending a duration determined to contain speech to another system. York teaches (“(316) At block 1130, in accordance with a determination that the user is speaking: at least a portion of the one or more data streams to is provided to an external electronic device (e.g., 800, 900), the portion including data representing a received speech input requesting performance of a task associated with the notification. In some examples, the speech input does not include a trigger phrase for initiating a digital assistant.”) by York et al. US 11705130 B2

Regarding Claim 16, the combination teaches the method claim 15 as identified above. 
York further teaches
16. The method of claim 15, wherein the sending the duration determined to contain speech to another system occurs at least partially concurrently with the receiving a digital audio signal corresponding to the duration.  (“(316) At block 1130, in accordance with a determination that the user is speaking: at least a portion of the one or more data streams to is provided to an external electronic device (e.g., 800, 900), the portion including data representing a received speech input requesting performance of a task associated with the notification. In some examples, the speech input does not include a trigger phrase for initiating a digital assistant.”) by York et al. US 11705130 B2

Regarding Claim 17, the combination teaches the method claim 1 as identified above.
York does not explicitly teach third stage comprising at least one temporal indicator to assess whether the identified candidate speech duration contains speech.
Otani further teaches: 
17. The method of claim 1, comprising applying a third stage comprising at least one temporal indicator to assess whether the identified candidate speech duration contains speech. Otani  teaches (“[0023] FIG. 13 shows how the flatness of a given signal is evaluated based on the maximum difference between adjacent spectral components.”) (“[0036] Referring to FIG. 1B, signal segments with a flatter frequency spectrum are regarded as noise, and signal segments with a less flat frequency spectrum are regarded as speech. The voice activity detector 10 of the present invention identifies talkspurts in a given signal accurately by evaluating the flatness of power spectrum of an input signal to determine whether each segment of the signal contains speech or noise.”)  by Otani et al. US 20050108004 A1
Otani is considered to be analogous to the claimed invention because it relates to a voice activity detector, and more particularly to a voice activity detector which discriminates talkspurts from background noises in a given input signal .
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York to incorporate the teachings of Otani in order to include temporal length. 
One could have been motivated to do so because model can sets an appropriate flag to indicate the result. (“[0114] … and it sets an appropriate flag to indicate the result.”) by Otani et al. US 20050108004 A1
Regarding Claim 18, the combination teaches the method claim 17 as identified above.
Otani further teaches:
18. The method of claim 17, wherein the candidate speech duration is determined not to contain speech if a temporal length of the duration is less than a specified value. Otani  teaches (“[0092] … It sets an appropriate flag to indicate the result. FIG. 15 illustrates how talkspurts are differentiated from noise periods, where the horizontal axis represents frames (time) …”) (“0054] Talkspurt periods can be distinguished from noise periods by calculating the flatness of a power spectrum in the way described above. The following will explain how the spectral flatness varies depending on whether the signal contains speech or only background noise.”) (“[0058] The flatness factor FLT1 of signal X1 (FIG. 7) is obviously greater than FLT2 of signal X2 (FIG. 8). This fact indicates that the signal X1 is speech while the signal X2 is noise. Note here that a larger value of FLT means a less flat spectrum, and that a smaller value of FLT means a flatter spectrum. Talkspurts can be identified by calculating flatness factors of spectrums and comparing them (the voice/noise discriminator 13 actually compares the flatness factor with a predetermined threshold).”) Otani  teaches (“[0114] (S14) The voice/noise discriminator 33f compares the flatness factor of each frame with a predetermined threshold. Through this comparison the voice/noise discriminator 33f determines whether the frame in question is speech or noise, and it sets an appropriate flag to indicate the result.”) by Otani et al. US 20050108004 A1.
Otani is considered to be analogous to the claimed invention because it relates to a voice activity detector, and more particularly to a voice activity detector which discriminates talkspurts from background noises in a given input signal .
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York to incorporate the teachings of Otani in order to include temporal length. 
One could have been motivated to do so because model can sets an appropriate flag to indicate the result. (“[0114] … and it sets an appropriate flag to indicate the result.”) by Otani et al. US 20050108004 A1

Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over York and Otani et al. in view of Xiao et al.  US 20150179187 A1.
Regarding Claim 5, the combination teaches the method claim 4 as identified above.
The combination does not explicitly teaches compare frequency spectrum with threshold.
Xiao teaches:
5. The method of claim 4, comprising comparing the determined representation of the dispersion with a first threshold and declaring a candidate speech duration in response to a result of the comparison. Xiao teaches (“[0066] The VAD detection technology for voice segment segmentation may be approximately divided into two steps.”)  (“[0067] Step 1: Identify, frame by frame, whether each frame in a voice signal segment is active or non-active. According to a common method in the prior art, activity of each frame is determined by calculating information, such as energy and a frequency spectrum, of each frame and comparing the energy and the frequency spectrum of each frame with a threshold. When the energy and the frequency spectrum of each frame are less than the threshold, the frame is defined to be non-active; otherwise, the frame is defined to be active.”) (“[0077] FIG. 5A and FIG. 5B are schematic diagrams of a voice segment segmentation algorithm according to Embodiment 5 of the present invention. For ease of description, B10 is equivalent to T0 in FIG. 4, B21 is equivalent to T1 in FIG. 4, and a duration [B10, B21] is a voice signal segment. The voice signal segment is detected by means of VAD, and it is determined that voice activity of the following durations [B10, T10], [T11, T20] and [T21, B21] are 0, that is, a status is non-active. Voice activity of durations [T10, T11] and [T20, T21] are 1, that is, a status is active.”) by Xiao et al. US 20150179187 A1 
Xiao is considered to be analogous to the claimed invention because it relates to the field of audio technologies, and more specifically, to a voice quality monitoring method and apparatus.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York and Otani to incorporate the teachings of Xiao in order to include feature threshold compare with frame. 
One could have been motivated to do so because model can analyze long audio signal by using relatively low costs. (“[0005] In view of this, embodiments of the present invention provide a voice quality monitoring method and apparatus, so as to solve a difficult problem of how to perform proper voice quality monitoring on a relatively long audio signal by using relatively low costs.”) by Xiao et al.  US 20150179187 A1 

Claim 6 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over York, Otani, and Xiao in view of Visser et al. US 20130282373 A1.
Regarding Claim 6, the combination teaches the method claim 5 as identified above.
The combination does not explicitly teaches adjusting the first threshold based upon a central tendency of the first frequency-domain indicator determined using multiple frames.
Visser teaches :
6. The method of claim 5, further comprising adjusting the first threshold based upon a central tendency of the first frequency-domain indicator determined using multiple frames. Visser teaches peak  of the frequency range. (“[0206] The bin-wise VAD 2087 may determine voice activity based on the peak information, the bin-wise SNR and the frame-wise voice indicator 2079. For example, the bin-wise VAD 2087 may detect voice activity on a bin-wise basis. More specifically, the bin-wise VAD 2087 may determine which of the peaks indicated by the peak map block/module 2083 are speech peaks. The bin-wise VAD 2087 may generate a bin-wise voice indicator 2089, which may indicate any bins for which voice activity is detected. In particular, the bin-wise voice indicator 2089 may indicate speech peaks and/or non-speech peaks in the transformed audio signal 2071. The peak removal block/module 2090 may remove non-speech peaks.[0207] The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated. [0208] Additionally, if two peaks are within a certain frequency range (e.g., 90 Hz) and their magnitudes are not much different (e.g., less than 12 dB), the lower one may be indicated as a non-speech peak by the bin-wise VAD 2087 and may be removed by the peak removal block/module 2090. The frequency range may be adjusted depending on speakers. For example, the frequency range may be increased for women or children, who have a relatively higher pitch.[0209] The bin-wise VAD 2087 may also detect temporally isolated peaks (based on the peaks indicated by the peak map block/module 2083, for instance). For example, the bin-wise VAD 2087 may compare peaks from one or more other frames (e.g., previous frame(s) and/or subsequent frame(s)) to peaks in a current frame. For instance, the bin-wise VAD 2087 may detect peaks in a frame that do not have a corresponding peak in a previous frame within a particular range. The range may vary based on the location of the peak. For example, the bin-wise VAD may determine that a peak has a corresponding peak in a previous frame (e.g., that the peak is temporally continuous) if a corresponding peak is found in a previous frame within .+-.1 bin for lower-frequency peaks and within .+-.3 bins for higher-frequency peaks. The bin-wise VAD 2087 may indicate temporally isolated peaks (e.g., peaks in a current frame without corresponding peaks in a previous frame) to the peak removal block/module 2090, which may remove the temporally isolated peaks from the transformed audio signal 2071.”) by Visser et al. US 20130282373 A1
Visser is considered to be analogous to the claimed invention because it relates to systems and methods for audio signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, Otani, and Xiao to incorporate the teachings of Visser in order to include feature threshold compare with frequency peak. 
One could have been motivated to do so because model can improve voice detection. (“[0061] The techniques disclosed herein may be used to improve voice activity detection (VAD) in order to enhance speech processing, such as voice coding. The disclosed voice activity detection techniques may be used to improve the accuracy and reliability of voice detection, and thus, to improve functions that depend on voice activity detection, ….”) by Visser et al. US 20130282373 A1
Regarding Claim 7, the combination teaches the method claim 6 as identified above.
Visser further teaches :
7. The method of claim 6, wherein the first threshold is adjusted based upon frames that are determined not to contain speech.  Visser teaches  (“[0208] Additionally, if two peaks are within a certain frequency range (e.g., 90 Hz) and their magnitudes are not much different (e.g., less than 12 dB), the lower one may be indicated as a non-speech peak by the bin-wise VAD 2087 and may be removed by the peak removal block/module 2090. The frequency range may be adjusted depending on speakers. For example, the frequency range may be increased for women or children, who have a relatively higher pitch.”) by Visser et al. US 20130282373 A1
Visser is considered to be analogous to the claimed invention because it relates to systems and methods for audio signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, Otani, and Xiao to incorporate the teachings of Visser in order to include feature threshold compare with frequency peak. 
One could have been motivated to do so because model can improve voice detection. (“[0061] The techniques disclosed herein may be used to improve voice activity detection (VAD) in order to enhance speech processing, such as voice coding. The disclosed voice activity detection techniques may be used to improve the accuracy and reliability of voice detection, and thus, to improve functions that depend on voice activity detection, ….”) by Visser et al. US 20130282373 A1
Claim 8 and 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over York et al., Otani in view of Lee et al. US 20220108687 A1
Regarding Claim 8, the combination teaches the method claim 2 as identified above.
York does not explicitly teach a logarithm of a frequency domain transform of a time-domain representation.
Lee teaches:
8. The method of claim 2, wherein the second stage comprises a pitch indicator, the pitch indicator comprising an inverse frequency domain transform of: a logarithm of a frequency domain transform of a time-domain representation of a respective one of the frames amongst the respective frames. Lee  teaches (“[0093] Next the frequency domain data 136 are transformed by taking the log of the magnitude of the frequency coefficients at 138 and then applying an inverse Fourier transform using FFT 140. These operations transform the data into cepstral coefficients 144 in the frequency domain. Representing the speech signal in the frequency domain is useful in analyzing at what frequencies the sound pressure levels peak in a speech utterance.”) (“[0115] Another subclass of spectrally derived features is known as cepstral based features. The cepstrum is essentially the power spectrum of the logpower spectrum. Given the log power spectrum, as derived above, we can compute the power cepstrum for the kth frame as: … … 1. Cepstral peaks can be used to identify the fundamental frequency, FO, that is, pitch estimation. Cepstral peak:
ceps=DCT(log(|FFT(x)|.sup.2)) …”) (“[0123] In the Short-Term Fourier Transform (STFT) domain, the harmonics of the pitch frequency for voice frames become evident in the magnitude spectrum of the signal. The STFT is formed by taking the DFT from Hamming windowed buffered signal frames with possible zero padding. This observation serves as the basis for the harmonic product spectrum technique which has been utilized for noise-robust pitch detection. The HPS in the log-spectral domain is defined as: …”) by Lee et al. US 20220108687 A1
Lee is considered to be analogous to the claimed invention because it relates to voice detection and speech signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, and Otani to incorporate the teachings of Lee in order to include pitch indicator. 
One could have been motivated to do so because model can improve recognizability. (“[0003] … Speech enhancement (SE) systems are sometimes used to improve recognizability. …”) by  Lee et al. US 20220108687 A1
Regarding Claim 9, The combination teaches the method claim 8 as identified above.
The combination does not explicitly teach wherein the pitch indicator includes determining a central tendency of a magnitude of a specified range of bins within the inverse frequency domain transform. 
Lee  teaches:
9. The method of claim 8, wherein the pitch indicator includes determining a central tendency of a magnitude of a specified range of bins within the inverse frequency domain transform. Lee  teaches (“[0093] Next the frequency domain data 136 are transformed by taking the log of the magnitude of the frequency coefficients at 138 and then applying an inverse Fourier transform using FFT 140. These operations transform the data into cepstral coefficients 144 in the frequency domain. Representing the speech signal in the frequency domain is useful in analyzing at what frequencies the sound pressure levels peak in a speech utterance.”) (“[0115] Another subclass of spectrally derived features is known as cepstral based features. The cepstrum is essentially the power spectrum of the logpower spectrum. Given the log power spectrum, as derived above, we can compute the power cepstrum for the kth frame as: … … 1. Cepstral peaks can be used to identify the fundamental frequency, FO, that is, pitch estimation. Cepstral peak:
ceps=DCT(log(|FFT(x)|.sup.2)) …”) (“[0123] In the Short-Term Fourier Transform (STFT) domain, the harmonics of the pitch frequency for voice frames become evident in the magnitude spectrum of the signal. The STFT is formed by taking the DFT from Hamming windowed buffered signal frames with possible zero padding. This observation serves as the basis for the harmonic product spectrum technique which has been utilized for noise-robust pitch detection. The HPS in the log-spectral domain is defined as: …”) (“[0124] The periodicity is computed as the maximum peak of P(t; !) in the plausible pitch range: …”)m by Lee et al. US 20220108687 A1
Lee is considered to be analogous to the claimed invention because it relates to voice detection and speech signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York and Otani to incorporate the teachings of Lee in order to include pitch indicator. 
One could have been motivated to do so because model can improve recognizability. (“[0003] … Speech enhancement (SE) systems are sometimes used to improve recognizability. …”) by  Lee et al. US 20220108687 A1

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over York, Otani, and Lee in view of Xiao et al. US 20150179187 A1
Regarding Claim 10, The combination teaches the method claim 9 as identified above.
Xiao teaches:
10. The method of claim 9, comprising comparing the determined central tendency to a threshold and declaring a candidate speech duration to be speech if the threshold is exceeded. Fig. 2, Xiao teaches average value (i.e. central tendency) (“[0058] S22. For each input signal of the unit of time, determine whether an average value of the numbers of pitch components included in each input signal of the unit of time is larger. The average value of the numbers of pitch components is compared with a threshold, if the average value of the numbers of pitch components is larger, that is, a result of the determining in S22 is "yes", S23 is performed. Otherwise, a result of the determining in S22 is "no", S24 is performed.”) (“[0060] S24. For each input signal of the unit of time, determine whether a distribution ratio of the pitch components of each input signal of the unit of time at a low frequency is smaller. The distribution ratio of the pitch components is compared with a threshold, if the distribution ratio of the pitch components at a low frequency is smaller, that is, a result of the determining in S24 is "yes", S23 is performed. …”) by Xiao et al. US 20150179187 A1
Xiao is considered to be analogous to the claimed invention because it relates to systems and methods for audio signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, Otani and Lee to incorporate the teachings of Xiao in order to include feature threshold compare between spectral centroids. 
One could have been motivated to do so because model can analyze long audio signal by using relatively low costs. (“[0005] In view of this, embodiments of the present invention provide a voice quality monitoring method and apparatus, so as to solve a difficult problem of how to perform proper voice quality monitoring on a relatively long audio signal by using relatively low costs.”) by Xiao et al.  US 20150179187 A1 

Claim 12  is/are rejected under 35 U.S.C. 103 as being unpatentable over York et al. and Visser et al. US 20130282373 A1
Regarding Claim 12, The combination teaches the method claim 11 as identified above.
The combination does not explicitly wherein the MFC indicator includes determining a representation of a dispersion of the MFC of the digital representation of the audio signal, the dispersion determined from an MFC transform corresponding to at least two frames. 
Visser  teaches:
12. The method of claim 11, wherein the MFC indicator includes determining a representation of a dispersion of the MFC of the digital representation of the audio signal, the dispersion determined from an MFC transform corresponding to at least two frames. Visser teaches peak  of the frequency range.(“[0184] … a noise statistic (e.g., spectral flatness measure) estimation block/module 1747, TF phase voice activity detection/gain difference based suppression block/module 1749, voice activity detection-based residual noise suppression block/module 1751, comb filtering block/module 1755 and an inverse fast Fourier transform block module 1757 that process one or more intermediate signals 1776a-f into an output signal 1780. …”)  (“[0206] The bin-wise VAD 2087 may determine voice activity based on the peak information, the bin-wise SNR and the frame-wise voice indicator 2079. For example, the bin-wise VAD 2087 may detect voice activity on a bin-wise basis. More specifically, the bin-wise VAD 2087 may determine which of the peaks indicated by the peak map block/module 2083 are speech peaks. The bin-wise VAD 2087 may generate a bin-wise voice indicator 2089, which may indicate any bins for which voice activity is detected. In particular, the bin-wise voice indicator 2089 may indicate speech peaks and/or non-speech peaks in the transformed audio signal 2071. The peak removal block/module 2090 may remove non-speech peaks.[0207] The bin-wise VAD 2087 may indicate peaks that are associated with speech based on distances between adjacent peaks and temporal continuity. For example, the bin-wise VAD 2087 may indicate small peaks (e.g., peaks that are more than a threshold amount (e.g., 30 dB) below the maximum peak). The bin-wise voice indicator 2089 may indicate these small peaks to the peak removal block/module 2090, which may remove the small peaks from the transformed audio signal 2071. For example, if peaks are determined to be significantly lower (e.g., 30 dB) than a maximum peak, they may not be related to the speech envelope and are thus eliminated. [0208] Additionally, if two peaks are within a certain frequency range (e.g., 90 Hz) and their magnitudes are not much different (e.g., less than 12 dB), the lower one may be indicated as a non-speech peak by the bin-wise VAD 2087 and may be removed by the peak removal block/module 2090. The frequency range may be adjusted depending on speakers. For example, the frequency range may be increased for women or children, who have a relatively higher pitch.[0209] The bin-wise VAD 2087 may also detect temporally isolated peaks (based on the peaks indicated by the peak map block/module 2083, for instance). For example, the bin-wise VAD 2087 may compare peaks from one or more other frames (e.g., previous frame(s) and/or subsequent frame(s)) to peaks in a current frame. For instance, the bin-wise VAD 2087 may detect peaks in a frame that do not have a corresponding peak in a previous frame within a particular range. The range may vary based on the location of the peak. For example, the bin-wise VAD may determine that a peak has a corresponding peak in a previous frame (e.g., that the peak is temporally continuous) if a corresponding peak is found in a previous frame within .+-.1 bin for lower-frequency peaks and within .+-.3 bins for higher-frequency peaks. The bin-wise VAD 2087 may indicate temporally isolated peaks (e.g., peaks in a current frame without corresponding peaks in a previous frame) to the peak removal block/module 2090, which may remove the temporally isolated peaks from the transformed audio signal 2071.” [0295].. The term "frequency component" is used to indicate one among a set of frequencies or frequency bands of a signal, such as a sample of a frequency domain representation of the signal (e.g., as produced by a fast Fourier transform) or a subband of the signal (e.g., a Bark scale or mel scale subband) by Visser et al. US 20130282373 A1
Visser is considered to be analogous to the claimed invention because it relates to systems and methods for audio signal processing.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, and  Otani to incorporate the teachings of Visser in order to include feature threshold compare with frequency peak. 
One could have been motivated to do so because model can improve voice detection. (“[0061] The techniques disclosed herein may be used to improve voice activity detection (VAD) in order to enhance speech processing, such as voice coding. The disclosed voice activity detection techniques may be used to improve the accuracy and reliability of voice detection, and thus, to improve functions that depend on voice activity detection, ….”) by Visser et al. US 20130282373 A1.

Claim 13  is/are rejected under 35 U.S.C. 103 as being unpatentable over York, Otani and Visser et al. US 20130282373 A1 in view of Xiao, CN 104409081 A
Regarding Claim 13, The combination teaches the method claim 12 as identified above.
The combination does not explicitly compare the determined representation of dispersion of the MFC to at least one threshold and at least one of adjusting a candidate speech duration or declaring a candidate speech duration to be speech in response to a result of the comparison. 
Xiao teaches:
13. The method of claim 12, comprising comparing the determined representation of dispersion of the MFC to at least one threshold and at least one of adjusting a candidate speech duration or declaring a candidate speech duration to be speech in response to a result of the comparison. Xiao teaches (“[0151] When the second characteristic value is the cepstrum distance, detecting the cepstrum distance of the voice signal is greater than a preset cepstrum distance threshold value, if it is more than the preset cepstrum distance threshold, it is determined that the voice signal belonging to unvoiced signal. otherwise, determining that the voice signal belonging to a voice signal. wherein the preset cepstrum distance threshold value is preset according to the actual requirement of the empirical value.”) (“[0220] when the first characteristic value is a cepstrum distance, detecting the cepstrum distance of the voice signal is greater than a preset cepstrum distance threshold value, if it is more than the preset cepstrum distance threshold value, determining the frame voice signal to voice signal. otherwise, determining that the frame voice signal to the voice signal. wherein the preset cepstrum distance threshold value is preset according to the actual requirement of the empirical value.”) 
Xiao is considered to be analogous to the claimed invention because it relates to generally to the field of speech recognition.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York, Otani, and Visser to incorporate the teachings of Xiao in order to include feature threshold compare with frame. 
One could have been motivated to do so because model improves the real-time performance of the beatbox treatment and efficiency. (“[0092] …  low efficiency problem, improves the real-time performance of the beatbox treatment and efficiency, and there is no need for artificial post repair, reaches the effect of automatically detecting beatbox. …”) by Xiao, CN 104409081 A
Claim 14, is/are rejected under 35 U.S.C. 103 as being unpatentable over York, Otani in view of  JANKOWSKI et al. US 20200074997 A1
Regarding Claim 14, York teaches the method claim 1 as identified above.
York does not explicitly teach MFC indicator and a pitch indicator.
JANKOWSKI  teaches: 
14. The method of claim 1, wherein the second stage comprises both an MFC indicator and a pitch indicator.  JANKOWSKI  teaches (“A voice activity detection method includes: training one or more computerized neural networks having a denoising autoencoder and a classifier, wherein the training is performed utilizing one or more models including Mel-frequency cepstral coefficients (MFCC) features, Δ features, ΔΔ features, and Pitch features, …”)  by JANKOWSKI et al. US 20200074997 A1
JANKOWSKI is considered to be analogous to the claimed invention because it relates to voice recognition systems and methods for extracting speech and filtering speech from other audio waveforms.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the York and Otani to incorporate the teachings of JANKOWSKI in order to include both an MFC indicator and a pitch indicator. 
One could have been motivated to do so because model add robustness to voice activity detection in noisy conditions. (“[0066] Various embodiments of the present disclosure provide improvements over existing VAD systems by utilizing a series of techniques that add robustness to voice activity detection in noisy conditions, …”) by JANKOWSKI et al. US 20200074997 A1

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FOUZIA HYE SOLAIMAN whose telephone number is (571)270-5656. The examiner can normally be reached M-F (8-5)AM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/F.H.S./Examiner, Art Unit 2653                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Jan 23, 2023
Application Filed
Feb 26, 2025
Non-Final Rejection — §101, §103
Jun 03, 2025
Response Filed
Aug 07, 2025
Final Rejection — §101, §103
Oct 10, 2025
Response after Non-Final Action
Nov 13, 2025
Request for Continued Examination
Nov 24, 2025
Response after Non-Final Action
Jan 09, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/174,120
Patent 12592217
SYSTEM AND METHOD FOR SPEECH PROCESSING
2y 5m to grant Granted Mar 31, 2026
18/116,994
Patent 12579976
USER TERMINAL, DIALOGUE MANAGEMENT SYSTEM, CONTROL METHOD OF USER TERMINAL, AND DIALOGUE MANAGEMENT METHOD
2y 5m to grant Granted Mar 17, 2026
17/888,243
Patent 12555563
SYSTEMS AND METHODS FOR CHARACTER-TO-PHONE CONVERSION
2y 5m to grant Granted Feb 17, 2026
17/666,645
Patent 12542149
METHOD AND APPARATUS FOR IMPROVING SPEECH INTELLIGIBILITY IN A ROOM
2y 5m to grant Granted Feb 03, 2026
18/932,524
Patent 12537017
COMPUTERIZED SCORING METHOD OF FEATURE EXTRACTION-BASED FOR COVERTNESS OF IMITATED MARINE MAMMAL SOUND SIGNAL
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+55.5%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 63 resolved cases by this examiner. Grant probability derived from career allow rate.