Last updated: April 19, 2026

Application No. 18/682,416

REVERB AND NOISE ROBUST VOICE ACTIVITY DETECTION BASED ON MODULATION DOMAIN ATTENTION

Final Rejection §103

Filed

Feb 08, 2024

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Dolby Laboratories Licensing Corporation

OA Round

2 (Final)

Interview Optional

— +7.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 403 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allow Rate

359 granted / 403 resolved

+27.1% vs TC avg

Moderate +7% lift

Without

With

+7.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

46 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

21.3%

-18.7% vs TC avg

§103

36.0%

-4.0% vs TC avg

§102

22.6%

-17.4% vs TC avg

§112

8.8%

-31.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 403 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 101 Abstract Idea rejection of claims 1-18 have been considered and found persuasive due to amendments, and the rejection has been withdrawn. The amendments allow the use of MSM and DI in the modulation frequency domain for better separation of speech from reverberation than the conventional way.
Applicant's arguments with respect to 35 U.S.C. 103 in regards to claim 1 has been considered, however are not found to be persuasive due to the following reasons. See detailed rejection below.
New claim 19 has been rejected. New Claim 20 is allowable.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5-9, 12, 16 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Mizumoto et al. (US 2016/0064000) in view of Morita et al. (“Robust Voice Activity Detection based on Concept of Modulation Transfer Function in Noisy Reverberant Environments”; 11 June 2015; pgs. 163-173).

Claim 1,
Mizumoto teaches receiving, from an input device, a piece of new audio data representing an input audio signal ([0079-0080] sound signal-acquiring part 41 receives a sound signal; converts the acquired sound signals from analog signals to digital signals):
obtaining, by a processor, a specific spectral temporal amplitude (STA) as a time- frequency representation corresponding to a time point covered by the new audio data in a time domain ([0134-0135] complex input spectrum Y(k, l) wherein k is an index indicating a frequency and l is an index indicating each frame; calculates a power spectrum);
generating an enhanced STA that filters reverberation and other noise from the specific STA ([0086] the noise-suppressing part 46 suppresses noise components including inversely filtering a room impulse response, a technique based on power spectrum estimation of a sound source, a technique based on a modulation transfer function (MTF) theory, and a technique based on geometric sound separation (GSS); and outputs a reverberation suppressed sound signal);
calculating one or more features from the enhanced STA ([0087] extracts a mel-scale logarithmic spectrum (MSLS) which is a sound feature value from the reverberation-suppressed speech signal for each sound source input from the noise-suppressing part 46);
creating one or more feature vectors using the DI and the one or more features ([0087-0088] the extracted feature values are provided for downstream processing: outputs the extracted sound feature value and utterance detection uses feature values; combining DI with extracted feature into a vector is a routine implementation of feature modeling/classification: feature extraction and feature modeling for classification in VAD);
determining an estimate of an extent of speech in the piece of the new audio data based on the one or more feature vectors ([0088] determines silent section, determines utterance section based on feature values);
updating the piece of the new audio data based on the estimate of the extent of speech to produce the enhanced piece of audio data representing an output audio signal that is less reverberant or noisy than the input audio signal ([0089] extracts the sound signal (subjected to sound source-separating and noise suppressing process) generated during the utterance section); and 
transmitting the enhanced piece of audio data to an output device ([0200] transmission part 70 transmits and the transmitted information includes utterance-section and MSLS information).
The difference between the prior art and the claimed invention is that Mizumoto does not explicitly teach an audio method for producing an enhanced piece of audio data based on data in a modulation frequency domain, the audio method comprising: obtaining a modulation spectrum measure (MSM) for the time point having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data; computing, based on the MSM, a diffuseness indicator (DI) that indicates a degree of diffuseness in a modulation frequency domain for the piece of the new audio data, wherein a higher DI indicates a more reverberant or noisy audio signal represented by the piece of the new audio data.
Morita teaches an audio method for producing an enhanced piece of audio data based on data in a modulation frequency domain ([Abstract] the ill effects of noise and reverberation for speech can be regraded as the modulation transfer function (MTF) under noisy and reverberant conditions), the audio method comprising: 
obtaining a modulation spectrum measure (MSM) for the time point having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data ([3.1.2] [3.2] the modulation spectrum, ˆEx(z),of ˆe2 x(t) is derived from the modulation spectrum, ˆEl(z),of ˆe2l (t) (see eq. 10); we adopt constant-bank filterbank (CBFB) and we have 40 sub-bands);
computing, based on the MSM, a diffuseness indicator (DI) that indicates a degree of diffuseness in a modulation frequency domain for the piece of the new audio data, wherein a higher DI indicates a more reverberant or noisy audio signal represented by the piece of the new audio data ([3.1.1] [3.1.2] reverberation/noise indicators used in modulation-domain processing: Two parameters, TR and ˆa, are the reverberation time and amplitude that are used in inverse filtering of MTF; the MTF under additive noise condition is a function of SNR only, this means the flat characteristics in modulation frequency domain; (recognizing a DI can be implemented as a monotonic function of TR and/or SNR-derived MTF term to indicate more reverberant or noisy)).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mizumoto with teachings of Morita by modifying the sound source-separating device and method as taught by Mizumoto to include an audio method for producing an enhanced piece of audio data based on data in a modulation frequency domain, the audio method comprising: obtaining a modulation spectrum measure (MSM) for the time point having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data; computing, based on the MSM, a diffuseness indicator (DI) that indicates a degree of diffuseness in a modulation frequency domain for the piece of the new audio data, wherein a higher DI indicates a more reverberant or noisy audio signal represented by the piece of the new audio data as taught by Morita for the benefit of reducing the effects of background noise and reverberation under noisy conditions (Morita [Section 2.2]).

Claim 5,
Mizumoto further teaches the audio method of claim 1, wherein the obtaining comprises  computing the MSM using pieces of new audio data corresponding to a certain number of consecutive time points before the time point with fast Fourier transform ([0087] preforming inverse discrete cosine transform on a mel frequency cepstrum coefficient using the spectrum feature value).

Claim 6,
Mizumoto further teaches the audio method, wherein generating the enhanced STA comprises filtering out values of the MSM outside an excluded range of modulation frequency bands ([0086] inversely filtering a room impulse response, a technique based on power spectrum estimation of a sound source, a technique based on a modulation transfer function (MTF) theory).

Claim 7,
Mizumoto further teaches the audio method of claim 6, wherein the excluded range of modulation frequency bands is from 3 Hz to 30 Hz ([0077] components in a frequency band (for example, 200 Hz to 4 kHz); it is up to the user to decide the frequency range needed/required).

Claim 8,
Mizumoto further teaches the audio method of 1, wherein generating the enhanced STA comprises computing a smoothed spectral temporal energy through aggregation over time ([Figs. 14-15] [ 0129-1030] the amplitude of the signal level of the noise signal is equal to or less than 0.01 [Vp-p] in the waveform).

Claim 9,
Mizumoto further teaches the audio method of 1, wherein generating the enhanced STA comprises eliminating residual noise through tracking a minimum spectral temporal energy over time ([Figs. 14-15] [ 0129-1030] the amplitude of the signal level of the noise signal is equal to or less than 0.01 [Vp-p] in the waveform).

Claim 12,
Mizumoto further teaches the audio method of 1, wherein the calculating comprises computing an enhanced mel-frequency filter cepstral coefficient (MFCC) using the enhanced STA ([0087] performing inverse discrete cosine transform on a mel frequency cepstrum coefficient (MFCC) using the spectrum feature value).

Claim 16,
Morita further teaches the audio method of claim 1, further comprising: receiving the new audio data in a time domain; and converting the piece of the new audio data corresponding to the time point into the specific spectral temporal amplitude (STA) as a time-frequency representation ([Section 2.2] a VAD by means of restoring the temporal power envelope based on the modulation MTF concept to reduce the effects of background noise and reverberation under noisy conditions).

Claim 18,
Mizumoto further teaches the audio method of claim 1, wherein the computing comprises using values of the MSM with a range of acoustic frequency bands from 125 Hz to 8,000 Hz ([0077] components in a frequency band (for example, 200 Hz to 4 kHz); it is up to the user to decide the frequency range needed/required).

Claim 19,
Mizumoto further teaches the audio method of claim 1, wherein the input device is configured to convert sounds into electrical signals ([0079] converts the acquired sound signals from analog signals to digital signals); and wherein the output device is configured to convert electrical signals into sounds ([0079] outputs the converted sound signals to the sound signal evaluating part 42).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Mizumoto et al. (US 2016/0064000) in view of Morita et al. (“Robust Voice Activity Detection based on Concept of Modulation Transfer Function in Noisy Reverberant Environments”; 11 June 2015; pgs. 163-173) and further in view of Yang et al. (US 2025/0131941).

Claim 17,
Mizumoto and Morita teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that neither Mizumoto nor Morita explicitly teaches generating is based on the Parseval's theorem.
Yang teaches generating is based on the Parseval's threorem ([0065] Parseval's theorem indicates that the sum (or integral) of the square of a function is equal to the sum (or integral) of the square of its Fourier transform).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Mizumoto and Morita with teachings of Yang by modifying the sound source-separating device and method as taught by Mizumoto and further in view Morita to include generating is based on the Parseval's theorem as taught by Yang for the benefit of estimating the extent of speech in the piece of the new audio data (Yang [0005]).

Allowable Subject Matter
Claims 2-4, 10-11, 13-15 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims AND overcome the 101 Abstract idea rejection. 
Claim 20 is allowable.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Feb 08, 2024

Application Filed

Sep 15, 2025

Non-Final Rejection — §103

Dec 29, 2025

Interview Requested

Jan 07, 2026

Applicant Interview (Telephonic)

Jan 07, 2026

Examiner Interview Summary

Jan 20, 2026

Response Filed

Feb 24, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

2y 5m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

2y 5m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

2y 5m to grant Granted Feb 10, 2026

18/589,789

Patent 12548583

ACOUSTIC CONTROL APPARATUS, STORAGE MEDIUM AND ACCOUSTIC CONTROL METHOD

2y 5m to grant Granted Feb 10, 2026

18/201,105

Patent 12536988

SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM

2y 5m to grant Granted Jan 27, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

89%

Grant Probability

96%

With Interview (+7.4%)

2y 3m

Median Time to Grant

Moderate

PTA Risk

Based on 403 resolved cases by this examiner. Grant probability derived from career allow rate.