Last updated: April 19, 2026

Application No. 18/617,602

VOICE DETECTION METHOD, VOICE DETECTION DEVICE, AND COMPUTER DEVICE

Non-Final OA §102§103

Filed

Mar 26, 2024

Examiner

ROBERTS, SHAUN A

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Aac Technologies (Nanjing) Co. Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +10.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 647 resolved cases, 2023–2026

Examiner Intelligence

ROBERTS, SHAUN A View full profile →

Grants 76% — above average

Career Allow Rate

491 granted / 647 resolved

+13.9% vs TC avg

Moderate +10% lift

Without

With

+10.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

31 currently pending

Career history

678

Total Applications

across all art units

Statute-Specific Performance

§101

7.6%

-32.4% vs TC avg

§103

49.2%

+9.2% vs TC avg

§102

29.5%

-10.5% vs TC avg

§112

3.5%

-36.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 647 resolved cases

Office Action

§102 §103

DETAILED ACTION
1.	This action is responsive to Application no.18/617,602 filed 3/26/2024.  All claims have been examined and are currently pending.
Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Allowable Subject Matter
3.	Claims 6-7 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim Objections
4.	Claim 8 is objected to because of the following informalities:
The claim recites:   
“an acquiring module, 
wherein the acquisition module is…”
Where the terms should match for consistency (where both should read either acquisition or acquiring module).
Appropriate correction is required.

Claim Rejections - 35 USC § 102
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

6.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

7.	Claims 1, 8-9 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yang et al (11,823,706).

Regarding claim 1 Yang et al (11,823,706) teaches A voice detection method (abstract: method of detecting human voice activity; col 1 l. 28-30: detecting voice activity in a frame of audio signal based on a plurality of features extracted from the audio signal), comprising steps: 
acquiring an audio sequence (col 1 l. 28-29); 
extracting a first audio feature from the audio sequence, and performing voice detection on the audio sequence according to the first audio feature to obtain a first voice detection result (col 1 l. 30-38: The audio signal representing sound from an environment surrounding a client device is applied with a plurality of feature extraction algorithms to determine a plurality of features of the audio signal. The plurality of features are compared to respective predefined criteria, where the respective predefined criteria are indicative of a presence of a human voice in the audio signal. For each feature, an output is generated based on the comparison of the feature to its associated predefined criteria.); 
extracting a second audio feature from the audio sequence, and performing the voice detection on the audio sequence according to the second audio feature to obtain a second voice detection result (col 1 l. 30-38: The audio signal representing sound from an environment surrounding a client device is applied with a plurality of feature extraction algorithms to determine a plurality of features of the audio signal. The plurality of features are compared to respective predefined criteria, where the respective predefined criteria are indicative of a presence of a human voice in the audio signal. For each feature, an output is generated based on the comparison of the feature to its associated predefined criteria); and 
determining a voice detection result of the audio sequence according to the first voice detection result and the second voice detection result (Col 1 l 40-43: The outputs of the plurality of features is then used to determine an aggregate value for the frame of audio signal. Based on the aggregate value, presence of a human voice in the frame of audio signal is determined for the frame, and a voice detection signal indicative of the presence of a human voice is generated and output.).  

Regarding claim 8 Yang teaches A voice detection device (fig 1, 2), comprising: 
an acquiring module (fig 1, 2), 
a first audio feature extraction module (fig 1, 2), 
a second audio feature extraction module (fig 1, 2), and 
a voice detection module (fig 1, 2); 
wherein the acquisition module is configured to acquire an audio sequence; 
the first audio feature extraction module is configured to extract a first audio feature from the audio sequence and perform voice detection on the audio sequence according to the first audio feature to obtain a first voice detection result; 
wherein the second audio feature extraction module is configured to extract a second audio feature from the audio sequence and perform the voice detection on the audio sequence according to the second audio feature to obtain a second voice detection result; and 
wherein the voice detection module is configured to determine a voice detection result of the audio sequence according to the first voice detection result and the second voice detection result.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Regarding claim 9 Yang teaches A computer device (figure 1, 2; col 2 l. 15-16: computing device; col 5 l. 12-29), comprising: 
a memory, a processor, and a computer program (col 5 l. 12-29); 
wherein the computer program is stored in the memory and is executable on the processor (fig 1, 2; col 5 l. 12-29); 
the processor implements the voice detection method according to claim 1 when executing the computer program (fig 1, 2; col 5 l. 12-29).
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 


Claim Rejections - 35 USC § 103
8.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

9.	Claims 2, 4 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Muesch (2015/0243300).

Regarding claim 2 Yang teaches The voice detection method according to claim 1, wherein the first audio feature comprises [an average energy], an energy ratio, and a zero-crossing rate of an audio signal (col 7 l. 57: zero-crossing rate; col 9 l.  27-30: energy ratio); 
the step of extracting the first audio feature from the audio sequence and performing the voice detection on the audio sequence according to the first audio feature to obtain the first voice detection result comprises steps: 
performing sampling frequency conversion and framing processing on the audio sequence to obtain frames of audio sub-signals (col 1 l 28-29: frame of an audio signal; col 6 l. 35 frame of audio signal; col 8 l 23-24., l. 27: down-sampled audio signal); 
calculating [an average energy and] an zero-crossing rate of each of the frames of the audio sub-signals according to each of the frames of the audio sub-signals to obtain [the average energy and] the zero-crossing rate of the audio signal (col 7 l. 57: zero-crossing rate); 
obtaining energy spectra of the audio sub-signals, obtaining low-frequency band energy and high-band energy according to the energy spectra, and calculating a ratio between an average energy of the low-frequency band energy and an average energy of the high-band energy to obtain the energy ratio of the audio signal (col 9 l.  27-30: the energy ratio calculation module 215 determines an energy ratio in the frame of audio signal between an energy in low frequency bands and an energy in high frequency bands); and 
performing the voice detection on the audio sequence according to [the average energy], the zero-crossing rate, and the energy ratio of the audio signal to obtain the first voice detection result
(col 7 l. 43-46: The zero-crossing detection module 205 determines a number of zero-crossing points in a frame of audio signal and compares the number to a threshold value to generate a detection result.; col 7 l. 57: zero-crossing rate; col 8 l. 1-3; col 9 l.  27-30; col 10 l. 31-35: The decision module 235 receives outputs from the zero-crossing detection module 205, periodicity detection module 210, energy ratio calculation module 215, and EFR determination module 220 to decide whether there is a human voice present in the audio signal.);
  
But does not specifically teach where Muesch (2015/0243300) teaches average energy (abstract; 0008: detecting voice activity; determining speech activity level of the frame based on….average of an energy of each of the plurality of subbands).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Muesch for an improved system, allowing for additional criteria in helping to determine voice activity detection, while still presenting a reasonable expectation of success.
Yang already teaches using multiple criteria to determine voice activity, and one could look to Muesch to further incorporate average energy to still allow for the use of multiple features to determine that there is a human voice in the audio
(Yang col 6 l. 30-55: The audio processing module 158 may determine characteristics of the audio signal by applying feature extraction algorithms to extract features from the audio signal and comparing each of the extracted features to corresponding predefined criteria indicative of human voice. Based on the comparison of each of the extracted features to its corresponding predefined criteria, the audio processing module 158 determines whether the feature is indicative of human voice. For example, the audio processing module 158 determines an aggregate value that represents an overall result of the feature comparisons of the individual features. The aggregate value is compared to a threshold, and if the aggregate value is greater than the threshold, the audio processing module 158 determines that there is a human voice in the audio signal.)

Regarding claim 4 Yang and Muesch teach The voice detection method according to claim 2, wherein the step of performing the voice detection on the audio sequence according to the average energy, the zero-crossing rate, and the energy ratio of the audio signal to obtain the first voice detection result comprises steps: 
comparing the average energy of the audio signal with a first predetermined threshold (Muesch abstract; 0008); 
comparing the energy ratio of the audio signal with a second predetermined threshold (Yang col 9 l 27-39); 
comparing the zero-crossing rate of the audio signal with a third predetermined threshold (Yang col 7 l. 57-67); and 
determining that the first voice detection result is a voice when the average energy of the audio signal is greater than the first predetermined threshold, the energy ratio of the audio signal is greater than the second predetermined threshold, and the zero-crossing rate of the audio signal is greater than the third predetermined threshold 
(Col 1 l. 28-44: Embodiments relate to detecting voice activity in a frame of audio signal based on a plurality of features extracted from the audio signal. The audio signal representing sound from an environment surrounding a client device is applied with a plurality of feature extraction algorithms to determine a plurality of features of the audio signal. The plurality of features are compared to respective predefined criteria, where the respective predefined criteria are indicative of a presence of a human voice in the audio signal. For each feature, an output is generated based on the comparison of the feature to its associated predefined criteria. The outputs of the plurality of features is then used to determine an aggregate value for the frame of audio signal. Based on the aggregate value, presence of a human voice in the frame of audio signal is determined for the frame, and a voice detection signal indicative of the presence of a human voice is generated and output.

Col 10 l. 31 – 55: The decision module 235 receives outputs from the zero-crossing detection module 205, periodicity detection module 210, energy ratio calculation module 215, and EFR determination module 220 to decide whether there is a human voice present in the audio signal. Each of the feature extraction modules may generate an output representing one or more characteristics indicative of a presence of a human voice in the frame of audio signal and provide the output to the decision module 235. In some embodiments, each of the modules may send a binary output for a given frame of audio signal. For example, an output of “1” may indicate that a feature corresponding to the module is indicative of human voice while an output of “0” indicates that the feature is not indicative of human voice. The decision module 235 compiles the outputs to determine an aggregate value, and determines whether the features indicate that the frame includes a human voice. To determine the aggregate value, the decision module 235 may add the outputs compare the sum to a decision threshold value. For example, in the example shown in FIG. 2, there are four features being extracted. The decision module 235 may determine that when at least two of the four features are indicative of human voice, there is a human voice (i.e., when the sum of the outputs is at least two). The decision module 235 may also compare to a different threshold number.).  
Further rejected for similar rationale and reasoning as claim 2


10.	Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Muesch (2015/0243300) in further view of Matsuoka et al (2016/0364963).

Regarding claim 3 Yang teaches The voice detection method according to claim 2, wherein the step of obtaining the energy spectra of the audio sub-signals and obtaining the low-frequency band energy and the high-band energy according to the energy spectra comprises steps 
obtaining the low-frequency band energy and the high-frequency band energy from a frequency domain [through fast Fourier transform]; 
or respectively obtaining a low-frequency signal and a high-frequency signal through a time-domain filter and a predetermined cut-off frequency, and calculating the low-frequency band energy of the low-frequency signal and the high-frequency band of the high-frequency signal 
(col 9 l. 40 - 54: the low frequency band may be defined as frequencies below a cut off frequency (e.g., 3 kHz) and the high frequency band may be defined as frequencies above the cut off frequency. The audio signal may be divided into the low frequency band and the high frequency band by applying a filter in the time domain. For example, a high pass filter with a cut off frequency at 3 kHz can be applied to the received audio signal to extract signals in the high frequency band. After the audio signal is divided into the low frequency band and the high frequency band, the energy of the low frequency band and the energy of the high frequency band are determined. The energy ratio calculation module 215 determines the energy ratio by dividing the low frequency band energy by the high frequency band energy.); 

And does not specifically teach where Matsuoka et al (2016/0364963) teaches
wherein the step of obtaining the low-frequency band energy and the high-frequency band energy from the frequency domain through the fast Fourier transform comprises: 
performing windowing processing on each of the frames of the audio sub-signals to obtain windowing processing results (123); 
respectively performing the fast Fourier transform on the windowing processing results to obtain fast Fourier transform results (122-123); 
respectively calculating the energy spectra according to the fast Fourier transform results (123-124); and 
counting the high-frequency band energy and the low-frequency band energy from the energy spectra
([0123] Each of the plurality of FFT information extractors 1104 is associated with a respective frequency or frequency band. After the FFT engine 1102 generates a series of audio data from the audio signal, each of the plurality of FFT information extractors 1104 is applied to filter the audio data of each time window in association with a corresponding frequency or frequency band. Thus, the audio data is split to a plurality of sound components each associated with a respective frequency or frequency band and including a series of time windows. For example, similarly to the time domain approach, the feature extractor 706 here can include three FFT information extractors 1104 that are associated with three frequency bands: 900 Hz and below, 1000-5000 Hz, and 6000 Hz and above. Likewise, the audio data is split to three sound components: a low frequency sound component SCL (900 Hz and below), an intermediate frequency sound component SCI (1000-5000 Hz), and a high frequency sound component SCH (6000 Hz and above). Each sound component SCL, SCI or SCH includes the series of time windows TW.sub.1-TW.sub.n but is only associated with part of the acoustic sound within these time windows. Unlike the time domain approach, the sound components of the audio data are represented in the frequency domain, for example as FFT coefficients, in this frequency domain.).  

It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Matsuoka presenting a reasonable expectation of success in still obtaining the low frequency band energy and high frequency band energy to determine the energy ratio to help make voice activity decisions.


11.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Muesch (2015/0243300) in further view of Matsuoka et al (2016/0364963) in further view of Keiller (6,560,575).
Regarding claim 5 Yang teaches The voice detection method according to claim 4, and
performing the sampling frequency conversion and segmentation processing on the audio sequence to obtain audio segments (col 1 l 28-29: frame of an audio signal; col 6 l. 35 frame of audio signal; col 8 l 23-24., l. 27: down-sampled audio signal); 

but does not specifically teach where Keiller (6,560,575) teaches
wherein the second feature comprises a spectral modulation energy
(col 5 l. 24-43: The way in which the endpoint detector 68 operates in this embodiment, will now be described with reference to FIGS. 6 to 9. In this embodiment, speech is detected by treating the average frame energy of the input signal as a sampled signal and looking for modulations within that sampled signal that are characteristic of speech. In particular, the energy due to speech is strongly modulated at frequencies around 4 Hz, with very little modulation below 1 Hz or above 10 Hz. In contrast, changes in noise level tend to occur relatively slowly, typically modulating the signal energy at less than 1 Hz. In addition, random fluctuations in the noise energy are uncorrelated from frame to frame and are spread over the modulation frequency range from 0 Hz to half the frame rate. Therefore, in this embodiment, the endpoint detector 68 is arranged to detect the presence of speech by band-pass filtering the average frame energy in a frequency band between 2 Hz and 6 Hz, by calculating the modulation power within this frequency band and by applying a detection threshold to the calculated modulation power.); 
the step of extracting the second audio feature from the audio sequence and performing the voice detection on the audio sequence according to the second audio feature to obtain the second voice detection result comprises steps: 
[performing the sampling frequency conversion and segmentation processing on the audio sequence to obtain audio segments]; 
calculating a Mel spectrum for each of the audio segments to obtain a Mel spectrogram containing channels (fig 15; col 10 l. 39-43: Mel filter bank); 
performing the [fast] Fourier transform on each of the channels in the Mel spectrogram, and calculating a normalized modulation energy of each of the channels (col 10 l. 18-20: DFT; col 13 l. 13-25, l. 22-23: energy…is then normalised); and 
performing the voice detection on the audio sequence according to the normalized modulation energy of each of the channels to obtain the second voice detection result (col 5 l. 24-43:  speech is detected by treating the average frame energy of the input signal as a sampled signal and looking for modulations within that sampled signal that are characteristic of speech.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Keiller for an improved system, allowing for additional criteria in helping to determine voice activity detection, while still presenting a reasonable expectation of success.
Yang already teaches using multiple criteria to determine voice activity, and one could look to Keiller to further incorporate modulation energy to still allow for the use of multiple features to determine that there is a human voice in the audio (Yang col 6 l. 30-55).
	Keiller teaches col 10 l. 18-20: DFT but does not specifically teach where Matsuoka teaches FFT (122-123 FFT).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Matsuoka presenting a reasonable expectation of success in still obtaining the required form of the audio signal.

Conclusion
12.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Mar 26, 2024

Application Filed

Oct 08, 2025

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/312,688

Patent 12586599

AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL

2y 5m to grant Granted Mar 24, 2026

18/484,282

Patent 12586568

SYNTHETICALLY GENERATING INNER SPEECH TRAINING DATA

2y 5m to grant Granted Mar 24, 2026

18/179,756

Patent 12573376

Dynamic Language and Command Recognition

2y 5m to grant Granted Mar 10, 2026

18/629,200

Patent 12562157

GENERATING TOPIC-SPECIFIC LANGUAGE MODELS

2y 5m to grant Granted Feb 24, 2026

18/484,538

Patent 12555562

VOICE SYNTHESIS FROM DIFFUSION GENERATED SPECTROGRAMS FOR ACCESSIBILITY

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

86%

With Interview (+10.3%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 647 resolved cases by this examiner. Grant probability derived from career allow rate.