Last updated: April 19, 2026

Application No. 18/712,467

VOICE SEPARATION METHOD AND APPARATUS, ELECTRONIC DEVICE AND READABLE STORAGE MEDIUM

Non-Final OA §103

Filed

May 22, 2024

Examiner

ZHANG, LESHUI

Art Unit

2695

Tech Center

2600 — Communications

Assignee

BEIJING ZITIAO NETWORK TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— +36.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 928 resolved cases, 2023–2026

Examiner Intelligence

ZHANG, LESHUI View full profile →

Grants 78% — above average

Career Allow Rate

719 granted / 928 resolved

+15.5% vs TC avg

Strong +36% interview lift

Without

With

+36.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

47 currently pending

Career history

975

Total Applications

across all art units

Statute-Specific Performance

§101

5.5%

-34.5% vs TC avg

§103

42.5%

+2.5% vs TC avg

§102

13.6%

-26.4% vs TC avg

§112

28.7%

-11.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 928 resolved cases

Office Action

§103

DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
This Office Action is in response to the preliminary claim amendment filed on May 22, 2024 and wherein claims 4-5, 8-9 amended, claims 6-7, 10 canceled, and claims 11-18 newly added.
In virtue of this communication, claims 1-5, 8-9, 11-18 are currently pending in this Office Action.
In the response to this office action, the Examiner respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Examiner in prosecuting this application.

Claim Objections
Claims 2-5 are objected to because of the following informalities: 
Claim 2 recited “The method according to claim 1” which should be -- The voice separation method according to claim 1--. Claim 3 is objected due to dependency to claim 1.
Claims 3-5 are objected for the at least similar reason as described in claim 2 above since claim 3-5 recited similar deficient features as recited in claim 2.
Appropriate correction is required.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-5, 8-9, 11, 13-15, 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al. (CN 113113044 A, a copy of the original recorded in IDS filed on August 12, 2024, an English translation is attached here, hereinafter Xu) and in view of reference Qu et al. (CN 110648656 A, a copy of the original and an English translation version are attached here, hereinafter Qu).
Claim 1: Xu teaches a voice separation method (title and abstract, ln 1-6, a method in fig. 3), comprising: 
obtaining a voice feature corresponding to a voice to be processed (mixed audio including target audio and non-target audio from multiple sources, para 2-3, p.8) and a bottleneck feature corresponding to a reference voice (target object and first voiceprint feature of the target object determined, as the bottleneck feature with respect to the mixed audio, S11, p.6 and para 4-7, p.7, and by a determining module, para 6, p.4); 
inputting, into a voice separation model (a pre-separation module, para 7, p.4), the voice feature corresponding to the voice to be processed (independent vector analysis for performing pre-separation applied on mixed audio as the voice feature, para 7-10, p.11) and obtaining a voice detection result output by the voice separation model (obtained multiple voice signals via pre-separation processing, S12, p.7, and including pre-separated n-channel voice signal); and 
determining, on a basis of the voice detection result (via an extraction module and a splicing module, and the target speaker separation module, the last two paragraphs of p.12 and para 1-2 of p.13), a target voice segment matching the reference voice in the voice to be processed (a target speaker’s voice is generated through the post-processing module, para 3, p.13), wherein the voice feature corresponding to the target voice segment matches the bottleneck feature corresponding to the reference voice (matching between the target audio in the mixed audio and target object having the first voiceprint is determined, para 1, p.14, para 4, p.10 and para 4-end of the page 10).
However, Xu does not explicitly teach inputting, into a voice separation model, also the bottleneck feature corresponding to the reference voice for disclosed obtaining the voice result.
Qu teaches an analogous field of endeavor by disclosing a voice separation method (title and abstract, ln 1-15 and a method in figs. 1-5 and executed on an electronic device in fig. 6) and wherein Qu further teaches, inputting, into a voice separation model (pre-trained voice detection model, para 4, p.12), the voice feature corresponding to the voice to be processed (frequency characteristics, energy characteristics, and zero-crossing rate characteristics of one or more to-be-detected sound frame, para 4, p.12) and the bottleneck feature corresponding to the reference voice (frequency characteristics, energy characteristics, and zero-crossing rate of other one or more to-be-detected sound frame, para 4, p.12) and obtaining a voice detection result output by the voice separation model (obtaining the detection results of each to-be-detected sound frame, including voice frames and non-speech frames, para 4, para 12) for  benefits of more accurately separating voice frames from others (the last paragraph of p.2, distinguishing sounds in different types by using different characteristics, para 1, p.10, and by using combined MFCC with FBank features, para 5, p.11, and with LSTER and HZCRR in fig. 3, and by using strong robust model of new technology applications including Deep Neural Network DNN plus LSTM, para 6, p.14);
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied wherein inputting, into the voice separation model, the voice feature corresponding to the voice to be processed and the bottleneck feature corresponding to the reference voice and obtaining a voice detection result output by the voice separation model, as taught by Qu, to the voice separation model and wherein inputting, into the voice separation model, the voice feature corresponding to the voice to be processed and obtaining the voice detection result output by the voice separation model, as taught by Xu, for the benefits discussed above.
Claim 8 has been analyzed and rejected according to claim 1 above and the combination of Xu and Qu further teaches, an electronic device (Xu, a terminal device in fig. 5,  and Qu, an electronic device in fig. 6) comprising: 
a memory and a processor (Xu, memory for storing instructions, claim 19, and Qu, memory 804 and processors 820 in fig. 8); 
the memory is configured to store computer program instructions ((Xu, processor, claim 19, and Qu, the memory storing instructions, para 7, p.20); and 
the processor is configured to execute the computer program instructions, so that the electronic device implements the voice separation method of claim 1 ((Xu, execution of the instructions by the processor, claim 19, and Qu, the processors to execute instructions for completing method steps of the voice endpoint detection method, para 6, p.20).
Claim 9 has been analyzed and rejected according to claims 1, 8 above.
Claim 2: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the inputting, into a voice separation model, the voice feature corresponding to the voice to be processed and the bottleneck feature corresponding to the reference voice and obtaining a voice detection result output by the voice separation model (Xu and Qu, the discussed in claim 1 above) comprises: 
inputting the voice feature corresponding to the voice to be processed into a first neural network comprised in the voice separation model (Xu, independent vector analysis IVA module for pre-separation processing, para 1, p.4, and Qu, including deep neural network model at the first classification layer, para 7, p.13), and obtaining a vector expression corresponding to the voice feature output by the first neural network (stitched feature matrix of each sound frame to be detected); splicing the vector expression corresponding to the voice feature and the bottleneck feature corresponding to the reference voice to obtain a fusion feature (Qu, obtaining a fusion feature of each sound frame by stitching features matrix of each sound frame to be detected is linearly mapped to obtain fusion feature of each sound frame to be detected, para 1, p.14); and inputting the fusion feature into a second neural network comprised in the voice separation model (the fusion feature of each to-be-detected sound frame is input into the first classification layer that is TDNN+LSTM model with the 42-dimentional fusion feature, para 5-7, p.14 or combination of the first classification layer and a second classification layer comprising delay neural network, a long-term and short-term memory network, para 1, p.16), obtaining a matrix output by the second neural network (Qu, the matrix output from the first classification layer to the second classification layer, para 1, p.16 and the second classification layer can be delayed neural network TDNN+a long-term and short-term memory network LSTM, para 1, p.17), and obtaining a voice detection result on a basis of the matrix (Qu, detection result of each to-be-detected sound frame, speech frame or non-speech frame, para 4, p.14 and para 2, para 16, and including speech endpoint detected, para 2, p.17).
Claim 4: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the voice separation model is obtained by training (Qu, training by using the concatenated or spliced or stitched MFCC feature, LSTER feature, and HZCRR feature, para 2, p.16) on a basis of the voice feature corresponding to the sample voice, the bottleneck feature corresponding to the sample voice (Qu, stitched MFCC feature, LSTER feature, and HZCRR feature above, para 2, p.16) and a labeled voice detection result of the sample voice (Qu, the class label of the frame to be trained and compared, para 2, p.16), and the sample voice comprises the reference voice (Qu, including voice frame and non-voice frame discussed above).
Claim 5: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the voice feature comprises one or more of a FBank feature, a Mel frequency spectrum feature or a pitch feature (Markush, see MPEP 2117, Xu, including tone, timbre, intensity, sound wave wavelength, frequency, and rhythm of change, etc., para 4, p.7, and Qu, MFCC or Mel Frequency Cepstrum Coefficient, Mel Frequency Cepstrum Coefficient, log spectral characteristics or Fbank feature).
Claim 11 has been analyzed and rejected according to claims 8, 2 above.
Claim 13 has been analyzed and rejected according to claims 8, 4 above.
Claim 14 has been analyzed and rejected according to claims 8, 5 above.
Claim 15 has been analyzed and rejected according to claims 9, 2 above.
Claim 17 has been analyzed and rejected according to claims 9, 4 above.
Claim 18 has been analyzed and rejected according to claims 9, 5 above.

Claims 3, 12, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Xu (above) and in view of references Qu (above) and Bocklet et al. (US 20190043488 A1, hereinafter Bocklet).
Claim 3: the combination of Xu and Qu further teaches, according to claim 2 above, wherein the obtaining a voice detection result on a basis of the matrix (discussion in claims 1-2 above), except the obtaining above comprises: 
obtaining probability values matches of that each audio frame pertains to a first category and a second category respectively according to an element corresponding to the each audio frame comprised in the matrix; the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice, and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice; and determining a voice detection result corresponding to the each audio frame on a basis of a maximum value of the probability values that the each audio frame pertains to a first category and a second category respectively.
Bocklet teaches an analogous field of endeavor by disclosing a voice separation method (title and abstract, ln 1-5 and method steps in figs. 5, 15) and wherein Bocklet further teaches obtaining a voice detection result on a basis of a matrix (via keyphrase detection through a neural network in fig. 2 and based on an audio input from a user and an feature extraction module 202 that generated feature vectors 212 for acoustic scoring 203 in fig. 2) and comprising:
obtaining probability values (probabilities outputted from acoustic scores 214 and for each of feature vectors 212, para 49) matches of that each audio frame (audio frame of a frame sequence, para 55, and stored in a buffer having a length of 360ms, para 43) pertains to a first category (probabilities for spoken component from a phone, para 50) and a second category (probabilities for silence, background noise, etc., para 50) respectively according to an element corresponding to the each audio frame comprised in the matrix (based on the vector features outputted from feature extraction 202 in fig. 2, para 50); the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice (predetermined key phrase as the bottleneck feature of reference voice stored in key phrase and rejection models 205, para 39, and performing a match of the input feature vector based on the scores or probabilities outputted from element 203 in fig. 2, para 39), and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice (probabilities of feature vector of non-speech audio also evaluated and detected through the element 210 within the key phrase detection decoder 204 in fig. 2, para 54); and determining a voice detection result corresponding to the each audio frame on a basis of a maximum value of the probability values that the each audio frame pertains to a first category and a second category respectively (through a maximum pooling on multiple element state score vector, para 31, and pooling different speech and non-speech rejection categories scored by the element 203, and key phrase detection , and indicated by outputting key phrase score 215, para 54) for benefits of improving an efficiency (by accelerated implementation of neural network to reducing computational loads, para 38) and a performance (by implementing pre-processing of microphone signals, para 128) in the voice separation (increasing accurate detection of the voice by phoneme scores, other than word scores, para 28).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied obtaining the voice detection result on the  basis of the matrix and the obtaining further comprising obtaining the probability values matches of that each audio frame pertains to the first category and the second category respectively according to the element corresponding to the each audio frame comprised in the matrix; the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice, and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice; and determining the voice detection result corresponding to the each audio frame on the basis of the maximum value of the probability values that the each audio frame pertains to the first category and the second category respectively, as taught by Bocklet, to obtaining the voice detection result on the basis of the matrix in the voice separation method, as taught by the combination of Xu and Qu, for the benefits discussed above.
Claim 12 has been analyzed and rejected according to claims 11, 3 above.
Claim 16 has been analyzed and rejected according to claims 15, 3 above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LESHUI ZHANG whose telephone number is (571)270-5589. The examiner can normally be reached Monday-Friday 6:30amp-4:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached at 571-272-7848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LESHUI ZHANG/
Primary Examiner, 
Art Unit 2695

Read full office action

Prosecution Timeline

May 22, 2024

Application Filed

Dec 18, 2025

Non-Final Rejection — §103

Mar 30, 2026

Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

18/124,589

Patent 12585677

AUTOMATED GENERATION OF IMPROVED LIST-TYPE ANSWERS IN QUESTION ANSWERING SYSTEMS

2y 5m to grant Granted Mar 24, 2026

17/726,728

Patent 12572757

VIDEO PROCESSING METHOD, VIDEO PROCESSING APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM

2y 5m to grant Granted Mar 10, 2026

18/410,942

Patent 12567423

SYSTEM AND METHODS FOR UPSAMPLING OF DECOMPRESSED SPEECH DATA USING A NEURAL NETWORK

2y 5m to grant Granted Mar 03, 2026

18/553,783

Patent 12567424

METHOD AND DEVICE FOR MULTI-CHANNEL COMFORT NOISE INJECTION IN A DECODED SOUND SIGNAL

2y 5m to grant Granted Mar 03, 2026

18/104,083

Patent 12561354

SYSTEMS AND METHODS FOR ITEM-SPECIFIC KEYWORD RECOMMENDATION

2y 5m to grant Granted Feb 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

78%

Grant Probability

99%

With Interview (+36.0%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 928 resolved cases by this examiner. Grant probability derived from career allow rate.