Prosecution Insights
Last updated: May 29, 2026
Application No. 18/712,467

VOICE SEPARATION METHOD AND APPARATUS, ELECTRONIC DEVICE AND READABLE STORAGE MEDIUM

Non-Final OA §103
Filed
May 22, 2024
Priority
Nov 22, 2021 — CN 202111386550.8 +1 more
Examiner
ZHANG, LESHUI
Art Unit
2695
Tech Center
2600 — Communications
Assignee
BEIJING ZITIAO NETWORK TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
78%
Grant Probability
Favorable
1-2
OA Rounds
9m
Est. Remaining
99%
With Interview

Examiner Intelligence

Grants 78% — above average
78%
Career Allowance Rate
728 granted / 937 resolved
+15.7% vs TC avg
Strong +36% interview lift
Without
With
+35.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
26 currently pending
Career history
980
Total Applications
across all art units

Statute-Specific Performance

§101
1.0%
-39.0% vs TC avg
§103
83.0%
+43.0% vs TC avg
§102
5.7%
-34.3% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 937 resolved cases

Office Action

§103
DETAILED ACTION The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This Office Action is in response to the preliminary claim amendment filed on May 22, 2024 and wherein claims 4-5, 8-9 amended, claims 6-7, 10 canceled, and claims 11-18 newly added. In virtue of this communication, claims 1-5, 8-9, 11-18 are currently pending in this Office Action. In the response to this office action, the Examiner respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Examiner in prosecuting this application. Claim Objections Claims 2-5 are objected to because of the following informalities: Claim 2 recited “The method according to claim 1” which should be -- The voice separation method according to claim 1--. Claim 3 is objected due to dependency to claim 1. Claims 3-5 are objected for the at least similar reason as described in claim 2 above since claim 3-5 recited similar deficient features as recited in claim 2. Appropriate correction is required. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-2, 4-5, 8-9, 11, 13-15, 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al. (CN 113113044 A, a copy of the original recorded in IDS filed on August 12, 2024, an English translation is attached here, hereinafter Xu) and in view of reference Qu et al. (CN 110648656 A, a copy of the original and an English translation version are attached here, hereinafter Qu). Claim 1: Xu teaches a voice separation method (title and abstract, ln 1-6, a method in fig. 3), comprising: obtaining a voice feature corresponding to a voice to be processed (mixed audio including target audio and non-target audio from multiple sources, para 2-3, p.8) and a bottleneck feature corresponding to a reference voice (target object and first voiceprint feature of the target object determined, as the bottleneck feature with respect to the mixed audio, S11, p.6 and para 4-7, p.7, and by a determining module, para 6, p.4); inputting, into a voice separation model (a pre-separation module, para 7, p.4), the voice feature corresponding to the voice to be processed (independent vector analysis for performing pre-separation applied on mixed audio as the voice feature, para 7-10, p.11) and obtaining a voice detection result output by the voice separation model (obtained multiple voice signals via pre-separation processing, S12, p.7, and including pre-separated n-channel voice signal); and determining, on a basis of the voice detection result (via an extraction module and a splicing module, and the target speaker separation module, the last two paragraphs of p.12 and para 1-2 of p.13), a target voice segment matching the reference voice in the voice to be processed (a target speaker’s voice is generated through the post-processing module, para 3, p.13), wherein the voice feature corresponding to the target voice segment matches the bottleneck feature corresponding to the reference voice (matching between the target audio in the mixed audio and target object having the first voiceprint is determined, para 1, p.14, para 4, p.10 and para 4-end of the page 10). However, Xu does not explicitly teach inputting, into a voice separation model, also the bottleneck feature corresponding to the reference voice for disclosed obtaining the voice result. Qu teaches an analogous field of endeavor by disclosing a voice separation method (title and abstract, ln 1-15 and a method in figs. 1-5 and executed on an electronic device in fig. 6) and wherein Qu further teaches, inputting, into a voice separation model (pre-trained voice detection model, para 4, p.12), the voice feature corresponding to the voice to be processed (frequency characteristics, energy characteristics, and zero-crossing rate characteristics of one or more to-be-detected sound frame, para 4, p.12) and the bottleneck feature corresponding to the reference voice (frequency characteristics, energy characteristics, and zero-crossing rate of other one or more to-be-detected sound frame, para 4, p.12) and obtaining a voice detection result output by the voice separation model (obtaining the detection results of each to-be-detected sound frame, including voice frames and non-speech frames, para 4, para 12) for benefits of more accurately separating voice frames from others (the last paragraph of p.2, distinguishing sounds in different types by using different characteristics, para 1, p.10, and by using combined MFCC with FBank features, para 5, p.11, and with LSTER and HZCRR in fig. 3, and by using strong robust model of new technology applications including Deep Neural Network DNN plus LSTM, para 6, p.14); Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied wherein inputting, into the voice separation model, the voice feature corresponding to the voice to be processed and the bottleneck feature corresponding to the reference voice and obtaining a voice detection result output by the voice separation model, as taught by Qu, to the voice separation model and wherein inputting, into the voice separation model, the voice feature corresponding to the voice to be processed and obtaining the voice detection result output by the voice separation model, as taught by Xu, for the benefits discussed above. Claim 8 has been analyzed and rejected according to claim 1 above and the combination of Xu and Qu further teaches, an electronic device (Xu, a terminal device in fig. 5, and Qu, an electronic device in fig. 6) comprising: a memory and a processor (Xu, memory for storing instructions, claim 19, and Qu, memory 804 and processors 820 in fig. 8); the memory is configured to store computer program instructions ((Xu, processor, claim 19, and Qu, the memory storing instructions, para 7, p.20); and the processor is configured to execute the computer program instructions, so that the electronic device implements the voice separation method of claim 1 ((Xu, execution of the instructions by the processor, claim 19, and Qu, the processors to execute instructions for completing method steps of the voice endpoint detection method, para 6, p.20). Claim 9 has been analyzed and rejected according to claims 1, 8 above. Claim 2: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the inputting, into a voice separation model, the voice feature corresponding to the voice to be processed and the bottleneck feature corresponding to the reference voice and obtaining a voice detection result output by the voice separation model (Xu and Qu, the discussed in claim 1 above) comprises: inputting the voice feature corresponding to the voice to be processed into a first neural network comprised in the voice separation model (Xu, independent vector analysis IVA module for pre-separation processing, para 1, p.4, and Qu, including deep neural network model at the first classification layer, para 7, p.13), and obtaining a vector expression corresponding to the voice feature output by the first neural network (stitched feature matrix of each sound frame to be detected); splicing the vector expression corresponding to the voice feature and the bottleneck feature corresponding to the reference voice to obtain a fusion feature (Qu, obtaining a fusion feature of each sound frame by stitching features matrix of each sound frame to be detected is linearly mapped to obtain fusion feature of each sound frame to be detected, para 1, p.14); and inputting the fusion feature into a second neural network comprised in the voice separation model (the fusion feature of each to-be-detected sound frame is input into the first classification layer that is TDNN+LSTM model with the 42-dimentional fusion feature, para 5-7, p.14 or combination of the first classification layer and a second classification layer comprising delay neural network, a long-term and short-term memory network, para 1, p.16), obtaining a matrix output by the second neural network (Qu, the matrix output from the first classification layer to the second classification layer, para 1, p.16 and the second classification layer can be delayed neural network TDNN+a long-term and short-term memory network LSTM, para 1, p.17), and obtaining a voice detection result on a basis of the matrix (Qu, detection result of each to-be-detected sound frame, speech frame or non-speech frame, para 4, p.14 and para 2, para 16, and including speech endpoint detected, para 2, p.17). Claim 4: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the voice separation model is obtained by training (Qu, training by using the concatenated or spliced or stitched MFCC feature, LSTER feature, and HZCRR feature, para 2, p.16) on a basis of the voice feature corresponding to the sample voice, the bottleneck feature corresponding to the sample voice (Qu, stitched MFCC feature, LSTER feature, and HZCRR feature above, para 2, p.16) and a labeled voice detection result of the sample voice (Qu, the class label of the frame to be trained and compared, para 2, p.16), and the sample voice comprises the reference voice (Qu, including voice frame and non-voice frame discussed above). Claim 5: the combination of Xu and Qu further teaches, according to claim 1 above, wherein the voice feature comprises one or more of a FBank feature, a Mel frequency spectrum feature or a pitch feature (Markush, see MPEP 2117, Xu, including tone, timbre, intensity, sound wave wavelength, frequency, and rhythm of change, etc., para 4, p.7, and Qu, MFCC or Mel Frequency Cepstrum Coefficient, Mel Frequency Cepstrum Coefficient, log spectral characteristics or Fbank feature). Claim 11 has been analyzed and rejected according to claims 8, 2 above. Claim 13 has been analyzed and rejected according to claims 8, 4 above. Claim 14 has been analyzed and rejected according to claims 8, 5 above. Claim 15 has been analyzed and rejected according to claims 9, 2 above. Claim 17 has been analyzed and rejected according to claims 9, 4 above. Claim 18 has been analyzed and rejected according to claims 9, 5 above. Claims 3, 12, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Xu (above) and in view of references Qu (above) and Bocklet et al. (US 20190043488 A1, hereinafter Bocklet). Claim 3: the combination of Xu and Qu further teaches, according to claim 2 above, wherein the obtaining a voice detection result on a basis of the matrix (discussion in claims 1-2 above), except the obtaining above comprises: obtaining probability values matches of that each audio frame pertains to a first category and a second category respectively according to an element corresponding to the each audio frame comprised in the matrix; the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice, and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice; and determining a voice detection result corresponding to the each audio frame on a basis of a maximum value of the probability values that the each audio frame pertains to a first category and a second category respectively. Bocklet teaches an analogous field of endeavor by disclosing a voice separation method (title and abstract, ln 1-5 and method steps in figs. 5, 15) and wherein Bocklet further teaches obtaining a voice detection result on a basis of a matrix (via keyphrase detection through a neural network in fig. 2 and based on an audio input from a user and an feature extraction module 202 that generated feature vectors 212 for acoustic scoring 203 in fig. 2) and comprising: obtaining probability values (probabilities outputted from acoustic scores 214 and for each of feature vectors 212, para 49) matches of that each audio frame (audio frame of a frame sequence, para 55, and stored in a buffer having a length of 360ms, para 43) pertains to a first category (probabilities for spoken component from a phone, para 50) and a second category (probabilities for silence, background noise, etc., para 50) respectively according to an element corresponding to the each audio frame comprised in the matrix (based on the vector features outputted from feature extraction 202 in fig. 2, para 50); the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice (predetermined key phrase as the bottleneck feature of reference voice stored in key phrase and rejection models 205, para 39, and performing a match of the input feature vector based on the scores or probabilities outputted from element 203 in fig. 2, para 39), and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice (probabilities of feature vector of non-speech audio also evaluated and detected through the element 210 within the key phrase detection decoder 204 in fig. 2, para 54); and determining a voice detection result corresponding to the each audio frame on a basis of a maximum value of the probability values that the each audio frame pertains to a first category and a second category respectively (through a maximum pooling on multiple element state score vector, para 31, and pooling different speech and non-speech rejection categories scored by the element 203, and key phrase detection , and indicated by outputting key phrase score 215, para 54) for benefits of improving an efficiency (by accelerated implementation of neural network to reducing computational loads, para 38) and a performance (by implementing pre-processing of microphone signals, para 128) in the voice separation (increasing accurate detection of the voice by phoneme scores, other than word scores, para 28). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied obtaining the voice detection result on the basis of the matrix and the obtaining further comprising obtaining the probability values matches of that each audio frame pertains to the first category and the second category respectively according to the element corresponding to the each audio frame comprised in the matrix; the voice feature corresponding to the audio frame comprised in the first category matches the bottleneck feature corresponding to the reference voice, and the voice feature corresponding to the audio frame comprised in the second category does not match the bottleneck feature corresponding to the reference voice; and determining the voice detection result corresponding to the each audio frame on the basis of the maximum value of the probability values that the each audio frame pertains to the first category and the second category respectively, as taught by Bocklet, to obtaining the voice detection result on the basis of the matrix in the voice separation method, as taught by the combination of Xu and Qu, for the benefits discussed above. Claim 12 has been analyzed and rejected according to claims 11, 3 above. Claim 16 has been analyzed and rejected according to claims 15, 3 above. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to LESHUI ZHANG whose telephone number is (571)270-5589. The examiner can normally be reached Monday-Friday 6:30amp-4:00pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached at 571-272-7848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /LESHUI ZHANG/ Primary Examiner, Art Unit 2695
Read full office action

Prosecution Timeline

May 22, 2024
Application Filed
Dec 31, 2025
Non-Final Rejection mailed — §103
Mar 30, 2026
Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12640156
AUDIO ENCODING METHOD AND APPARATUS, AND AUDIO DECODING METHOD AND APPARATUS
2y 6m to grant Granted May 26, 2026
Patent 12633301
METHOD AND SYSTEM FOR PERFORMING DATA AUGMENTATION BASED ON MODIFIED SURROGATES, AND, NON-TRANSITORY COMPUTER READABLE MEDIUM
3y 5m to grant Granted May 19, 2026
Patent 12620401
ACOUSTIC PATTERN DETERMINATION
2y 9m to grant Granted May 05, 2026
Patent 12621620
SOUND SIGNAL DOWNMIX METHOD, SOUND SIGNAL CODING METHOD, SOUND SIGNAL DOWNMIX APPARATUS, SOUND SIGNAL CODING APPARATUS, PROGRAM
2y 2m to grant Granted May 05, 2026
Patent 12614555
Method and System for Producing an Augmented Ambisonic Format
2y 7m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+35.5%)
2y 9m (~9m remaining)
Median Time to Grant
Low
PTA Risk
Based on 937 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month