Last updated: April 19, 2026

Application No. 18/410,794

AUDIO PROCESSING IN MULTI-SPEAKER MULTI-CHANNEL AUDIO ENVIRONMENTS

Final Rejection §103§DP

Filed

Jan 11, 2024

Examiner

GODBOLD, DOUGLAS

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

2 (Final)

Interview Optional

— +10.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1079 resolved cases, 2023–2026

Examiner Intelligence

GODBOLD, DOUGLAS View full profile →

Grants 83% — above average

Career Allow Rate

898 granted / 1079 resolved

+21.2% vs TC avg

Moderate +10% lift

Without

With

+10.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

25 currently pending

Career history

1104

Total Applications

across all art units

Statute-Specific Performance

§101

15.0%

-25.0% vs TC avg

§103

46.3%

+6.3% vs TC avg

§102

19.6%

-20.4% vs TC avg

§112

8.6%

-31.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1079 resolved cases

Office Action

§103 §DP

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office Action is in response to correspondence filed 03 February 2026 in reference to application 18/410,794.  Claims 1-20 are pending and have been examined.

Response to Amendment
The amendment filed 03 February 2026 has been accepted and considered in this office action.  Claims 1, 2, 5, 6, 8, 9, 13-18, and 20 have been amended.

Response to Arguments
Applicant’s arguments, see Remarks pages 9-10, filed 03 February 2023, with respect to double patenting rejections have been fully considered and are persuasive.  The double patenting rejection of the claims has been withdrawn. 

Applicant's arguments filed 03 February 2026 with respect to prior art rejections have been fully considered but they are not persuasive.  Applicant argues, see pages 10-13, that Li in view of Zhang and Penniman fail to teach the “processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech: combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.”  The examiner respectfully disagrees.  Li and Zhang were not relied upon to teach these limitations, only Penniman.  Applicant argues that Penniman does not specifically teach a weighted combination of ADCs using  VAL weights.  However it is noted, that the claim only requires “combining, based on weights determined by the VAL.”  Thus using the weights to determine whether to include the channels in the final mix is in fact “combining, based on weights determined by the VAL.” Applicant notes that the specification teaches using the VAL weights as weights for the signals. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  For these reasons, Li in view of Zhang and Penniman teach the limitations of the claims.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.

Claim(s) 1-3, 6-11, and 13-20 are is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US PAP 2022/0199099) in view of Zhang et al. (US PAP 2019/0341057) and further in view of Penniman et al. (US Patent 11,302,347).  

Consider claim 1, Li teaches A method comprising: 
receiving a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech); 
obtaining, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals.  0066, there may be less M signals then N signals); and 
processing, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches 
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech  (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
	Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
	In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
	Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).

Consider claim 2, Li teaches the method of claim 1, wherein the obtaining the second set of one or more ADCs comprises: 
obtaining a similarity matrix, wherein an element (j, k) of the similarity matrix characterizes similarity of the audio data of j-th ADC of the first set of ADC and the audio data of k-th ADC of the first set of ADC (0083-89, especially 0087-88, where each element of spatial matrix represents a correlation between each of the N different microphone signals, and the diagonal represent the energy of the microphone itself as it corresponds to itself. ); 
identifying, using the similarity matrix, one or more clusters of ADCs of the first set of ADCs (0097, using spatial matrix to cluster the audio signals); and 
using the one or more clusters of ADCs to obtain the second set of one or more ADCs (0097-104, 0062, clustering signals to obtain M audio signals corresponding to number of sources).

Consider claim 3, Li teaches The method of claim 2,wherein using an individual cluster of the one or more clusters of ADCs to obtain a respective ADC of the second set of one or more ADCs comprises:
aggregating the audio data of a plurality of ADCs of the individual cluster to obtain the audio data for the respective ADC of the second set of one or more ADCs (100-102, obtaining a source signal for each of the M clusters), wherein aggregating the audio data of the plurality of ADCs comprises at least one of:
combining the audio data of the plurality of ADCs (0102, cluster center of the audio data is calculated, i.e. combination of the audio data), or selecting the audio data of a maximum signal-to-noise ADC of the plurality of AD.

Consider claim 6, Penniman teaches the method of claim 1, wherein combining the second set of one or more ADCs comprisies:
eliminating ADCs of the second set of one or more ADCs having the VAL below a VAL threshold (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li and Zhang in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).

Consider claim 7, Penniman teaches The method of claim 6, wherein the processing the second set of one or more ADCs using the voice detection model comprises processing embeddings associated with the second set of one or more ADCs using the voice detection model (col 5 lines 55-60, may use deep learning techniques for VAD, which would required embedded feature vectors in order to feed into decision network).

Consider claim 8, Penniman teaches the method of claim 6, wherein the third set of ADCs consists of a single ADC o (col 5 lines 57- col 6 lines 14, signals are included in mixed (i.e. aggregated) signals if the VAD is determined above threshold).

Consider claim 9, Penniman teaches the method of claim 1, b wherein the processing the second set of one or more ADCs comprises:
inputting into the voice detection model, a first plurality of embeddings associated with the second set of one or more ADCs (col 5 lines 44-61, VAD processing including deep learning, which would be based on embeddings, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
eliminating, using the VAL, one or more embeddings from the first plurality of embeddings to obtain a second plurality of embeddings associated with the second set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds); 
wherein combining the second set of one or more ADCs to obtain the third set of one or more ADCs comprises generating, using the second plurality of embeddings, an aggregated embedding (col 5 lines 57- col 6 lines 14, signals are included in mixed (i.e. aggregated) signals if the VAD is determined above threshold); and 
wherein processing the third set of one or more ADCs comprises inputting, into the audio processing NN model, the aggregated embedding to obtain the association of the speech to the one or more speakers (in combination with Li and Zhang, Zhang 0021, using a neural network 34 to identify speakers in audio signals, Li, 0063, 0073, performing speaker diarization).

Consider claim 10, Zhang suggests The method of claim 9, wherein the second plurality of embeddings has a predetermined number of embeddings (0028-33, embeddings may have dimensions based on the embedding network). 

Consider claim 11, Li suggests The method of claim 9, wherein the aggregated embedding is generated individually for a given temporal unit of the speech (col 0088, processing based on frames and windows).

Consider claim 13, Li teaches The method of claim 1, wherein the processing the third set of one or more ADCs to obtain the association of the speech to the one or more speakers comprises: partitioning the speech into one or more intervals, wherein individual intervals of the one or more intervals are mapped to respective speakers who generated speech associated with the individual intervals  (0063, 0073, performing speaker diarization, which is labeling segments of speech to corresponding speakers).

Consider claim 14, Li teaches A system (abstract) comprising:
one or more processing units (0160 CPU) to: 
receive a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech); 
obtain, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals.  0066, there may be less M signals then N signals); and 
process, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches 
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech  (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
	Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
	In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
	Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).

Claim 15 contains similar limitations as claim 2 and therefore is rejected for the same reasons.

Claim 16 contains similar limitations as claim 3 and therefore is rejected for the same reasons.

Claim 17 contains similar limitations as claim 6 and therefore is rejected for the same reasons.

Claim 18 contains similar limitations as claim 9 and therefore is rejected for the same reasons.

Consider claim 19, Li teaches The system of claim 14, wherein the system is comprised in at least one of: an in-vehicle infotainment system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system for generating or presenting at least one of virtual reality content, mixed reality content, or augmented reality content; a system implemented using a robot; a system for performing one or more conversational Al operations; a system implementing one or more large language models (LLMs); a system implementing one or more language models; a system for performing one or more generative Al operations, a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources (0165, system maybe implemented in a data center).

Consider claim 20, Li teaches A processing device (0160 CPU) to: 
receive a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech); 
obtain, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals.  0066, there may be less M signals then N signals); and 
process, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches 
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech  (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).

Claim(s) 4, 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Penniman and Zhang as applied to claims 1 above, and further in view of Wang et al. (Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation).

Consider claim 4, Li and Penniman and Zhang teach The method of claim 1, but do not specifically teach wherein the obtaining the second set of one or more ADCs comprises: applying the audio data of the first set of ADCs to a clustering NN model.
In the same field of clustering audio signals, Wang teaches herein the obtaining the second set of one or more ADCs comprises: applying the audio data of the first set of ADCs to a clustering NN model (section 2.2 and 2.3, using neural networks for signal clustering).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural networks to cluster as taught by Wang in the system of Li and Penniman and Zhang in order to improve signal separation performance (Zhang abstract).

Consider claim 5, Zhang teaches the method of claim 4,wherein the clustering NN model is trained to improve an audio quality associated with an output into the clustering NN model compared with the audio quality associated with an input into the clustering NN model (introduction, using deep clustering to improve audio quality), wherein the input into the clustering NN model comprises a plurality of input embeddings associated with the first set of ADCs (intro, section 2, input feature embeddings), and wherein an output of the clustering NN model comprises one or more output embeddings associated with the second set of one or more ADCs (section 3, creating a mask which is applied to the embeddings to generate separated sources).

Allowable Subject Matter
Claim 12 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.   The following is a statement of reasons for the indication of allowable subject matter:  

Consider claim 12, Li, Zhang and Penniman teach the method of claim 9, but do not specifically teach “wherein the eliminating the one or more embeddings from the first plurality of embeddings comprises: determining distances, in an embeddings space, between embeddings of the first plurality of embeddings; and eliminating the one or more embeddings based on the determined distances” when combined with each and every other limitation of the claim, the base claim, and intervening claims.  Therefore claim 12 contains allowable subject matter. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DOUGLAS GODBOLD
Examiner
Art Unit 2655



/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Jan 11, 2024

Application Filed

Oct 30, 2025

Non-Final Rejection — §103, §DP

Feb 03, 2026

Response Filed

Feb 23, 2026

Final Rejection — §103, §DP (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/338,075

Patent 12585879

ARTIFICIAL INTELLIGENCE ASSISTED NETWORK OPERATIONS REPORTING AND MANAGEMENT

2y 5m to grant Granted Mar 24, 2026

18/327,780

Patent 12579371

USING MACHINE LEARNING TO GENERATE SEGMENTS FROM UNSTRUCTURED TEXT AND IDENTIFY SENTIMENTS FOR EACH SEGMENT

2y 5m to grant Granted Mar 17, 2026

18/489,671

Patent 12579372

KEY PHRASE TOPIC ASSIGNMENT

2y 5m to grant Granted Mar 17, 2026

18/492,524

Patent 12579383

VERIFYING TRANSLATIONS OF SOURCE TEXT IN A SOURCE LANGUAGE TO TARGET TEXT IN A TARGET LANGUAGE

2y 5m to grant Granted Mar 17, 2026

18/232,485

Patent 12572749

Compressing Information Provided to a Machine-Trained Model Using Abstract Tokens

2y 5m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

83%

Grant Probability

94%

With Interview (+10.5%)

2y 10m

Median Time to Grant

Moderate

PTA Risk

Based on 1079 resolved cases by this examiner. Grant probability derived from career allow rate.

AUDIO PROCESSING IN MULTI-SPEAKER MULTI-CHANNEL AUDIO ENVIRONMENTS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email