DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to correspondence filed 03 February 2026 in reference to application 18/410,794. Claims 1-20 are pending and have been examined.
Response to Amendment
The amendment filed 03 February 2026 has been accepted and considered in this office action. Claims 1, 2, 5, 6, 8, 9, 13-18, and 20 have been amended.
Response to Arguments
Applicant’s arguments, see Remarks pages 9-10, filed 03 February 2023, with respect to double patenting rejections have been fully considered and are persuasive. The double patenting rejection of the claims has been withdrawn.
Applicant's arguments filed 03 February 2026 with respect to prior art rejections have been fully considered but they are not persuasive. Applicant argues, see pages 10-13, that Li in view of Zhang and Penniman fail to teach the “processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech: combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.” The examiner respectfully disagrees. Li and Zhang were not relied upon to teach these limitations, only Penniman. Applicant argues that Penniman does not specifically teach a weighted combination of ADCs using VAL weights. However it is noted, that the claim only requires “combining, based on weights determined by the VAL.” Thus using the weights to determine whether to include the channels in the final mix is in fact “combining, based on weights determined by the VAL.” Applicant notes that the specification teaches using the VAL weights as weights for the signals. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). For these reasons, Li in view of Zhang and Penniman teach the limitations of the claims.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1-3, 6-11, and 13-20 are is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US PAP 2022/0199099) in view of Zhang et al. (US PAP 2019/0341057) and further in view of Penniman et al. (US Patent 11,302,347).
Consider claim 1, Li teaches A method comprising:
receiving a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech);
obtaining, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals. 0066, there may be less M signals then N signals); and
processing, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).
Consider claim 2, Li teaches the method of claim 1, wherein the obtaining the second set of one or more ADCs comprises:
obtaining a similarity matrix, wherein an element (j, k) of the similarity matrix characterizes similarity of the audio data of j-th ADC of the first set of ADC and the audio data of k-th ADC of the first set of ADC (0083-89, especially 0087-88, where each element of spatial matrix represents a correlation between each of the N different microphone signals, and the diagonal represent the energy of the microphone itself as it corresponds to itself. );
identifying, using the similarity matrix, one or more clusters of ADCs of the first set of ADCs (0097, using spatial matrix to cluster the audio signals); and
using the one or more clusters of ADCs to obtain the second set of one or more ADCs (0097-104, 0062, clustering signals to obtain M audio signals corresponding to number of sources).
Consider claim 3, Li teaches The method of claim 2,wherein using an individual cluster of the one or more clusters of ADCs to obtain a respective ADC of the second set of one or more ADCs comprises:
aggregating the audio data of a plurality of ADCs of the individual cluster to obtain the audio data for the respective ADC of the second set of one or more ADCs (100-102, obtaining a source signal for each of the M clusters), wherein aggregating the audio data of the plurality of ADCs comprises at least one of:
combining the audio data of the plurality of ADCs (0102, cluster center of the audio data is calculated, i.e. combination of the audio data), or selecting the audio data of a maximum signal-to-noise ADC of the plurality of AD.
Consider claim 6, Penniman teaches the method of claim 1, wherein combining the second set of one or more ADCs comprisies:
eliminating ADCs of the second set of one or more ADCs having the VAL below a VAL threshold (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li and Zhang in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
Consider claim 7, Penniman teaches The method of claim 6, wherein the processing the second set of one or more ADCs using the voice detection model comprises processing embeddings associated with the second set of one or more ADCs using the voice detection model (col 5 lines 55-60, may use deep learning techniques for VAD, which would required embedded feature vectors in order to feed into decision network).
Consider claim 8, Penniman teaches the method of claim 6, wherein the third set of ADCs consists of a single ADC o (col 5 lines 57- col 6 lines 14, signals are included in mixed (i.e. aggregated) signals if the VAD is determined above threshold).
Consider claim 9, Penniman teaches the method of claim 1, b wherein the processing the second set of one or more ADCs comprises:
inputting into the voice detection model, a first plurality of embeddings associated with the second set of one or more ADCs (col 5 lines 44-61, VAD processing including deep learning, which would be based on embeddings, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
eliminating, using the VAL, one or more embeddings from the first plurality of embeddings to obtain a second plurality of embeddings associated with the second set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds);
wherein combining the second set of one or more ADCs to obtain the third set of one or more ADCs comprises generating, using the second plurality of embeddings, an aggregated embedding (col 5 lines 57- col 6 lines 14, signals are included in mixed (i.e. aggregated) signals if the VAD is determined above threshold); and
wherein processing the third set of one or more ADCs comprises inputting, into the audio processing NN model, the aggregated embedding to obtain the association of the speech to the one or more speakers (in combination with Li and Zhang, Zhang 0021, using a neural network 34 to identify speakers in audio signals, Li, 0063, 0073, performing speaker diarization).
Consider claim 10, Zhang suggests The method of claim 9, wherein the second plurality of embeddings has a predetermined number of embeddings (0028-33, embeddings may have dimensions based on the embedding network).
Consider claim 11, Li suggests The method of claim 9, wherein the aggregated embedding is generated individually for a given temporal unit of the speech (col 0088, processing based on frames and windows).
Consider claim 13, Li teaches The method of claim 1, wherein the processing the third set of one or more ADCs to obtain the association of the speech to the one or more speakers comprises: partitioning the speech into one or more intervals, wherein individual intervals of the one or more intervals are mapped to respective speakers who generated speech associated with the individual intervals (0063, 0073, performing speaker diarization, which is labeling segments of speech to corresponding speakers).
Consider claim 14, Li teaches A system (abstract) comprising:
one or more processing units (0160 CPU) to:
receive a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech);
obtain, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals. 0066, there may be less M signals then N signals); and
process, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).
Claim 15 contains similar limitations as claim 2 and therefore is rejected for the same reasons.
Claim 16 contains similar limitations as claim 3 and therefore is rejected for the same reasons.
Claim 17 contains similar limitations as claim 6 and therefore is rejected for the same reasons.
Claim 18 contains similar limitations as claim 9 and therefore is rejected for the same reasons.
Consider claim 19, Li teaches The system of claim 14, wherein the system is comprised in at least one of: an in-vehicle infotainment system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system for generating or presenting at least one of virtual reality content, mixed reality content, or augmented reality content; a system implemented using a robot; a system for performing one or more conversational Al operations; a system implementing one or more large language models (LLMs); a system implementing one or more language models; a system for performing one or more generative Al operations, a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources (0165, system maybe implemented in a data center).
Consider claim 20, Li teaches A processing device (0160 CPU) to:
receive a first set of audio data channels (ADCs) jointly capturing a speech produced by one or more speakers (figure 1a, 0062, 0055, receiving N microphone observed signals, including speech);
obtain, using the first set of ADCs, a second set of one or more ADCs, at least ADC of the second set of one or more ADCs representing a cluster of two or more ADCs of the first set of ADCs, the two of more ADCs of the first set of ADCs being selected based at least on a similarity of audio data of the two or more ADCs of the first set of ADCs (0062, 0066-72, N observed signals converted to M source signals based on spatial characteristic matrix which represents correlation between the N observed signals. 0066, there may be less M signals then N signals); and
process, using an audio processing model, the final set of ADCs to obtain one or more associations of the speech to the one or more speakers (0063, 0073, performing speaker diarization ).
Li does not specifically teach
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech;
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs.
In the same field of speech separation, Penniman teaches
processing the second set of one or more ADCs using a voice detection model to determine voice activity likelihoods (VALs) that individual ADCs of the second set of one or more ADCs comprise the speech (col 5 lines 44-61, VAD processing, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihoods);
combining, based on weights determined by the VALs, the second set of one or more ADCs to obtain a third set of one or more ADCs (col 5 lines 44-61, VAD processing, gating signals where voice is determined not to be present, col 8 line 63- col 9 line 5, VAD may be based on confidence levels, i.e. likelihood thresholds).
It would have been obvious to one of ordinary skill in the art at the time of effective filing to use VADs to gate channels as taught by Penniman in the system of Li in order to reduce errant noises being included in the signals that are further processed (Penniman col 5 lines 61-67).
Li and Penniman do not specifically teach processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers.
In the same field of speaker identification, Zhang teaches processing, using an audio processing neural network (NN) model, the third set of one or more ADCs to obtain one or more associations of the speech to the one or more speakers (0021, using a neural network 34 to identify speakers in audio signals).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural network based models as taught by Zhang in the system of Li and Penniman in order to more accurately identify speakers in a voice signal (Zhang 0002-03).
Claim(s) 4, 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Penniman and Zhang as applied to claims 1 above, and further in view of Wang et al. (Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation).
Consider claim 4, Li and Penniman and Zhang teach The method of claim 1, but do not specifically teach wherein the obtaining the second set of one or more ADCs comprises: applying the audio data of the first set of ADCs to a clustering NN model.
In the same field of clustering audio signals, Wang teaches herein the obtaining the second set of one or more ADCs comprises: applying the audio data of the first set of ADCs to a clustering NN model (section 2.2 and 2.3, using neural networks for signal clustering).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use neural networks to cluster as taught by Wang in the system of Li and Penniman and Zhang in order to improve signal separation performance (Zhang abstract).
Consider claim 5, Zhang teaches the method of claim 4,wherein the clustering NN model is trained to improve an audio quality associated with an output into the clustering NN model compared with the audio quality associated with an input into the clustering NN model (introduction, using deep clustering to improve audio quality), wherein the input into the clustering NN model comprises a plurality of input embeddings associated with the first set of ADCs (intro, section 2, input feature embeddings), and wherein an output of the clustering NN model comprises one or more output embeddings associated with the second set of one or more ADCs (section 3, creating a mask which is applied to the embeddings to generate separated sources).
Allowable Subject Matter
Claim 12 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is a statement of reasons for the indication of allowable subject matter:
Consider claim 12, Li, Zhang and Penniman teach the method of claim 9, but do not specifically teach “wherein the eliminating the one or more embeddings from the first plurality of embeddings comprises: determining distances, in an embeddings space, between embeddings of the first plurality of embeddings; and eliminating the one or more embeddings based on the determined distances” when combined with each and every other limitation of the claim, the base claim, and intervening claims. Therefore claim 12 contains allowable subject matter.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
DOUGLAS GODBOLD
Examiner
Art Unit 2655
/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655