Last updated: April 19, 2026

Application No. 18/224,659

System and Method for Audio Processing using Time-Invariant Speaker Embeddings

Final Rejection §102§103

Filed

Jul 21, 2023

Examiner

OPSASNICK, MICHAEL N

Art Unit

2658

Tech Center

2600 — Communications

Assignee

Mitsubishi Electric Research Laboratories Inc.

OA Round

2 (Final)

Interview Optional

— +10.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 900 resolved cases, 2023–2026

Examiner Intelligence

OPSASNICK, MICHAEL N View full profile →

Grants 82% — above average

Career Allow Rate

737 granted / 900 resolved

+19.9% vs TC avg

Moderate +10% lift

Without

With

+10.5%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

46 currently pending

Career history

946

Total Applications

across all art units

Statute-Specific Performance

§101

17.7%

-22.3% vs TC avg

§103

33.0%

-7.0% vs TC avg

§102

29.9%

-10.1% vs TC avg

§112

6.3%

-33.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 900 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-7,9-17,19,20 are rejected under 35 U.S.C. 102 (a)(2) as being anticipated by Wang et al (20200202869).

As per claim 1, Wang et al (20200202869) teaches a method for processing an audio mixture formed by one or a combination of concurrent and sequential utterances of multiple speakers (as processing mixed audio data which includes at least 2 speakers and background noise – Fig. 5), wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising:
 receiving the audio mixture and identification information in a form of a time-invariant speaker embedding for each of the multiple speakers (as, receiving audio – Fig 3, subblock 302, with ground-truth audio data used to create the embedding – para 0038, see also Fig. 2);
 processing the audio mixture with a deep neural network including: a speaker-independent layer, applied to the audio mixture of multiple speakers and producing a speaker-independent output common to all of the multiple speakers (as, using a CNN processing the time-frequency spectrogram, which includes all possible speakers – Figure 3, subblock 306 int the voice filter model, the first step being the CNN (314);
and a speaker-biased layer, applied to the speaker-independent output once independently for each of the multiple speakers to produce speaker-biased outputs, the speaker biased outputs comprising a speaker biased output for each of the multiple speakers, each application of the speaker-biased layer being individually assigned to a corresponding speaker by inputting the corresponding time-invariant speaker embedding (as, the speaker embedding is input into the RNN – see Fig. 3, subblock 318 into 316; and para 0042, so that the output of the RNN is speaker-affected by the speaker embeddings, which contain information trained for a particular speaker – para 0009, using groundtruth information – para 0039; and not for a single speaker but t for multiple speakers – para 0049);
extracting data indicative of time-frequency activity regions of each speaker of the multiple speakers in the audio mixture from a combination of speaker-biased outputs; and rendering the extracted data (as, combining the predicted mask, with the audio spectrogram --  see figure 3, subblock 306 to 308, including the mask (322), wherein the mask emphasis a particular speaker para 0048).

As per claim 2, Wang et al (20200202869) teaches the method of claim 1, wherein the time-invariant speaker embedding remains constant for the entire execution of the deep neural network (as, the voice filter model takes the audio spectrogram input – see Fig. 3, subblock 312, performs the predicted mask output, without changing the speaker embedding 318-  examiner notes, the speaker embedding is output only; ie, no feedback/updating from the neural networks).

As per claim 3, Wang et al (20200202869) teaches the method of claim 1 further comprising: partitioning the audio mixture into a sequence of audio segments; and executing the deep neural network for the sequence of audio segments, wherein the time-invariant speaker embedding is shared between processing of different audio segments of the sequence of audio segments (as, the CNN-RNN-Fully connected operates on the audio segments of the audio mixture – Figure 3, and the speaker embedding is integrated based on the speaker identified, in some instances – para 0042, see commentary around “given speaker”).

As per claim 4, Wang et al (20200202869) teaches the method of claim 1, wherein rendering the extracted data comprises outputting a time-frequency mask comprising: an estimate of the time-frequency activity regions of each speaker of the multiple speakers, subjected to a non-linearity function (as filtering the regions, including multiple speakers or nonspeakers, for the one particular speaker – para 0042, or noise – see para 0054, and figure 5, mixed audio data).

As per claim 5, Wang et al (20200202869) teaches the method of claim 4, further comprising: combining the outputted time-frequency mask with the audio mixture; and generating an output for a single speaker from the multiple speakers based on the combination (as, combining the predicted mask, with the audio spectrogram --  see figure 3, subblock 306 to 308, including the mask (322), wherein the mask emphasis a particular speaker para 0048).

As per claim 6, Wang et al (20200202869) teaches the method of claim 5, wherein the output for the single speaker comprises a text output indicative of speech transcription data of the single speaker (as providing text output – after the refinement of the speaker audio – para 0007 – see, voice-to-text).

As per claim 7, Wang et al (20200202869) teaches the method of claim 1, wherein the deep neural network is trained with weakly supervised training process comprising training the deep neural network based on training data comprising time annotation data associated with the audio mixture (as, using time based audio, from the ground-truth audio, for a particular speaker – para 0038).

As per claim 9, Wang et al (20200202869) teaches the method of claim 1, wherein the time-invariant speaker embedding comprises a speaker embedding vector obtained on the basis of audio segments of speech forming the audio mixture, when only a single speaker is active (as, a speaker embedding vector – para 0008, and using a VAD to determine and mark, the instance of active speech by the speaker – para 0008).

As per claim 10, Wang et al (20200202869) teaches the method of claim 1, wherein the deep neural network comprises a combined estimation layer for extracting the data indicative of time-frequency activity regions of each speaker of the multiple speakers (as, multilayered variations of a neural network – para 0037, and using the layers operating on spectrograms (time-frequency) for a activity for a particular speaker – para 0042 – masking the spectrogram so as to separate a particular speaker from the other voices).

	Claims 11-17,19,20 are sound system processing claims that perform the steps found in method claims 1-7,9,10 above and as such, claims 11-17,19,20 are similar in scope and content to claims 1-7,9,10; therefore, claims 11-17,19,20 are rejected under similar rationale as presented against claims 1-7,9,10 above.  Furthermore, Wang et al (20200202869) teaches a processor executing the stored steps in memory – para 0083.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 8,18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al (20200202869) in view of Park et al (20230352041) .

As per claims 8,18, Wang et al (20200202869) teaches the method of claim 7, wherein the deep neural network is trained on the time annotation data comprising ground truth data including: data for diarization information and data for ground-truth separated sources (as using ground-truth information from the ground-truth database, for speaker embedding – para 0039), such that: Wang et al (20200202869) teaches one of the loss functions, but does not teach a separate diarization loss; Park et al (20230352041) teaches a dual loss function including diarization loss, as well as regular loss (para 0055, 0056) associated with a differential between the estimated and actual samples (See para 0036; and para 0041).  Therefore, it would have been obvious to one of ordinary skill in the art of speaker separation using neural networks to expand upon the loss function of Wang et al (20200202869) with a dual loss function for both diarization and signal estimate errors, as taught by Park et al (20230352041) because it allow for adding auxiliary losses to each individual encoding block (para 0073), to provide a more accurate model of the system efficiency/inefficiencies (see para 0081, where, by using the error function, discovering that a residual connection reduced the error).  

Response to Arguments

Applicant's arguments filed 09/10/2025 have been fully considered; the changes to the title and corrected drawings are accepted; the arguments toward multiple speakers have been considered but they are not persuasive.  As per applicants arguments toward multiple speaker biased outputs, examiner notes that the system in Wang is not restricted to a single voice filter, but trained for multiple unique human speakers – see e.g. para 0049.  In other words, as the system is trained to separate one speaker from multiple speakers, the other speakers are not precluded/prevented from being the next speaker to be separated.  Examiner also notes applicants arguments are toward the neural network layer structure; examiner proposes that to further differentiate from the Wang et al reference, further claim amendments towards the data calculation/vectors themselves, to include parallel processing of multiple speakers simultaneously, would advance prosecution.  Examiner further notes, on the updated search, the Sharifi et al (11557278) reference teaching the use of enrolled speakers and developing a speaker vector set for multiple speakers “Each enrolled speaker vector is associated with a different respective enrolled user of the assistant-enabled device. When the first speaker-discriminative vector matches one of the enrolled speaker vectors, the method includes identifying the user that spoke the utterance as the respective enrolled user associated with the one of the enrolled speaker vectors that matches the extracted speaker-discriminative vector.”

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  In further detail, the following references were found, pertinent to applicants claim features and/or spec:

Sharifi et al (11557278) reference teaching the use of enrolled speakers and developing a speaker vector set for multiple speakers

Nachami et al (202102569930 teaches multi-layer RNN/CNN performing speaker separation, using embedding from trained VGG networks, taking advantage of groundtruth information – para 0041.

Ozturk et al (20220291328) teaches speaker separation of multiple speakers using voice activity, as one parameter (para 0253) including other embedded data as well – para 0091.

Xu et al (20240005941) teaches target speaker separation, using speaker related cue information, as well as auditory signal representation.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        11/28/2025

Read full office action

Prosecution Timeline

Jul 21, 2023

Application Filed

Jun 09, 2025

Non-Final Rejection — §102, §103

Sep 10, 2025

Response Filed

Nov 28, 2025

Final Rejection — §102, §103

Feb 02, 2026

Interview Requested

Feb 06, 2026

Applicant Interview (Telephonic)

Feb 06, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/512,723

Patent 12602554

SYSTEMS AND METHODS FOR PRODUCING RELIABLE TRANSLATION IN NEAR REAL-TIME

2y 5m to grant Granted Apr 14, 2026

17/698,029

Patent 12592246

SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS

2y 5m to grant Granted Mar 31, 2026

18/367,779

Patent 12586580

System For Recognizing and Responding to Environmental Noises

2y 5m to grant Granted Mar 24, 2026

18/344,007

Patent 12579995

Automatic Speech Recognition Accuracy With Multimodal Embeddings Search

2y 5m to grant Granted Mar 17, 2026

18/273,354

Patent 12567432

VOICE SIGNAL ESTIMATION METHOD AND APPARATUS USING ATTENTION MECHANISM

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

82%

Grant Probability

92%

With Interview (+10.5%)

3y 3m

Median Time to Grant

Moderate

PTA Risk

Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.