Last updated: May 29, 2026

Application No. 18/466,503

LOW-LATENCY SPEAKER SEPARATION

Non-Final OA §103

Filed

Sep 13, 2023

Examiner

MCLEAN, IAN SCOTT

Art Unit

2654

Tech Center

2600 — Communications

Assignee

BAR ILAN UNIVERSITY

OA Round

2 (Non-Final)

Interview Optional

— +31.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 45% grant rate with +31.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 47 resolved cases, 2023–2026

Examiner Intelligence

MCLEAN, IAN SCOTT View full profile →

Grants 45% of resolved cases

Career Allowance Rate

21 granted / 47 resolved

-17.3% vs TC avg

Strong +31% interview lift

Without

With

+31.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

22 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§103

88.4%

+48.4% vs TC avg

§102

11.6%

-28.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 47 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
2.	Applicant’s arguments with respect to claim 1, 8 and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

4.	Claims 1-4, 6-11, 13-17 and 19-20 are 35 U.S.C. 103 as being unpatentable over Wang (US 2022/0301573) in view of Tripathi (US 2021/0343273).

Regarding Claim 1:
Wang discloses a system for speech separation (Wang: Fig. 1 discloses a system for voice separation) comprising:
data processing hardware (Wang: p[0016] discloses a system implemented on a client computing device that performs speaker-conditioned speech separation using a voice filter model);
memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations (Wang: p[0016] discloses a system implemented on a client computing device including a memory and processor):  comprising:
generating a two-dimensional representation of a speech mixture (Wang: p[0051]-p[0053], p[0063] discloses generating frequency representations of audio using FFT or filter bank, both of which are 2D representations (time and frequency) of a source signal);
separating the speech mixture into an initial separation  (Wang: Fig. 3p[0050-0053], p[0066] discloses initial separations done with a frequency representation (generated using automatic speech recognition and is a preliminary separation step) which then goes through power compression and normalization);


supplying the initial separation and speaker representations to a refinement module (Wang: Fig. 3 p[0050-0053], p[0066] discloses supplying an initial representation (normalized frequency representation 306) together with speaker embeddings 218 to a refinement model (voice filter model 112), which refines the initial representation by generating and applying a predicted mask 322 to later be used for completed separation);
refining the initial separation based on the initial separation and the speaker representations (Wang: Fig. 3 the initial separation (normalized representation 306) is supplied to the refinement module (voice filter model) which is then used to produce the predicted mask (a refined representation));
estimating a mask per speaker (Wang: Fig. 3 308 shows the convolving of the initial input and the predicted mask (i.e., the mask is used on the speaker input));
and applying the masks to the two-dimensional representation to create two- dimensional, per-speaker representations (Wang: Fig. 3 308 shows the convolving of the initial input and the predicted mask to produce a revised frequency result (a completely isolated speaker representation)).
Wang does not explicitly disclose:
generating per-frame speaker representations of the initial separation;
comparing the per-frame speaker representations to stored speaker representations in a speaker embedding table and enforcing, during training, that the per-frame speaker representations are consistent with the stored speaker representations;
supplying feature projections of the per frame speaker representations; and
refining the initial separation based on the feature projections of the per frame speaker representation;
However, Tripathi discloses:
generating per-frame speaker representations of the initial separation (Tripathi: ¶[0008] produces speaker-specific representations at each frame, the per-frame speaker representations correspond to the respective masked audio embeddings generated per frame for each speaker);
comparing the per-frame speaker representations to stored speaker representations in a speaker embedding table and enforcing, during training, that the per-frame speaker representations are consistent with the stored speaker representations (Tripathi: ¶[0050]-[0051] compares the per-frame speaker representations is satisfied by computing speaker embeddings derived from the per frame masked embeddings and comparing them as part of EmbLoss; the so called speaker embedding table reads on stored reference speaker embeddings (e.g., speaker embedding vectors used as training targets), and the per-frame speaker representations are enforced to remain consistent with the stored/reference embeddings via the embedding loss);
supplying feature projections of the per frame speaker representations (Tripathi: ¶[0031]-[0033] discloses these embeddings are projected outputs of the encoder network, wherein the encoder output embedding is a feature projection (projection from time-domain or Mel features into an embedding space); and
refining the initial separation based on the feature projections of the per frame speaker representation (Tripathi: ¶[0033] discloses an initial separation at the feature level, the audio encoder generates encoded embeddings from a monophonic mixture, these embedding still represent mixed speakers, i.e., an initial separation. ¶[0008] discloses per frame encoder embeddings and ¶[0041] discloses speaker embeddings supplied per frame which are explicitly feature space projections representing speaker identity at the frame level. The masking model refines the encoded embeddings based on per-frame encoder embeddings and speaker embeddings/condition inputs to generate masked audio embeddings. During training, embedding loss is applied to enforce that masked embeddings correspond to only one speaker);
Wang is directed to speech separation using speaker conditioned masking and refinement while Tripathi is directed to end-to-end multi-speaker speech recognition that refines speaker-specific representations at a per-frame level using embedding based losses. These references are in the same field of endeavor (speech separation) and address the same problem of improving speaker specific representations in overlapping speech. Tripathi expressly discloses refining speaker-specific representations by enforcing consistency of per frame speaker embeddings during training, stating that “the embedding loss to each of (i) the respective masked audio embedding generated for the first speaker to enforce that an entirety of the respective masked audio embedding generated for the first speaker corresponds to only audio spoken by the first speaker and (ii) the respective masked audio embedding generated for the second speaker to enforce that an entirety of the respective masked audio embedding generated for the second speaker corresponds to only audio spoken by the second speaker” in ¶[0004].

Regarding Claim 2:
	The proposed combination of Wang and Tripathi further discloses the system of Claim 1, wherein generating a two-dimensional representation of the speech mixture includes passing the speech mixture through an encoder (Wang: p[0050] states that the automatic speech recognition (ASR) engine 114 can process audio to determine a frequency representation, it states that this engine may use a Fast Fourier Transform (FFT) or a filter bank, these are both forms of encoding and therefore the ASR engine as a whole is acting an encoder as is well understood in this field of endeavor).

Regarding Claim 3:
	The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising supplying at least one feature projection to the refinement module for use by the refinement module in refining the initial separation (Wang: p[0063]-p[0065] frequency representation 302 and speaker embedding 318 are both optionally normalized and/or power compressed  before entering the voice filter model this normalization especially of the speaker embedding (which is interpreted as the speaker representations) is equivalent to the claimed supply at least one feature projection).

Regarding Claim 4:
	The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising using a speaker embedding table to refine the speaker representations (Wang: p[0005] discloses pre-generated speaker embeddings. The sequence of audio data can be associated with the pre-generated speaker embedding after verification of the first human speaker, this is voice printing which satisfies Applicants claimed speaker embedding table as described in p[0048] of Applicants disclosure because the pre-enrolled audio is mapped to just like in a table).

Regarding Claim 6:
The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising a microphone in communication with the data processing hardware and configured to detect the speech mixture (Wang: p[0002] discloses a microphone of a client device).



Regarding Claim 7:
	The proposed combination of Wang and Tripathi further the discloses a vehicle incorporating the microphone of Claim 6 (Wang: p[0048] the client device may be an in-vehicle system).

Regarding Claim 8:
Claim 8 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 9:
Claim 9 has been analyzed with regard to claim 2 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 10:
Claim 10 has been analyzed with regard to claim 3 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 11:
Claim 11 has been analyzed with regard to claim 4 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 13:
Claim 13 has been analyzed with regard to claim 6 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 14:
Claim 14 has been analyzed with regard to claim 7 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 15:
Claim 15 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 16:
Claim 16 has been analyzed with regard to claim 2 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 17:
Claim 17 has been analyzed with regard to claim 4 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 19:
Claim 19 has been analyzed with regard to claim 6 (see rejection above) and
is rejected for the same reasons of anticipation used above.

Regarding Claim 20:
Claim 20 has been analyzed with regard to claim 7 (see rejection above) and
is rejected for the same reasons of anticipation used above.

5.	Claims 5, 12, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Tripathi and further in view of Yang (US 2024/0404509).

Regarding Claim 5:
The proposed combination of Wang and Tripathi further discloses the system of Claim 1, except further comprising passing the two-dimensional, per-speaker representations through a decoder to generate per-speaker waveforms.
	However, Yang discloses this limitation: (Yang: p[0034] discloses that an intermediate representation and an embedding can be put into a decoder and obtain a final representation which is converted into a waveform by using the decoder to generate speech).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Yang to the per-speaker frequency representations of Wang and Tripathi in order to generate per=speaker waveforms. Decoding spectrogram or frequency domain features back into waveforms (via STFT, Griffin-Lim) is a well-known and predictable step in speech processing at the time of the invention. A person of ordinary skill in the art would recognize that producing per-speaker waveforms enables additional uses such as playback, user experience, real-time responsiveness or downstream processing beyond ASR transcription (see Yang p[0003]) and Wang’s system is completely capable of all these tasks but chooses to stick to ASR as a matter of personal implementation (see last line of Wang p[0003] where it is states the voice filter model can be used for automatic speech recognition without [emphasis added] reconstructing the audio).

Regarding Claim 12:
Claim 12 has been analyzed with regard to claim 5 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 18:
Claim 18 has been analyzed with regard to claim 5 (see rejection above) and
is rejected for the same reasons of obviousness used above.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IAN SCOTT MCLEAN/Examiner, Art Unit 2654   

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Show 3 earlier events

Sep 18, 2025

Examiner Interview Summary

Sep 18, 2025

Applicant Interview (Telephonic)

Nov 14, 2025

Response Filed

Dec 23, 2025

Final Rejection mailed — §103

Jan 12, 2026

Interview Requested

Jan 20, 2026

Applicant Interview (Telephonic)

Jan 20, 2026

Examiner Interview Summary

Feb 12, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/499,296

Patent 12609127

NEUTRALIZING DISTORTION IN AUDIO DATA

2y 5m to grant Granted Apr 21, 2026

18/245,802

Patent 12602553

SPEECH TRANSLATION METHOD, DEVICE, AND STORAGE MEDIUM

3y 0m to grant Granted Apr 14, 2026

17/952,401

Patent 12494199

VOICE INTERACTION METHOD AND ELECTRONIC DEVICE

3y 2m to grant Granted Dec 09, 2025

18/063,167

Patent 12443805

Systems and Methods for Multilingual Data Processing and Arrangement on a Multilingual User Interface

2y 10m to grant Granted Oct 14, 2025

17/559,283

Patent 12437144

Content Recommendation Method and User Terminal

3y 9m to grant Granted Oct 07, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

45%

Grant Probability

76%

With Interview (+31.4%)

3y 0m (~4m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 47 resolved cases by this examiner. Grant probability derived from career allowance rate.