Last updated: April 19, 2026
Application No. 18/463,388
Audio Signal Extraction from Audio Mixture using Neural Network

Non-Final OA §103
Filed
Sep 08, 2023
Examiner
LEE, CLAY C
Art Unit
3699
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
1 (Non-Final)
This examiner grants 54% of cases after interview

— +57.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 216 resolved cases, 2023–2026
Examiner Intelligence

LEE, CLAY C View full profile →
Grants 54% of resolved cases
Career Allow Rate
117 granted / 216 resolved
+2.2% vs TC avg
Strong +57% interview lift
Without
With
+57.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
60 currently pending
Career history
276
Total Applications
across all art units
Statute-Specific Performance

§101
32.7%
-7.3% vs TC avg
§103
45.9%
+5.9% vs TC avg
§102
8.2%
-31.8% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 216 resolved cases
Office Action

§103
CTNF 18/463,388 CTNF 94575 DETAILED ACTION Claim Status This is first office action on the merits in response to the application filed on 9/8/2023. 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Claims 1-20 are currently pending and have been examined. Information Disclosure Statement The information disclosure statement(s) (IDS) submitted on 9/8/2023 is(are) in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Claim Rejections - 35 USC § 103 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-20-02-aia AIA This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. 07-23-aia AIA The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-21-aia AIA Claim (s) 1-2 and 15-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar (US 20220366927 A1) in view of Chattopadhyay (US 20240211728 A1) . Regarding Claims 1 and 19-20 , Pishehvar teaches An audio system for facilitating an operation of a machine including one or multiple [actuators] assisting one or multiple tools to perform one or multiple tasks, comprising (Pishehvar: Abstract; Paragraph(s) 0004-0005, 0012, 0031, 0065-0066) : A system for facilitating an operation of a machine including one or multiple actuators assisting one or multiple tools to perform one or multiple tasks, comprising: a processor; and a memory having instructions stored thereon that cause the processor to (Pishehvar: Paragraph(s) Abstract; Paragraph(s) 0004-0005, 0012, 0031, 0065-0066, 0069-0070) : A method for facilitating an operation of a machine including one or multiple [actuators] assisting one or multiple tools to perform one or multiple tasks, comprising (Pishehvar: Paragraph(s) Abstract; Paragraph(s) 0004-0005, 0012, 0031, 0065-0066) : an audio input interface configured to receive an audio mixture of signals generated by multiple audio sources including at least one of: the one or multiple tools performing one or multiple tasks, or the one or multiple actuators operating the one or multiple tools, wherein at least one of the audio sources forming the audio mixture is identified by a location relative to a location of each microphone of a microphone array measuring the audio mixture (Pishehvar: Abstract; Paragraph(s) 0034-0038, 0012, 0031 teach(es) Visual information of the target speaker provided by the camera may augment the audio signals captured by the microphone array to facilitate the multi-task machine learning model to better discriminate between the target speech and the interfering talker or the background noise; a scenario of a user uttering speech during a telephony or video conferencing call or issuing a voice command to a smartphone for the smartphone to detect the voice command according to one aspect of the disclosure. The smartphone may include three microphones located at various locations on the smartphone. The microphones may form a compact microphone array to capture speech signals from the user) ; a processor configured to extract an audio signal generated by an identified audio source of the multiple audio sources from the audio mixture based on a correlation of spectral features in a multi-channel spectrogram of the audio mixture with directional information indicative of the relative location of the identified audio source (Pishehvar: Abstract; Paragraph(s) 0010, 0012, 0025-0026, 0052 teach(es) The separation network may estimate the spectrogram masks of the target speaker at the time-frequency bins to multiply with the spectrogram from the STFT to mask the target speech from the interference signal. An inverse STFT (iSTFT) may transform the masked spectrogram into the time-domain target speech signal and to estimate the target audio parameters; spectral features of the mixed signal may be estimated using short-term Fourier Transform (STFT) instead of mapping the mixed signal to time-domain representations using the linear encoder of the time-domain DNN model; the DNN model receiving the multi-channel audio signals that include the target speech signal overlapped with interference signals and the multi-modal signals that contain information of a source of the target speech signal; the target speaker's voice characteristics extracted from utterances by the target speaker captured during an enrollment process, etc.) ; and an output audio interface configured to output the extracted audio signal to facilitate the operation of the machine (Pishehvar: Paragraph(s) 0024-0025, 0031-0032 teach(es) the target audio parameters may be provided as independent outputs for processing by subsequent audio processing functions, thus eliminating the need to separately generate the target audio parameters by the other audio processing functions; The enhanced target speech signals and optionally the target audio parameters may be provided to telephony or video conferencing applications to improve the user experience during the conversation or to automatic speech recognition applications to identify and interpret the voice command or query). However, Pishehvar does not explicitly teach actuators . Chattopadhyay from same or similar field of endeavor teaches actuators (Chattopadhyay: Paragraph(s) 0123 teach(es) audio capture device may include audio capture hardware including one or more sensors as well as actuator controls). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Pishehvar to incorporate the teachings of Chattopadhyay for actuator. There is motivation to combine Chattopadhyay into Pishehvar because Chattopadhyay’s teachings of actuator would facilitate the encoder and the separation network to better discriminate between the target speech and speech from interfering speakers or background noise (Chattopadhyay: Paragraph(s) 0045). Regarding Claim 2 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 1 above; and Pishehvar further teaches wherein the processor is configured to extract the audio signal generated by the identified audio source using a neural network (Pishehvar: Abstract; Paragraph(s) 0003 teach(es) deep neural networks (DNN) trained to model desirable and undesired signal characteristics of mixed signals may filter or separate target speech from interference signals to generate enhanced target speech signals). Regarding Claim 15 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 1 above; and Pishehvar further teaches wherein the processor is further configured to: produce a control command for the operation of the machine based on the extracted signal; and transmit the control command to the machine over a communication channel (Pishehvar: Paragraph(s) 0024-0025, 0031-0032, as stated above with respect to claim 1). Regarding Claim 16 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 15 above; and Pishehvar further teaches wherein the processor is further configured to: analyze the extracted audio signal generated by the identified acoustic source from the audio mixture to produce a state of performance of a task; select the control command from a set of control commands based on the state of performance of the task, wherein the set of control commands correspond to different states of performance of the one or multiple tasks; and cause the machine to execute the control command (Pishehvar: Paragraph(s) 0002, 0036, 0022, 0031-0032 teach(es) For the devices to isolate the speech from the target speaker in telephony or video conference calls or to invoke applications and services to respond accurately and timely to the voice commands, the devices need to suppress interference and improve intelligibility of the speech signals in the noisy environment; The target speech may be overlapped with interference signal such as voice from competing speakers (e.g., 120 of FIG. 2), background noise (e.g., 130 of FIG. 2), artifacts due to the acoustic environment such as reverberant signals, the main speaker's own interruptions, etc.). Regarding Claim 17 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 15 above; however the combination does not explicitly teach wherein the processor is further configured to: determine an anomaly score for the identified acoustic source based on the extracted audio signal corresponding to the identified audio source, wherein the anomaly score indicates a correlation between a type of an anomaly and a state of the identified acoustic source; compare the anomaly score with an anomaly threshold; select the control command from a set of control commands to be performed by the machine when the anomaly score is greater than the anomaly threshold; and transmit the selected control command to the machine for overcoming an anomaly at the identified acoustic source . Chattopadhyay further teaches wherein the processor is further configured to: determine an anomaly score for the identified acoustic source based on the extracted audio signal corresponding to the identified audio source, wherein the anomaly score indicates a correlation between a type of an anomaly and a state of the identified acoustic source; compare the anomaly score with an anomaly threshold; select the control command from a set of control commands to be performed by the machine when the anomaly score is greater than the anomaly threshold; and transmit the selected control command to the machine for overcoming an anomaly at the identified acoustic source (Chattopadhyay: Paragraph(s) 0033, 0109, 0143-0145 teach(es) the boosted signals from the separator model were detected by the second stage classifier with an accuracy of greater than 90% (and with an F1 score greater than 0.9, which indicates a high performance classifier (where over 0.9 is very good, 0.8 to 0.9 is good, 05. to 0.8 is adequate, less than 0.5 is poor); The audio subsystem also may be used to control the motion of articles or selection of commands on the interface). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar and Chattopadhyay to incorporate the teachings of Chattopadhyay for wherein the processor is further configured to: determine an anomaly score for the identified acoustic source based on the extracted audio signal corresponding to the identified audio source, wherein the anomaly score indicates a correlation between a type of an anomaly and a state of the identified acoustic source; compare the anomaly score with an anomaly threshold; select the control command from a set of control commands to be performed by the machine when the anomaly score is greater than the anomaly threshold; and transmit the selected control command to the machine for overcoming an anomaly at the identified acoustic source. There is motivation to combine Chattopadhyay into the combination of Pishehvar and Chattopadhyay because Chattopadhyay’s teachings of scores would facilitate detection of target audio source in noisy environments (Chattopadhyay: Abstract; Paragraph(s) 0033, 0109, 0143-0145). Regarding Claim 18 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 1 above; and Pishehvar further teaches wherein the multiple audio sources generating the audio mixture belong to a same class, and wherein the processor is further configured to: extract an audio signal generated by each of the multiple audio sources from the audio mixture based on a correlation of spectral features in a multi-channel spectrogram of the audio mixture with directional information indicative of relative locations corresponding to the multiple audio sources (Pishehvar: Paragraph(s) 0035-0040 teach(es) a machine learning model that uses multi-task learning to jointly generate an enhanced target speech signal and one or more target audio parameters from a mixed signal of target speech and interference signal according to one aspect of the disclosure. The machine learning model may be an end-to-end time-domain multi-task learning framework that uses a data-driven internal representation to encode the mixed signal, separate the internal representation of the target speech from the interference signal to generate masked features in time-domain; if the machine learning model is a non-time-domain network, separation network may estimate the spectrogram masks of the target speech at the time-frequency bins to multiply with the spectrogram from the STFT to mask the target speech from the interference signal) . 07-21-aia AIA Claim (s) 3-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar in view of Chattopadhyay, as applied to claims 1 or 2 above, and in further view of Wingate (WO 2015157013 A1) . Regarding Claim 3 , the combination of Pishehvar and Chattopadhyay teaches all the limitations of Claim 2 above; and Pishehvar further teaches wherein the spectral features include inter-channel [phase] differences of channels in the multi-channel spectrogram of the audio mixture (Pishehvar: Paragraph(s) 0010, 0037, 0052 teach(es) spectral features of the mixed signal may be estimated using short-term Fourier Transform (STFT) instead of mapping the mixed signal to time-domain representations using the linear encoder of the time-domain DNN model; The learned mapped feature representation may jointly encode spectral, temporal, or spatial information of multi-channel signal) , the directional information includes target [phase] differences (TPDs) of a sound propagating from the relative location of the identified audio source to different microphones in the microphone array (Pishehvar: Paragraph(s) 0025-0026 teach(es) the multi-task machine learning model may infer a spatial VAD such that only target speech from a preferred direction is considered as active speech; the multi-task machine learning model may leverage spatial and directional information provided by the multiple microphones of the microphone array to improve the robustness of the enhanced target speech signal and the target audio parameters) , the correlation of the spectral features and the directional information is represented by a target [phase] correlation spectrogram, wherein values for different time-frequency bins of the target [phase] correlation spectrogram quantify alignment of the inter-channel [phase] differences with the target [phase] differences in the corresponding time-frequency bins, and wherein the target [phase] differences are expected [phase] differences for the time-frequency bins indicative of properties of sound propagation, wherein the processor is further configured to: determine the target [phase] correlation spectrogram; and process the target [phase] correlation spectrogram with the neural network to extract the audio signal (Pishehvar: Paragraph(s) 0010, 0038 teach(es) spectral features of the mixed signal may be estimated using short-term Fourier Transform (STFT) instead of mapping the mixed signal to time-domain representations using the linear encoder of the time-domain DNN model. The separation network may estimate the spectrogram masks of the target speaker at the time-frequency bins to multiply with the spectrogram from the STFT to mask the target speech from the interference signal. An inverse STFT (iSTFT) may transform the masked spectrogram into the time-domain target speech signal and to estimate the target audio parameters). However, the combination of Pishehvar and Chattopadhyay does not explicitly teach phase correlation spectrogram and phase differences . Wingate from same or similar field of endeavor teaches phase correlation spectrogram and phase differences (Wingate: Paragraph(s) 0007-0008, 0095, 0034-0035, 0099 teach(es) a plurality of acoustic sensors are employed make a distinction between the signals acquired by different sensors (e.g. for the purpose of determining DOA by e.g. comparing the phases of the different signals); The different BSS techniques presented herein are based on computing time-dependent spectral characteristics X of the acquired signal; direction-of-arrival (DOA) information is computed from the time signals, also indexed by frequency and frame. For example, continuous incidence angle estimates D(f, n), which may be represented as a scalar or a multidimensional vector, are derived from the phase differences of the STFT). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar and Chattopadhyay to incorporate the teachings of Wingate for phase correlation spectrogram and phase differences. There is motivation to combine Wingate into the combination of Pishehvar and Chattopadhyay because Wingate’s teachings of phase correlation spectrogram and phase differences would facilitate separating a sound generated by a particular source of interest from a mixture of different sounds (Wingate: Abstract). Regarding Claim 4 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 3 above; however the combination does not explicitly teach wherein the target phase correlation spectrogram includes complex numbers, and wherein the neural network is a complex neural network for processing the complex numbers of the target phase correlation spectrogram . Wingate from same or similar field of endeavor teaches wherein the target phase correlation spectrogram includes complex numbers, and wherein the neural network is a complex neural network for processing the complex numbers of the target phase correlation spectrogram (Wingate: Paragraph(s) 0096, 0009, 0017, 0093 teach(es) complex STFT; The frequency decomposition of all of the frames may be arranged in a matrix where frames and frequency are indexed (in the following, frames are described to be indexed by "n" and frequencies are described to be indexed by "f"). Each element of such an array, indexed by (f, n) comprises a complex value resulting from the application of the transformation function and is referred to herein as a "time-frequency bin" or simply "bin." The term "bin" may be viewed as indicative of the fact that such a matrix may be considered as comprising a plurality of bins into which the signal's energy is distributed. In an embodiment, the bins may be considered to contain not complex values but positive real quantities X(f, n) of the complex values, such quantities representing magnitudes of the acquired signal, presented e.g. as an actual magnitude, a squared magnitude, or as a compressive transformation of a magnitude, such as a square root). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar, Chattopadhyay, and Wingate to incorporate the teachings of Wingate for wherein the target phase correlation spectrogram includes complex numbers, and wherein the neural network is a complex neural network for processing the complex numbers of the target phase correlation spectrogram. There is motivation to combine Wingate into the combination of Pishehvar, Chattopadhyay, and Wingate because Wingate’s teachings of complex STFT and time-frequency bin would facilitate improving accuracy and efficiency of source separation (Wingate: Abstract; Paragraph(s) 0096, 0009, 0017, 0093). Regarding Claim 5 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 4 above; and Pishehvar further teaches wherein the complex neural network has a complex U-net architecture (Pishehvar: Paragraph(s) 0037-0040, 0052-0055, 0010 teach(es) encoder module may be implemented by a convolutional operation of the mixed signal with the encoder basis functions followed by a linear function or a nonlinear function such as a rectified linear unit (ReLU). In one aspect, if the machine learning model is a non-time-domain network, encoder module may extract spectral features of the mixed signal using short-term Fourier Transform (STFT); An audio convolutional network may contain a series of convolutional filters across time and channels to transform the conditioned multi-channel signal to an internal representation using encoder basis functions; ). Regarding Claim 6 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 5 above; and Pishehvar further teaches wherein the complex U-net architecture comprises: a complex convolutional encoder; a [complex bidirectional long short-term memory (BLSTM)] module arranged to process outputs of the complex convolutional encoder; and a complex convolutional decoder arranged to process the outputs of the complex convolutional encoder and outputs of the complex BLSTM module (Pishehvar: Paragraph(s) 0037-0040, 0052-0055, 0010, as stated above with respect to claim 6). However, the combination of Pishehvar, Chattopadhyay, and Wingate does not explicitly teach a complex bidirectional long short-term memory (BLSTM) module . Wingate from same or similar field of endeavor teaches a complex bidirectional long short-term memory (BLSTM) module (Wingate: Paragraph(s) 0150 teach(es) in order to capture longer range interactions, other types of neural net models may be learned, such as recurrent neural nets (RNN) or long short-term memory (LSTM) nets. Further, nets may be trained to be specific to a single speaker or language, or more general, depending on the training data chosen). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar, Chattopadhyay, and Wingate to incorporate the teachings of Wingate for a complex bidirectional long short-term memory (BLSTM) module. There is motivation to combine Wingate into the combination of Pishehvar, Chattopadhyay, and Wingate because Wingate’s teachings of long short-term memory (LSTM) nets would facilitate using neural network for improving accuracy and efficiency of source separation (Wingate: Abstract; Paragraph(s) 0150). Regarding Claim 7 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 6 above; and Pishehvar further teaches wherein the neural network is trained to extract signals of multiple identified audio sources and wherein the complex U-net architecture includes at least one complex convolutional decoder for each of the identified audio sources (Pishehvar: Paragraph(s) 0022, 0032, 0007 teach(es) The DNN models may be trained to remove noise and reverberation from the target speech to enhance the target speech for subsequent audio processing such as speech routing in telephony or video conferencing applications or to recognize and interpret the target speech in voice command applications; The multi-task machine learning model may be trained to model desirable and undesired signal characteristics of the mixed signals to filter or separate the target speech signal from interference signals to generate enhanced target speech signals). Regarding Claim 8 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 2 and “ phase ” above; and Pishehvar further teaches wherein the processor is further configured to: determine the target … correlation spectrogram for the audio mixture using the neural network (Pishehvar: Paragraph(s) 0010, 0038-0040 teach(es) The separation network may estimate the spectrogram masks of the target speaker at the time-frequency bins to multiply with the spectrogram from the STFT to mask the target speech from the interference signal. An inverse STFT (iSTFT) may transform the masked spectrogram into the time-domain target speech signal and to estimate the target audio parameters). Regarding Claim 9 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 2 above; however the combination does not explicitly teach wherein, to train the neural network, the processor is further configured to: receive a training audio mixture of signals generated by one or more training audio sources including at least one of: one or more tools performing one or more tasks, or one or more actuators operating the one or more tools, wherein at least one of the one or more training audio sources forming the training audio mixture is identified by location data relative to the location of each microphone of the microphone array measuring the training audio mixture; generate one or more training target phase correlation spectrograms associated with corresponding training audio sources, the one or more training target phase correlation spectrograms being generated based on a correlation between spectral features of the training audio mixture and directional features indicative of the location data of the one or more training audio sources forming the training audio mixture, wherein each time-frequency (TF) bin of the one or more training target phase correlation spectrograms defines a feature that quantifies a match between inter-channel phase differences observed in the spectral features of the measured training audio mixture and corresponding expected phase differences indicative of properties of sound propagation for the corresponding location data of the one or more training audio sources relative to location of each microphone of the microphone array; and train the neural network to extract training audio signals corresponding to the one or more training audio sources based on the respective one or more training target phase correlation spectrograms . Wingate further teaches wherein, to train the neural network, the processor is further configured to: receive a training audio mixture of signals generated by one or more training audio sources including at least one of: one or more tools performing one or more tasks, or one or more actuators operating the one or more tools, wherein at least one of the one or more training audio sources forming the training audio mixture is identified by location data relative to the location of each microphone of the microphone array measuring the training audio mixture (Wingate: Abstract; Paragraph(s) 0132, 0150, 0152, 0255-0257 teach(es) to present a user with a graphical illustration of the location of each source relative to the microphone array, allowing for manual selection of which sources to pass and block or visual feedback about which sources are being automatically blocked; nets may be trained to be specific to a single speaker or language, or more general, depending on the training data chosen) ; generate one or more training target phase correlation spectrograms associated with corresponding training audio sources, the one or more training target phase correlation spectrograms being generated based on a correlation between spectral features of the training audio mixture and directional features indicative of the location data of the one or more training audio sources forming the training audio mixture, wherein each time-frequency (TF) bin of the one or more training target phase correlation spectrograms defines a feature that quantifies a match between inter-channel phase differences observed in the spectral features of the measured training audio mixture and corresponding expected phase differences indicative of properties of sound propagation for the corresponding location data of the one or more training audio sources relative to location of each microphone of the microphone array (Wingate: Paragraph(s) 0124, 0175-0176, 0132, 0150-0152, 0255-0257 teach(es) In addition to the spectral information, the processing of the acquired signals may also include determining directional characteristics at each time frame for each of multiple components of the signals; ) ; and train the neural network to extract training audio signals corresponding to the one or more training audio sources based on the respective one or more training target phase correlation spectrograms (Wingate: Paragraph(s) 0080, 0139, 0150 teach(es) Use of spoken input for user devices, e.g. smartphones, can be challenging due to presence of other sound sources. BSS techniques aim to separate a sound generated by a particular source of interest from a mixture of various sounds. Various BSS techniques disclosed herein are based on recognition that providing additional information that is considered within iterations of an nonnegative matrix factorization (NMF) model, thus making a model a nonnegative tensor factorization model due to the presence of at least one extra dimension in the model (hence, "tensor" instead of "matrix"), improves accuracy and efficiency of source separation. Examples of such information include direction estimates or neural network models trained to recognize a particular sound of interest). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar, Chattopadhyay, and Wingate to incorporate the teachings of Wingate for wherein, to train the neural network, the processor is further configured to: receive a training audio mixture of signals generated by one or more training audio sources including at least one of: one or more tools performing one or more tasks, or one or more actuators operating the one or more tools, wherein at least one of the one or more training audio sources forming the training audio mixture is identified by location data relative to the location of each microphone of the microphone array measuring the training audio mixture; generate one or more training target phase correlation spectrograms associated with corresponding training audio sources, the one or more training target phase correlation spectrograms being generated based on a correlation between spectral features of the training audio mixture and directional features indicative of the location data of the one or more training audio sources forming the training audio mixture, wherein each time-frequency (TF) bin of the one or more training target phase correlation spectrograms defines a feature that quantifies a match between inter-channel phase differences observed in the spectral features of the measured training audio mixture and corresponding expected phase differences indicative of properties of sound propagation for the corresponding location data of the one or more training audio sources relative to location of each microphone of the microphone array; and train the neural network to extract training audio signals corresponding to the one or more training audio sources based on the respective one or more training target phase correlation spectrograms. There is motivation to combine Wingate into the combination of Pishehvar, Chattopadhyay, and Wingate because Wingate’s teachings of training neural networks would facilitate neural network models trained to recognize a particular sound of interest (Wingate: Abstract). Regarding Claim 10 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 9 above; however the combination does not explicitly teach wherein, to train the neural network, the processor is further configured to: train the neural network based on a set of loss functions, the set of loss functions comprising at least one of: a location loss function corresponding to each of the separated training audio signals for the training audio sources, or a reconstruction loss function associated with a summation of the extracted training audio signals for reconstructing the training audio mixture . Chattopadhyay further teaches wherein, to train the neural network, the processor is further configured to: train the neural network based on a set of loss functions, the set of loss functions comprising at least one of: a location loss function corresponding to each of the separated training audio signals for the training audio sources, or a reconstruction loss function associated with a summation of the extracted training audio signals for reconstructing the training audio mixture (Chattopadhyay: Paragraph(s) 0087 teach(es) To train the separator NN, loss may be estimated between the output or estimated spectrogram that is the separated audio signal after applying the mask, and ground-truth spectrogram based on the ideal or pure target and background audio signals to generate a mask inference loss). It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of the combination of Pishehvar, Chattopadhyay, and Wingate to incorporate the teachings of Chattopadhyay for wherein, to train the neural network, the processor is further configured to: train the neural network based on a set of loss functions, the set of loss functions comprising at least one of: a location loss function corresponding to each of the separated training audio signals for the training audio sources, or a reconstruction loss function associated with a summation of the extracted training audio signals for reconstructing the training audio mixture. There is motivation to combine Chattopadhyay into the combination of Pishehvar, Chattopadhyay, and Wingate because Chattopadhyay’s teachings of estimating of loss between the output or estimated spectrogram would facilitate separating target audio source signal (Chattopadhyay: Abstract; Paragraph(s) 0087). Regarding Claim 11 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 9 above; and the combination further teaches wherein, to compute the location loss functions, the processor is configured to: compute ideal target phase correlation spectrogram using physical properties of sound propagation for the one or more training audio sources based on location data for each of the one or more training audio sources; and compute estimated training target phase correlation spectrogram associated with corresponding training audio sources, the one or more estimated training target phase correlation spectrograms being generated based on a correlation between spectral features associated with corresponding separated training audio signals and directional features indicative of the location data of the one or more training audio sources forming the training audio mixture, wherein each time- frequency (TF) bin of the one or more estimated training target phase correlation spectrograms defines a feature that quantifies a match between inter-channel phase differences observed in the spectral features of the corresponding separated training audio signals and corresponding expected phase differences indicative of 85properties of sound propagation for the corresponding location data of the one or more training audio sources relative to location of each microphone of the microphone array; and determine a difference between the estimated training target phase correlation spectrogram and the corresponding ideal target phase correlation spectrogram for each of the one or more training audio sources, wherein the difference indicates the location loss functions , as stated above with respect to claims 9 and 10. Regarding Claim 12 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 9 above; and Pishehvar further teaches wherein, to train the neural network, the processor is further configured to: collect the training audio mixture generated by the one or more training audio sources by moving the microphone array in different locations in proximity to the machine (Pishehvar: Paragraph(s) 0026, 0031 teach(es) the multi-task machine learning model may leverage spatial and directional information provided by the multiple microphones of the microphone array to improve the robustness of the enhanced target speech signal and the target audio parameters). Regarding Claim 13 , the combination of Pishehvar, Chattopadhyay, and Wingate teaches all the limitations of Claim 1 and “ phase ” above; and Pishehvar further teaches wherein the processor is further configured to: transform the received audio mixture with Fourier transformation to produce a multi-channel short-time Fourier transform (STFT) of the received audio mixture; determine inter-channel [phase] differences (IPDs) between different channels of the multi-channel STFT (Pishehvar: Abstract; Paragraph(s) 0010, 0037-0040) ; …; and process the channel concatenation of the received audio mixture with a neural network to extract the audio signal (Pishehvar: Paragraph(s) 0054 teach(es) the fusion module may generate fused feature space using concatenation, linear mixing, nonlinear mixing, or a combination of these techniques). In addition, the combination of Pishehvar, Chattopadhyay, and Wingate teaches determine target [phase] differences (TPDs) of a sound propagating from the relative location of the identified audio source to different microphones in the microphone array; correlate the IPDs with the TPDs to produce a target [phase] correlation spectrogram, wherein values of the target [phase] correlation spectrogram for different time-frequency bins quantify alignment of the IPDs with the TPDs in the corresponding time-frequency bins, and wherein the TPD is the expected phase difference for the time-frequency bin indicative of properties of sound propagation; combine the target [phase] correlation spectrogram with the multi-channel STFT and frequency position encodings to produce channel concatenation of the received audio mixture , as stated above with respect to claim 3 . Allowable Subject Matter Claim 14 is allowed. The prior arts do not teach the specific steps recited in the claim. Conclusion 07-96 AIA The prior art made of record and not relied upon is considered pertinent to applicant&apos;s disclosure. Picco (US 20220179903 A1) teaches Method/System For Extracting And Aggregating Demographic Features With Their Spatial Distribution From Audio Streams Recorded In A Crowded Environment, including spectrogram, encoder, target, neural networks, correlation, proximity, sound, acoustic, and separating the recorded audio stream signals into individual speaker stream. Lopatka (US 20200213728 A1) teaches Audio-Based Detection And Tracking Of Emergency Vehicles, including microphones, and to identify time-frequency bins of the acoustic signal spectra. Any inquiry concerning this communication or earlier communications from the examiner should be directed to CLAY LEE whose telephone number is (571)272-3309. The examiner can normally be reached Monday-Friday 8-5pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neha Patel can be reached at (571)270-1492. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /CLAY C LEE/ Primary Examiner, Art Unit 3699 Application/Control Number: 18/463,388 Page 2 Art Unit: 3699 Application/Control Number: 18/463,388 Page 3 Art Unit: 3699 Application/Control Number: 18/463,388 Page 4 Art Unit: 3699 Application/Control Number: 18/463,388 Page 5 Art Unit: 3699 Application/Control Number: 18/463,388 Page 6 Art Unit: 3699 Application/Control Number: 18/463,388 Page 7 Art Unit: 3699 Application/Control Number: 18/463,388 Page 8 Art Unit: 3699 Application/Control Number: 18/463,388 Page 9 Art Unit: 3699 Application/Control Number: 18/463,388 Page 10 Art Unit: 3699 Application/Control Number: 18/463,388 Page 12 Art Unit: 3699 Application/Control Number: 18/463,388 Page 13 Art Unit: 3699 Application/Control Number: 18/463,388 Page 14 Art Unit: 3699 Application/Control Number: 18/463,388 Page 15 Art Unit: 3699 Application/Control Number: 18/463,388 Page 16 Art Unit: 3699 Application/Control Number: 18/463,388 Page 17 Art Unit: 3699 Application/Control Number: 18/463,388 Page 18 Art Unit: 3699 Application/Control Number: 18/463,388 Page 19 Art Unit: 3699 Application/Control Number: 18/463,388 Page 20 Art Unit: 3699 Application/Control Number: 18/463,388 Page 21 Art Unit: 3699 Application/Control Number: 18/463,388 Page 22 Art Unit: 3699 Application/Control Number: 18/463,388 Page 23 Art Unit: 3699 Application/Control Number: 18/463,388 Page 24 Art Unit: 3699 Application/Control Number: 18/463,388 Page 25 Art Unit: 3699
Read full office action
Prosecution Timeline

Sep 08, 2023
Application Filed
Mar 25, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/657,263
Patent 12597019
Post-Provisioning Authentication Protocols
2y 5m to grant Granted Apr 07, 2026
18/196,769
Patent 12591639
RESOURCE BASED LICENSING
2y 5m to grant Granted Mar 31, 2026
18/549,430
Patent 12572907
UNIVERSAL PAYMENT CHANNEL
2y 5m to grant Granted Mar 10, 2026
17/207,792
Patent 12561654
SYSTEMS AND METHODS FOR EXECUTING REAL-TIME ELECTRONIC TRANSACTIONS USING A ROUTING DECISION MODEL
2y 5m to grant Granted Feb 24, 2026
17/857,471
Patent 12561712
LOYALTY POINT DISTRIBUTIONS USING A DECENTRALIZED LOYALTY ID
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
54%
Grant Probability
99%
With Interview (+57.1%)
4y 1m
Median Time to Grant
Low
PTA Risk
Based on 216 resolved cases by this examiner. Grant probability derived from career allow rate.