DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on 9/22/2025.
Claims 1-25 are pending and have been examined.
All previous objections / rejections not mentioned in this Office Action have been withdrawn by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendments
Regarding the Applicant’s arguments for the rejections under 35 U.S.C. § 103, applicant argues that the prior art reference Visser in view of Hlawatsch fails to teach or suggest the presently claimed subject matter. Examiner respectfully disagrees. During patent examination, pending claims must be “given their broadest reasonable interpretation consistent with the specification.” MPEP 2111. Also, claims "must particularly point out and distinctly claim the invention." MPEP 2173. Most importantly, claims should not be interpreted by reading limitations of the specification into the claim, to narrow the scope of the claim, by implicitly adding disclosed limitations that have no express basis in the claim language. In re Prater, 415 F.2d 1393. The prior art references do teach the claim as currently recited and broadly interpreted. The term “deep filter” is not specifically defined in the specification and the currently recited claim in claim 1, 20, and 21 does not teach how the deep filter is utilized to “acquire[s] estimates of respective elements of a desired representation”. Therefore, examiner interprets a deep filter to be a multi-dimensional mask in the time and frequency domain where the mask can be applied to the noisy signal. This interpretation is in line with the claim limitation “deep filter comprises a one or multi-dimensional tensor with elements”. Moreover, the deep filter interpreted as a multi-dimensional mask also is in line with the claim limitation “multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture”. A multi-dimensional mask is estimated for each time and frequency bin when utilizing a multi-dimensional mask. Visser P0033, Hlawatsch, Introduction. Claims must particularly point out and distinctly claim the invention. Claim terms and claim language is broadly interpreted. Specification is used to interpret claim language, but not to narrow it. Lastly, examiner agrees with applicant in the fact that the use of a complex conjugated 2D filter to obtain a desired signal value for a time and frequency bin, according to the equation in claim 12, is not taught by prior art references. Currently, the claim language has “deep filter” as a claim term, but does not specifically define it or describe how deep filter us utilized to “acquire[s] estimates of respective elements of a desired representation.” Therefore, the claim as currently recited does not overcome the rejection under 35 U.S.C. § 103.
Applicant has introduced new dependent claims 24 and 25. New references have been applied.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8, 10, 11, 14-22 are rejected under 35 U.S.C. 103 as being unpatentable over Visser et al. (U.S. PG Pub No. 20160284346), hereinafter Visser, in view of “Time-Frequency Formulation, Design, and Implementation of Time-Varying Optimal Filters for Signal Estimation” by Hlawatsch et al, hereinafter Hlawatsch.
Regarding claim 1 and 20 Visser teaches:
(Claim 20) A non-transitory digital storage medium having a computer program stored thereon to perform (P0075, A software module may reside in a non-transitory storage medium.;)
(Claim 20) when said computer program is run by a computer. (P0073, Various elements of an implementation of an apparatus as disclosed herein may also be embodied as one or more computers.)
[a] method for determining a deep filter for filtering a mixture of desired and undesired signals, comprising an audio signal or a sensor signal, to extract the desired signal from the mixture of the desired and the undesired signals, the method comprising: (P0004, Methodologies for feature extraction and classification of audio signals using deep neural network based filter prediction.; P0028, Feature extraction may include normalized spectral band energies that can be used to train specific customized DNNs to predict filter gains to be applied to recorded spectrograms in order to separate underlying audio events therein from noise.)
determining the deep filter of at least one-dimension, comprising: (P0018, Extracting features from noisy audio and feeding them to the DNN to predict filters for carving out the underlying audio events from noisy spectrograms.; P0027, The feature extraction in turn may specifically provide spectral bands, delta spectral bands, and onset/offset information.; P0033, Compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).)
receiving the mixture; (P0018, Audio data is gathered in a target environment under varying conditions.)
estimating using a deep neural network the deep filter, wherein the estimating is performed, such that the deep filter, when applying to elements of the mixture, acquires estimates of respective elements of a desired representation, wherein the deep filter is acquired by defining a filter structure with filter variables for the deep filter of at least one dimension and training the deep neural network, (P0024, One or more DNNs are trained to perform filter gain prediction for target audio event extraction.; P0035, From this example, approximately 43000 frames are used for feature extraction and training, which in turn permits the training of 13 DNNs each with 257 input nodes, two hidden layers of 100 nodes each and one output layer of 17-20 nodes. (The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain).; P0018, Once trained, these DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof.)
wherein the training is performed using a mean-squared error between a ground truth and the desired representation and minimizing the mean-squared error or minimizing an error function between the ground truth and the desired representation; (P0030, The DNN is trained using the resultant training dataset and the outputs of the DNN training are tested against data not contained in the training data set to determine efficacy. A determination is made as to whether the test data outputs are in line with the training data outputs and, if so, the DNN's training is complete and the DNN is ready for operational use. If not, however—i.e., if the test data outputs of the DNN are dissimilar from the training data outputs, thus indicating that the DNN is not yet sufficiently predicting target audio events—then the test data is added to the training data for the DNN, and the DNN returns for further training.; P0032, The DNN training is complete and stopped when the inputs gathered for the DNN's particular use case create outputs that exhibit audio event like features, that is, when noisy speech inputs to the DNN results in speech-like output features form that DNN.; P0048, Decide the suppression gain depending on SNR to meet certain criterion such as minimum mean squared error between clean and estimated speech.)
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture; (P0029, Labels are computed for the DNN output layers based on the predicted filter gains for the targeted audio events. For enhanced effectiveness, these computed labels for the DNN output layers may be based on knowledge of underlying target audio event, noise.; P0035, The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain.)
wherein the deep filter comprises a one- or multi- dimensional tensor with elements. (P0019, Several implementations are directed to different DNN training architectures, feature vectors and applications for the resulting DNNs once trained.; Figure 3.; P0037, FIG. 3 is a progression diagram illustrating neural net based filter gain prediction in three phases (from feature extraction through training to prediction) for a DNN representative of various implementations herein disclosed. In the figure, the extracted features progress through feature extraction and training to reach the prediction phase and the resultant predicted filters.; P0027, The feature extraction in turn may specifically provide spectral bands, delta spectral bands, and onset/offset information.)
Visser does not specifically teach:
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture;
Hlawatsch, however, teaches:
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture; (1. Introduction, Time-varying Wiener filter.; A. Time-Varying Weiner Filter [Multi-dimensional deep filter is predicted in the time and frequency dimensions.]; Figure 2.)
Visser and Hlawatsch are analogous art because they form a similar field of endeavor in filtering a signal to isolate speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to utilize a multi-dimensional deep filter. It would have been obvious to combine the references because the use of time-frequency filters is optimal in nonstationary environments compared to linear estimator that is time invariant. (Hlawatsch, Abstract)
Regarding claim 2 Visser in view Hlawatsch teach claim 1.
Visser further teaches:
wherein the mixture comprises a real- or complex- valued time-frequency presentation or a feature representation of it; and (P0024, From this collected audio, at feature extraction is performed to extract and label features.; P0028, Feature extraction may include normalized spectral band energies that can be used to train specific customized DNNs to predict filter gains to be applied to recorded spectrograms in order to separate underlying audio events therein from noise.; P0034, For feature and frame labeling—to continue the example—one might then extract features as (log(|Y(t,f)|)−mean(log(|Y(t,:)|))/max(|Y(:,f)|) bounded in [0-1] and use M(t,f) bound in [0-1] as labels for each neural net output layer mode. The labels are then enforced in supervised neural net learning.)
wherein the desired representation comprises a desired real- or complex-valued time-frequency presentation or a feature representation of it. (P0024, From this collected audio, at feature extraction is performed to extract and label features.; P0028, Feature extraction may include normalized spectral band energies that can be used to train specific customized DNNs to predict filter gains to be applied to recorded spectrograms in order to separate underlying audio events therein from noise.; P0034, For feature and frame labeling—to continue the example—one might then extract features as (log(|Y(t,f)|)−mean(log(|Y(t,:)|))/max(|Y(:,f)|) bounded in [0-1] and use M(t,f) bound in [0-1] as labels for each neural net output layer mode. The labels are then enforced in supervised neural net learning.)
Regarding claim 3 Visser in view of Hlawatsch teach claim 1.
Visser further teaches:
wherein the deep filter comprises a real- or com- plex-valued time-frequency filter; (P0033, Next, compute the noise power N(t,f) and speech power X(t,f) for each frame t/band f, and then compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).; P0034, For feature and frame labeling—to continue the example—one might then extract features as (log(|Y(t,f)|)−mean(log(|Y(t,:)|))/max(|Y(:,f)|) bounded in [0-1] and use M(t,f) bound in [0-1] as labels for each neural net output layer mode. The labels are then enforced in supervised neural net learning.)
and/or wherein the deep filter of at least one dimension is described in the short-time Fourier transform domain. (P0033, Next, compute the noise power N(t,f) and speech power X(t,f) for each frame t/band f, and then compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).; P0028, DNNs to predict filter gains to be applied to recorded spectrograms in order to separate underlying audio events therein from noise.)
Regarding claim 4 Visser in view Hlawatsch teach claim 1.
Visser further teaches:
wherein the step of estimating is performed for each element of the mixture or for a predetermined portion of the elements of the mixture. (P0038, For the prediction phase, the trained DNN can then be used to predict filter gain and thus determine the underlying speech for each frame or combination of frames.)
Regarding claim 5 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the estimating is performed for at least two sources. (P0030, The DNN is trained using the resultant training dataset and the outputs of the DNN training are tested against data not contained in the training data set to determine efficacy.)
Regarding claim 6 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the deep filter is multi-dimensional complex deep filter. (P0034, For feature and frame labeling—to continue the example—one might then extract features as (log(|Y(t,f)|)−mean(log(|Y(t,:)|))/max(|Y(:,f)|) bounded in [0-1] and use M(t,f) bound in [0-1] as labels for each neural net output layer mode. The labels are then enforced in supervised neural net learning.; P0045, Furthermore, more sensor information, such as spectral features, direction of arrival (DoA), accelerometer information, etc., may be added to augment the feature space (i.e., feature vector=spectral feature+DoA+sensor position). Additionally, feature vectors may be mapped to desired filter gain and labeled with specific audio event DNNs to provide enhanced audio event classification via use of specific/dedicated DNNs.; P0033, Compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).)
Regarding claim 7 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the deep neural network comprises a number of output parameters equal to the number of filter values of a filter function of the deep filter. (P0035, From this example, approximately 43000 frames are used for feature extraction and training, which in turn permits the training of 13 DNNs each with 257 input nodes, two hidden layers of 100 nodes each and one output layer of 17-20 nodes. (The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain).)
Regarding claim 8 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the at least one dimension are out of a group comprising time, frequency and sensor, or wherein the at least one of the dimensions is across time or frequency. (P0033, Compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).; P0034, For feature and frame labeling—to continue the example—one might then extract features as (log(|Y(t,f)|)−mean(log(|Y(t,:)|))/max(|Y(:,f)|) bounded in [0-1] and use M(t,f) bound in [0-1] as labels for each neural net output layer mode. The labels are then enforced in supervised neural net learning.)
Regarding claim 10 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
further comprising training the deep neural network. (P0004, Training a deep neural network to perform filter gain prediction using the extracted features.)
Regarding claim 11 Visser in view of f Hlawatsch teaches claim 10.
Visser further teaches:
wherein the deep neural network is trained by optimizing of the mean squared error between a ground truth of the desired representation and an estimate of the desired representation; or (P0030, The DNN is trained using the resultant training dataset and the outputs of the DNN training are tested against data not contained in the training data set to determine efficacy. A determination is made as to whether the test data outputs are in line with the training data outputs and, if so, the DNN's training is complete and the DNN is ready for operational use. If not, however—i.e., if the test data outputs of the DNN are dissimilar from the training data outputs, thus indicating that the DNN is not yet sufficiently predicting target audio events—then the test data is added to the training data for the DNN, and the DNN returns for further training.; P0032, The DNN training is complete and stopped when the inputs gathered for the DNN's particular use case create outputs that exhibit audio event like features, that is, when noisy speech inputs to the DNN results in speech-like output features form that DNN.; P0048, Decide the suppression gain depending on SNR to meet certain criterion such as minimum mean squared error between clean and estimated speech.)
wherein the deep neural network is trained by reducing the reconstruction error between the desired representation and an estimate of the desired representation; or (P0025, The extracted audio event is refined and reconstructed by fitting basis functions.; P0047, For vector quantization, the nearest candidates can be found based on spectral distance measure between input spectrum and dictionary, and the weight of each basis function can be computed based on inverse of distance. Then more than one basis function can be selected for smooth reconstruction, and the reconstructed extracted audio event can be based on selected basis functions and associated weights.; P0047, Activation coefficients for NMF can be obtained by minimizing the distance between input signal and estimated signal iteratively.; P0048, Decide the suppression gain depending on SNR to meet certain criterion such as minimum mean squared error between clean and estimated speech.)
wherein the training is performed by a magnitude reconstruction. (P0047, Then more than one basis function can be selected for smooth reconstruction, and the reconstructed extracted audio event can be based on selected basis functions and associated weights.)
Regarding claim 14 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the tensor elements of the deep filter are bounded in magnitude or bounded in magnitude by use of the following formula:
PNG
media_image1.png
19
624
media_image1.png
Greyscale
(P0034, Use M(t,f) bound in [0-1] as labels for each neural net output layer mode. [Each TF bin magnitude is bounded.])
Regarding claim 15 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the step of applying is performed element-wise. (P0005, The filter gain prediction may comprise: applying a predicted filter to an input spectrogram to yield a first output.; P0035, The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain.; P0033, Compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f). [Estimated signal is acquired by multiplying predicted filter gain with input signal for time and frequency.])
Regarding claim 16 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
wherein the applying is performed by summing up to acquire an estimate of the desired representation in a respective tensor element. (P0038, DBN is folded onto the DNN and an output layer is added. The trained DNN can then be used to predict filter gain and thus determine the underlying speech for each frame or combination of frames.)
Regarding claim 17 Visser in view of Hlawatsch teaches claim 1.
Visser further teaches:
a method for filtering the mixture of desired and undesired signals comprising an audio signal or sensor signal, to extract the desired signal from the mixture of the desired and the undesired signals, the method comprising: (P0018, Once trained, these DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof. Stated differently, underlying events can be predicted by extracting features from noisy audio and feeding them to the DNN to predict filters for carving out the underlying audio events from noisy spectrograms.)
applying the deep filter to the mixture. (P0005, The filter gain prediction may comprise: applying a predicted filter to an input spectrogram to yield … output.)
Regarding claim 18 Visser in view of Hlawatsch teaches claim 17.
Visser further teaches:
The use of the method according to claim 17 for signal extraction or for signal separation of at least two sources. (P0045, The methodology may be used to predict voiced and unvoiced speech activity and envelope, and multi-speaker training results in filters focusing on voice activity detecting and extracting the speech envelope.)
Regarding claim 19 Visser in view of Hlawatsch teaches claim 17.
Visser further teaches:
The use of the method according to claim 17 for signal reconstruction. (P0025, The extracted audio event is refined and reconstructed by fitting basis functions using a dictionary approach.)
Regarding claim 21 Visser teaches:
An apparatus for determining a deep filter enabling to extract a desired signal from a mixture of desired and undesired signals, the apparatus comprising (P0072, An apparatus.; P0004, Methodologies for feature extraction and classification of audio signals using deep neural network based filter prediction.; P0018, DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof. Stated differently, underlying events can be predicted by extracting features from noisy audio and feeding them to the DNN to predict filters for carving out the underlying audio events from noisy spectrograms.)
an input for receiving the mixture of the desired and the undesired signals or comprising at least undesired signals comprising an audio signal or a sensor signal; (P0018, Audio data is gathered in a target environment under varying conditions.; P0023, Audio event data may comprise both speech components and noise components.)
a deep filter for estimating the deep filter such that the deep filter, when applying to elements of the mixture, acquires estimates of respective elements of a desired representation; (P0024, One or more DNNs are trained to perform filter gain prediction for target audio event extraction.; P0035, From this example, approximately 43000 frames are used for feature extraction and training, which in turn permits the training of 13 DNNs each with 257 input nodes, two hidden layers of 100 nodes each and one output layer of 17-20 nodes. (The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain).; P0018, Once trained, these DNNs are used to predict underlying events in noisy audio to extract therefrom features that enable the separation of the underlying audio events from the noisy components thereof.)
wherein the deep neural network is acquired by defining a filter structure with filter variables for the deep filter of at least one dimension and training the deep neural network, wherein the training is performed using the mean-squared error between a ground truth and the desired representation and minimizing the mean- squared error or minimizing an error function between the ground truth and the de- sired representation; (P0033, Compute filter gain M(t,f) that predicts speech based on known signal-to-noise ratio: X_est(t,f)=M(t,f)*Y(t,f).; P0030, The DNN is trained using the resultant training dataset and the outputs of the DNN training are tested against data not contained in the training data set to determine efficacy. A determination is made as to whether the test data outputs are in line with the training data outputs and, if so, the DNN's training is complete and the DNN is ready for operational use. If not, however—i.e., if the test data outputs of the DNN are dissimilar from the training data outputs, thus indicating that the DNN is not yet sufficiently predicting target audio events—then the test data is added to the training data for the DNN, and the DNN returns for further training.; P0032, The DNN training is complete and stopped when the inputs gathered for the DNN's particular use case create outputs that exhibit audio event like features, that is, when noisy speech inputs to the DNN results in speech-like output features form that DNN.; P0048, Decide the suppression gain depending on SNR to meet certain criterion such as minimum mean squared error between clean and estimated speech.)
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture; (P0029, Labels are computed for the DNN output layers based on the predicted filter gains for the targeted audio events. For enhanced effectiveness, these computed labels for the DNN output layers may be based on knowledge of underlying target audio event, noise.; P0035, The 13 output layer nodes of these 13 DNNs combined in turn constitute predicted filter gain.)
wherein the deep filter is of at least one dimension comprising a one- or multi-dimensional tensor with elements. (P0019, Several implementations are directed to different DNN training architectures, feature vectors and applications for the resulting DNNs once trained.; Figure 3.; P0037, FIG. 3 is a progression diagram illustrating neural net based filter gain prediction in three phases (from feature extraction through training to prediction) for a DNN representative of various implementations herein disclosed. In the figure, the extracted features progress through feature extraction and training to reach the prediction phase and the resultant predicted filters.; P0027, The feature extraction in turn may specifically provide spectral bands, delta spectral bands, and onset/offset information.)
Visser does not specifically teach:
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture;
Hlawatsch, however, teaches:
wherein the multi dimensional deep filter is estimated for each tensor element of tensor elements of the mixture; (1. Introduction, Time-varying Wiener filter.; A. Time-Varying Weiner Filter [Multi-dimensional deep filter is predicted in the time and frequency dimensions.]; Figure 2.)
Visser and Hlawatsch are analogous art because they form a similar field of endeavor in filtering a signal to isolate speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to utilize a multi-dimensional deep filter. It would have been obvious to combine the references because the use of time-frequency filters is optimal in nonstationary environments compared to linear estimator that is time invariant. (Hlawatsch, Abstract)
Regarding claim 22 Visser teaches claim 21 and further teaches:
An apparatus filtering a mixture, the apparatus comprising the apparatus of claim 21 and the deep filter as determined and a unit for applying the deep filter to the mixture. (Figure 9, Electronic Device.; P0005, The filter gain prediction may comprise: applying a predicted filter to an input spectrogram to yield a first output.)
Claims 9 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Visser in view of Hlawatsch and further view of Mesgarani et al (U.S. PG Pub No. 20190066713), hereinafter Mesgarani.
Regarding claim 9 Visser in view of Hlawatsch does not teach:
wherein the deep neural network comprises a batch-normalization layer, a bidirectional long short-term memory layer, a feed-for-ward output layer with a tanh activation and/or one or more additional layer.
Mesgarani, however, teaches:
wherein the deep neural network comprises a batch-normalization layer, a bidirectional long short-term memory layer, a feed-for-ward output layer with a tanh activation and/or one or more additional layer. (P0190, Three different normalization schemes may be used, namely: channel-wise layer normalization (cLN), global layer normalization (gLN), and batch normalization (BN).; P0149, The network contained four (4) bi-directional LSTM layers. The embedding dimension was set to 20, resulting in a fully-connected feed-forward layer of 2580 hidden units (20×129) after the BLSTM layers.; P0138, Finding the similarity of each T-F bin in the embedding space to each of the attractor vectors A, where the similarity metric is defined in Equation 8. This particular metric uses the inner product followed by a sigmoid function which monotonically scales the masks between [0, 1].; P0158, Sigmoid activation function. [It is well established in the art that tanh is a typical and popular sigmoid function to use as activation function.])
Visser and Mesgarani are analogous art because they form a similar field of endeavor in filtering a signal to isolate speech, of a person, utilizing a filter that was derived by training deep neural network. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a deep neural network as taught by Visser with the use of specific neural network features as taught in Mesgarani. It would have been obvious to combine the references because a configured deep neural network can identify individual sound sources from multiple sound sources based on the combined sound signal. (Mesgarani P0080)
Regarding claim 13 Visser in view of Hlawatsch teaches claim 10.
Visser in view of Hlawatsch does not specifically teach:
wherein the training is performed by use of the following formula:
PNG
media_image2.png
32
624
media_image2.png
Greyscale
representation and Xd(n, k) the estimated desired representation where N is the total number of time-frames and K the number of frequency bins per time-frame, where n is the time-frame and k is the frequency index, or by use of the following formula:
PNG
media_image3.png
58
319
media_image3.png
Greyscale
Where Xd(n,k) is the desired representation and Xd(n, k) is the estimated desired representation, where N is the total number of time-frames and K the number of frequency bins per time-frame, where n is the time-frame and k is the frequency index.
Mesgarani, however, teaches:
wherein the training is performed by use of the following formula:
PNG
media_image2.png
32
624
media_image2.png
Greyscale
representation and Xd(n, k) the estimated desired representation where N is the total number of time-frames and K the number of frequency bins per time-frame, where n is the time-frame and k is the frequency index, or by use of the following formula:
PNG
media_image3.png
58
319
media_image3.png
Greyscale
where Xd(n,k) is the desired representation and Xd(n, k) is the estimated desired representation, where N is the total number of time-frames and K the number of frequency bins per time-frame, where n is the time-frame and k is the frequency index. (P0135, Σf,t,c∥Sf,t,c−Xf,t×Mf,t,c∥22 [Equation 7] where S is the clean spectrogram (frequency F×time T) of C sources, X is the mixture spectrogram (frequency F×time T), and M is the mask formed to extract each source.; P0139, A standard L2 reconstruction error is used to generate the gradient, as shown in Equation 7. Therefore, the error for each source reflects the difference between the masked signal and the clean reference, forcing the network to optimize the global reconstruction error for better separation.)
Visser and Mesgarani are analogous art because they form a similar field of endeavor in filtering a signal to isolate speech, of a person, utilizing a filter that was derived by training deep neural network. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize a mean squared error function to calculate the error or loss from the expected signal compared to the predicted signal. Visser teaches the use of mean squared error function to minimize the error during training. (Visser P0048) Mesgarani further teaches the equation used for mean squared function in the time and frequency domain. (Mesgarani P0135) It would have been obvious to combine the references because the use of mean squared error function for calculating reconstruction error for a signal when training a neural network is applying a known technique to a method to yield a predictable result. The known technique is mean squared error function and predicted result is optimizing the reconstruction error. (Mesgarani P0139)
Claims 24 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Visser in view of Hlawatsch and further view of “Learning Spectral Mapping for Speech Dereverberation and Denoising” by “Han et al., hereinafter Han.
Regarding claim 24 and 25 Visser in view of Hlawatsch teach claim 1 and 20.
Visser in view of Hlawatsch does not specifically teach:
wherein the estimating is performed such that, for each tensor element of the mixture, the deep filter generates the estimate of the desired representation by combining weighted contributions from a plurality of neighboring tensor elements across at least one of the dimensions.
Han, however, teaches:
wherein the estimating is performed such that, for each tensor element of the mixture, the deep filter generates the estimate of the desired representation by combining weighted contributions from a plurality of neighboring tensor elements across at least one of the dimensions. (A. Spectral Features, In order to incorporate temporal dynamics, we include the spectral features of neighboring frames into a feature vector.; A. Metrics and Parameters, We utilize context information using a concatenation of features from 5 frames on each side of the current frame. Temporal information is an important property for speech signals, and thus adding these neighboring frames should be helpful to learn a spectral mapping.; Fig. 1, Neighboring frames are input into a DNN to output a frequency bins for a single frame. DNN has weights to combine weighted contributions from neighboring frames. B. DNN Based Spectral Mapping.)
Visser and Mesgarani are analogous art because they form a similar field of endeavor in filtering a signal to reduce noise. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize neighboring frames by combining weighted contributions of the frames. It would have been obvious to combine the references because spectral features of neighboring frames are utilized to incorporate temporal dynamics and temporal dynamics provides rich information for speech when denoising speech. (Han, Section V. Discussion and Conclusion)
Allowable Subject Matter
Claim 12 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. More specifically, none of the prior art either alone or in combination, teaches or makes obvious the specifics of the equation that is provided in claim 12.
Claim 23 is allowed. Claim 23 includes the limitations of claim 1 and claim 12. More specifically, none of the prior art either alone or in combination, teaches or makes obvious the specifics of the equation that is provided in claim 12.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Stein (U.S. PG Pub No. 20160314800): Computationally efficient method for filtering noise
Stein teaches two-dimensional filtering, convolution in time and frequency space. (P0005, One way to address the errors and/or local processing that produces a mask is to perform a smoothing of the mask, for example, by two-dimensional filtering (e.g., convolution in time-frequency space). … The MRF is characterized by conditional distributions of the mask value at one time-frequency location based on the mask values at neighboring locations, for example, according to the four or eight nearest neighbors in the time-frequency mask.)
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL WONSUK CHUNG whose telephone number is (571)272-1345. The examiner can normally be reached Monday - Friday (7am-4pm)[PT].
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL W CHUNG/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659