Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
2. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
3. Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Tan et al. “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement” herein Tan in view of Kim et al. “Multi-Domain Processing via Hybrid Denoising Networks for Speech Enhancement” herein Kim.
Regarding Claim 1:
Tan discloses a computer-implemented method comprising:
receiving audio data including a known noisy acoustic signal, the known noisy acoustic signal including a known clean acoustic signal and at least one known additive noise (Tan: Sections 2.3, 3.1 discloses constructing noisy audio which is clean audio plus additive noise, and uses the noisy mixture as input and uses the clean speech as known ground truth during supervised training. This is exactly the claimed “known noisy acoustic signal including known clean and known additive noise”);
transforming the audio data into frequency-domain data (Tan: Section 2.3 discloses that it transforms the time-domain waveform into frequency domain STFT data);
and training a convolutional neural network (Tan: Section 1 uses a convolutional network to train directly on STFT frequency domain data,),
the convolutional neural network outputs (Tan: Section 2.1 discloses outputting the clean magnitude directly).
Tan does not explicitly discloses including at least on gated linear unit (GLU) [in its convolutional neural network]
The convolutional neural network outputting a frequency multiplicative mask that to be multiplied to the frequency domain data to estimate the known clean signal.
However, Kim discloses including at least on gated linear unit (GLU) (Kim: “Model Architecture” discloses what is commonly understood as a GLU in the art is a mechanism that operates on an input by performing two independent linear transformations, specifically, two pieces of convolutional output that it combines with a gating mechanism, typically a sigmoid is applied to one of these unit. The general form of this is GLU = A(x) ⊗ σ(B). Kim does exactly this at under “B Model Architecture” where it discloses the operation ⊗ means element wise multiplication and the preceding layer of this operation uses sigmoid as an activation function);
The convolutional neural network outputting a frequency multiplicative mask that to be multiplied to the frequency domain data to estimate the known clean signal (Kim: Section 2.2 discloses CNN outputs a ratio mask, the mask is multiplied with the frequency bins of the STFT and results in estimates of clean audio signal also see Fig. 3).
Tan and Kim are combinable because they are in the same field of endeavor, supervised speech enhancement using convolutional neural networks. Tan discloses a network architecture that input frequency data to denoise output by training on known ground truth clean and noisy signal in order to denoise the original signal. Kim discloses a GLU block designed for spectrogram based on CNN speech enhancement, using spectrograms/frequency data. Replacing a standard convolutional block, instead with a gated convolutional block, and applying it as a mask would be a simple substitution within Tan’s architecture. In fact, Tan notes previous applications that take this approach within its own introduction, it merely does not use this specific approach in its own model architecture. The suggestion/motivation for applying Kim’s approach is “the spectrogram approach (U-Net) successfully removes high frequency noise” as disclosed in section 3.2 of Kim.
Regarding Claim 2:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1, further comprising: constructing the convolutional neural network, including a plurality of neurons, the plurality of neurons arranged in a plurality of layers including at least one hidden layer, the plurality of layers including a layer including the GLU component, and the plurality of neurons being connected by a plurality of connections (Tan: Section 2.1 discloses an encoder with five convolutional layers and a decoder with five deconvolutional layers, also shows hidden layers e.g., two stacked LSTM layers (section 2.2); Kim discloses a U-net block where the preceding layer uses a sigmoid and ⊗ means element wise multiplication. Examiner Interpretation {Tan discloses the layered CNN structure with neurons and connections, Kim provides the GLU layer embedded within a CNN layer}).
Tan and Kim are combinable because they are in the same field of endeavor, supervised speech enhancement using convolutional neural networks. Tan discloses a network architecture that input frequency data to denoise output by training on known ground truth clean and noisy signal in order to denoise the original signal. Kim discloses a GLU block designed for spectrogram based on CNN speech enhancement, using spectrograms/frequency data. Replacing a standard convolutional block, instead with a gated convolutional block, and applying it as a mask would be a simple substitution within Tan’s architecture. In fact, Tan notes previous applications that take this approach within its own introduction, it merely does not use this specific approach in its own model architecture. The suggestion/motivation for applying Kim’s approach is “the spectrogram approach (U-Net) successfully removes high frequency noise” as disclosed in section 3.2 of Kim.
Regarding Claim 3:
The combination of Tan and Kim further discloses the computer-implemented method of claim 2 wherein the layer including GLU component includes a convolutional block that is configured to calculate a first convolutional output and a second convolutional output, the first convolutional output and the second convolutional output calculated based on the frequency-domain data, and a gating block that uses the first convolutional output to partially or completely block the second convolutional output (Kim: “B Model Architecture” discloses two convolutional paths one producing a linear output and one producing a sigmoid gate and are combined in the GLU calculation, both very clearly derived from the frequency data).
Regarding Claim 4:
The combination of Tan and Kim further discloses the computer-implemented method of claim 3 wherein a logistic function, including a sigmoid function, receives the first convolutional output and outputs a weight, and the gating block performs an element-wise multiplication with the second convolutional output and the weight (Kim: “B Model Architecture” discloses two convolutional paths one producing a linear output and one producing a sigmoid gate and are combined in the GLU calculation).
Regarding Claim 5:
The combination of Tan and Kim further discloses the computer-implemented method of claim 3 wherein the convolutional block is configured to zero-pad at least a portion of the frequency-domain data (Tan: Section 2.3 encoder-decoder uses padding on convolution operations).
Regarding Claim 6:
The combination of Tan and Kim further discloses the computer-implemented method of claim 2 wherein the at least one hidden layer of the convolutional neural network includes at least one long short-term memory layer (Tan: Section 2.2 discloses thar are stacked LSTM layers).
Regarding Claim 7:
The combination of Tan and Kim further discloses the wherein a first layer of the plurality of layers is configured to encode frequencies in the frequency-domain data into a lower-dimension feature space, and a second layer of the plurality of layers is configured to decode feature space to high-dimension and output the frequency multiplicative mask (Tan: Section 2.1 discloses reducing the frequency dimension “we halve the frequency dimension size”; Kim Section 2.2 discloses the learning an ideal ratio mask, by multiplying the estimated mask to the noisy spectrogram i.e., decodes into the high dimension and outputs the multiplicative mask).
Regarding Claim 8:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1, further comprising providing the trained convolutional neural network to a wearable or portable audio device wherein the audio device is capable of receiving real-time audio data, transforming the real-time audio data into real-time frequency-domain data, outputting a real-time frequency multiplicative mask using the trained convolutional neural network and the real-time audio data, and applying the real-time frequency multiplicative mask to the real-time frequency-domain data (Tan: Abstract, Introduction, disclose hearing aids and similar devices, which are portable audio devices, overall repeatedly cited them as a use case for real-time audio processing).
Regarding Claim 9:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1 wherein the audio data includes a plurality of frames wherein the transforming the audio data into frequency-domain data further includes calculating spectral features for a plurality of frequency bins based on the plurality of frames (Tan: Section 2.3, discloses taking the STFT frames (frequency bins which are always done at specific time snapshots, wherein there are a plurality of time snapshots) and uses this as input, i.e., it calculates spectral features).
Regarding Claim 10:
The combination of Tan and Kim further discloses the computer-implemented method according to claim 1, further comprising receiving a test data set, the test data set including audio data with unseen noise, and evaluating the trained convolutional neural network using the received test data set (Tan: Section 3.1 discloses untrained/unseen noises which is used for testing).
Regarding Claim 11:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1 wherein the frequency multiplicative mask is at least one of a complex ratio mask or an ideal ratio mask (Kim: Section 2.2, as previously explained, aims to learn an ideal ration mask).
Regarding Claim 12:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1 wherein the audio data is synthetic audio data with a known noisy acoustic signal and at least one of a known clean acoustic signal or a known additive noise (Tan: Section 2.3 discloses constructing artificial mixtures (synthetic noisy data) where clean speech and noise are known).
Regarding Claim 13:
The combination of Tan and Kim further discloses the computer-implemented method of claim 1 wherein the known noisy acoustic signal is a known noisy speech signal and the known clean acoustic signal is a known clean speech signal (Tan: Section 2.3 explicitly trains on known noisy signal and known clean speech with the goal of clean speech as the training target).
Regarding Claim 14:
Claim 14 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of obviousness used above.
It is noted that Tan discloses experimental computer implemented tests of the trained model that necessarily include a processing device with memory to store instructions at least at Section 3.1.
Regarding Claim 15:
Claim 15 has been analyzed with regard to claim 2 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 16:
Claim 16 has been analyzed with regard to claim 3 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 17:
Claim 17 has been analyzed with regard to claim 4 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 18:
Claim 18 has been analyzed with regard to claim 6 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 19:
Claim 19 has been analyzed with regard to claim 7 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 20:
Claim 20 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of obviousness used above.
It is noted that Tan discloses experimental computer implemented tests of the trained model that necessarily include a processing device with memory to store instructions at least at Section 3.1.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IAN SCOTT MCLEAN/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654