DETAILED ACTION
This communication is in response to the Application filed on 04/23/2024. Claims 1-20 are pending and have been examined. Claims 1, 12, and 20 are independent. This Application was published as U.S. Pub. No. 20240363132.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/11/2024 was filed. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Priority
Applicant’s claims for benefit of a provisional application 63/461,665 submitted on 04/25/2023 is acknowledged.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 16-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claims 16-19 contains the trademark/trade name Tensilica®. Where a trademark or trade name is used in a claim as a limitation to identify or describe a particular material or product, the claim does not comply with the requirements of 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph. See Ex parte Simpson, 218 USPQ 1020 (Bd. App. 1982). The claim scope is uncertain since the trademark or trade name cannot be used properly to identify any particular material or product. A trademark or trade name is used to identify a source of goods, and not the goods themselves. Thus, a trademark or trade name does not identify or describe the goods associated with the trademark or trade name. In the present case, the trademark/trade name is used to identify/describe brand name or a company, accordingly, the identification/description is indefinite.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6 and 8-20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al., (US Pat No. 11,727,926, hereinafter, Liu) in view of Defossez et al., ("Music source separation in the waveform domain." arXiv preprint arXiv:1911.13254 (2019), hereinafter, Defossez).
Regarding Claim 1,
Liu discloses a computer-implemented method comprising:
receiving audio data including a known noisy acoustic signal, the known noisy acoustic signal including a known clean acoustic signal and at least one known additive noise (Liu, Fig.1, col.3, lls.13-29, "…the user device 110 and/or remote system 120 receives (120) first audio data that includes representations of both an utterance 104 of a user 102 and noise 108 from a noise source 106...");
transforming the audio data into frequency-domain data (Liu, col.3, lls.30-41, "…the user device 110 and/or remote system 120 may further process the audio data to, for example, convert time-domain audio data into frequency domain audio data (via, for example, a Fourier transform)..."); and
training a convolutional neural network based on the frequency-domain data and at least one of the known clean acoustic signal or the known additive noise (Liu, col.5, lls.4-15, "…noise-reduction component may be trained to process received audio data that includes a representation of both an utterance and of noise to determine output audio data that includes a representation of the utterance and reduced noise..."),
wherein the convolutional neural network is configured to: output a frequency multiplicative mask to be applied to the frequency-domain data to estimate the known clean acoustic signal (Liu, Figs.2 and 4, col.15, lls.1-34, "…The noise-reduction component 222 may further include a decoder 414 for processing the encoder output data 412 and the RNN output data 422 to produce mask data 424…"; Figs. 5 and 6 illustrate the encoder and decoder layers where dense layers of encoder/decoder, respectively, perform two-dimensional convolution; col.18, lls.33-36, "…Each dense layer 502 may perform an AxB two-dimensional convolution..."; col.19, lls.20-24, "…Each dense layer 604 may perform a transpose AxB two-dimensional convolution...").
Liu discloses a multi-layer convolutional encoder and decoder with U-net skip connections, but does not explicitly discloses the limitation, "include an encoder configured to upsample the frequency-domain data into a feature space." However, Defossez discloses include an encoder configured to upsample the frequency-domain data into a feature space (Defossez, Figure 2, 4. The Demucs Architecture, "…Demucs takes a stereo mixture as input and outputs a stereo estimate for each source). It is an encoder/decoder architecture composed of a convolutional encoder, a bidirectional LSTM, and a convolutional decoder, with the encoder and decoder
linked with skip U-Net connections...The number of channels in the input mixture is 2, while we uses the number of output channels for the first encoder block…the final number of channels is CL= 2048...").
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified a system and method for speech enhancement and noise reduction of Liu with a waveform-to-waveform source separation model, with a U-Net structure and bidirectional LSTM of Defossez with a reasonable expectation of success to improve speech/noise separation while compressed down to small size while maintaining the accuracy and naturalness (Defossez, Abstract).
Regarding Claim 2,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Liu further discloses comprising multiplying the frequency multiplicative mask to the frequency-domain data to estimate the known clean acoustic signal (Liu, Fig.4A, col.15, lls.6-22, "…A complex multiplication component 426 may process the mask data 424 and the delayed input data 428 to determine the output data 430. The mask data 424 may be a vector and/or series of vectors comprising complex numbers of the form a +bi..."; col.5, lls.4-15, "…output audio data that includes a representation of the utterance and reduced noise…noise reduction refers to reducing a magnitude of the volume of the representation of the noise represented in the audio data. This reduction in magnitude includes reducing the magnitude to zero...").
Regarding Claim 3,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Defossez further discloses spatial dimensions specified by width and height of the frequency-domain data remain the same before and after a performance of at least one 2-dimensional convolutional layer of the encoder (Defossez, Fig.2, 4.1 Convolutional auto-encoder, "…The number of channels in the input mixture is 2, while we use as the number of output channels for the first encoder block…the final number of channels is CL= 2048..."; i.e., note that the upsampling at the encoder involves increasing the number of channels).
Regarding Claim 4,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Defossez further discloses the convolutional neural network includes a decoder configured to downsample the feature space into the frequency multiplicative mask (Defossez, Fig.2, 4.1 Convolutional auto-encoder, "…The decoder is mostly the inverse of the encoder...").
Regarding Claim 5,
Liu in view of Defossez discloses the computer-implemented method of claim 4. Defossez further discloses spatial dimensions specified by width and height of the frequency-domain data remain the same before and after a performance of at least one 2-dimensional convolutional layer of the decoder (Defossez, Fig.2, 4.1 Convolutional auto-encoder, "…The decoder is mostly the inverse of the encoder...").
Regarding Claim 6,
Liu in view of Defossez discloses the computer-implemented method of claim 1 further comprising:
Liu further discloses constructing the convolutional neural network, including a plurality of neurons arranged in a plurality of layers including encoding layers and decoding layers wherein the encoding layers and decoding layers include 2-dimensional convolutional layers (Liu, Figs.2 and 4, col.15, lls.1-34, "…The noise-reduction component 222 may further include a decoder 414 for processing the encoder output data 412 and the RNN output data 422 to produce mask data 424…"; Figs. 5 and 6 illustrate the encoder and decoder layers where dense layers of encoder/decoder, respectively, perform two-dimensional convolution; col.18, lls.33-36, "…Each dense layer 502 may perform an AxB two-dimensional convolution..."; col.19, lls.20-24, "…Each dense layer 604 may perform a transpose AxB two-dimensional convolution...").
Regarding Claim 8,
Liu in view of Defossez discloses the computer-implemented method of claim 6. Defossez discloses a first layer of the plurality of layers is configured to encode frequencies in the frequency-domain data into a higher-dimension feature space in comparison with an original dimension of the frequency-domain data, and a second layer of the plurality of layers is configured to decode feature space to lower-dimension in comparison with the higher-dimension feature space (Defossez, Figure 2, 4. The Demucs Architecture, "…Demucs takes a stereo mixture as input and outputs a stereo estimate for each source). It is an encoder/decoder architecture composed of a convolutional encoder, a bidirectional LSTM, and a convolutional decoder, with the encoder and decoder linked with skip U-Net connections...The number of channels in the input mixture is 2, while we use as the number of output channels for the first encoder block…the final number of channels is CL= 2048...Decoder...The decoder is mostly the inverse of the encoder...").
Regarding Claim 9,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Liu further discloses providing the trained convolutional neural network to a wearable or portable audio device wherein the audio device is capable of (Liu, Fig.1, col.2, lls.45-66, "…The device 110 may capture audio that represents both desired audio, such as the utterance 104, and undesired audio, such as the noise 108…The device 110 may contain a noise-reduction component and a number of other components..."):
receiving real-time audio data, transforming the real-time audio data into real-time frequency-domain data (Liu, Fig.1, col.3, lls.13-29, "…the user device 110 and/or remote system 120 receives (120) first audio data that includes representations of both an utterance 104 of a user 102 and noise 108 from a noise source 106..."; col.3, lls.30-41, "…the user device 110 and/or remote system 120 may further process the audio data to, for example, convert time-domain audio data into frequency domain audio data (via, for example, a Fourier transform)..."),
outputting a real-time frequency multiplicative mask using the trained convolutional neural network and the real-time audio data (Liu, Figs.2 and 4, col.15, lls.1-34, "…The noise-reduction component 222 may further include a decoder 414 for processing the encoder output data 412 and the RNN output data 422 to produce mask data 424…"; Figs. 5 and 6 illustrate the encoder and decoder layers where dense layers of encoder/decoder, respectively, perform two-dimensional convolution; col.18, lls.33-36, "…Each dense layer 502 may perform an AxB two-dimensional convolution..."; col.19, lls.20-24, "…Each dense layer 604 may perform a transpose AxB two-dimensional convolution..."), and applying the real-time frequency multiplicative mask to the real-time frequency-domain data (Liu, Fig.4A, col.15, lls.6-22, "…A complex multiplication component 426 may process the mask data 424 and the delayed input data 428 to determine the output data 430. The mask data 424 may be a vector and/or series of vectors comprising complex numbers of the form a +bi..."; col.5, lls.4-15, "…output audio data that includes a representation of the utterance and reduced noise…noise reduction refers to reducing a magnitude of the volume of the representation of the noise represented in the audio data. This reduction in magnitude includes reducing the magnitude to zero...").
Regarding Claim 10,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Liu further discloses the frequency multiplicative mask is a phase-aware complex ratio mask (Liu, col.15, lls.8-14, "…A complex multiplication component 426 may process the mask data 424 and the delayed input data 428 to determine the output data 430. The mask data 424 may be a vector and/or series of vectors comprising complex numbers of the form a +bi, wherein a denotes the real part of each number and wherein b denotes the imaginary part of each number..."; col.17, lls.38-40, "…The input data 402 and/or mask data 424 may, as described herein, be divided into complex data such as magnitude data and phase data...").
Regarding Claim 11,
Liu in view of Defossez discloses the computer-implemented method of claim 1. Liu further discloses the known noisy acoustic signal is a known noisy speech signal and the known clean acoustic signal is a known clean speech signal (Liu, Fig.1, col.3, lls.13-29, "…the user device 110 and/or remote system 120 receives (120) first audio data that includes representations of both an utterance 104 of a user 102 and noise 108 from a noise source 106..."; col.5, lls.4-15, "…output audio data that includes a representation of the utterance and reduced noise…noise reduction refers to reducing a magnitude of the volume of the representation of the noise represented in the audio data. This reduction in magnitude includes reducing the magnitude to zero (i.e., user utterance with zero magnitude noise = clean speech signal)...").
Claim 12 is a system claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Liu further discloses a system comprising: a combination of a high fidelity digital signal processor (HiFi DSP) paired with a neural processing unit (NPU) for real-time audio processing and executed on the combination of the HiFi DSP paired with the NPU (Liu, Figs. 2 and 3, col.8, lls.36-39, "…one or more of the speech-processing systems 292, which may be used to determine which, if any, of the ASR 250, NLU 260, and/or TTS 280 components..."; col.12, lls.22-66, "…audio data from two or more microphones 301 may be processed by the analysis filter bank 304 (and/or other components)…The analysis filterbank 304 may perform a Fourier transform, such as a fast Fourier transform (FFT), and may include one or more uniform discrete Fourier transform (DFT) filterbanks..."; col.13, ll.4-col,14, ll.33, "…The acoustic-echo cancellation component 306 may subtract reference audio data 312 from the frequency-domain audio data...The user device 110 may perform a number of other audio functions, such as automatic gain control (AGC), filtering (high-, low-, and/or band-pass filtering), echo suppression, and/or beamforming. Beamforming...A synthesis filterbank 310 may be used to convert the frequency-domain data back to time-domain output audio data 316..."; In broadest reasonable interpretation, the user devices or remote systems employ speech-processing systems 292 (i.e., DSP) and Noise reduction component 222 (i.e., NPU) ); and
estimating a noise suppressed version of the input audio data (Liu, Fig.4A, col.15, lls.6-22, "…A complex multiplication component 426 may process the mask data 424 and the delayed input data 428 to determine the output data 430. The mask data 424 may be a vector and/or series of vectors comprising complex numbers of the form a +bi..."; col.5, lls.4-15, "…output audio data that includes a representation of the utterance and reduced noise…noise reduction refers to reducing a magnitude of the volume of the representation of the noise represented in the audio data. This reduction in magnitude includes reducing the magnitude to zero...").
…
Rationale for combination is similar to that provided for Claim 1.
Claim 13 is a system claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.
Claim 14 is a system claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.
Regarding Claim15,
Liu in view of Defossez discloses the system of claim 13.
Defossez further discloses the decoding layer is configured to increase a number of channels (Defossez, Fig.2 (a), See Decoder block, Cin=4096 and Cout = 4*2).
Regarding Claim 16,
Liu in view of Defossez discloses the system of claim 12.
Liu further discloses the HiFi DSP is of Tensilica® HiFi DSP family (Liu, Figs. 2 and 3, one or more of the speech-processing systems 292, the ASR 250, NLU 260, and/or TTS 280 components, analysis filter bank 304, The acoustic-echo cancellation component 306 which are made executable in combination with hardware processors (e.g., microcontrollers, systems on a chip, field programmable gate arrays, digital signal processors, graphic processing units, general processing units (Liu, col.21, lls.36-62) would be interpreted as DSP).
Regarding Claim 17,
Liu in view of Defossez discloses the system of claim 16.
Liu further discloses the HiFi DSP is HiFi 5 DSP of Tensilica® HiFi DSP family (Liu, Figs. 2 and 3, one or more of the speech-processing systems 292, the ASR 250, NLU 260, and/or TTS 280 components, analysis filter bank 304, The acoustic-echo cancellation component 306 in combination with hardware processors would be interpreted as DSP).
Regarding Claim 18,
Liu in view of Defossez discloses the system of claim 12.
Liu further discloses the NPU is of Tensilica® neural network engine (NNE) family (Liu, Figs. 2 and 3, Liu discloses Noise reduction component 222 in combination with hardware processors, which would be interpreted as a NNE under BRI).
Regarding Claim 19,
Liu in view of Defossez discloses the system of claim 18.
Liu further discloses the NPU is NNE 110 of Tensilica® NNE family (Liu, Figs. 2 and 3, Liu discloses Noise reduction component 222 in combination with hardware processors, which would be interpreted as a NNE under BRI).
Claim 20 is a computer-readable storage device claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Liu further discloses a computer-readable storage device storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method (Liu, Figs. 8 and 9, col.20, lls.62-67, "…the user device(s) 110 and/or the remote system(s) 120 may include their own dedicated processors, memory, and/or storage...processor(s) (804/904), memory (806/906)..."; col., lls., "…non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure...").
…
Rationale for combination is similar to that provided for Claim 1.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Defossez further in view of Borgstrom et al., US Pub No. 2023/016,2758, hereinafter, Borgstrom).
Regarding Claim 7,
Liu in view of Defossez discloses the computer-implemented method of claim 6. Liu further discloses each of the encoding layers and the decoding layers includes a 2-dimensional convolution (Liu, Figs. 5 and 6, col.18, lls.33-36, "…Each dense layer 502 may perform an AxB two-dimensional convolution..."; col.19, lls.20-24, "…Each dense layer 604 may perform a transpose AxB two-dimensional convolution..."), but does not explicitly discloses the limitations of a batch normalization and a rectified linear unit activation following each encoding/decoding layers.
Defossez discloses a rectified linear unit activation (Defossez, 4.1 Convolutional auto-encoder, "Encoder…input channels, output channels and ReLU activation, followed by a convolution with kernel size 1, 2Ci output channels and gated linear units (GLU) as activation function...Encoder...The decoder is mostly the inverse of the encoder...a ReLU activation…").
Neither Liu nor Defossez explicitly discloses the batch normalization following the convolution layers. However, Borgstrom, in the analogous field of endeavor, discloses a batch normalization (Borgstrom, Fig.4D, par [020], "…the mask estimator can include a multi-layer fully convolutional network (FCN). The FCN can include a series of convolutional blocks. Each series can include a CNN filter process, a batch normalization process, an activation process...")
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified a compact system and method for speech enhancement and noise reduction of Liu in view of Defossez with the normalized encoder features of convolution block within the mask estimation fully convolutional network of Borgstrom with a reasonable expectation of success to improve the intelligibility of the speech observed in acoustically adverse environments, as well as lower the cognitive load required during listening. (Borgstrom, paras [011-013]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Andreev et al., (US Pat No. 12,400,675, hereinafter, Andreev) discloses a system for an audio waveform processing based on a generative adversarial network (GAN) generator using two-dimensional U-Net convolutional blocks to the mel-spectrogram and a learnable spectral masking module (SpectralMaskNet) (Andreev, Summary, Col.2,ll.64 - Col.4,ll.42).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JANGWOEN LEE/Examiner, Art Unit 2656
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656