DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretations
2. The following is a quotation of 35 U.S.C. 112(f):
(f) ELEMENT IN CLAIM FOR A COMBINATION.—An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
3. The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” (or “step”) but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is/are coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are:
21. (New) A sound separation system, comprising:
an encoder comprising a short-time Fourier transform module to determine a first magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal including voice;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the first magnitude spectrum as input, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN architecture; and
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the voice from the input audio signal to obtain a second magnitude spectrum for the voice.
23. (New) The system of claim 22, further comprising a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the additional sound using the second magnitude spectrum and the phase spectrum.
30. (New) The method of claim 29, wherein the audio signal includes additional sound and wherein the mixer is configured to separate the voice from the additional sound.
36. (New) The at least one non-transitory computer readable medium of claim 35, wherein the audio signal includes additional sound and wherein the mixer is configured to separate the voice from the additional sound.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof, for example, viewing figures 2 and 6 and paras. 58-62 and 77 of the present specification where indication is present of memory storing instructions to be executed by processor; furthermore, para. 17 gives indication on processor executing stored instructions.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
For more information, see MPEP § 2173 et seq. and Supplementary Examination Guidelines for Determining Compliance With 35 U.S.C. 112 and for Treatment of Related Issues in Patent Applications, 76 FR 7162, 7167 (Feb. 9, 2011).
Claim Rejections - 35 USC § 112
4. The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
5. Claims 30-31 and 36-37 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA the applicant regards as the invention.
Claim 30 recites the limitation “the mixer” in line 2. There is insufficient antecedent basis for this limitation in the claim. Claim 36 recites the limitation “the mixer” in line 2. There is insufficient antecedent basis for this limitation in the claim.
Claims 31 and 37 are rejected as the same ground as virtue of their dependency.
Double Patenting
6. The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP §§ 706.02(l) (1) - 706.02(l) (3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
7. Claims 21-23, 26-31, 33-37 and 39-40 of the pending application 18/770496 filed on 10/24/2024 are rejected on the ground of nonstatutory double patenting as being unpatentable over 1, 1, 1, 3, (1+6), 9, 10, 10, 10, 17, (10+13), 18, 18, 18, 22 and (18+21) of the issued patent US 12,062,369 B2 respectively. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the pending application are similar in scope in comparison to the issued patent US 12,062,369 B2. Please see the below table for claim similarities for the independent claim. The pending application and the issued patent refer to the same method of controlling a speaker profile. Therefore, it would have been obvious to a one of ordinary skill in the art before the effective filing date of the claimed invention to use a method/a device/a non-transitory computer readable medium of obtaining a denoise magnitude spectrum as recited in US 12,062,369 B2 to separate the voice from the input audio signal as claimed in the pending application 18/770496.
Pending Application 18/770496
Issued Patent US 12,062,369 B2
21. (New) A sound separation system, comprising:
an encoder comprising a short-time Fourier transform module to determine a first magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal including voice;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the first magnitude spectrum as input, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN architecture; and
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the voice from the input audio signal to obtain a second magnitude spectrum for the voice.
1. A dynamic noise reduction system, comprising:
an encoder comprising a short-time Fourier transform module to determine a magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal comprising speech and dynamic noise;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the magnitude spectrum as input, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN merges non-causal convolution layers with causal convolution layers to form a hybrid TCN architecture;
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the speech from the dynamic noise to obtain a denoise magnitude spectrum; and
a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the dynamic noise using the denoise magnitude spectrum and the phase spectrum.
22. (New) The system of claim 21, wherein the audio signal includes additional sound and wherein the mixer is configured to separate the voice from the additional sound.
1. A dynamic noise reduction system, comprising:
an encoder comprising a short-time Fourier transform module to determine a magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal comprising speech and dynamic noise;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the magnitude spectrum as input, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN merges non-causal convolution layers with causal convolution layers to form a hybrid TCN architecture;
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the speech from the dynamic noise to obtain a denoise magnitude spectrum; and
a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the dynamic noise using the denoise magnitude spectrum and the phase spectrum.
23. (New) The system of claim 22, further comprising a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the additional sound using the second magnitude spectrum and the phase spectrum.
1. A dynamic noise reduction system, comprising:
an encoder comprising a short-time Fourier transform module to determine a magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal comprising speech and dynamic noise;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the magnitude spectrum as input, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN merges non-causal convolution layers with causal convolution layers to form a hybrid TCN architecture;
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the speech from the dynamic noise to obtain a denoise magnitude spectrum; and
a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the dynamic noise using the denoise magnitude spectrum and the phase spectrum.
26. (New) The system of claim 21, wherein the TCN comprises at least one stack of 1-D convolution blocks that repeat n times.
3. The system of claim 1, wherein the TCN comprises at least one stack of 1-D dilated convolution blocks that repeat n times.
27. (New) The system of claim 21, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
1. A dynamic noise reduction system, comprising:
an encoder comprising a short-time Fourier transform module to determine a magnitude spectrum and a phase spectrum of an input audio signal, the input audio signal comprising speech and dynamic noise;
a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the magnitude spectrum as input, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN merges non-causal convolution layers with causal convolution layers to form a hybrid TCN architecture;
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the speech from the dynamic noise to obtain a denoise magnitude spectrum; and
a decoder, coupled to the mixer and the encoder, comprising an inverse short-time Fourier transform module to reconstruct the input audio signal without the dynamic noise using the denoise magnitude spectrum and the phase spectrum.
6. The system of claim 1, wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
28. (New) The system of claim 21, wherein the system is executable on small form factor devices.
9. The system of claim 1, wherein the dynamic noise reduction system is executable on small form factor devices capable of voice calls.
29. (New) A method for dynamic noise reduction, comprising:
receiving, by an encoder, an input audio signal, the input audio signal including voice;
performing, by the encoder, a short-time Fourier transform on the audio signal to generate a first magnitude spectrum and a phase spectrum;
estimating, by a temporal convolution network (TCN), a separation mask based on the first magnitude spectrum using deep learning, wherein the TCN comprises non-causal convolution layers merged with causal convolution layers; and
mixing the separation mask with the first magnitude spectrum to separate the voice from the input audio signal and obtain a second magnitude spectrum for the voice.
10. A method for dynamic noise reduction, comprising:
receiving, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
performing, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimating, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mixing the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
performing, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
30. (New) The method of claim 29, wherein the audio signal includes additional sound and wherein the mixer is configured to separate the voice from the additional sound.
10. A method for dynamic noise reduction, comprising:
receiving, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
performing, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimating, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mixing the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
performing, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
31. (New) The method of claim 30, further comprising performing, by a decoder, an inverse short-time Fourier transform using the second magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the additional sound.
10. A method for dynamic noise reduction, comprising:
receiving, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
performing, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimating, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mixing the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
performing, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
33. (New) The method of claim 29, wherein the TCN comprises at least one stack of 1-D dilated convolution blocks that repeat n times to estimate the separation mask using deep learning.
17. The method of claim 10, wherein the TCN comprises at least one stack of 1-D dilated convolution blocks that repeat n times to estimate the separation mask using the deep learning.
34. (New) The method of claim 29, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and, wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
10. A method for dynamic noise reduction, comprising:
receiving, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
performing, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimating, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mixing the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
performing, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
13. The method of claim 10, wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
35. (New) At least one non-transitory computer readable medium, comprising a set of instructions, which when executed by one or more computing devices, cause the one or more computing devices to:
receive, by an encoder, an input audio signal, the input audio signal including voice;
perform, by the encoder, a short-time Fourier transform on the audio signal to generate a first magnitude spectrum and a phase spectrum;
estimate, by a temporal convolution network (TCN), a separation mask based on the first magnitude spectrum using deep learning, wherein the TCN comprises non-causal convolution layers merged with causal convolution layers; and
mix the separation mask with the first magnitude spectrum to generate a second magnitude spectrum for the voice.
18. At least one non-transitory computer readable medium, comprising a set of instructions, which when executed by one or more computing devices, cause the one or more computing devices to:
receive, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
perform, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimate, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mix the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
perform, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
36. (New) The at least one non-transitory computer readable medium of claim 35, wherein the audio signal includes additional sound and wherein the mixer is configured to separate the voice from the additional sound.
18. At least one non-transitory computer readable medium, comprising a set of instructions, which when executed by one or more computing devices, cause the one or more computing devices to:
receive, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
perform, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimate, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mix the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
perform, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
37. (New) The at least one non-transitory computer readable medium of claim 36, wherein the set of instructions further cause the one or more computing devices to:
perform, by a decoder, an inverse short-time Fourier transform using the second magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the additional sound.
18. At least one non-transitory computer readable medium, comprising a set of instructions, which when executed by one or more computing devices, cause the one or more computing devices to:
receive, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
perform, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimate, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mix the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
perform, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
39. (New) The at least one non-transitory computer readable medium of claim 35, wherein the TCN comprises at least one stack of 1-D dilated convolution blocks that repeat n times to estimate the separation mask using deep learning.
22. The at least one non-transitory computer readable medium of claim 18, wherein the TCN comprises at least one stack of 1-D dilated convolution blocks that repeat n times to estimate the separation mask using the deep learning.
40. (New) The at least one non-transitory computer readable medium of claim 35, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and, wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
18. At least one non-transitory computer readable medium, comprising a set of instructions, which when executed by one or more computing devices, cause the one or more computing devices to:
receive, by an encoder, an input audio signal, the input audio signal including speech and dynamic noise;
perform, by the encoder, a short-time Fourier transform on the audio signal to generate a magnitude spectrum and a phase spectrum;
estimate, by a temporal convolution network (TCN), a separation mask based on the magnitude spectrum using deep learning, wherein the TCN is trained using a frequency SNR cost function used to calculate loss during training, and wherein the TCN comprises non-causal convolution layers merged with causal convolution layers;
mix the separation mask with the magnitude spectrum to generate a denoise magnitude spectrum; and
perform, by a decoder, an inverse short-time Fourier transform using the denoise magnitude spectrum and the phase spectrum to reconstruct the input audio signal without the dynamic noise.
21. The at least one non-transitory computer readable medium of claim 18, wherein the frequency SNR cost function includes a logarithmic scale to balance quiet and loud magnitudes.
This is a non-provisional nonstatutory double patenting rejection because the patentably indistinct claims have in fact been patented.
Allowable Subject Matter
8. Claims 21-40 are allowed in view of the prior art of record.
The following is a statement of reasons for the indication of allowable subject matter: the prior art(s) taken alone or in combination fail(s) to teach the following element(s) in combination with the other recited elements in the claim(s).
“a separator, coupled to the encoder, comprising a temporal convolution network (TCN) used to develop a separation mask using the first magnitude spectrum as input, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN architecture; and
a mixer, coupled to the separator, to multiply the separation mask with the magnitude spectrum to separate the voice from the input audio signal to obtain a second magnitude spectrum for the voice.” as recited in Claim 21.
Claims 29-35 recite the similar features as Claim 21.
The closest prior art found as following.
a. Luo et al. (“Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.”) In this reference, Luo et al. propose a fully convolution time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet use a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The mask is found using a temporal convolutional network consisting of stacked one-dimensional dilated convolution blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size (Section II. Convolutional Time-Domain Audio Separation Network, para. 4 of pg. 1257, encoder module e.g. STFT module replacement with data driven representation to determine features such as magnitude spectrums of an inputted audio signal hence Luo et al. teaches a basis index function corresponding to parameters, see para. 3 of pg. 1256, (Section D. Convolutional Separation Module para. 1 of pg. 1259, long-range dependencies of speech signal, where training objective is to maximize the scale-invariant source-to-noise ratio e.g., audio signal includes speech and dynamic noise, Fig. 1B, separation module is a TCN used to estimate the masks based on the waveform information of the encoder e.g., magnitude spectrum of input, see Fig. 1B description on pg. 1258.) Luo et al. discusses a convolutional time-domain audio separation network including an encoder, a separator, and a decoder. Luo et al. receives an input mixture sound that can be divided into overlapping segments of length L, and performs a 1-D convolution operation on the input. Luo et al. discusses a fully convolutional separation module. Luo et al. states that “for noncausal configuration, we found empirically that global layer normalization (gLN) outperforms all other normalization methods”, and “in causal configuration, gLN cannot be applied since it relies on the future values of the signal at any time step. Instead, we designed a cumulative layer normalization (cLN) operation...” In right column, page 2, Luo et al. indicates that Conv-TasNet significantly increase the separation accuracy over the previous LSTM-TasNet in both causal and non-causal implements. Luo et al. implements Conv- TasNet in both causal and non-causal convolution. Conv- TasNet is a fully-convolution time-domain audio separation network. With Conv- TasNet, Luo et al. does not convert an input audio signal in time domain to frequency domain. Luo et al. does not generate a separation mask using a magnitude spectrum as input for a temporal convolution network, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN. Thus, Luo et al. fails to teach and/or suggest the allowable subject matter.
b. Wu (CN 111261145.) In this reference, Wu discusses a voice extraction model using a multiple-channel time domain audio separation network, which may include an encoder, an enhancer, a multiplier, and a decoder. Wu states that the encoder may include a Short-time Fourier Transform and that the decoder may include an inverse Short-Fourier Transform. In Wu, with the voice extraction model 301, which includes an encoder using a Fourier transform (STFT) and an amplitude spectrum information may be outputted indicative of magnitude and phase spectrum information as per the STFT of an input audio signal; furthermore, the decoder uses a Fourier inverse transform as to reconstruct the signal per the amplitude spectrum information indicative of using the magnitude and phase spectrum information as to decode e.g. reconstruct the input audio signal for speech separation using the target object voice mask mixed amplitude spectrum and achieve a target signal. Wu uses the Temporal Fully-convolutional Network to obtain target object voice mask. However, Wu does not generate a separation mask using a magnitude spectrum as input for a temporal convolution network, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN. Thus, Wu fails to teach and/or suggest the allowable subject matter.
c. Kremer et al. (US 2004/0078199 A1.) In this reference, Kremer et al. disclose a method/a system for reducing noise in audio signal (Kremer et al. [0043] Apparatus 100 includes: (i) high pass filter 110, (ii) a frequency converter such as Weighted OverLap-Add (WOLA) analyzer 120, (iii) first voice activity detector 130, (iv) noise estimator 140, (v) spectral subtracting block 150, (vi) masking threshold calculator 160, (vii) optimal parameters calculator 170, (viii) parametric subtracting block 180, (ix) signal to noise estimator 190, (x) musical noise suppressor 200, (xi) WOLA synthesizer 210, (xii) second voice activity detector 220, (xiii) low pass filter 230 and (xiv) output suppressor 240. It is noted that the spectral subtracting block 150, the masking threshold calculator 160, the optimal parameters calculator 170, and the parametric subtracting block form a parametric subtraction entity, [0045] An output of the first voice activity detectors 130 and an output of second voice activity detector 220 each are connected to noise estimator 140, while the output of the noise estimator 140 is connected to an input of spectral subtracting block 150 and to an input of signal to noise estimator 190. The output of spectral subtracting block 150 is connected to an input of the optical parametric calculator 170 and to the input of the masking threshold calculator 160. The output of the masking threshold calculator 160 is connected to an input of the optimal parameters calculator 170. The output of the optimal parameters calculator 170 is connected to an input of the parametric subtracting block 180. The output of the parametric subtracting block 180 is connected to an input of the musical noise suppressor 200, while another input of the musical noise suppressor 200 is connected to the output of the signal to noise estimator 190. The output of the musical noise suppressor 200 is connected to an input of the WOLA synthesizer 210. The output of the WOLA synthesizer 210 is connected to an input of second voice activity detector 220 and to the input of the low pass filter 230. The output of the low pass filter 230 is connected to an input of the output suppressor 240, while another input of the output suppressor 240 is connected to the output of the second voice activity detector 220. The output of output suppressor 240 provides the output signal of apparatus 100 that is an estimation of the speech signal (during estimated speech periods) or a noise signal (during estimated non-speech periods), [0073] According to an aspect of the invention the spectral subtracting occurs only if first voice activity detector 130 determines that the noisy input signal includes speech signals (that the likelihood that the noisy input signal includes a speech signal exceeds a threshold), [0074] According to another aspect of the invention the spectral subtraction is implemented for each noisy input signal, regardless the determination of the first voice activity detector 130, [0076] The masking threshold calculator 160 is operable to compute a masking threshold per band, and for each frame. For each band and for each frame the computation includes summing the energies of frequency components of the roughly estimated speech signal that belong to the band. The summed energies undergo a convolution operation with frequency components of a spreading function that reflects the masking phenomenon. Frequency components of a relative threshold offset are subtracted from the product of the convolution.) Kremer et al. uses the convolution operation in masking threshold calculator. However, Kremer et al. does not generate a separation mask using a magnitude spectrum as input for a temporal convolution network, wherein the TCN includes non-causal convolution layers and causal convolution layers to form a hybrid TCN. Thus, Kremer et al. fail to teach and/or suggest the allowable subject matter.
Conclusion
9. The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. See PTO-892.
a. Sun et al. (US 2022/0223144 A1.) In this reference, Sun et al. disclose a method and a system for separating sources based on convolution neural network.
b. Mesgarani et al. (US 2019/0066713 A1.) In this reference, Mesgarani et al. disclose the speech-separation processing.
c. Ochiai et al. (US 2023/0067132 A1.) In this reference, Ochiai et al. disclose a method and a system for extracting a separated signal from a mixed signal by beamforming.
10. Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/THUYKHANH LE/Primary Examiner, Art Unit 2655