DETAILED ACTION
Claims 1-20 of the instant application are pending and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/25/2024, 11/06/2025, and 11/13/2025 were filed. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.
Claim Objections
Claims 2, 11, and 14 objected to because of the following informalities: the first instance of the limitation “a target audio data” should read: the [[a]] target audio data. Appropriate correction is required.
Claims 6 and 18 objected to because of the following informalities: the first instance of the limitation “M first-order time derivatives” should read: the M first-order time derivatives. Appropriate correction is required.
Claims 6 and 18 objected to because of the following informalities: the first instance of the limitation “M second-order time derivatives” should read: the M second-order time derivatives. Appropriate correction is required.
Claims 8 and 20 objected to because of the following informalities: the first instance of the limitation “a dynamic spectrum feature” should read: the [[a]] dynamic spectrum feature. Appropriate correction is required.
Claim 9 objected to because of the following informalities: the limitations of “a target mask estimation model” and “a target mask” should read: the [[a]] target mask estimation model and the [[a]] target mask, respectively. Appropriate correction is required.
Claims 2 and 14 objected to because of the following informalities: the second instance of the limitation “each audio data segment” should read: each of the audio data segment. Appropriate correction is required.
Claims 3 and 15 objected to because of the following informalities: the first and second instances of the limitation “each audio data segment” should read: each audio data segment i. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 12, and 13 recite the limitation "inputting… the dynamic spectrum feature into a target mask…" in the limitation of “inputting the N target cepstrum coefficients…”. There is insufficient antecedent basis for this limitation in the claim. The Examiner understands the limitation should read: inputting… the dynamic spectrum feature associated with the target audio data frame into a target mask…
Hence, dependent claims 2-11 and 14-20 are also rejected.
Claims 2 and 14 recite the limitation "from obtained H audio data frames…" in the limitation of “determining, from obtained H audio data frames,…”. There is insufficient antecedent basis for this limitation in the claim. The Examiner understands the limitation should read: determining, from the obtained H audio data segments,…
Hence, dependent claims 3 and 15 are also rejected.
Claim 9 recite the limitations:
"the inputting … the dynamic spectrum feature to a target mask…" in the limitation of “the inputting the N target cepstrum coefficients,…” and
"using… the dynamic spectrum feature to a target mask…" in the limitation of “using the N target cepstrum coefficients,…”.
There is insufficient antecedent basis for these limitations in the claim. The Examiner understands the limitations should read:
the inputting … the dynamic spectrum feature associated with the target audio data frame to a target mask…"
using… the dynamic spectrum feature associated with the target audio data frame to a target mask…
Hence, dependent claim 10 is also rejected.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2,6-8, 12-14, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1).
As to independent claim 1, Fuchs et al. teaches:
1. An audio data processing method, performed by a computer device (see ¶ [0010]: “According to an embodiment, an audio decoder for providing a decoded audio representation on the basis of an encoded audio representation may have a filter for providing an enhanced audio representation of the decoded audio representation, wherein the filter is configured to obtain a plurality of scaling values, which are associated with different frequency bins or frequency ranges, on the basis of spectral values of the decoded audio representation which are associated with different frequency bins or frequency ranges, and wherein the filter is configured to scale spectral values of the decoded audio signal representation, or a pre-processed version thereof, using the scaling values, to obtain the enhanced audio representation.”) and comprising:
obtaining a target audio data frame and K historical audio data frames that are associated with raw audio data (see ¶ [0010] citation as in the preamble above and further ¶ [0032]: “…Rather, the adjustment of the filter may be solely based on the decoded spectral values of a currently processed frame regardless of the coding scheme used for generating the encoded and the decoded representations of the audio signal, and possibly decoded spectral values of one or more previously decoded frames and/or one or more subsequently decoded frames.” and
¶ [0120]: “The audio decoder 100 optionally comprises a decoder core 120, which may receive the encoded audio representation 110 and provide, on the basis thereof, a decoded audio representation 122. The audio decoder further comprises a filter 130, which is configured to provide the enhanced audio representation 112 on the basis of the decoded audio representation 122...”
The Examiner notes that the target audio data frame and the K historial audio data frames associated with raw audio data are read by Fuchs’ teachings of current, previous, and subsequent decoded frames being processed as cited above.) ,
the target audio data frame and the K historical audio data frames being spectral frames (see ¶ [0010, 0032, and 0120] citation(s) as in limitation(s) above. More specifically: ¶ [0032]: “…Rather, the adjustment of the filter may be solely based on the decoded spectral values of a currently processed frame regardless of the coding scheme used for generating the encoded and the decoded representations of the audio signal, and possibly decoded spectral values of one or more previously decoded frames and/or one or more subsequently decoded frames.” ) ,
each of the K historical audio data frames being a spectral frame preceding the target audio data frame, and K being a positive integer (see ¶ [0010, 0032, and 0120] citation(s) as in limitation(s) above. More specifically: ¶ [0032]: “…Rather, the adjustment of the filter may be solely based on the decoded spectral values of a currently processed frame regardless of the coding scheme used for generating the encoded and the decoded representations of the audio signal, and possibly decoded spectral values of one or more previously decoded frames and/or one or more subsequently decoded frames.”
The Examiner notes that the one or more previously/subsequently decoded frames are a positive integer (i.e., K>1).) ;
in a case that N target cepstrum coefficients of the target audio data frame are obtained (see ¶ [0038-0039]: “[0038] In an embodiment of the audio decoder, the filter is configured to obtain magnitude values |{circumflex over (X)}(k, n)| (which may, for example, describe an absolute value or an amplitude or a norm) of the enhanced audio representation according to |{circumflex over (X)}(k, n)|=M(k, n)*|{tilde over (X)}(k, n)|, wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein |{tilde over (X)}(k, n)| is a magnitude value of a spectral value of decoded audio representation. The magnitude value |{tilde over (X)}(k, n)| can be a magnitude, an absolute value, or any norm of a spectral value obtained by applying a time-frequency transform like SIFT (Short-term Fourier transform), FFT or MDCT, to the decoded audio signal. [0039] Alternatively, the filter may be configured to obtain values {circumflex over (X)}(k, n) of the enhanced audio representation according to {circumflex over (X)}(k, n)=M(k, n)*{tilde over (X)}(k, n), wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein {tilde over (X)}(k, n) is a spectral value of the decoded audio representation.” and
¶ [0071]: “In an embodiment of the audio decoder, the filter is configured to obtain short term Fourier transform coefficients (e.g. {tilde over (X)}(k, n)) which represent the spectral values of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”,
¶ [0080]: “The apparatus is configured to obtain spectral values (e.g. magnitudes or phases or MDCT coefficients, e.g. represented by magnitude values, e.g. |{tilde over (X)}(k, n)|)of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”),
obtaining, based on the N target cepstrum coefficients(see ¶ [0179-0181]: “[0179] 6.1.1 Ideal Ratio Mask (IRM) [0180] From a very simplistic mathematical point of view, one can describe the coded speech x(n), e.g., a decoded speech provided by a decoder core, (e.g., the decoder core 120 or the decoder core 320 or the decoder core 430 or the decoder core 530) as: {tilde over (x)}(n)=x(n)+δ(n) (1) where x(n) is the input to the encoder (e.g., to the audio encoder 410, 510) and δ(n) is the quantization noise. The quantization noise δ(n) is correlated to the input speech since ACELP uses perceptual models during the quantization process. This correlation property of the quantization noise makes our post-filtering problem unique to speech enhancement problem which assumes the noise to be uncorrelated. In order to reduce the quantization noise, we estimate a real valued mask per time-frequency bin and multiply this mask with that of magnitude of the coded speech for that time-frequency bin. |{circumflex over (X)}(k, n)|=M(k, n)* |{tilde over (X)}(k, n)| (2) where M(k, n) is the real valued mask, {tilde over (X)}(k, n) is magnitude of the coded speech, {circumflex over (X)}(k, n) is the magnitude of enhanced speech, k is the frequency index and n is the time index. If our mask is ideal (e.g., if the scaling values M(k, n) are ideal), we can reconstruct the clean speech from coded speech. |X(k, n)|=IRM(k, n)*|{tilde over (X)}(k, n)| (3) where |X(k, n)| is the magnitude of the clean speech. [0181] Comparing the Eq. 2 and 3, we obtain the ideal ratio mask (IRM) (e.g., an ideal value of the scaling values M (k, n)) and is given by IRM (k,n) = (X(k,n)) / (K(k,n) + γ) (4) where γ is very small constant factor to prevent division by zero. Since the magnitude values lies in the range [0, ∞], the values of IRM also lie in the range [0, ∞].” and
¶ [0222-0226]: “[0222] According to a first aspect, a mask-based post-filter to enhance the quality of the coded speech is used in embodiments according to the invention. [0223] a. The mask is real valued (or the scaling values are real-valued). It is estimated for each frequency bin by a machine-learning algorithm (or by a neural network) from the input features [0224] b. {circumflex over (X)}(k, n)=M.sub.est(k, n)*{tilde over (X)}(k, n) [0225] c. Where M.sub.est(k, n) is the estimated mask, {tilde over (X)}(k, n) is the magnitude value of coded speech and {tilde over (X)}(k, n) is the post-processed speech at frequency bin k and time index n [0226] d. The input features used currently are log magnitude spectrum but can also be any derivative of magnitude spectrum.”),
N being a positive integer greater than 1, and M being a positive integer less than N (see ¶ [0179-0181] citations as in limitation(s) above. More specifically: ¶ [0181]: “… Since the magnitude values lies in the range [0, ∞], the values of IRM also lie in the range [0, ∞].” and ¶ [0226] “…d. The input features used currently are log magnitude spectrum but can also be any derivative of magnitude spectrum.” ) ;
obtaining N historical cepstrum coefficients corresponding to each historical audio data frame (see ¶ [0010, 0032, and 0120] citation(s) as in limitation(s) above. More specifically: ¶ [0071]: “In an embodiment of the audio decoder, the filter is configured to obtain short term Fourier transform coefficients (e.g. {tilde over (X)}(k, n)) which represent the spectral values of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”, and
¶ [0032]: “…Rather, the adjustment of the filter may be solely based on the decoded spectral values of a currently processed frame regardless of the coding scheme used for generating the encoded and the decoded representations of the audio signal, and possibly decoded spectral values of one or more previously decoded frames and/or one or more subsequently decoded frames.” ), and
determining, based on obtained K×N historical cepstrum coefficients, a dynamic spectrum feature associated with the target audio data frame (see ¶ [0038-0039]: “[0038] … The magnitude value |{tilde over (X)}(k, n)| can be a magnitude, an absolute value, or any norm of a spectral value obtained by applying a time-frequency transform like SIFT (Short-term Fourier transform), FFT or MDCT, to the decoded audio signal. [0039] Alternatively, the filter may be configured to obtain values {circumflex over (X)}(k, n) of the enhanced audio representation according to {circumflex over (X)}(k, n)=M(k, n)*{tilde over (X)}(k, n), wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein {tilde over (X)}(k, n) is a spectral value of the decoded audio representation.”
¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced. It has been found that a neural network or a machine-learning structure can typically better handle logarithmized magnitudes of the spectral values when compared to the spectral values themselves, since the spectral values typically have a high dynamic range. By using logarithmized values, it is also possible to use a simplified number representation in the (artificial) neural network or in the machine-learning structure, since it is often not needed to use a floating point number of representation. Rather, it is possible to design the neural network or the machine-learning structure using a fixed point number representation, which significantly reduces an implementation effort.”, and
¶ [0153]: “Accordingly, the scaling 338 may, for example, multiply the spectral values which are input into the scaling 338 with the scaling values, wherein different scaling values are associated with different frequency bins or frequency ranges…”); and
inputting the N target cepstrum coefficients, the (see ¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced. and further:
¶ [0096]: “[0096] The method comprises obtaining a plurality of scaling values (e.g. mask values, e.g. M(k, n)), which may, for example, be real valued and which may, for example, be non-negative, and which may, for example, be limited to a predetermined range, and which are associated with different frequency bins or frequency ranges (e.g. having frequency bin index or frequency range index k), on the basis of spectral values of the decoded audio representation which are associated with different frequency bins or frequency ranges (e.g. having frequency bin index or frequency range index k).”, and
¶ [0221-0226]: “[0221] In the following, some additional important points will be described. [0222] According to a first aspect, a mask-based post-filter to enhance the quality of the coded speech is used in embodiments according to the invention. [0223] a. The mask is real valued (or the scaling values are real-valued). It is estimated for each frequency bin by a machine-learning algorithm (or by a neural network) from the input features [0224] b. {circumflex over (X)}(k, n)=M.sub.est(k, n)*{tilde over (X)}(k, n) [0225] c. Where M.sub.est(k, n) is the estimated mask, {tilde over (X)}(k, n) is the magnitude value of coded speech and {tilde over (X)}(k, n) is the post-processed speech at frequency bin k and time index n [0226] d. The input features used currently are log magnitude spectrum but can also be any derivative of magnitude spectrum.); and
applying the target mask to obtain enhanced audio data corresponding to the raw audio data by suppressing noise data in the raw audio data (see ¶ [0046, 0096, and 0221-0226] citation(s) as in limitation(s) above. More specifically and/or further: ¶ [0097]: “The method comprises scaling spectral values of the decoded audio signal representation (e.g. {tilde over (X)}(k, n)), or a pre-processed version thereof, using the scaling values (e.g. M(k, n)), to obtain the enhanced audio representation (e.g. {circumflex over (X)}(k, n))” and
¶ [0222]: “According to a first aspect, a mask-based post-filter to enhance the quality of the coded speech is used in embodiments according to the invention. [0223] a. The mask is real valued (or the scaling values are real-valued)…”).
However, Fuchs et al. does not explicitly teach, but Ichikawa et al. does teach:
obtaining, based on the N target cepstrum coefficients, M first-order time derivatives and M second-order time derivatives that are associated with the target audio data frame (see ¶ [0041]: “A feature extraction unit 210 receives a spectrum of a speech signal (or a spectrum of a speech signal from which noise was eliminated), extracts a static or dynamic feature, and outputs the feature. Conventionally, a combination of an MFCC (Mel-Frequency Cepstral Coefficient) and its delta (first-order, difference) and delta-delta (second-order difference), or linear transforms of these are often used. They are extracted as static or dynamic features.”)
inputting the N target cepstrum coefficients, the M first-order time derivatives, the M second-order time derivatives, and the dynamic spectrum feature into a target mask estimation model to obtain a target mask corresponding to the target audio data frame (see ¶ [0031]: “The feature vector calculation means which receives an output from the vowel enhancement means or the masking means as an input may extract any feature that can be calculated by a known calculation method, such as a combination of a cepstral coefficient such as MFCC and its delta (first order difference) and delta-delta (second-order difference), or LDA (Linear Discriminant Analysis), which is a linear transform of these.” and
¶ [0069]: “The feature vector calculation unit 335 receives an output from the masking unit 345 as an input and extracts a speech feature from the input. The feature vector calculation unit 335 outputs the extracted speech feature, together with the time-series data of the maximum CSP coefficient values output from the time-series data generation unit 330, as a speech feature vector. Here, the input from the masking unit 345 is a spectrum in which sound from a non-harmonic structure is weakened. The feature that the feature vector calculation unit 335 extracts from the input from the masking unit 345 can be any feature that can be calculated with a known calculation method, for example a combination of a cepstral coefficient such as an MFCC and its delta (first-order difference) or delta-delta (second-order difference), or linear transforms of these.”)
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. of obtaining, based on the N target cepstrum coefficients, M first-order time derivatives and M second-order time derivatives that are associated with the target audio data frame and inputting the N target cepstrum coefficients, the M first-order time derivatives, the M second-order time derivatives, and the dynamic spectrum feature into a target mask estimation model to obtain a target mask corresponding to the target audio data frame which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
As to independent claim 13, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claim 1, above.
Fuchs et al. further teaches:
13. A computer device (see ¶ [0014]: “Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform any one of the inventive methods when said computer program is run by a computer.”), comprising:
a processor and a memory (see ¶ [0250]: “…Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.” and
¶ [0261]: “A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.”),
the processor being connected to the memory, the memory being configured to store a computer program, and the processor being configured to invoke the computer program, so that the computer device performs an audio data processing method (see ¶ [0250 and 0261] citation(s) as in limitation(s) above. More specifically: ¶ [00261]: “A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.”), comprising:
[the limitations as in claim 1, above].
Regarding claims 2 and 14, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 1 and 13, above.
Fuchs et al. further teaches:
2 and 14. The method/computer device according to claims 1 and 13,
wherein the obtaining a target audio data frame and K historical audio data frames that are associated with raw audio data (see ¶ [0010, 0032, and 0120] citation(s) as in claims 1 and 13, above) comprises:
Ichikawa et al. further teaches:
performing framing and windowing preprocessing on the raw audio data to obtain H audio data segments, H being a positive integer greater than 1 (see ¶ [0012]: “The first aspect of the present invention provides a speech signal processing system including: framing unit dividing an input speech signal into frames so that a pair of consecutive frames have a frame shift length greater than or equal to one period of the speech signal and have an overlap greater than or equal to a predetermined length; discrete Fourier transform means applying discrete Fourier transform to each of the frames and outputting a spectrum of the speech signal…” and
¶ [0040]: “FIG. 2 illustrates a configuration of a typical conventional speech recognition device 200. A pre-processing unit 205 receives a digital speech signal converted from an analog speech signal, divides the signal into frames by an appropriate method such as a Hann window or a Hamming window, then applies discrete Fourier transform to the frames to output spectra of the speech signal.”);
performing time-frequency transform on each audio data segment to obtain an audio data frame corresponding to each audio data segment (see ¶ [0012 and 0040] citation as in limitation above. More specifically: “[0012] …discrete Fourier transform means applying discrete Fourier transform to each of the frames and outputting a spectrum of the speech signal…
[0040] …then applies discrete Fourier transform to the frames to output spectra of the speech signal…”); and
determining, from obtained H audio data frames, the target audio data frame and K historical audio data frames preceding the target audio data frame, K being less than H (see ¶ [0012 and 0040] citation as in limitation above. More specifically: “[0012] …framing unit dividing an input speech signal into frames so that a pair of consecutive frames have a frame shift length greater than or equal to one period of the speech signal and have an overlap greater than or equal to a predetermined length; …”).
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. of performing framing and windowing preprocessing on the raw audio data to obtain H audio data segments, H being a positive integer greater than 1; performing time-frequency transform on each audio data segment to obtain an audio data frame corresponding to each audio data segment; and determining, from obtained H audio data frames, the target audio data frame and K historical audio data frames preceding the target audio data frame, K being less than H which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
Regarding claims 6 and 18, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 1 and 13, above.
Fuchs et al. further teaches:
6 and 18. The method/computer device according to claims 1 and 13,
wherein the obtaining, based on the N target cepstrum coefficients, M first-order time derivatives and M second-order time derivatives that are associated with the target audio data frame (see ¶ [0179-0181] and ¶ [0222-0226] citations as in claim 1, above.) comprises:
Ichikawa et al. further teaches:
performing a differential operation on the N target cepstrum coefficients to obtain (N–1) differential operation values, using each of the (N–1) differential operation values as a first-order time derivative, and obtaining, from the (N–1) first-order time derivatives, the M first-order time derivatives associated with the target audio data frame (see ¶ [0031]: “The feature vector calculation means which receives an output from the vowel enhancement means or the masking means as an input may extract any feature that can be calculated by a known calculation method, such as a combination of a cepstral coefficient such as MFCC and its delta (first order difference) and delta-delta (second-order difference), or LDA (Linear Discriminant Analysis), which is a linear transform of these.”); and
performing a secondary differential operation on the (N–1) first-order time derivatives to obtain (N–2) differential operation values, using each of the (N–2) differential operation values as a second-order time derivative, and obtaining, from the (N–2) second-order time derivatives, the M second-order time derivatives associated with the target audio data frame (see ¶ [0031]: “The feature vector calculation means which receives an output from the vowel enhancement means or the masking means as an input may extract any feature that can be calculated by a known calculation method, such as a combination of a cepstral coefficient such as MFCC and its delta (first order difference) and delta-delta (second-order difference), or LDA (Linear Discriminant Analysis), which is a linear transform of these.”).
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. of performing a differential operation on the N target cepstrum coefficients to obtain (N–1) differential operation values, using each of the (N–1) differential operation values as a first-order time derivative, and obtaining, from the (N–1) first-order time derivatives, the M first-order time derivatives associated with the target audio data frame; and performing a secondary differential operation on the (N–1) first-order time derivatives to obtain (N–2) differential operation values, using each of the (N–2) differential operation values as a second-order time derivative, and obtaining, from the (N–2) second-order time derivatives, the M second-order time derivatives associated with the target audio data frame which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
Regarding claims 7 and 19, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 1 and 13, above.
Fuchs et al. further teaches:
7 and 19. The method/computer device according to claims 1 and 13,
wherein the obtaining N historical cepstrum coefficients corresponding to each historical audio data frame (see ¶ [0010, 0032, and 0120] citation(s) as in claims 1 and 13, above. More specifically: ¶ [0071]: “In an embodiment of the audio decoder, the filter is configured to obtain short term Fourier transform coefficients (e.g. {tilde over (X)}(k, n)) which represent the spectral values of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”, and
¶ [0032]: “…Rather, the adjustment of the filter may be solely based on the decoded spectral values of a currently processed frame regardless of the coding scheme used for generating the encoded and the decoded representations of the audio signal, and possibly decoded spectral values of one or more previously decoded frames and/or one or more subsequently decoded frames.” ) comprises:
obtaining any two adjacent historical audio data frames from the K historical audio data frames as a first historical audio data frame and a second historical audio data frame, the second historical audio data frame being a spectral frame obtained after the first historical audio data frame (see ¶ [0010, 0032, and 0120] citation(s) as in claims 1 and 13, above. ); and
obtaining, from a cache related to the target audio data frame, N historical cepstrum coefficients corresponding to the first historical audio data frame and N historical cepstrum coefficients corresponding to the second historical audio data frame (see ¶ [0010, 0032, and 0120] citation(s) as in claims 1 and 13, above and further ¶ [0251]: “The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet…”).
Regarding claims 8 and 20, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 7 and 19, above.
Fuchs et al. further teaches:
8 and 20. The method /computer device according to claims 7 and 19,
wherein the determining, based on obtained K×N historical cepstrum coefficients, a dynamic spectrum feature associated with the target audio data frame (see ¶ [0038-0039]: “[0038] … The magnitude value |{tilde over (X)}(k, n)| can be a magnitude, an absolute value, or any norm of a spectral value obtained by applying a time-frequency transform like SIFT (Short-term Fourier transform), FFT or MDCT, to the decoded audio signal. [0039] Alternatively, the filter may be configured to obtain values {circumflex over (X)}(k, n) of the enhanced audio representation according to {circumflex over (X)}(k, n)=M(k, n)*{tilde over (X)}(k, n), wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein {tilde over (X)}(k, n) is a spectral value of the decoded audio representation.”
¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced. It has been found that a neural network or a machine-learning structure can typically better handle logarithmized magnitudes of the spectral values when compared to the spectral values themselves, since the spectral values typically have a high dynamic range. By using logarithmized values, it is also possible to use a simplified number representation in the (artificial) neural network or in the machine-learning structure, since it is often not needed to use a floating point number of representation. Rather, it is possible to design the neural network or the machine-learning structure using a fixed point number representation, which significantly reduces an implementation effort.”, and
¶ [0153]: “Accordingly, the scaling 338 may, for example, multiply the spectral values which are input into the scaling 338 with the scaling values, wherein different scaling values are associated with different frequency bins or frequency ranges…”) comprises:
using N coefficient difference values between the N historical cepstrum coefficients corresponding to the first historical audio data frame and the N historical cepstrum coefficients corresponding to the second historical audio data frame as interframe difference values between the first historical audio data frame and the second historical audio data frame (see ¶ [0038-0039, 0046, and 0153] citation(s) as in limitation(s) above and further ¶ [0197]: “Our proposed post-filter computes short time Fourier transform (SIFT) of frames with length 16 ms with 50% overlap (8 ms) at 16 kHz sampling rate (e.g., in block 324). The time frames are windowed with hann window before fast Fourier transform (FFT) of length 256 was computed resulting in 129 frequency bins (e.g., spatial domain representation 326). From the FFT, log magnitude values are computed in order to compress the very high dynamic range of magnitude values (e.g., logorithmized absolute values 372). Since speech has temporal dependency, we used context frames around the processed time frame (e.g., designated with 373). We tested our proposed model in two conditions: a) only past context frames were used and b) both past and future context frames were used. This was done because the future context frames adds to the delay of the proposed post-filter and we wanted to test the benefit of using the future context frames. The context window of 3 was chosen for our experiments leading of delay of just one frame (16 ms) when only past context frames was considered. When both past and future context frames were considered, the delay of the proposed post-filter was 4 frames (64 ms).”); and
determining the dynamic spectrum feature associated with the target audio data frame based on K–1 interframe difference values between adjacent historical audio data frames in the K historical audio data frames (see ¶ [0038-0039, 0046, 0153, and 0197] citation(s) as in limitation(s) above, more specifically: ¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced. It has been found that a neural network or a machine-learning structure can typically better handle logarithmized magnitudes of the spectral values when compared to the spectral values themselves, since the spectral values typically have a high dynamic range.” and
¶ [0197]: “…We tested our proposed model in two conditions: a) only past context frames were used and b) both past and future context frames were used. This was done because the future context frames adds to the delay of the proposed post-filter and we wanted to test the benefit of using the future context frames. The context window of 3 was chosen for our experiments leading of delay of just one frame (16 ms) when only past context frames was considered. When both past and future context frames were considered, the delay of the proposed post-filter was 4 frames (64 ms).”).
As to independent claim 12, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claim 1, above.
Fuchs et al. further teaches:
12. An audio data processing method, performed by a computer device (see ¶ [0010] citation as in claim 1, above.) and comprising:
obtaining a target sample audio data frame and K historical sample audio data frames that are associated with sample audio data (see ¶ [0010, 0032, and 0120] citation(s) as in claim 1, above.), and
obtaining a sample mask corresponding to the target sample audio data frame (see ¶ [0046, 0096, and 0221-0226] citation(s) as in claim 1, above.),
the target sample audio data frame and the K historical sample audio data frames being spectral frames (see ¶ [0010, 0032, and 0120] citation(s) as in claim 1, above.),
each of the K historical sample audio data frames being a spectral frame preceding the target sample audio data frame, and K being a positive integer (see ¶ [0010, 0032, and 0120] citation(s) as in claim 1, above.);
in a case that N target sample cepstrum coefficients of the target sample audio data frame are obtained (see ¶ [0038-0039, 0071 and 0080] citation(s) as in claim 1, above.),
obtaining, based on the N target sample cepstrum coefficients sample, (see ¶ [0179-0181 and 0222-0226] citation(s) as in claim 1, above.),
N being a positive integer greater than 1, and M being a positive integer less than N (see ¶ [0179-0181 and 0222-0226] citation(s) as in claim 1, above.);
obtaining N historical sample cepstrum coefficients corresponding to each historical sample audio data frame (see ¶ [0010, 0032, and 0120] citation(s) as in claim 1, above.), and
determining, based on obtained K×N historical sample cepstrum coefficients, a sample dynamic spectrum feature associated with the target sample audio data frame (see ¶ [0038-0039, 0046, and 0153] citation(s) as in claim 1, above.);
inputting the N target sample cepstrum coefficients, the sample audio data frame (see ¶ [0046, 0096, and 0221-0226] citation(s) as in claim 1, above.); and
performing iterative training on the initial mask estimation model based on the predicted mask and the sample mask to obtain a target mask estimation model, the target mask estimation model outputting a target mask corresponding to a target audio data frame associated with raw audio data (see ¶ [0171]: “It should be noted here that, if the enhanced audio representation 592 approximates the training audio representation 510 with a good accuracy, signal degradations caused by the lossy encoding are at least partially compensated by the scaling 590. Worded yet differently, the neural net training 596 may, for example, determine a (weighted) difference between the training audio representation 510 and the enhanced audio representation 592 and adjust the coefficients 594 of the machine-learning structure or of the neural network 580 in order to reduce or minimize this difference. The adjustment of the coefficients 594 may, for example, be performed in an iterative procedure.”), and
the target mask being used for suppressing noise data in the raw audio data to obtain enhanced audio data corresponding to the raw audio data (see ¶ [0046, 0096-0097, and 0221-0226] citation(s) as in claim 1, above.).
However, Fuchs et al. does not explicitly teach, but Ichikawa et al. does teach:
obtaining, based on the N target sample cepstrum coefficients sample, M sample first-order time derivatives and M sample second-order time derivatives that are associated with the target sample audio data frame (see ¶ [0041] citation(s) as in claim 1, above.),
inputting the N target sample cepstrum coefficients, the M sample first-order time derivatives, the M sample second-order time derivatives, and the sample dynamic spectrum feature to an initial mask estimation model, the initial mask estimation model outputting a predicted mask corresponding to the target sample audio data frame (see ¶ [0031 and 0069] citation(s) as in claim 1, above.)
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. obtaining, based on the N target sample cepstrum coefficients sample, M sample first-order time derivatives and M sample second-order time derivatives that are associated with the target sample audio data frame, and inputting the N target sample cepstrum coefficients, the M sample first-order time derivatives, the M sample second-order time derivatives, and the sample dynamic spectrum feature to an initial mask estimation model, the initial mask estimation model outputting a predicted mask corresponding to the target sample audio data frame which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
Claims 3 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1) as applied to claims 1 and 13 above, and further in view of Kong et al. (US 20040220800 A1).
Regarding claims 3 and 15, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 2 and 14, above.
Ichikawa et al. further teaches:
3 and 15. The method/computer device according to claims 2 and 14,
wherein the H audio data segments comprise an audio data segment i, i being a positive integer less than or equal to H (see ¶ [0012]: “The first aspect of the present invention provides a speech signal processing system including: framing unit dividing an input speech signal into frames so that a pair of consecutive frames have a frame shift length greater than or equal to one period of the speech signal and have an overlap greater than or equal to a predetermined length; discrete Fourier transform means applying discrete Fourier transform to each of the frames and outputting a spectrum of the speech signal…” and
¶ [0040]: “FIG. 2 illustrates a configuration of a typical conventional speech recognition device 200. A pre-processing unit 205 receives a digital speech signal converted from an analog speech signal, divides the signal into frames by an appropriate method such as a Hann window or a Hamming window, then applies discrete Fourier transform to the frames to output spectra of the speech signal.”); and
the performing time-frequency transform on each audio data segment to obtain an audio data frame corresponding to each audio data segment (see ¶ [0012 and 0040] citation as in limitation above. More specifically: “[0012] …discrete Fourier transform means applying discrete Fourier transform to each of the frames and outputting a spectrum of the speech signal…
[0040] …then applies discrete Fourier transform to the frames to output spectra of the speech signal…”)
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. wherein the H audio data segments comprise an audio data segment i, i being a positive integer less than or equal to H; and the performing time-frequency transform on each audio data segment to obtain an audio data frame corresponding to each audio data segment which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
However, Fuchs et al. in combination with Ichikawa et al. does not explicitly teach, but Kong et al. does teach:
comprises: performing Fourier transform on the audio data segment i to obtain a direct-current component frequency bin and 2S frequency bins for the audio data segment i in frequency domain (see ¶ [0084]: “FIG. 14 illustrates a relationship between a channel and a frequency bin which are used by the VAD 1320, according to an aspect of the invention. In a graph shown in FIG. 14, the horizontal axis indicates the frequency bin and the vertical axis indicates the channel. In this aspect, 128-point DFT is performed and 64 frequency bins are generated. However, actually, 62 frequency bins are used because a first frequency bin corresponding to a direct current component and a second frequency bin corresponding to a very low frequency component are excluded.”),
the 2S frequency bins comprising S frequency bins related to a first frequency bin type and S frequency bins related to a second frequency bin type, and S being a positive integer (see ¶ [0084] citation as in limitation above, more specifically: “…128-point DFT is performed and 64 frequency bins are generated…”); and
determining an audio data frame corresponding to the audio data segment i based on the S frequency bins related to the first frequency bin type and the direct-current component frequency bin (see ¶ [0084] citation as in limitation above, more specifically: “…However, actually, 62 frequency bins are used because a first frequency bin corresponding to a direct current component and a second frequency bin corresponding to a very low frequency component are excluded.”).
Fuchs et al., Ichikawa et al. and Kong et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. to incorporate the teachings of Kong et al. of performing Fourier transform on the audio data segment i to obtain a direct-current component frequency bin and 2S frequency bins for the audio data segment i in frequency domain, the 2S frequency bins comprising S frequency bins related to a first frequency bin type and S frequency bins related to a second frequency bin type, and S being a positive integer which provides the benefit of effectively receiving a target signal among signals input into a microphone array, a method of decreasing the amount of computation required for a multiple signal classification (MUSIC) algorithm ([0003] of Kong et al.).
Claims 4 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1) as applied to claims 1 and 13 above, and further in view of Kong et al. (US 20040220800 A1) and Smith et al. (US 20200066257 A1).
Regarding claims 4 and 16, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claims 1 and 13, above.
Fuchs et al. further teaches:
4 and 16. The method/computer device according to claims 1 and 13,
the obtaining of the N target cepstrum coefficients of the target audio data frame (see ¶ [0038-0039]: “[0038] In an embodiment of the audio decoder, the filter is configured to obtain magnitude values |{circumflex over (X)}(k, n)| (which may, for example, describe an absolute value or an amplitude or a norm) of the enhanced audio representation according to |{circumflex over (X)}(k, n)|=M(k, n)*|{tilde over (X)}(k, n)|, wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein |{tilde over (X)}(k, n)| is a magnitude value of a spectral value of decoded audio representation. The magnitude value |{tilde over (X)}(k, n)| can be a magnitude, an absolute value, or any norm of a spectral value obtained by applying a time-frequency transform like SIFT (Short-term Fourier transform), FFT or MDCT, to the decoded audio signal. [0039] Alternatively, the filter may be configured to obtain values {circumflex over (X)}(k, n) of the enhanced audio representation according to {circumflex over (X)}(k, n)=M(k, n)*{tilde over (X)}(k, n), wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein {tilde over (X)}(k, n) is a spectral value of the decoded audio representation.” and
¶ [0071]: “In an embodiment of the audio decoder, the filter is configured to obtain short term Fourier transform coefficients (e.g. {tilde over (X)}(k, n)) which represent the spectral values of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”,
¶ [0080]: “The apparatus is configured to obtain spectral values (e.g. magnitudes or phases or MDCT coefficients, e.g. represented by magnitude values, e.g. |{tilde over (X)}(k, n)|)of the decoded audio representation, which are associated with different frequency bins or frequency ranges.”)
However, Fuchs et al. in combination with Ichikawa et al. do not explicitly teach, but Kong et al. does teach:
wherein the target audio data frame comprises S1 frequency bins (see ¶ [0084]: “FIG. 14 illustrates a relationship between a channel and a frequency bin which are used by the VAD 1320, according to an aspect of the invention. In a graph shown in FIG. 14, the horizontal axis indicates the frequency bin and the vertical axis indicates the channel. In this aspect, 128-point DFT is performed and 64 frequency bins are generated. However, actually, 62 frequency bins are used because a first frequency bin corresponding to a direct current component and a second frequency bin corresponding to a very low frequency component are excluded.”),
the S1 frequency bins comprise a direct-current component frequency bin and S2 frequency bins related to a frequency bin type (see ¶ [0084] citation as in limitation above, more specifically: “…However, actually, 62 frequency bins are used because a first frequency bin corresponding to a direct current component and a second frequency bin corresponding to a very low frequency component are excluded.”), and
both S1 and S2 are positive integers (see ¶ [0084] citation as in limitation above, more specifically: “…In this aspect, 128-point DFT is performed and 64 frequency bins are generated…”); and
Fuchs et al., Ichikawa et al. and Kong et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. to incorporate the teachings of Kong et al. of wherein the target audio data frame comprises S1 frequency bins, the S1 frequency bins comprise a direct-current component frequency bin and S2 frequency bins related to a frequency bin type, and both S1 and S2 are positive integers which provides the benefit of effectively receiving a target signal among signals input into a microphone array, a method of decreasing the amount of computation required for a multiple signal classification algorithm ([0003] of Kong et al.).
However, Fuchs et al. in combination with Ichikawa et al. and Kong et al. do not explicitly teach, but Smith et al. does teach:
comprises: mapping the S1 frequency bins to N acoustic bands, S1 being greater than or equal to N (see ¶ [0036]: “Additional processing of the audio signal may be required to accurately classify the event. For example, the feature extraction process may include computing a second spectrogram from the first spectrogram. In an illustrative embodiment, a mel-frequency based spectrogram is computed from the first spectrogram to approximate the perceptual scale of pitch for a human. Using a mel-frequency based spectrogram to process the first spectrogram simulates the processing of spectral information in about the same way as human hearing. The mel-frequency based spectrogram may be generated by mapping (e.g., binning) the first spectrogram to approximately 64 mel bins covering a range between about 125 and 7,500 Hz. Features of the MEL spectrogram may be framed into non-overlapping examples of 0.96 s, where each example covers 64 MEL bands and 96 frames of 10 ms each. In other embodiments, the number of mel bins and frequency range may be different. For example, the second spectrogram could cover within the range of human hearing (e.g., 20 Hz-20 kHz) or greater depending on the type of event being characterized. In yet other embodiments, a different feature extraction method may be used to isolate important information from the first spectrogram (e.g., by computing the root-mean-square (RMS) energy from each frame, computing mel-frequency cepstral coefficients, etc.).”); and
performing cepstrum processing on each acoustic band to obtain a target cepstrum coefficient corresponding to each acoustic band (see ¶ [0036] citation as in limitation above, more specifically: “…In yet other embodiments, a different feature extraction method may be used to isolate important information from the first spectrogram (e.g., by computing the root-mean-square (RMS) energy from each frame, computing mel-frequency cepstral coefficients, etc.).”).
Fuchs et al., Ichikawa et al., Kong et al., and Smith et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. and Kong et al. to incorporate the teachings of Smith et al. of mapping the S1 frequency bins to N acoustic bands, S1 being greater than or equal to N; and performing cepstrum processing on each acoustic band to obtain a target cepstrum coefficient corresponding to each acoustic band which provides the benefit of improving the accuracy of event classification ([0032] of Smith et al.).
Claims 5 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1), Kong et al. (US 20040220800 A1) and Smith et al. (US 20200066257 A1) as applied to claims 4 and 16 above, and further in view of Yao et al. (US 7966183 B1).
Regarding claims 5 and 17, Fuchs et al. in combination with Ichikawa et al., Kong et al., and Smith et al. teach the limitations as in claim 4 and 16, above.
Smith et al. further teaches:
5 and 17. The method/computer device according to claims 4 and 16,
wherein the N acoustic bands comprise an acoustic band j, j being a positive integer less than or equal to N (see ¶ [0036]: “Additional processing of the audio signal may be required to accurately classify the event. For example, the feature extraction process may include computing a second spectrogram from the first spectrogram. In an illustrative embodiment, a mel-frequency based spectrogram is computed from the first spectrogram to approximate the perceptual scale of pitch for a human. Using a mel-frequency based spectrogram to process the first spectrogram simulates the processing of spectral information in about the same way as human hearing. The mel-frequency based spectrogram may be generated by mapping (e.g., binning) the first spectrogram to approximately 64 mel bins covering a range between about 125 and 7,500 Hz. Features of the MEL spectrogram may be framed into non-overlapping examples of 0.96 s, where each example covers 64 MEL bands and 96 frames of 10 ms each. In other embodiments, the number of mel bins and frequency range may be different. For example, the second spectrogram could cover within the range of human hearing (e.g., 20 Hz-20 kHz) or greater depending on the type of event being characterized. In yet other embodiments, a different feature extraction method may be used to isolate important information from the first spectrogram (e.g., by computing the root-mean-square (RMS) energy from each frame, computing mel-frequency cepstral coefficients, etc.).”); and
the performing cepstrum processing on each acoustic band to obtain a target cepstrum coefficient corresponding to each acoustic band (see ¶ [0036] citation as in limitation above, more specifically: “…In yet other embodiments, a different feature extraction method may be used to isolate important information from the first spectrogram (e.g., by computing the root-mean-square (RMS) energy from each frame, computing mel-frequency cepstral coefficients, etc.).”)
Fuchs et al., Ichikawa et al., Kong et al., and Smith et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. and Kong et al. to incorporate the teachings of Smith et al. wherein the N acoustic bands comprise an acoustic band j, j being a positive integer less than or equal to N; and the performing cepstrum processing on each acoustic band to obtain a target cepstrum coefficient corresponding to each acoustic band which provides the benefit of improving the accuracy of event classification ([0032] of Smith et al.).
However, Fuchs et al. in combination with Ichikawa et al., Kong et al. and Smith et al. do not explicitly teach, but Yao et al. does teach:
comprises: obtaining band energy of the acoustic band j (see ¶ see ¶ Col. 3, lines 5-50: “…apply a fast Fourier transform (e.g., 256-point FFT) to each frame of samples to convert to the spectral domain; compute the spectral energy density in each frame by squared absolute values of the transform; …”), and
performing logarithmic transform on the band energy of the acoustic band j to obtain logarithmic band energy of the acoustic band j (see ¶ Col. 3, lines 5-50 further: “(13) The section provides a brief example HMM recognizer (FIG. 1a top portion; FIG. 1b left portion). Presume triphone acoustic models which could be used for recognition as follows: sample input speech (e.g., at 8 kHz); partition the stream of samples into overlapping (windowed) frames (e.g., 160 samples per frame with 2/3 overlap); apply a fast Fourier transform (e.g., 256-point FFT) to each frame of samples to convert to the spectral domain; compute the spectral energy density in each frame by squared absolute values of the transform; apply a Mel frequency filter bank (e.g., 20 overlapping triangular filters) to the spectral energy density and integrate over each Mel subband to get a 20-component vector in the linear spectral energy domain for each frame; apply a logarithmic compression to convert to the log spectral energy domain; apply a 20-point discrete cosine transform (DCT) to decorrelate the 20-component log spectral vectors to convert to the cepstral domain with Mel frequency cepstral components (MFCC); take the 10 lowest frequency MFCCs as the feature vector for the frame plus, optionally, also include the rate of change (plus acceleration) of each component to give a 20- or 30-component feature vector with the rate of change (and acceleration) computed as differences from prior frames; compare the sequence of MFCC feature vectors for the frames to each of a set of models corresponding to a vocabulary of triphones for recognition; declare recognition of the triphone corresponding to the model with the highest score where the score for a model is the computed probability of observing the sequence of MFCC feature vectors for that model with a Viterbi type of computation; the probability computations use the model state transition probabilities together with the feature vector probability densities of the states (probability densities defined as mixtures of Gaussians allows for simple computations in the log probability domain). Note that the segmentation of the input sequence of MFCC feature vectors into phones is by the successive recognitions of triphones; silence and background noise typically are also represented as models in order to be recognized.”); and
performing discrete cosine transform on the logarithmic band energy of the acoustic band j to obtain a target cepstrum coefficient corresponding to the acoustic band j (see ¶ Col. 3, lines 5-50: “… apply a 20-point discrete cosine transform (DCT) to decorrelate the 20-component log spectral vectors to convert to the cepstral domain with Mel frequency cepstral components (MFCC);…”).
Fuchs et al., Ichikawa et al., Kong et al., Smith et al. and Yao et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. and Kong et al. to incorporate the teachings of Yao et al. of performing logarithmic transform on the band energy of the acoustic band j to obtain logarithmic band energy of the acoustic band j; and performing discrete cosine transform on the logarithmic band energy of the acoustic band j to obtain a target cepstrum coefficient corresponding to the acoustic band j which provides the benefit of improving performance (¶ Col. 2, lines 50-52 of Yao et al.).
Claims 9-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1) as applied to claim 1 above, and further in view of Borgstrom et al. (US 20230162758 A1).
Regarding claim 9, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claim 1, above.
Fuchs et al. further teaches:
9. The method according to claim 1,
the inputting the N target cepstrum coefficients, the (see ¶ [0038-0039]: “[0038] … The magnitude value |{tilde over (X)}(k, n)| can be a magnitude, an absolute value, or any norm of a spectral value obtained by applying a time-frequency transform like SIFT (Short-term Fourier transform), FFT or MDCT, to the decoded audio signal. [0039] Alternatively, the filter may be configured to obtain values {circumflex over (X)}(k, n) of the enhanced audio representation according to {circumflex over (X)}(k, n)=M(k, n)*{tilde over (X)}(k, n), wherein M(k, n) is a scaling value, wherein k is a frequency index (e.g. designating different frequency bins or frequency ranges), wherein n is a time index (e.g. designating different overlapping or non-overlapping frames), and wherein {tilde over (X)}(k, n) is a spectral value of the decoded audio representation.”
¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced. It has been found that a neural network or a machine-learning structure can typically better handle logarithmized magnitudes of the spectral values when compared to the spectral values themselves, since the spectral values typically have a high dynamic range. By using logarithmized values, it is also possible to use a simplified number representation in the (artificial) neural network or in the machine-learning structure, since it is often not needed to use a floating point number of representation. Rather, it is possible to design the neural network or the machine-learning structure using a fixed point number representation, which significantly reduces an implementation effort.”,
¶ [0096]: “[0096] The method comprises obtaining a plurality of scaling values (e.g. mask values, e.g. M(k, n)), which may, for example, be real valued and which may, for example, be non-negative, and which may, for example, be limited to a predetermined range, and which are associated with different frequency bins or frequency ranges (e.g. having frequency bin index or frequency range index k), on the basis of spectral values of the decoded audio representation which are associated with different frequency bins or frequency ranges (e.g. having frequency bin index or frequency range index k).”,
¶ [0153]: “Accordingly, the scaling 338 may, for example, multiply the spectral values which are input into the scaling 338 with the scaling values, wherein different scaling values are associated with different frequency bins or frequency ranges…”¶ [0046]: “It has been found that it is advantageous to provide logarithmic magnitudes of spectral values, amplitudes of spectral values or norms of spectral values as input signals of the neural network or of the machine-learning structure. It has been found that the sign or the phase of the spectral values is of subordinate importance for the adjustment of the filter, i.e. for the determination of the scaling values. In particular, it has been found that logarithmizing magnitudes of the spectral values of the decoded audio representation is particularly advantageous, since a dynamic range can be reduced.”, and
¶ [0221-0226]: “[0221] In the following, some additional important points will be described. [0222] According to a first aspect, a mask-based post-filter to enhance the quality of the coded speech is used in embodiments according to the invention. [0223] a. The mask is real valued (or the scaling values are real-valued). It is estimated for each frequency bin by a machine-learning algorithm (or by a neural network) from the input features [0224] b. {circumflex over (X)}(k, n)=M.sub.est(k, n)*{tilde over (X)}(k, n) [0225] c. Where M.sub.est(k, n) is the estimated mask, {tilde over (X)}(k, n) is the magnitude value of coded speech and {tilde over (X)}(k, n) is the post-processed speech at frequency bin k and time index n [0226] d. The input features used currently are log magnitude spectrum but can also be any derivative of magnitude spectrum.”) comprises:
using the N target cepstrum coefficients, the (see ¶ [0038-0039, 0046, 0096, 0153 and 0221-0226] citations as in limitation above and further ¶ [0200-0201 and 0229]: “[0200] An FCNN is a simple neural network that has an input layer 610, one or more hidden layers 612a to 612d and an output layer 614. We implemented the FCNN in python with Keras [16] and used Tensorflow [17] as backend. In our experiments, we have used 4 hidden layers with 2048 units. All the 4 hidden layers used Rectified linear units (ReLU) as activation functions [18]. The output of hidden layers were normalized using batch normalization [19]. In order to prevent overfitting, we set the dropout [20] to 0.2. To train our FCNN, we used Adam optimizer [21] with learning rate 0.01 and the batch size used was 32. [0201] The dimension of the output layer 614 was 129. Since our FCNN estimates rel valued (or real valued) mask and these masks can any value between [0, ∞], we tested with both bounding the mask values and no bounding. When the mask values were unbounded, we used ReLU activation in our output layer. When the mask values were bounded, we either used bounded ReLU activation or sigmoid function and scaled the output of sigmoid activation by a certain scaling factor N. [0228] The estimated mask values lie, for example, in the range [0,∞]. In order to prevent such a large range, a threshold can optionally be set. In traditional speech enhancement algorithms, the mask is bounded to 1. In contrast we bound it to a threshold value that is greater than 1. This threshold value is determined by analyzing the mask distribution. Useful threshold values may, for example, lie anywhere between 2 to 10. [0229] a. Since the estimated mask values are, for example, bounded to a threshold and since the threshold valued is greater than 1, output layer can either be bounded rectified linear units ReLU or scaled sigmoid. [0230] b. When the machine learning algorithm is optimized using mask approximation MMSE (minimum mean square estimation optimization) method, the target mask (e.g. the target scaling values) can optionally be modified by either setting the mask values (e.g. the target scaling values) above the threshold in the target mask to 1 or can be set to threshold.”); and
inputting the hidden feature to the mask output layer, and performing, by the mask output layer, feature combination on the hidden feature to obtain the target mask corresponding to the target audio data frame (see ¶ [0046, 0096, 0200-0201, 0221-0226, and 0229] citations as in limitation(s) above, more specifically: “[0229] …In our experiments, we have used 4 hidden layers with 2048 units. All the 4 hidden layers used Rectified linear units (ReLU) as activation functions [18]. The output of hidden layers were normalized using batch normalization [19]. In order to prevent overfitting, we set the dropout [20] to 0.2. To train our FCNN, we used Adam optimizer [21] with learning rate 0.01 and the batch size used was 32. [0201] The dimension of the output layer 614 was 129. Since our FCNN estimates rel valued (or real valued) mask and these masks can any value between [0, ∞], we tested with both bounding the mask values and no bounding. When the mask values were unbounded, we used ReLU activation in our output layer. When the mask values were bounded, we either used bounded ReLU activation or sigmoid function and scaled the output of sigmoid activation by a certain scaling factor N. [0228] The estimated mask values lie, for example, in the range [0,∞]. In order to prevent such a large range, a threshold can optionally be set. In traditional speech enhancement algorithms, the mask is bounded to 1. In contrast we bound it to a threshold value that is greater than 1. This threshold value is determined by analyzing the mask distribution. Useful threshold values may, for example, lie anywhere between 2 to 10. [0229] a. Since the estimated mask values are, for example, bounded to a threshold and since the threshold valued is greater than 1, output layer can either be bounded rectified linear units ReLU or scaled sigmoid. [0230] b. When the machine learning algorithm is optimized using mask approximation MMSE (minimum mean square estimation optimization) method, the target mask (e.g. the target scaling values) can optionally be modified by either setting the mask values (e.g. the target scaling values) above the threshold in the target mask to 1 or can be set to threshold.”).
Ichikawa et al. further teaches:
inputting the N target cepstrum coefficients, the M first-order time derivatives, the M second-order time derivatives, and the dynamic spectrum feature into a target mask estimation model to obtain a target mask corresponding to the target audio data frame (see ¶ [0031]: “The feature vector calculation means which receives an output from the vowel enhancement means or the masking means as an input may extract any feature that can be calculated by a known calculation method, such as a combination of a cepstral coefficient such as MFCC and its delta (first order difference) and delta-delta (second-order difference), or LDA (Linear Discriminant Analysis), which is a linear transform of these.” and
¶ [0069]: “The feature vector calculation unit 335 receives an output from the masking unit 345 as an input and extracts a speech feature from the input. The feature vector calculation unit 335 outputs the extracted speech feature, together with the time-series data of the maximum CSP coefficient values output from the time-series data generation unit 330, as a speech feature vector. Here, the input from the masking unit 345 is a spectrum in which sound from a non-harmonic structure is weakened. The feature that the feature vector calculation unit 335 extracts from the input from the masking unit 345 can be any feature that can be calculated with a known calculation method, for example a combination of a cepstral coefficient such as an MFCC and its delta (first-order difference) or delta-delta (second-order difference), or linear transforms of these.”)
Fuchs et al. and Ichikawa et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. to incorporate the teachings of Ichikawa et al. of obtaining, based on the N target cepstrum coefficients, M first-order time derivatives and M second-order time derivatives that are associated with the target audio data frame and inputting the N target cepstrum coefficients, the M first-order time derivatives, the M second-order time derivatives, and the dynamic spectrum feature into a target mask estimation model to obtain a target mask corresponding to the target audio data frame which provides the benefit of being capable of improving the accuracy of speech recognition even under a very low SNR condition ([0011] of Ichikawa et al.).
However, Fuchs et al. in combination with Ichikawa et al. do not explicitly teach, but Borgstrom et al. does teach:
wherein the target mask estimation model comprises a mask estimation network layer and a mask output layer (see Fig. 4B (433-436: plurality of layers connected) and ¶ [0067-0068]: “[0067] The Mask Estimation Network [0068] In the b Net architecture of the example system 400, enhancement can be performed via attention masking in the embedding space defined by f.sub.enc so that interfering signal components can be appropriately attenuated. The goal of the mask estimation block 430 in FIG. 4A can be to generate a multiplicative mask, with outputs within the range [0,1], which can provide the desired attenuation. FIG. 4B illustrates a procedure 430 that can be used to generate the attention mask applied to the embedding features. The encoder 421 outputs (461 of FIG. 4C) can be cepstral normalized 431, forwarded though a multi-layer FCN 432 (e.g., including a plurality of FCN layers 433, 434, 435, 436, etc.), and finally scaled by a frame-level voice activity detection (VAD) term 439 to produce the attention masking elements. The individual components of the estimation procedure 430 are detailed below.”);
Fuchs et al., Ichikawa et al., and Borgstrom et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. to incorporate the teachings of Borgstrom of wherein the target mask estimation model comprises a mask estimation network layer and a mask output layer which provides the benefit of improving the performance of automated speech systems (abstract of Borgstrom et al.).
Regarding claim 10, Fuchs et al. in combination with Ichikawa et al. and Borgstrom et al. teach the limitations as in claim 9, above.
Borgstrom et al. further teaches:
10. The method according to claim 9,
wherein the mask estimation network layer comprises a first mask estimation network layer, a second mask estimation network layer, and a third mask estimation network layer that have a skip connection (see Fig. 4B (433-436: plurality of layers connected) and ¶ [0067-0068]: “[0067] The Mask Estimation Network [0068] In the b-Net architecture of the example system 400, enhancement can be performed via attention masking in the embedding space defined by f.sub.enc so that interfering signal components can be appropriately attenuated. The goal of the mask estimation block 430 in FIG. 4A can be to generate a multiplicative mask, with outputs within the range [0,1], which can provide the desired attenuation. FIG. 4B illustrates a procedure 430 that can be used to generate the attention mask applied to the embedding features. The encoder 421 outputs (461 of FIG. 4C) can be cepstral normalized 431, forwarded though a multi-layer FCN 432 (e.g., including a plurality of FCN layers 433, 434, 435, 436, etc.), and finally scaled by a frame-level voice activity detection (VAD) term 439 to produce the attention masking elements. The individual components of the estimation procedure 430 are detailed below.”); and
the inputting the target audio feature to the mask estimation network layer, and performing, by the mask estimation network layer, mask estimation on the target audio feature to obtain a hidden feature corresponding to the target audio feature (see Fig. 4B (433-436: plurality of layers connected) and ¶ [0067-0068] citations as in limitation(s) above. More specifically and/or further ¶ [0070]: “[0070] Mask Estimation: The normalized encoder features of Equation 7 can be applied to an FCN, as shown in FIG. 4D. The FCN can include a series of generalized convolutional blocks 433, each comprising a CNN filter 471, batch normalization 472, an activation 473, and a Squeeze and Excitation Network (SENet) 474. Each layer (e.g., FCN layers 433, 434, 435, 436, etc. of FIG. 4B) of the FCN can be a specific configuration of this generalized block 433. Table 1 specifies one non-limiting example set of layer parameters…”) comprises:
inputting the target audio feature to the first mask estimation network layer, the first mask estimation network layer outputting a first intermediate feature (see Fig. 4B (433-436: more specifically: 431-433) and ¶ [0067-0068 and 0070] citations as in limitation(s) above and further ¶ [0069 and 0071]: “[0069] … the CNN outputs are unit normalized across each filter by first subtracting a filter-dependent Global Mean 464 and element-wise dividing by the filter-dependent Global Standard Deviation 465…
[0071] As can be observed in Table 1, the first five layers exhibit increasing filter dilation rates, allowing the FCN to summarize increasing temporal contexts. The next four layers apply 1×1 CNN layers, and can be interpreted as improving the discriminative power of the overall network. Finally, the FCN can include a layer with channel-wise sigmoid activations, providing outputs within the range [0, 1], which are appropriate for multiplicative masking. Let h.sub.n,t ∈ custom-character.sup.N.sup.m denote the output vector of the 9.sup.th layer in Table 1, and let W.sub.mask ∈ custom-character.sup.N.sup.m.sup.×N.sup.e and b.sub.mask ∈ custom-character.sup.N.sup.e be the weight matrix and bias vector from the 10.sup.th layer. The output of the FCN is given by Equation 9: σ(W.sub.mask.sup.Th.sub.n,t+b.sub.mask), (Equation 9) where σ(.Math.) denotes the element-wise sigmoid function.”.);
performing feature splicing on the first intermediate feature and the target audio feature based on a skip connection between the first mask estimation network layer and the second mask estimation network layer to obtain a second intermediate feature (see Fig. 4B (433-436: more specifically: 433-434) and ¶ [0067-0071] citations as in limitation(s) above.),
inputting the second intermediate feature to the second mask estimation network layer, the second mask estimation network layer outputting a third intermediate feature (see Fig. 4B (433-436: more specifically: 434-435) and ¶ [0067-0071] citations as in limitation(s) above.);
performing feature splicing on the third intermediate feature, the target audio feature, and the first intermediate feature based on a skip connection between the first mask estimation network layer and the third mask estimation network layer, and a skip connection between the second mask estimation network layer and the third mask estimation network layer, to obtain a fourth intermediate feature (see Fig. 4B (433-436: more specifically: 435-436) and ¶ [0067-0071] citations as in limitation(s) above.; and
inputting the fourth intermediate feature to the third mask estimation network layer, the third mask estimation network layer outputting the hidden feature corresponding to the target audio feature (see Fig. 4B (435-439) and ¶ [0067-0071] citations as in limitation(s) above.).
PNG
media_image1.png
428
306
media_image1.png
Greyscale
Fuchs et al., Ichikawa et al., and Borgstrom et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. to incorporate the teachings of Borgstrom et al. of wherein the mask estimation network layer comprises a first mask estimation network layer, a second mask estimation network layer, and a third mask estimation network layer that have a skip connection; and the inputting the target audio feature to the mask estimation network layer, and performing, by the mask estimation network layer, mask estimation on the target audio feature to obtain a hidden feature corresponding to the target audio feature comprises: inputting the target audio feature to the first mask estimation network layer, the first mask estimation network layer outputting a first intermediate feature; performing feature splicing on the first intermediate feature and the target audio feature based on a skip connection between the first mask estimation network layer and the second mask estimation network layer to obtain a second intermediate feature, inputting the second intermediate feature to the second mask estimation network layer, the second mask estimation network layer outputting a third intermediate feature; performing feature splicing on the third intermediate feature, the target audio feature, and the first intermediate feature based on a skip connection between the first mask estimation network layer and the third mask estimation network layer, and a skip connection between the second mask estimation network layer and the third mask estimation network layer, to obtain a fourth intermediate feature; and inputting the fourth intermediate feature to the third mask estimation network layer, the third mask estimation network layer outputting the hidden feature corresponding to the target audio feature. which provides the benefit of improving the performance of automated speech systems (abstract of Borgstrom et al.).
Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fuchs et al. (US 20220223161 A1) and further in view of Ichikawa et al. (US 20110301945 A1) as applied to claim 1 above, and further in view of Wang et al. (US 20200202869 A1).
Regarding claim 11, Fuchs et al. in combination with Ichikawa et al. teach the limitations as in claim 1, above.
However, Fuchs et al. in combination with Ichikawa et al. do not explicitly teach, but Wang et al. does teach:
11. The method according to claim 1, further comprising:
performing interpolation on the target mask to obtain an interpolation mask, a length of the interpolation mask being the same as that of the target audio data frame (see ¶ [0009]: “Regardless of the technique(s) utilized to generate a speaker embedding, implementations disclosed herein process spectrogram representations of audio data and the speaker embedding, using a trained voice filter model, to generate a predicted mask which can be used in isolating utterance(s) (if any) of a speaker corresponding to the speaker embedding. Voice filter models can include a variety of layers including: a convolutional neural network portion, a recurrent neural network portion, as well as a fully connected feed-forward neural network portion. A spectrogram of the audio data can be processed using the convolutional neural network portion to generate convolutional output. Additionally or alternatively, the convolutional output and a speaker embedding associated with the human speaker can be processed using the recurrent neural network portion to generate recurrent output. In many implementations, the recurrent output can be processed using the fully connected feed-forward neural network portion to generate a predicted mask. The spectrogram can be processed using the predicted mask, for example by convolving the spectrogram with the predicted mask, to generate a masked spectrogram. The masked spectrogram includes only the utterance(s) associated with the human speaker and excludes any background noise and/or additional human speaker(s) in the audio data. In many implementations, the masked spectrogram can be processed using an inverse transformation such as an inverse Fourier transform to generate the refined version of the audio data.”);
multiplying the interpolation mask with the target audio data frame, and performing inverse Fourier transform on a multiplication result to obtain target audio data that is obtained by performing noise suppression on the target audio data frame (see ¶ [0009] citation(s) as in limitation above, more specifically: “…. The spectrogram can be processed using the predicted mask, for example by convolving the spectrogram with the predicted mask, to generate a masked spectrogram …”); and
after noise suppression is performed on each audio data frame associated with the raw audio data, obtaining, based on target audio data corresponding to each audio data frame, enhanced audio data corresponding to the raw audio data (see ¶ [0009] citation(s) as in limitation above, more specifically: “…. The masked spectrogram includes only the utterance(s) associated with the human speaker and excludes any background noise and/or additional human speaker(s) in the audio data.…”).
Fuchs et al., Ichikawa et al., and Wang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech/audio processing/enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Fuchs et al. in combination with Ichikawa et al. to incorporate the teachings of Wang et al. of performing interpolation on the target mask to obtain an interpolation mask, a length of the interpolation mask being the same as that of the target audio data frame; multiplying the interpolation mask with the target audio data frame, and performing inverse Fourier transform on a multiplication result to obtain target audio data that is obtained by performing noise suppression on the target audio data frame; and after noise suppression is performed on each audio data frame associated with the raw audio data, obtaining, based on target audio data corresponding to each audio data frame, enhanced audio data corresponding to the raw audio data which provides the benefit of a refined version of audio data and improved accuracy ([0011] of Wang et al.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Regarding speech/audio processing/enhancement (pertinent to claims 1 and 12-13):
Fingscheidt et al. (US 20070198255 A1, ¶ [0011 and 0041])
Mandel et al. (US 20220358904 A1, ¶ [0024-0024 and 0065])
Ichikawa et al. (JP 2011253133 A, ¶ 6 of page 3 and ¶ 1-3 of page 5)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Keisha Y. Castillo-Torres
Examiner
Art Unit 2659
/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659