Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claim Rejections - 35 USC § 103
1. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
2. Claims 1-3, 8-10, 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over submitted prior arts Jensen et al. (2023/0388721) in view of Zeghidour et al. (2023/0186927).
As to claim 1, Jensen teaches a method of targe vocal enhancement, the method performed by at least one processor (Fig. 4 and [0044] – processor PRO) and comprising:
receiving an audio signal obtained from a microphone (Fig. 4; [0046] - The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal);
inputting the audio signal into frequency-domain Kalman filter (FDKD) (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1, . . . , M, at the respective microphones of the hearing aid. Hence y(n)=[y.sub.1(n), . . . , yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,l)=[Y1(k,l), . . . , YM(k,l)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0225] - The calculation unit (CALC) may further comprise a Kalman filter (FIL) (or one or more Kalman filters) for filtering the individual or combined digitized amplified voltages);
inputting the audio signal and an output from the FDKF into a neural network (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1, . . . , M, at the respective microphones of the hearing aid. Hence y(n)=[y.sub.1(n), . . . , yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,l)=[Y1(k,l), . . . , YM(k,l)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0210] - The processor (PRO) is connected to the database (cf. Θ[DB] in FIG. 4) and configured to estimate ATF-vectors ATF*.sub.θ for the user based on the database Θ, the current electric input signals x.sub.m(n), m=1, . . . , M, (here m=1, 2), the current information (ϕ(n)) about the user's eyes, e.g. an eye-gaze signal, such as an eye-gaze direction, and the model of the acoustic propagation channels. ATF-vectors ATF* (cf. d* in FIG. 4) for the user may be determined by a number of different methods available in the art, e.g. maximum likelihood estimate (MLE) methods, cf. e.g. EP3413589A1. Other statistical methods may e.g. include Mean Squared Error (MSE), regression analysis (e.g. Least Squares (LS)), e.g. probabilistic methods (e.g. MLE), e.g. supervised learning (e.g. neural network algorithms); [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y.sub.1, y.sub.2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
estimating, based on the audio signal and the output from the FDKF, and removing feedback signals from the audio signal by the neural network ([0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y.sub.1, y.sub.2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
a codec (via signal processor; Fig. 4; [0212] - The processor (PRO) and the signal processor (SP) may form part of the same digital signal processor (or be independent units). The analysis filter banks (FB-A1, FB-A2), the processor (PRO), the signal processor (SP), the synthesis filter bank (FBS), and the voice activity detector (VAD) may form part of the same digital signal processor (or be independent units); [0213] - The signal processor (SP) is configured to apply one or more processing algorithms to the electric input signals (e.g. beamforming and compressive amplification) and to provide a processed output signal (OUT) for presentation to the user via the output transducer); and
outputting a version of the audio signal in which the target vocal signal is enhanced by removal of the feedback signals from the audio signal ([0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y.sub.1, y.sub.2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user).
Jensen does not explicitly teach recovering, by a codec receiving an output from the neural network, vocal quality of a target vocal signal; and outputting a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec.
Zeghidour teaches recovering, by a codec receiving an output from a neural network (encoder neural network 102), vocal quality of a target vocal signal (Figs. 1 & 2; [0026-0027], [0038] - using an entropy codec 302, into a compressed representation of the audio waveform 114. The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc., [0070] - By weighting appropriate loss terms with weight factors λ.sub.rec, λ.sub.adv and λ.sub.feat, the objective function 214 can emphasize certain properties, such as faithful reconstructions, fidelity, perceptual quality, etc. In some implementations, the weight factors are set to λ.rec=λ.adv=1 and λ.feat=100; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”; a codec can be from a neural network to enhance the speech to the vocal quality desired); and
outputting a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec ([0038] - entropy codec 302, into a compressed representation of the audio waveform 114. The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0070]; [0091] - The system processes the input audio waveform for each training example using an encoder neural network, a plurality of vector quantizers, and a decoder neural network to generate a respective output audio waveform (704), where each vector quantizer is associated with a respective codebook. In some implementations, the encoder and/or decoder neural networks are conditioned on data that defines where the corresponding target audio waveform is the same as the input audio waveform or an enhanced version of the input audio waveform; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,”; a codec can be from a neural network to enhance the speech to the vocal quality desired as an output).
It would have been obvious before the effective filing date of the claimed invention to modify Jensen with the codec of Zeghidour for the purpose of enhancing a target vocal, thereby producing speech having the desired enhancements for the type of speech with little latency (Zeghidour, [0030, 0052]).
As to claims 2, 9 and 16, Jensen teaches the method according to claim 1, the apparatus according to claim 8 and the non-transitory computer readable medium according claim 15, wherein the audio signal is obtained from the microphone in a hands-free Karaoke environment ([0046] - The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal. The input unit may comprise a wireless receiver for receiving a wireless signal comprising or representing sound and for providing an electric input signal representing said sound; [0090] - Use may be provided in a system comprising one or more hearing aids (e.g. hearing instruments), headsets, ear phones, active ear protection systems, etc., e.g. in handsfree telephone systems, teleconferencing systems (e.g. including a speakerphone), public address systems, karaoke systems, classroom amplification systems, etc.).
As to claims 3, 10 and 17, Jensen teaches the method according to claim 1, the apparatus according to claim 8 and the non-transitory computer readable medium according claim 15, wherein the output from the FDKF is a version of the audio signal in which acoustic feedback cancellation (AFC) is implemented by iterative feedback to the FDKF in which the target vocal signal is estimated by short-time Fourier transform (STFT) and used by the FDKF to update filter weights of the FDKF ([0055] - The TF conversion unit may comprise a filter bank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. The TF conversion unit may comprise a Fourier transformation unit (c.g. a Discrete Fourier Transform (DFT) algorithm, or a Short Time Fourier Transform (STFT) algorithm, or similar) for converting a time variant input signal to a (time variant) signal in the (time-)frequency domain; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms; adaptive feedback cancellation uses updated filter weights over time including using a STFT), and
wherein the neural network implements a neural network adaptive feedback cancellation (NNAFC) based on STFT domain versions of the audio signal, the output from the FDKF, and a reference music signal ([0063] - The hearing aid may comprise a classification unit configured to classify the current situation based on input signals from (at least some of) the detectors, and possibly other inputs as well. In the present context 'a current situation' may be taken to be defined by one or more of; [0065] - b) the current acoustic situation (input level, feedback, etc.); [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms. They both have the property to minimize the error signal in the mean square sense with the NLMS additionally normalizing the filter update with respect to the squared Euclidean norm of some reference signal; [0128] - The auxiliary device may be constituted by or comprise an audio gateway device adapted for receiving a multitude of audio signals (e.g. from an entertainment device, e.g. a TV or a music player, a telephone apparatus, e.g. a mobile telephone or a computer, e.g. a PC) and adapted for selecting and/or combining an appropriate one of the received audio signals (or combination of signals) for transmission to the hearing aid; [0178] - where XX(k,I)=E[I (k,1)12] is the target speech signal power spectral density (PSD) at the reference microphone; [0179] - Finally, in order to evaluate CY(k,1,Θi) for a particular candidate RATF vector d(k,1,Θi) in practice, we follow the procedure described in [4]. Specifically, GV(k,10) may be estimated from speech absence time-frequency tiles, where 10 denotes the latest time instant in the past with speech absence, while to estimate AX(k,1) and AV(k,1), ML estimators from [4] may be used; adaptive feedback cancellation uses a trained neural network and updated filter weights over time including using a STFT and a signal from a reference microphone).
As to claim 8, Jensen teaches an apparatus for target vocal enhancement (Fig. 4 & 5; [0044], the apparatus comprising:
at least one memory ([0215] - memory (MEM)) configured to store computer program code; at least one processor (Fig. 4, processor (PRO)) configured to access the computer program code and operate as instructed by the computer program code (Fig. 4 & 5; [0117-0118]), the computer program code including:
receiving code configured to cause the at least one processor (PRO) to receive an audio signal obtained from a microphone (Fig. 4; [0046] - The input unit may comprise an input transducer, e.g. a microphone, for converting an input sound to an electric input signal);
inputting code configured to cause the at least one processor (PRO) to input the audio signal into frequency-domain Kalman filter (FDKF) (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n) yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,l)=[Y1(k,]), YM(k,1)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0225] - The calculation unit (CALC) may further comprise a Kalman filter (FIL) (or one or more Kalman filters) for filtering the individual or combined digitized amplified voltages);
further inputting code configured to cause the at least one processor (PRO) to input the audio signal and an output from the FDKF into a neural network (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n), yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,l)=[Y1(k,l), YM(k,1)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0210] - The processor (PRO) is connected to the database (cf. Θ[DB] in FIG. 4) and configured to estimate ATF-vectors ATF*.sub.θ for the user based on the database Θ, the current electric input signals x.sub.m(n), m=1, . . . , M, (here m=1, 2), the current information (ϕ(n)) about the user's eyes, e.g. an eye-gaze signal, such as an eye-gaze direction, and the model of the acoustic propagation channels. ATF-vectors ATF* (cf. d* in FIG. 4) for the user may be determined by a number of different methods available in the art, e.g. maximum likelihood estimate (MLE) methods, cf. e.g. EP3413589A1. Other statistical methods may e.g. include Mean Squared Error (MSE), regression analysis (e.g. Least Squares (LS)), e.g. probabilistic methods (e.g. MLE), e.g. supervised learning (e.g. neural network algorithms); [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
estimating code configured to cause the at least one processor (PRO) to estimate, based on the audio signal and the output from the FDKF, and remove feedback signals from the audio signal by the neural network (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n), yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as (k,I)=[Y1(k,I) YM(k,l)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (c.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
a codec (via signal processor SP; Fig. 4; [0212] - The processor (PRO) and the signal processor (SP) may form part of the same digital signal processor (or be independent units). The analysis filter banks (FB-A1, FB-A2), the processor (PRO), the signal processor (SP), the synthesis filter bank (FBS), and the voice activity detector (VAD) may form part of the same digital signal processor (or be independent units); [0213] - The signal processor (SP) is configured to apply one or more processing algorithms to the electric input signals (e.g. beamforming and compressive amplification) and to provide a processed output signal (OUT) for presentation to the user via the output transducer); and
outputting code configured to cause the at least one processor to output a version of the audio signal in which the target vocal signal is enhanced by removal of the feedback signals from the audio signal (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n), yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,1)=[Y1(k,1), YM(k,l)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user).
Jensen does not explicitly disclose recovering code configured to cause the at least one processor to recover, by a codec receiving an output from the neural network, vocal quality of a target vocal signal; and outputting code configured to cause the at least one processor to output a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec.
Zeghidour teaches recovering code configured to cause an at least one processor to recover, by a codec receiving an output from a neural network (encoder neural network 102), vocal quality of a target vocal signal (Fig. 1 & 2; [0026] - FIG. 1 depicts an example audio compression system 100 that can compress audio waveforms using an encoder neural network 102 and a residual vector quantizer 106. Similarly, FIG. 2 depicts an example audio decompression system 200 that can decompress compressed audio waveforms using a decoder neural network 104 and the residual vector quantizer 106; [0038] - The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0070] - By weighting appropriate loss terms with weight factors Arec, ladv and Afeat, the objective function 214 can emphasize certain properties, such as faithful reconstructions, fidelity, perceptual quality, etc.; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec; [0099] - The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers; a codec can be from a neural network to enhance the speech to the vocal quality desired); and
outputting code configured to cause the at least one processor to output a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec (Fig. 1 & 2; [0038] - The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalitics, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0070] - By weighting appropriate loss terms with weight factors Arec, ladv and Afeat, the objective function 214 can emphasize certain properties, such as faithful reconstructions, fidelity, perceptual quality, etc; [0091] - The system processes the input audio waveform for each training example using an encoder neural network, a plurality of vector quantizers, and a decoder neural network to generate a respective output audio waveform (704), where each vector quantizer is associated with a respective codebook. In some implementations, the encoder and/or decoder neural networks are conditioned on data that defines where the corresponding target audio waveform is the same as the input audio waveform or an enhanced version of the input audio waveform; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec'; a codec can be from a neural network to enhance the speech to the vocal quality desired as an output).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Jensen with the codec of Zeghidour for the purpose of cnhancing a target vocal, thereby producing specch having the desired enhancements for the type of speech with little latency (Zeghidour; [0030], [0052]).
As to claim 15, Jensen teaches a non-transitory computer readable medium storing a program causing a computer ([0117-0118]) to:
receive an audio signal obtained from a microphone (Fig. 4; [0046] - The input unit may comprise an input transducer, c.g. a microphone, for converting an input sound to an electric input signal);
input the audio signal into frequency-domain Kalman filter (FDKF) (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n) yM(n)]. The current electric (y(n)) signals input are e.g. provided in the time-frequency domain, as Y(k,I)=[Y1(k,]) YM(k,1)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0225] – The calculation unit (CALC) may further comprise a Kalman filter (FIL) (or one or more Kalman filters) for filtering the individual or combined digitized amplified voltages);
input the audio signal and an output from the FDKF into a neural network (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n) yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,I)=[Y1(k,1) YM(k,1)]; [0052] – Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] – The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0210] - The processor (PRO) is connected to the database (cf. Θ[DB] in FIG. 4) and configured to estimate ATF-vectors ATF*.sub.θ for the user based on the database Θ, the current electric input signals x.sub.m(n), m=1, . . . , M, (here m=1, 2), the current information (ϕ(n)) about the user's eyes, e.g. an eye-gaze signal, such as an eye-gaze direction, and the model of the acoustic propagation channels. ATF-vectors ATF* (cf. d* in FIG. 4) for the user may be determined by a number of different methods available in the art, e.g. maximum likelihood estimate (MLE) methods, cf. e.g. EP3413589A1. Other statistical methods may e.g. include Mean Squared Error (MSE), regression analysis (e.g. Least Squares (LS)), e.g. probabilistic methods (e.g. MLE), e.g. supervised learning (e.g. neural network algorithms); [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
estimate, based on the audio signal and the output from the FDKF, and removing feedback signals from the audio signal by the neural network (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n) yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain as Y(k,I)=[Y1(k,])…YM(k,1)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user);
a codec (via signal processor SP; Fig. 4; [0212] - The processor (PRO) and the signal processor (SP) may form part of the same digital signal processor (or be independent units). The analysis filter banks (FB-A1, FB-A2), the processor (PRO), the signal processor (SP), the synthesis filter bank (FBS), and the voice activity detector (VAD) may form part of the same digital signal processor (or be independent units); [0213] - The signal processor (SP) is configured to apply one or more processing algorithms to the electric input signals (e.g. beamforming and compressive amplification) and to provide a processed output signal (OUT) for presentation to the user via the output transducer); and
output a version of the audio signal in which the target vocal signal is enhanced by removal of the feedback signals from the audio signal (Fig. 4; [0031] - The current electric input signals (y(n)) are the values of the electric input signals ym(n), m=1 M, at the respective microphones of the hearing aid. Hence y(n)=[y1(n), yM(n)]. The current electric input signals (y(n)) are e.g. provided in the time-frequency domain, as Y(k,J)=[Y1(k,]) YM(k,I)]; [0052] - Some or all signal processing of the analysis path and/or the forward path may be conducted in the frequency domain, in which case the hearing aid comprises appropriate analysis and synthesis filter banks; [0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithms; [0211] - The hearing aid (HD) of FIG. 4 comprises a forward (audio signal) path configured to process the electric input signals (y1, y2) and to provide an enhanced (processed) output signal for (OUT) being presented to the user).
Jensen does not explicitly disclose recover, by a codec receiving an output from the neural network, vocal quality of a target vocal signal; and output a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec.
Zeghidour teaches recover, by a codec receiving an output from a neural network (encoder neural network 102), vocal quality of a target vocal signal (Fig. 1 & 2; [0026], FIG. 1 depicts an example audio compression system 100 that can compress audio waveforms using an encoder neural network 102 and a residual vector quantizer 106. Similarly, FIG. 2 depicts an example audio decompression system 200 that can decompress compressed audio waveforms using a decoder neural network 104 and the residual vector quantizer 106; [0038] - The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0070] - By weighting appropriate loss terms with weight factors Arec, ladv and Xfcat, the objective function 214 can emphasize certain properties, such as faithful reconstructions, fidelity, perceptual quality, etc; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec'; a codec can be from a neural network to enhance the speech to the vocal quality desired); and
output a version of the audio signal in which the target vocal signal is enhanced by recovery of the vocal quality by the codec (Fig. 1 & 2; [0038] - The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0070] - By weighting appropriate loss terms with weight factors Arec, ladv and Afeat, the objective function 214 can emphasize certain properties, such as faithful reconstructions, fidelity, perceptual quality, etc; [0091] - The system processes the input audio waveform for each training example using an encoder neural network, a plurality of vector quantizers, and a decoder neural network to generate a respective output audio waveform (704), where each vector quantizer is associated with a respective codebook. In some implementations, the encoder and/or decoder neural networks are conditioned on data that defines where the corresponding target audio waveform is the same as the input audio waveform or an enhanced version of the input audio waveform; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec'; a codec can be from a neural network to enhance the speech to the vocal quality desired as an output).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Jensen with the codec of Zeghidour for the purpose of enhancing a target vocal, thereby producing speech having the desired enhancements for the type of speech with little latency (Zeghidour; [0030], [0052]).
3. Claims 4-7, 11-14, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over submitted prior arts Jensen et al. (2023/0388721) and Zeghidour et al. (2023/0186927) in view of Zhang (2023/0276182).
As to claims 4, 11 and 18, Jensen and Zeghidour do not explicitly disclose the method according to claim 3, the apparatus according to claim 10, and the non-transitory computer readable medium according claim 17, wherein the NNAFC comprises a two-layer Long Short-Term Memory (LSTM) network configured to estimate and suppress music and playback components in the audio signal based on at least two ratio masks.
Zhang is in the field of using a neural network to enhance a speech audio signal ([0002]) and teaches a NNAFC comprises a two-layer Long Short-Term Memory (LSTM) network configured to estimate and suppress music and playback components in an audio signal based on at least two ratio masks (Fig. 8; [0002] - In one embodiment, methods and systems are described that receive an audio signal from a microphone of a mobile device. The mobile device processes the audio signal via a neural network to obtain a speech-enhanced audio signal; [0020] - Speech enhancement (SE) is an audio signal processing technique that aims to improve the quality and intelligibility of speech signals corrupted by noise… Recently, the success of deep neural networks (DNNs) in automatic speech recognition led to investigation of DNNs for noise suppression for ASR and speech enhancement; [0021] - The present disclosure includes descriptions of embodiments that utilize a DNN to enhance sound processing. Although in hearing devices this commonly involves enhancing the user's perception of speech, such enhancement techniques can be used in specialty applications to enhance any type of sound whose signals can be characterized, such as music, animal noises (e.g., bird calls), machine noises, pure or mixed tones, etc; [0026] - For example, if different DNNs 108 have different output vectors, then an output vector abstraction similar to the feature abstraction template 112 may be used to process and stream the output data downstream. Also, changing the DNN may trigger changes to other processing elements not shown, such as equalization, feedback cancellation, etc; [0029] - For example, the number and type of hidden layers within each neural network 200, 202 may be different. The type of neural networks 200, 202 may also be different, e.g., feedforward, (vanilla) recurrent neural network (RNN), long short-term memory (LSTM), gated recurrent units (GRU), light gated recurrent units (LiGRU), convolutional neural network (CNN), spiking neural networks, etc. These different network types may involve different arrangements of state data in memory, different processing algorithms, etc.; [0056] - The output of the deep learning model 807 may be a real-valued, ideal ratio mask of phase sensitive mask or a complex-valued ideal ratio mask. The own voice detection may use a neural network on either or both devices 800, 802 for speaker verification. The sidechain processing 809 may include environment detection and background noise level estimation and use data from either device 800, 802; and it would have been obvious that a LSTM may use multiple layers and ratio masks to provide noise suppression and enhancement including from music and other components).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Jensen with the LSTM of Zhang for the purpose of using a deep neural network for processing, thereby providing speech enhancement and noise suppression (Zhang; [0020], [0021]).
As to claim 5, 12, and 19, Jensen does not explicitly disclose the method according to claim 4, the apparatus according to claim 11, and the non-transitory computer readable medium according claim 18 wherein the codec comprises a residual vector quantization (RVQ) codec module configured to transform an input specch signal into a compressed latent representation, quantize the latent representation, and reconstruct a speech signal from a quantized version of the latent representation.
Zeghidour is in the field of processing audio using neural networks (Title and Abstract) teaches the codec comprises a residual vector quantization (RVQ) codec module (vector quantizer 108) configured to transform an input speech signal (audio waveform 112 into the encoder neural network 102) into a compressed latent representation, quantize the latent representation, and reconstruct a speech signal from a quantized version of the latent representation (Fig. 1 shows the compression sequences using the vector quantizer 108, while Fig. 2 shows the decompression; [0011] - The neural network parameters of the decoder neural network are also simultaneously optimized to enable more accurate reconstruction of audio waveforms from quantized feature vectors generated using the updated codebooks of the vector quantizers; [0026] - FIG. 1 depicts an example audio compression system 100 that can compress audio waveforms using an encoder neural network 102 and a residual vector quantizer 106. Similarly, FIG. 2 depicts an example audio decompression system 200 that can decompress compressed audio waveforms using a decoder neural network 104 and the residual vector quantizer 106; [0031] - The audio waveform 112 is processed (e.g., encoded) by the encoder 102 to generate a sequence of feature vectors 208 representing the waveform 112. Feature vectors 208 (e.g., embeddings, latent representations) are compressed representations of waveforms that extract the most relevant information about their audio content; [0035] - For example, at the first vector quantizer 108, the quantizer 106 can receive the feature vector 208 and select a code vector from its codebook 110 to represent the feature vector 208 based on a smallest distance metric. A residual vector can be computed as the difference between the feature vector 208 and the code vector representing the feature vector 208. The residual vector can be received by the next quantizer 108 in the sequence to select a code vector from its codebook 110 to represent the residual vector based on a smallest distance metric; [0038] - The entropy codec 302 can implement any appropriate lossless entropy coding, e.g., arithmetic coding, Huffman coding, etc.; [0052] - In some cases, the target waveform 204 is identical to the input waveform 202, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform 204 can also be modified with respect to the input waveform 202 to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples 116 with certain qualities. For instance, the target waveform 204 can be a speech enhanced version of the input waveform 202, such that the neural networks improve audio dialogue upon reconstruction of waveforms; [0094] - FIGS. 8A and 8B show an example of a fully convolutional neural network architecture for the encoder 102 and decoder 104 neural networks. C represents the number of channels and D is the dimensionality of the feature vectors 208. The architecture in FIGS. 8A and 8B is based on the SoundStream model developed by N. Zeghidour, A. Luebs, A. Omran, J. Skoglund and M. Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec').
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Jensen with the residual vector quantization of Zeghidour for the purpose of reconstructing a speech signal, thereby providing more accurate and enhanced reconstructions (Zeghidour; [0011], [0052]).
As to claims 6, 13, and 20, Jensen does not explicitly discloses the method according to claim 5, the apparatus according to claim 12, and the non-transitory computer readable medium according claim 19, wherein the input speech signal is the output of the neural network.
Zeghidour is in the field of processing audio using neural networks (Title and Abstract) teaches the input speech signal (112, 102) is the output of the neural network (encoder neural network 102; Fig. 1; [0026] - FIG. 1 depicts an example audio compression system 100 that can compress audio waveforms using an encoder neural network 102 and a residual vector quantizer 106. Similarly, FIG. 2 depicts an example audio decompression system 200 that can decompress compressed audio waveforms using a decoder neural network 104 and the residual vector quantizer 106; [0029] - The audio waveform 112 can originate from any suitable audio source. For example, the waveform 112 can be a recording from an external audio device (e.g., speech from a microphone); [0031] - The audio waveform 112 is processed (e.g., encoded) by the encoder 102 to generate a sequence of feature vectors 208 representing the waveform 112).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Jensen with the input speech of Zeghidour for the purpose of receiving a signal from a neural network, thereby generating feature vectors than can be processed by the residual vector quantizer to refine quantization (Zeghidour; [0032]-[0034]).
As to claims 7 and 14, Jensen discloses the method according to claim 5 and the apparatus according claim 13, further comprising training the NNAFC ([0068] - The classification unit may be based on or comprise a neural network, e.g. a trained neural network; [0069] - The hearing aid may comprise an acoustic (and/or mechanical) feedback control (e.g. suppression) or echo-cancelling system. Adaptive feedback cancellation has the ability to track feedback path changes over time. It is typically based on a linear time invariant filter to estimate the feedback path but its filter weights are updated over time. The filter update may be calculated using stochastic gradient algorithms, including some form of the Least Mean Square (LMS) or the Normalized LMS (NLMS) algorithm).
Jensen does not explicitly disclose wherein the RVQ codec module is a module trained jointly with the neural network.
Zeghidour is in the field of processing audio using neural networks (Title and Abstract) teaches the RVQ codec module is a module trained jointly with a neural network ([0011] - The compression/decompression systems include an encoder neural network, a set of vector quantizers, and a decoder neural network that are jointly trained (i.e., from "end-to-end"). Jointly training the respective neural network parameters of the encoder and decoder neural networks along with the codebooks of the vector quantizers enables the parameters of the compression/decompression systems to be adapted in unison to achieve more efficient audio compression than would otherwise be possible; [0015] - The compression/decompression systems can be trained to jointly perform both audio data compression and audio data enhancement, e.g., de-noising. That is, the compression and decompression systems can be trained to simultaneously enhance (e.g., de-noise) an audio waveform as part of compressing and decompressing the waveform without increasing overall latency; [0043] - The training system 300 can enable efficient general-purpose compression or tailored compression (e.g., speech-tailored) by utilizing a suitable set of training examples 116 and various training procedures. Specifically, the training system 300 can jointly train the encoder neural network 102 and decoder neural network 104 to efficiently encode and decode feature vectors 208 of various waveforms contained in the training examples 116. Furthermore, the training system 300 can train the RVQ 106 to efficiently quantize the feature vectors 208. In particular, each codebook 110 of each cascading vector quantizer 108 can be trained to minimize quantization error; [0089] - FIG. 7 is a flow diagram of an example process 700 for jointly training an encoder neural network, a decoder neural network and a residual vector quantizer. For convenience, the process 700 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 300 of FIG. 3, appropriately programmed in accordance with this specification, can perform the process 700).
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Jensen with the joint training of Zeghidour for the purpose of training modules of different purposes together, thereby allowing the modules to adapt in unison to enable more efficient processing with less latency (Zeghidour; [0011], [0015]).
Conclusion
4. Any inquiry concerning this communication or earlier communications from the examiner should be directed to QUYNH H NGUYEN whose telephone number is (571)272-7489. The examiner can normally be reached Monday-Thursday 7:30AM-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ahmad Matar can be reached on 571-272-7488. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/QUYNH H NGUYEN/Primary Examiner, Art Unit 2693