DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority Date: 03/07/2023
Claims 1-20 are pending, and claims 1, 11 and 20 are independent claims.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5, 10-13 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Jia et al. Pat App No. US 20220068256 A1 (Jia) in view of Rosenberg et al. Pat App No. US 20230058447 A1 (Rosenberg), further in view of Sun Pat App No. CN 114842860 A (Sun).
Regarding Claim 1. Jia discloses an electronic device (Jia, par 0014, speech-enabled device), comprising:
circuitry (Jia, para 0041-0043, digital electronic and/or optical circuitry, integrated circuitry… special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit)) configured to:
receive a dataset associated with a speech recognition task (Jia, para 0023, A speech recognition system 140 receives an audio signal 202 as an input and transcribes that audio signal into a transcription 142 as an output);
receive a text dataset (Jia, para 0024, the TTS system 150 receives, as input, text 152 and converts the text 152 to an output of synthesized playback audio);
train a speech recognition model for the speech recognition task (Jia, para 0022, the device 110 is configured to perform speech recognition using a speech recognition system 140 … (e.g., using the self-training model 200));
train a voice conversion model for a voice conversion task based on the trained speech recognition model (Jia, para 0021-0022, The device 110 further includes an audio subsystem with an audio capturing device (e.g., a microphone) 116 for capturing and converting spoken utterances 12 …the device 110 is configured to perform … conversion of text to speech using a TTS system 150 … (e.g., using the self-training model 200); [i.e., “The device 110 …converting spoken utterances ... configured to perform… using the self-training model” as “train a voice conversion model…”; ]);
train a text-to-speech (TTS) model for a text-to-speech conversion task (Jia, para 0034, the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152); and
execute a set of operations (Jia, para 0004, perform operations), which includes:
an operation to generate an augmented speech dataset based on an application of the trained voice conversion model on the dataset (Jia, para 0018, Since speech synthesis models do not always have the luxury of a large amount of recorded speech data, TTS systems have attempted to generate personalized TTS voices by fine-tuning a pre-trained base model. In other words, this conventional fine-tuning approach refers to the process of first training a base model of a ITS system using a large corpus of training data and then re-training the base model, which was trained on the large corpus of training data, with the limited amount of recorded speech data for the target speaker having the desired personalized TTS voice… the fine-tuning process is adapted to retrain the pre-trained base model using a training corpus of an adequate size to reduce overfitting that still results in a personalized TTS voice that intelligibly resembles the target speaker. With this approach, a TTS system may deploy a high fidelity attention-based TTS model to generate a personalized TTS voice);
an operation to finetune the TTS conversion model based on the augmented speech dataset; (Jia, para 0004, The operations further include re-training the trained TTS model using retraining speech data)
an operation to apply the finetuned TTS conversion model on the text dataset to generate speech samples corresponding to text samples in the text dataset; (Jia, para 0033-0034, During the fine tuning stage 320, the fine tuning retraining differs from conventional fine tuning for other personalized speech models in that, instead of retraining solely on a small amount of recorded speech samples 172 from the target speaker 14, the fine tuning stage 320 retrains the pre-trained base model jointly on a combination of data from the target speaker 14 (i.e., the low data regime 170) and the complete set of training data for the training the model 200 during the initial training stage 310 (e.g., the large data regime 160). By including a larger volume of training data during the fine tuning stage, the training process 300 may reduce or potentially avoid overfitting to the target speaker 14 and generalizing the model 200 to input texts 152 beyond the fine tuning stage 320 training data… the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152. At operation 406, the method 400 retrains the trained TTS model 200 using retraining speech data 162, 172 that includes the second plurality of recorded speech samples 172a-n combined with the first plurality of recorded speech samples 162a-n from the assortment of speakers. Here, the retrained TTS model 200 is configured to output synthetic speech 154 with speaking characteristics resembling the target speaker 14)
an operation to apply the voice conversion model on the speech samples to generate augmented text-speech dataset; (Jia, par 0004, The operations include receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The operation also include training a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers)
an operation to finetune the trained voice conversion model based on the received dataset and the finetuned speech recognition model, (Jia, par 0030, In order to form the personalized speech model 200 specific to the target speaker 14, the personalized speech model 200 undergoes a training process 300 divided into two stages, an initial training stage 310 and a fine tuning training stage 320… In the fine tuning stage 320, the training process 300 trains the pre-trained base model 200 (i.e., trained from the initial training stage 310) to generate synthesized speech 154 resembling speech of the target speaker 14)
Jia does not specifically disclose an operation to finetune the trained speech recognition model based on the augmented text-speech dataset.
However, Rosenberg, in the same field of endeavor, discloses an operation to finetune the trained speech recognition model based on the augmented text-speech dataset, (Rosenberg, para 0036-0038, In some examples, the training process initially trains the TTS system 300 using available transcribed audio samples. In some examples, the available audio samples used to train the TTS system 300 include in-domain audio samples associated with the target domain. In other examples, the available audio samples used to train the TTS system 300 include out-of-domain audio samples that are distinct from the target domain. In these examples, the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples. In some examples, the training process 300 applies data augmentation to at least one of the sample utterances of synthetic speech 306. The data augmentation may include, without limitation, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data augmentation may add different synthesized recording conditions to the synthesized speech 306. During the pre-training stage, the ASR model 200 receives, as input, each utterance of synthetic speech).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg with the method of Jia because this would enable training ASR models on larger training datasets that improves the accuracy of the ASR model, and such large volume of training data can be obtained via synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data that can be used to train the ASR models (Rosenberg, para 0002).
Jia in view of Rosenberg does not specifically disclose wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss.
However, Sun, in the same field of endeavor, discloses wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss. (Sun, 13th page, 15th para, model iteration sub-unit, for when the loss value is greater than the preset loss threshold value, iteratively updating the initial conversion model through the reverse propagation algorithm until the loss value is less than or equal to the preset loss threshold value, obtaining the trained voice conversion model ).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Sun with the method of Jia in view of Rosenberg because this would enable the model iteration unit to specifically execute loss comparison by comparing the loss value and the preset loss threshold value and this application realizes the non-parallel voice conversion processing by vector quantization of the content feature vector which improves the accuracy of the voice conversion (Sun, Abstract and 13th page, 13th-14th para).
Regarding Claim 2. Jia in view of Rosenberg, further in view of Sun disclose the electronic device according to claim 1, wherein the circuitry is further configured to:
Furthermore, Rosenberg teaches:
determine whether the dataset only includes the speech samples in the dataset (the training process initially trains the TTS system 300 using available transcribed audio samples. In some examples, the available audio samples used to train the TTS system 300 include in-domain audio samples associated with the target domain. In other examples, the available audio samples used to train the TTS system 300 include out-of-domain audio samples that are distinct from the target domain. In these examples, the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples.
); and
generate the text samples corresponding to the speech samples based on application of the trained speech recognition model on the speech samples (Rosenberg, para 0035, The sequence of text may include graphemes or phonemes. The transcripts 320 may be sampled from a language model trained to generate text utterances in the target domain. The TTS system 330 may apply a speaker embedding, z, when converting the transcript 320 to obtain synthesized speech with a specific speaking style and prosody associated with the speaker embedding); and
wherein the augmented text-speech dataset includes the text samples and the speech samples. (Rosenberg, para 0035-0037, The transcripts 320 may be sampled from a language model trained to generate text utterances in the target domain… the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples… In some examples, the training process 300 applies data augmentation to at least one of the sample utterances of synthetic speech 306. The data augmentation may include, without limitation, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data augmentation may add different synthesized recording conditions to the synthesized speech 306)
Regarding Claim 3. Jia in view of Rosenberg, further in view of Sun disclose the electronic device according to claim 1, wherein the dataset includes at least one of voice recordings of a set of human speakers in one or more languages and text transcripts corresponding to the voice recordings. (Jia, para 0030, When training the model 200, the training data includes training examples where each training example includes a recorded speech sample and text corresponding to that recorded speech sample (e.g., a textual representation of the characters, words, or phrases spoken during the recorded speech sample) A recorded speech sample may be spoken utterances 12 by one or more respective speakers that have been recorded by an audio capturing device (e.g., the audio capturing device 116))
Regarding Claim 10. Jia in view of Rosenberg, further in view of Sun disclose the electronic device according to claim 1, wherein the set of operations is further executed for the number of the iterations until a TTS loss associated with the TTS conversion model is below a threshold TTS loss. (Jia, para 0030, Referring to FIG. 3, the TTS model 200 is configured to generate synthesized speech 154 with speaking characteristics of the target speaker 14… With a textual representation of each recorded speech sample, each stage 310, 320 of the training process 300 trains the model 200 to generate an output of synthesized speech 154 that resembles the input training example Often, to train to this resemblance, the training process 300 uses an optimization approach such that the training process 300 trains to minimize a loss function (e.g., the loss function described in “Transfer Learning from Speaker Verification to Multi-speaker Text-to-Speech Synthesis)
Regarding Claim 11. A method, comprising:
in an electronic device (Jia, par 0014, speech-enabled device):
receiving a dataset associated with a speech recognition task (Jia, para 0023, A speech recognition system 140 receives an audio signal 202 as an input and transcribes that audio signal into a transcription 142 as an output);
receiving a text dataset (Jia, para 0024, the TTS system 150 receives, as input, text 152);
training a speech recognition model for the speech recognition task (Jia, para 0022, the device 110 is configured to perform speech recognition using a speech recognition system 140 … (e.g., using the self-training model 200));
training a voice conversion model for a voice conversion task based on the trained speech recognition model (Jia, para 0021-0022, The device 110 further includes an audio subsystem with an audio capturing device (e.g., a microphone) 116 for capturing and converting spoken utterances 12 …the device 110 is configured to perform … conversion of text to speech using a TTS system 150 … (e.g., using the self-training model 200); [i.e., “The device 110 …converting spoken utterances ... configured to perform… using the self-training model” as “train a voice conversion model…”]);
training a text-to-speech (TTS) model for a text-to-speech conversion task (Jia, para 0034, the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152); and
executing a set of operations (Jia, para 0004, perform operations) comprising:
generating an augmented speech dataset based on an application of the trained voice conversion model on the dataset (Jia, para 0018, Since speech synthesis models do not always have the luxury of a large amount of recorded speech data, TTS systems have attempted to generate personalized TTS voices by fine-tuning a pre-trained base model. In other words, this conventional fine-tuning approach refers to the process of first training a base model of a ITS system using a large corpus of training data and then re-training the base model, which was trained on the large corpus of training data, with the limited amount of recorded speech data for the target speaker having the desired personalized TTS voice… the fine-tuning process is adapted to retrain the pre-trained base model using a training corpus of an adequate size to reduce overfitting that still results in a personalized TTS voice that intelligibly resembles the target speaker. With this approach, a TTS system may deploy a high fidelity attention-based TTS model to generate a personalized TTS voice);
finetuning the TTS conversion model based on the augmented speech dataset; (Jia, para 0004, The operations further include re-training the trained TTS model using retraining speech data)
applying the finetuned TTS conversion model on the text dataset to generate speech samples corresponding to text samples in the text dataset; (Jia, para 0033-0034, During the fine tuning stage 320, the fine tuning retraining differs from conventional fine tuning for other personalized speech models in that, instead of retraining solely on a small amount of recorded speech samples 172 from the target speaker 14, the fine tuning stage 320 retrains the pre-trained base model jointly on a combination of data from the target speaker 14 (i.e., the low data regime 170) and the complete set of training data for the training the model 200 during the initial training stage 310 (e.g., the large data regime 160). By including a larger volume of training data during the fine tuning stage, the training process 300 may reduce or potentially avoid overfitting to the target speaker 14 and generalizing the model 200 to input texts 152 beyond the fine tuning stage 320 training data… the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152. At operation 406, the method 400 retrains the trained TTS model 200 using retraining speech data 162, 172 that includes the second plurality of recorded speech samples 172a-n combined with the first plurality of recorded speech samples 162a-n from the assortment of speakers. Here, the retrained TTS model 200 is configured to output synthetic speech 154 with speaking characteristics resembling the target speaker 14)
applying the voice conversion model on the speech samples to generate augmented text-speech dataset; (Jia, par 0004, The operations include receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The operation also include training a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers)
finetuning the trained voice conversion model based on the received dataset and the finetuned speech recognition model, (Jia, par 0030, In order to form the personalized speech model 200 specific to the target speaker 14, the personalized speech model 200 undergoes a training process 300 divided into two stages, an initial training stage 310 and a fine tuning training stage 320… In the fine tuning stage 320, the training process 300 trains the pre-trained base model 200 (i.e., trained from the initial training stage 310) to generate synthesized speech 154 resembling speech of the target speaker 14)
Jia does not specifically disclose finetuning the trained speech recognition model based on the augmented text-speech dataset; and
However, Rosenberg, in the same field of endeavor, discloses finetuning the trained speech recognition model based on the augmented text-speech dataset , (Rosenberg, para 0036-0038, In some examples, the training process initially trains the TTS system 300 using available transcribed audio samples. In some examples, the available audio samples used to train the TTS system 300 include in-domain audio samples associated with the target domain. In other examples, the available audio samples used to train the TTS system 300 include out-of-domain audio samples that are distinct from the target domain. In these examples, the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples. In some examples, the training process 300 applies data augmentation to at least one of the sample utterances of synthetic speech 306. The data augmentation may include, without limitation, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data augmentation may add different synthesized recording conditions to the synthesized speech 306. During the pre-training stage, the ASR model 200 receives, as input, each utterance of synthetic speech).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg with the method of Jia because this would enable training ASR models on larger training datasets that improves the accuracy of the ASR model, and such large volume of training data can be obtained via synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data that can be used to train the ASR models (Rosenberg, para 0002).
Jia in view of Rosenberg does not specifically disclose wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss.
However, Sun, in the same field of endeavor, discloses wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss. (Sun, 13th page, 15th para, model iteration sub-unit, for when the loss value is greater than the preset loss threshold value, iteratively updating the initial conversion model through the reverse propagation algorithm until the loss value is less than or equal to the preset loss threshold value, obtaining the trained voice conversion model).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Sun with the method of Jia in view of Rosenberg because this would enable the model iteration unit to specifically execute loss comparison by comparing the loss value and the preset loss threshold value and this application realizes the non-parallel voice conversion processing by vector quantization of the content feature vector which improves the accuracy of the voice conversion (Sun, Abstract and 13th page, 13th-14th para).
Regarding Claim 12. Jia in view of Rosenberg, further in view of Sun disclose the method according to claim 11, further comprising:
Furthermore, Rosenberg teaches:
determining whether the dataset only includes the speech samples in the dataset (the training process initially trains the TTS system 300 using available transcribed audio samples. In some examples, the available audio samples used to train the TTS system 300 include in-domain audio samples associated with the target domain. In other examples, the available audio samples used to train the TTS system 300 include out-of-domain audio samples that are distinct from the target domain. In these examples, the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples); and
generating the text samples corresponding to the speech samples based on application of the trained speech recognition model on the speech samples (Rosenberg, para 0035, The sequence of text may include graphemes or phonemes. The transcripts 320 may be sampled from a language model trained to generate text utterances in the target domain. The TTS system 330 may apply a speaker embedding, z, when converting the transcript 320 to obtain synthesized speech with a specific speaking style and prosody associated with the speaker embedding); and
wherein the augmented text-speech dataset includes the text samples and the speech samples (Rosenberg, para 0035-0037, The transcripts 320 may be sampled from a language model trained to generate text utterances in the target domain… the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples… In some examples, the training process 300 applies data augmentation to at least one of the sample utterances of synthetic speech 306. The data augmentation may include, without limitation, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data augmentation may add different synthesized recording conditions to the synthesized speech 306).
Regarding Claim 13. Jia in view of Rosenberg, further in view of Sun disclose the method according to claim 11, wherein the dataset includes at least one of voice recordings of a set of human speakers in one or more languages and text transcripts corresponding to the voice recordings (Jia, para 0030, When training the model 200, the training data includes training examples where each training example includes a recorded speech sample and text corresponding to that recorded speech sample (e.g., a textual representation of the characters, words, or phrases spoken during the recorded speech sample) A recorded speech sample may be spoken utterances 12 by one or more respective speakers that have been recorded by an audio capturing device (e.g., the audio capturing device 116)).
Regarding Claim 19. Jia in view of Rosenberg, further in view of Sun disclose the method according to claim 11, wherein the set of operations is further executed for the number of the iterations until a TTS loss associated with the TTS conversion model is below a threshold TTS loss. (Jia, para 0030, Referring to FIG. 3, the TTS model 200 is configured to generate synthesized speech 154 with speaking characteristics of the target speaker 14… With a textual representation of each recorded speech sample, each stage 310, 320 of the training process 300 trains the model 200 to generate an output of synthesized speech 154 that resembles the input training example Often, to train to this resemblance, the training process 300 uses an optimization approach such that the training process 300 trains to minimize a loss function (e.g., the loss function described in “Transfer Learning from Speaker Verification to Multi-speaker Text-to-Speech Synthesis).
Regarding Claim 20. A non-transitory computer-readable medium having stored thereon, computer-executable instructions that when executed by an electronic device, causes the electronic device to execute operations (Jia, para 0042, These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor), the operations comprising:
receiving a dataset associated with a speech recognition task (Jia, para 0023, A speech recognition system 140 receives an audio signal 202 as an input and transcribes that audio signal into a transcription 142 as an output);
receiving a text dataset (Jia, para 0024, the TTS system 150 receives, as input, text 152);
training a speech recognition model for the speech recognition task (Jia, para 0022, the device 110 is configured to perform speech recognition using a speech recognition system 140 … (e.g., using the self-training model 200));
training a voice conversion model for a voice conversion task based on the trained speech recognition model (Jia, para 0021-0022, The device 110 further includes an audio subsystem with an audio capturing device (e.g., a microphone) 116 for capturing and converting spoken utterances 12 …the device 110 is configured to perform … conversion of text to speech using a TTS system 150 … (e.g., using the self-training model 200); [i.e., “The device 110 …converting spoken utterances ... configured to perform… using the self-training model” as “train a voice conversion model…”]);
training a text-to-speech (TTS) model for a text-to-speech conversion task (Jia, para 0034, the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152); and
executing a set of operations (Jia, para 0004, perform operations) comprising:
generating an augmented speech dataset based on an application of the trained voice conversion model on the dataset (Jia, para 0018, Since speech synthesis models do not always have the luxury of a large amount of recorded speech data, TTS systems have attempted to generate personalized TTS voices by fine-tuning a pre-trained base model. In other words, this conventional fine-tuning approach refers to the process of first training a base model of a ITS system using a large corpus of training data and then re-training the base model, which was trained on the large corpus of training data, with the limited amount of recorded speech data for the target speaker having the desired personalized TTS voice… the fine-tuning process is adapted to retrain the pre-trained base model using a training corpus of an adequate size to reduce overfitting that still results in a personalized TTS voice that intelligibly resembles the target speaker. With this approach, a TTS system may deploy a high fidelity attention-based TTS model to generate a personalized TTS voice);
finetuning the TTS conversion model based on the augmented speech dataset; (Jia, para 0004, The operations further include re-training the trained TTS model using retraining speech data)
applying the finetuned TTS conversion model on the text dataset to generate speech samples corresponding to text samples in the text dataset; (Jia, para 0033-0034, During the fine tuning stage 320, the fine tuning retraining differs from conventional fine tuning for other personalized speech models in that, instead of retraining solely on a small amount of recorded speech samples 172 from the target speaker 14, the fine tuning stage 320 retrains the pre-trained base model jointly on a combination of data from the target speaker 14 (i.e., the low data regime 170) and the complete set of training data for the training the model 200 during the initial training stage 310 (e.g., the large data regime 160). By including a larger volume of training data during the fine tuning stage, the training process 300 may reduce or potentially avoid overfitting to the target speaker 14 and generalizing the model 200 to input texts 152 beyond the fine tuning stage 320 training data… the method 400 trains a TTS model 200 using the first plurality of recorded speech samples 162a-n from the assortment of speakers. The trained TTS model 200 is configured to output synthetic speech 154 as an audible representation of a text input 152. At operation 406, the method 400 retrains the trained TTS model 200 using retraining speech data 162, 172 that includes the second plurality of recorded speech samples 172a-n combined with the first plurality of recorded speech samples 162a-n from the assortment of speakers. Here, the retrained TTS model 200 is configured to output synthetic speech 154 with speaking characteristics resembling the target speaker 14)
applying the voice conversion model on the speech samples to generate augmented text-speech dataset (Jia, par 0004, The operations include receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The operation also include training a text-to-speech (TTS) model using the first plurality of recorded speech samples from the assortment of speakers);
finetuning the trained voice conversion model based on the received dataset and the finetuned speech recognition model (Jia, par 0030, In order to form the personalized speech model 200 specific to the target speaker 14, the personalized speech model 200 undergoes a training process 300 divided into two stages, an initial training stage 310 and a fine tuning training stage 320… In the fine tuning stage 320, the training process 300 trains the pre-trained base model 200 (i.e., trained from the initial training stage 310) to generate synthesized speech 154 resembling speech of the target speaker 14),
Jia does not specifically disclose finetuning the trained speech recognition model based on the augmented text-speech dataset.
However, Rosenberg, in the same field of endeavor, discloses finetuning the trained speech recognition model based on the augmented text-speech dataset (Rosenberg, para 0036-0038, In some examples, the training process initially trains the TTS system 300 using available transcribed audio samples. In some examples, the available audio samples used to train the TTS system 300 include in-domain audio samples associated with the target domain. In other examples, the available audio samples used to train the TTS system 300 include out-of-domain audio samples that are distinct from the target domain. In these examples, the TTS system 300 is generating utterance of synthesized speech 306 in the target domain for input to the ASR model 200 during the pre-training stage despite the TTS system 300 being trained on transcribed out of domain audio samples. The TTS system 300 may be trained on a variation of in- and out-of-domain in some examples. In some examples, the training process 300 applies data augmentation to at least one of the sample utterances of synthetic speech 306. The data augmentation may include, without limitation, adding noise, manipulating timing (e.g., stretching), or adding reverberation to the corresponding speech representation. Data augmentation may add different synthesized recording conditions to the synthesized speech 306. During the pre-training stage, the ASR model 200 receives, as input, each utterance of synthetic speech).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg with the method of Jia because this would enable training ASR models on larger training datasets that improves the accuracy of the ASR model, and such large volume of training data can be obtained via synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data that can be used to train the ASR models (Rosenberg, para 0002).
Jia in view of Rosenberg does not specifically disclose wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss.
However, Sun, in the same field of endeavor, discloses wherein the set of operations is executed for a number of iterations until a loss associated with the voice conversion model is below a threshold loss. (Sun, 13th page, 15th para, model iteration sub-unit, for when the loss value is greater than the preset loss threshold value, iteratively updating the initial conversion model through the reverse propagation algorithm until the loss value is less than or equal to the preset loss threshold value, obtaining the trained voice conversion model).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Sun with the method of Jia in view of Rosenberg because this would enable the model iteration unit to specifically execute loss comparison by comparing the loss value and the preset loss threshold value and this application realizes the non-parallel voice conversion processing by vector quantization of the content feature vector which improves the accuracy of the voice conversion (Sun, Abstract and 13th page, 13th-14th para).
Claims 4-5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Jia et al. Pat App No. US 20220068256 A1 (Jia) in view of Rosenberg et al. Pat App No. US 20230058447 A1 (Rosenberg), further in view of Sun, and further in view of Chen et al. Pat No. US 11990117 B2 (Chen).
Regarding Claim 4. Jia in view of Rosenberg, further in view of Sun discloses the electronic device according to claim 3, wherein the circuitry is further configured to:
generate mel-spectrograms corresponding to the voice recordings; (Jia, para 0027, The decoder 230 is configured as a neural network (e.g., an autoregressive recurrent neural network) to generate an output audio signal 232 (e.g., an output sequence mel-frequency spectrograms) of expressive speech)
feed inputs that include the mel-spectrograms and speaker embeddings associated with the mel-spectrograms to the trained voice conversion model or the finetuned voice conversion model to generate mel-spectrogram predictions; (Jia, para 0026-0027, With an attention mechanism 220, the model 200 may be able to generate an output sequence (e.g., a sequence of output log-mel spectrogram frames) based on additional inputs, such as the speaker embedding 204, that receive particular attention weights in order to generate the context vector 222. The decoder 230 is configured as a neural network (e.g., an autoregressive recurrent neural network) to generate an output audio signal 232 (e.g., an output sequence mel-frequency spectrograms) of expressive speech that includes the intended prosody and speaker characteristics associated with the voice of the target speaker 14. For instance, based on the context vector 222, the decoder 230 predicts a representation of a speech signal (e.g., a mel frame or spectrogram frame) from the encoded representation 212. In some examples, the decoder 230 includes an architecture similar to Tacotron or Tacotron 2 (See “Tacotron: Towards End-to-End Speech Synthesis,” by Y. Wang, et al., available at, e.g., https://arxiv.org/pdf/1703.10135.pdf and “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions…) and
Jia in view of Rosenberg and Sun do not specifically disclose compute a speech consistency loss by use of the trained speech recognition model or the finetuned speech recognition model, wherein the speech consistency loss is computed based on the mel-spectrogram predictions and the inputs that include the mel-spectrograms, and the voice conversion model is trained or finetuned based on the computed speech consistency loss.
However, Chen, in the same field of endeavor, discloses:
compute a speech consistency loss by use of the trained speech recognition model or the finetuned speech recognition model (Chen, col 10, ln 14-31, For instance, the training process 300 may employ a consistent loss term module 350 configured to receive, at each of a plurality of time steps, the corresponding speech recognition results 312a, 312b output by the ASR model 200, and determine the consistency loss term 352 between the corresponding speech recognition results 312a, 312b at each of the plurality of time steps … The consistent loss term 352 provides an “unsupervised” loss term that is independent of the accuracy of the ASR model 200 and may be employed to update parameters of the ASR model 200 for promoting consistency between the first speech recognition result 312a recognized), wherein
the speech consistency loss is computed based on the mel-spectrogram predictions and the inputs that include the mel-spectrograms (Chen, col 13, ln 57-67, the ASR model 200 receives the mel-frequency spectrogram frames for the native synthesized speech representation 306a conditioned on the first conditioning input 306a (e.g., the speaker characteristics 306a of the native speaker of the Kannada language) and generates the first speech recognition result 312a. The ASR model also receives the mel-frequency spectrogram frames for the cross-lingual synthesized speech representation 306b conditioned on the second conditioning input 304b (e.g., the speaker characteristic 304b of the native speaker of the English language) and generates the second speech recognition result 312b… the training process 300 determines a consistent loss term 352 based on the first and second speech recognition results 312a, 312b. For instance, the training process 300 may employ a consistent loss term module 350 configured to receive, at each of a plurality of time steps, the corresponding speech recognition results 312a, 312b output by the ASR model 200, and determine the consistency loss term 352 between the corresponding speech recognition results 312a… The consistent loss term 352 provides an “unsupervised” loss term that is independent of the accuracy of the ASR model 200 and may be employed to update parameters of the ASR model 200 for promoting consistency between the first speech recognition result), and
the voice conversion model is trained or finetuned based on the computed speech consistency loss (Chen, col 4, ln 56 – col 5, ln 31, A consistent loss term module determines a consistent loss term based on a comparison of the first and second speech recognition results and the ASR model updates parameters based on the consistent loss term. …FIG. 1 illustrates an automated speech recognition (ASR) system 100 implementing an ASR model 200 that resides on a user device 102 of a user 104 and/or on a remote computing device 201 …The user device 102 includes an audio subsystem 108 configured to receive an utterance 106 spoken by the user 104 (e.g., the user device 102 may include one or more microphones for recording the spoken utterance 106) and convert the utterance 106).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable determining a consistent loss term based on the first speech recognition result and the second speech result and updating parameters of the speech recognition model based on the consistent loss term which would improve cross-language speech synthesis (Chen, col 1, ln 16-54).
Regarding Claim 5. Jia in view of Rosenberg, further in view of Sun, and further in view of Chen disclose the electronic device according to claim 4, wherein the speaker embeddings include information associated with a pitch, a loudness, an intensity of a voice of a human speaker. (Jia, para 0026-0027, With an attention mechanism 220, the model 200 may be able to generate an output sequence (e.g., a sequence of output log-mel spectrogram frames) based on additional inputs, such as the speaker embedding 204, that receive particular attention weights in order to generate the context vector 222. The decoder 230 is configured as a neural network (e.g., an autoregressive recurrent neural network) to generate an output audio signal 232 (e.g., an output sequence mel-frequency spectrograms) of expressive speech that includes the intended prosody and speaker characteristics associated with the voice of the target speaker 14).
Regarding Claim 14. Jia in view of Rosenberg, further in view of Sun disclose the method according to claim 13, further comprising:
generating mel-spectrograms corresponding to the voice recordings; (Jia, para 0027, The decoder 230 is configured as a neural network (e.g., an autoregressive recurrent neural network) to generate an output audio signal 232 (e.g., an output sequence mel-frequency spectrograms) of expressive speech)
feeding inputs that include the mel-spectrograms and speaker embeddings associated with the mel-spectrograms to the trained voice conversion model or the finetuned voice conversion model to generate mel-spectrogram predictions; (Jia, para 0026-0027, With an attention mechanism 220, the model 200 may be able to generate an output sequence (e.g., a sequence of output log-mel spectrogram frames) based on additional inputs, such as the speaker embedding 204, that receive particular attention weights in order to generate the context vector 222. The decoder 230 is configured as a neural network (e.g., an autoregressive recurrent neural network) to generate an output audio signal 232 (e.g., an output sequence mel-frequency spectrograms) of expressive speech that includes the intended prosody and speaker characteristics associated with the voice of the target speaker 14. For instance, based on the context vector 222, the decoder 230 predicts a representation of a speech signal (e.g., a mel frame or spectrogram frame) from the encoded representation 212. In some examples, the decoder 230 includes an architecture similar to Tacotron or Tacotron 2 (See “Tacotron: Towards End-to-End Speech Synthesis,” by Y. Wang, et al., available at, e.g., https://arxiv.org/pdf/1703.10135.pdf and “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions…).
Jia in view of Rosenberg and Sun do not specifically disclose computing a speech consistency loss by use of the trained speech recognition model or the finetuned speech recognition model, wherein the speech consistency loss is computed based on the mel-spectrogram predictions and the inputs that include the mel-spectrograms, and the voice conversion model is trained or finetuned based on the computed speech consistency loss.
However, Chen, in the same field of endeavor, discloses:
computing a speech consistency loss by use of the trained speech recognition model or the finetuned speech recognition model (Chen, col 10, ln 14-31, For instance, the training process 300 may employ a consistent loss term module 350 configured to receive, at each of a plurality of time steps, the corresponding speech recognition results 312a, 312b output by the ASR model 200, and determine the consistency loss term 352 between the corresponding speech recognition results 312a, 312b at each of the plurality of time steps … The consistent loss term 352 provides an “unsupervised” loss term that is independent of the accuracy of the ASR model 200 and may be employed to update parameters of the ASR model 200 for promoting consistency between the first speech recognition result 312a recognized), wherein
the speech consistency loss is computed based on the mel-spectrogram predictions and the inputs that include the mel-spectrograms (Chen, col 13, ln 57-67, the ASR model 200 receives the mel-frequency spectrogram frames for the native synthesized speech representation 306a conditioned on the first conditioning input 306a (e.g., the speaker characteristics 306a of the native speaker of the Kannada language) and generates the first speech recognition result 312a. The ASR model also receives the mel-frequency spectrogram frames for the cross-lingual synthesized speech representation 306b conditioned on the second conditioning input 304b (e.g., the speaker characteristic 304b of the native speaker of the English language) and generates the second speech recognition result 312b… the training process 300 determines a consistent loss term 352 based on the first and second speech recognition results 312a, 312b. For instance, the training process 300 may employ a consistent loss term module 350 configured to receive, at each of a plurality of time steps, the corresponding speech recognition results 312a, 312b output by the ASR model 200, and determine the consistency loss term 352 between the corresponding speech recognition results 312a… The consistent loss term 352 provides an “unsupervised” loss term that is independent of the accuracy of the ASR model 200 and may be employed to update parameters of the ASR model 200 for promoting consistency between the first speech recognition result), and
the voice conversion model is trained or finetuned based on the computed speech consistency loss (Chen, col 4, ln 56 – col 5, ln 31, A consistent loss term module determines a consistent loss term based on a comparison of the first and second speech recognition results and the ASR model updates parameters based on the consistent loss term. …FIG. 1 illustrates an automated speech recognition (ASR) system 100 implementing an ASR model 200 that resides on a user device 102 of a user 104 and/or on a remote computing device 201 …The user device 102 includes an audio subsystem 108 configured to receive an utterance 106 spoken by the user 104 (e.g., the user device 102 may include one or more microphones for recording the spoken utterance 106) and convert the utterance 106).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable determining a consistent loss term based on the first speech recognition result and the second speech result and updating parameters of the speech recognition model based on the consistent loss term which would improve cross-language speech synthesis (Chen, col 1, ln 16-54).
Claims 6, 9, 15 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Jia et al. Pat App No. US 20220068256 A1 (Jia) in view of Rosenberg et al. Pat App No. US 20230058447 A1 (Rosenberg), further in view of Sun, and further in view of Gupta et al. Pat No. US 12112752 B1 (Gupta).
Regarding Claim 6. Jia in view of Rosenberg, further in view of Sun discloses the electronic device according to claim 1, wherein the circuitry is further configured to:
Jia in view of Rosenberg and Sun does not specifically disclose compute the loss in terms of a word error rate (WER) based on application of the trained speech recognition model or the finetuned speech recognition model on a validation set of the dataset; and compare the determined WER with a threshold WER loss, wherein the speech recognition model is trained or finetuned based on the comparison
However, Gupta, in the same field of endeavor, discloses:
compute the loss in terms of a word error rate (WER) based on application of the trained speech recognition model or the finetuned speech recognition model on a validation set of the dataset (Gupta, col 21, ln 4-15, Processing may continue at action 620, at which at least one cluster of account identifiers with performance data that is below a threshold performance metric may be determined. For example, after forming the clusters, the average performance metric score (or some other aggreated performance metric score or scores) associated with each cluster may be determined. A threshold performance metric may be determined. This threshold performance metric may be determined empirically and/or statistically as an outlier); and
compare the determined WER with a threshold WER loss, wherein the speech recognition model is trained or finetuned based on the comparison (Gupta, col 21, ln 14-25, This threshold performance metric may indicate poor performance by the relevant model of the natural language processing system (e.g., for ASR component 250 a WER that is above a particular threshold may indicate poor performance). Clusters with performance data that is below a threshold may be identified and data points (e.g., feature representations of the natural language inputs) of that cluster may be included in a training data set that may be used to retrain the relevant machine learning models of the natural language processing system).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable improve the relevant machine learning models and such performance may be improved esp. for previously underserved cohorts of individuals (Gupta, col 21, ln 25-28).
Regarding Claim 9. Jia in view of Rosenberg, further in view of Sun discloses the electronic device according to claim 1.
Jia in view of Rosenberg and Sun does not specifically disclose wherein wherein the set of operations is further executed for the number of the iterations until a Word Error Rate (WER) loss associated with the speech recognition model is below a threshold WER loss
However, Gupta, in the same field of endeavor, discloses wherein wherein the set of operations is further executed for the number of the iterations until a Word Error Rate (WER) loss associated with the speech recognition model is below a threshold WER loss (Gupta, col 21, ln 13-21, A threshold performance metric may be determined. This threshold performance metric may be determined empirically and/or statistically as an outlier. This threshold performance metric may indicate poor performance by the relevant model of the natural language processing system (e.g., for ASR component 250 a WER that is above a particular threshold may indicate poor performance). Clusters with performance data that is below a threshold may be identified and data points).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable improve the relevant machine learning models and such performance may be improved esp. for previously underserved cohorts of individuals (Gupta, col 21, ln 25-28).
Regarding Claim 15. Jia in view of Rosenberg, further in view of Sun discloses the method according to claim 11, further comprising:
Jia in view of Rosenberg and Sun does not specifically disclose computing the loss in terms of a word error rate (WER) based on application of the trained speech recognition model or the finetuned speech recognition model on a validation set of the dataset; and comparing the determined WER with a threshold WER loss, wherein the speech recognition model is trained or finetuned based on the comparison.
However, Gupta, in the same field of endeavor, discloses:
computing the loss in terms of a word error rate (WER) based on application of the trained speech recognition model or the finetuned speech recognition model on a validation set of the dataset (Gupta, col 21, ln 4-15, Processing may continue at action 620, at which at least one cluster of account identifiers with performance data that is below a threshold performance metric may be determined. For example, after forming the clusters, the average performance metric score (or some other aggreated performance metric score or scores) associated with each cluster may be determined. A threshold performance metric may be determined. This threshold performance metric may be determined empirically and/or statistically as an outlier); and
comparing the determined WER with a threshold WER loss, wherein the speech recognition model is trained or finetuned based on the comparison (Gupta, col 21, ln 14-25, This threshold performance metric may indicate poor performance by the relevant model of the natural language processing system (e.g., for ASR component 250 a WER that is above a particular threshold may indicate poor performance). Clusters with performance data that is below a threshold may be identified and data points (e.g., feature representations of the natural language inputs) of that cluster may be included in a training data set that may be used to retrain the relevant machine learning models of the natural language processing system).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable improve the relevant machine learning models and such performance may be improved esp. for previously underserved cohorts of individuals (Gupta, col 21, ln 25-28).
Regarding Claim 18. Jia in view of Rosenberg, further in view of Sun discloses the method according to claim 11.
Jia in view of Rosenberg and Sun does not specifically disclose wherein the set of operations is further executed for the number of the iterations until a Word Error Rate (WER) loss associated with the speech recognition model is below a threshold WER loss.
However, Gupta, in the same field of endeavor, discloses wherein the set of operations is further executed for the number of the iterations until a Word Error Rate (WER) loss associated with the speech recognition model is below a threshold WER loss (Gupta, col 21, ln 13-21, A threshold performance metric may be determined. This threshold performance metric may be determined empirically and/or statistically as an outlier. This threshold performance metric may indicate poor performance by the relevant model of the natural language processing system (e.g., for ASR component 250 a WER that is above a particular threshold may indicate poor performance). Clusters with performance data that is below a threshold may be identified and data points).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Chen with the method of Jia in view of Rosenberg and Sun because this would enable improve the relevant machine learning models and such performance may be improved esp. for previously underserved cohorts of individuals (Gupta, col 21, ln 25-28).
Claims 7, 8, 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Jia et al. Pat App No. US 20220068256 A1 (Jia) in view of Rosenberg et al. Pat App No. US 20230058447 A1 (Rosenberg) , further in view of Sun, and further in view of Rosenberg et al. Pat App No. US 20250095639 A1 (Rosenberg II) (Domestic Priority: 03/21/2022).
Regarding Claim 7. Jia in view of Rosenberg, further in view of Sun discloses the electronic device according to claim 1.
Jia in view of Rosenberg and Sun does not specifically disclose wherein the circuitry is further configured to freeze training parameters of the trained voice conversion model or the finetuned voice conversion model while the speech recognition model or the TTS conversion model is finetuned over the iterations.
However, Rosenberg II, in the same field of endeavor discloses wherein the circuitry is further configured to freeze training parameters of the trained voice conversion model or the finetuned voice conversion model while the speech recognition model or the TTS conversion model is finetuned over the iterations. (Rosenberg II, para 0054, for use in optimizing the VC model 400 to minimize both L1 and L2-norm squared distance between the source and target speech features 402, 480. Notably, implementing the trained ASR encoder as the content encoder 410 results in freezing the parameters of the content encoder 410 while training the VQ-VAE layer 420, decoder 450, and speaker classifier 460. That is, by freezing the parameters of the content encoder 410 that was trained on ASR loss may encourage the other components to learn different, better representations for the voice conversion task. Notably, the trained ASR encoder is more robust on noisy conditions).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg II with the method of Jia in view of Rosenberg and Sun because this would enable improving the accuracy of the ASR model through training ASR models on larger training datasets with synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data used to train the ASR models, and also this enables determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term (Rosenberg II, Abstract and para 0003).
Regarding Claim 8. Jia in view of Rosenberg, further in view of Sun discloses the electronic device according to claim 1.
Jia in view of Rosenberg and Sun does not specifically disclose wherein the circuitry is further configured to freeze training parameters of the trained speech recognition model or the finetuned speech recognition model while the voice conversion model is finetuned over the iterations.
However, Rosenberg II, in the same field of endeavor discloses wherein the circuitry is further configured to freeze training parameters of the trained speech recognition model or the finetuned speech recognition model while the voice conversion model is finetuned over the iterations (Rosenberg II, para 0008, operations that include receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding voice conversion synthetic speech representation by using a voice conversion model to convert the non-synthetic speech representation into the corresponding voice conversion synthetic representation of the corresponding utterance. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the operations also include: generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding non-synthetic speech representation of the corresponding utterance; generating, for output by the speech recognition model, a second probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation of the corresponding utterance; and determining a consistent loss term for the corresponding training utterance pair based on the first probability distribution over possible non-synthetic speech recognition hypotheses and the second probability distribution over possible non-synthetic speech recognition hypotheses. The operations also include updating parameters of the speech recognition model based on the consistent loss term determined at each of the plurality of output steps for each training utterance pair in the set of training utterance pairs; [i.e., “updating parameters of the speech recognition model based on the consistent loss term” include “freeze/keep the same/ training parameters of the trained speech recognition model or the finetuned speech recognition model”]).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg II with the method of Jia in view of Rosenberg and Sun because this would enable improving the accuracy of the ASR model through training ASR models on larger training datasets with synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data used to train the ASR models, and also this enables determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term (Rosenberg II, Abstract and para 0003).
Regarding Claim 16. Jia in view of Rosenberg, further in view of Sun discloses the method according to claim 11.
Jia in view of Rosenberg and Sun does not specifically disclose wherein comprising freezing training parameters of the trained voice conversion model or the finetuned voice conversion model while the speech recognition model or the TTS conversion model is finetuned over the iterations.
However, Rosenberg II, in the same field of endeavor discloses further comprising freezing training parameters of the trained voice conversion model or the finetuned voice conversion model while the speech recognition model or the TTS conversion model is finetuned over the iterations (Rosenberg II, para 0054, for use in optimizing the VC model 400 to minimize both L1 and L2-norm squared distance between the source and target speech features 402, 480. Notably, implementing the trained ASR encoder as the content encoder 410 results in freezing the parameters of the content encoder 410 while training the VQ-VAE layer 420, decoder 450, and speaker classifier 460. That is, by freezing the parameters of the content encoder 410 that was trained on ASR loss may encourage the other components to learn different, better representations for the voice conversion task. Notably, the trained ASR encoder is more robust on noisy conditions).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg II with the method of Jia in view of Rosenberg and Sun because this would enable improving the accuracy of the ASR model through training ASR models on larger training datasets with synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data used to train the ASR models, and also this enables determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term (Rosenberg II, Abstract and para 0003).
Regarding Claim 17. Jia in view of Rosenberg, further in view of Sun discloses the method according to claim 11.
Jia in view of Rosenberg and Sun does not specifically disclose wherein comprising freezing training parameters of the trained speech recognition model or the finetuned speech recognition model while the voice conversion model is finetuned over the iterations.
However, Rosenberg II, in the same field of endeavor discloses wherein comprising freezing training parameters of the trained speech recognition model or the finetuned speech recognition model while the voice conversion model is finetuned over the iterations (Rosenberg II, para 0008, operations that include receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding voice conversion synthetic speech representation by using a voice conversion model to convert the non-synthetic speech representation into the corresponding voice conversion synthetic representation of the corresponding utterance. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the operations also include: generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the corresponding non-synthetic speech representation of the corresponding utterance; generating, for output by the speech recognition model, a second probability distribution over possible synthetic speech recognition hypotheses for the corresponding synthetic speech representation of the corresponding utterance; and determining a consistent loss term for the corresponding training utterance pair based on the first probability distribution over possible non-synthetic speech recognition hypotheses and the second probability distribution over possible non-synthetic speech recognition hypotheses. The operations also include updating parameters of the speech recognition model based on the consistent loss term determined at each of the plurality of output steps for each training utterance pair in the set of training utterance pairs; [i.e., “updating parameters of the speech recognition model based on the consistent loss term” include “freeze/keep the same/ training parameters of the trained speech recognition model or the finetuned speech recognition model”]).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of Rosenberg II with the method of Jia in view of Rosenberg and Sun because this would enable improving the accuracy of the ASR model through training ASR models on larger training datasets with synthesized speech and/or data-augmented speech that can be incorporated to increase the volume of training data used to train the ASR models, and also this enables determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term (Rosenberg II, Abstract and para 0003).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MULUGETA T. DUGDA whose telephone number is (703)756-1106. The examiner can normally be reached Mon - Fri, 4:30am - 7:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at 571-270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MULUGETA TUJI DUGDA/Examiner, Art Unit 2653
/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655