DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file
Response to Amendment
In response to an amendment on filed on 2/13/2026, amending claims 1, 5, 7-9, 16, and 17. Claims 6 and 14 are cancelled. Also, applicant amended all independent claim incorporating new limitation and claims 6 and 14 incorporated in independent claim The pending claims are 1-5, 7-13, and 15-20.
Response to Arguments
Applicant’s arguments filed on 2/13/2026 are being considered by the examiner but they are not persuasive.
Regarding 35 U.S.C. 101 rejection, examiner withdraws 101 rejection in claim 1-5, and 7-8.
The examiner maintained 35 U.S.C. 101 rejection on 9-13 and 15-20 claims. Applicant’s request (Remark page 11) considered but claims are not subject matter eligible and no arguments were provided.
Examiner updated 35 U.S.C. 101 rejection.
With regard to 35 U.S.C 102, new ground of rejection necessitated for newly amended claims. Applicant arguments are moot for new ground of rejections. Regarding claim 9, applicant argued Adam fails to disclose all the features (Remark page 13, 2nd para), Examiner used Adam in view of Gabrys et al. US 20230260502 A1 and further in view of Ljolje et al. US 12340793 B1. Applicant arguments are moot.
With regard to 35 U.S.C 103, applicant amended all independent claims incorporating new limitation and claims 6 and 14 are incorporated into independent claim.
Applicant Asserts: “… "identifying an expected speech characteristic for a respective synthetic speech recording," "generating, based on language science resources, an expected speech recording for the respective synthetic speech recording," "generating, based on language science resources, an expected speech recording for the respective synthetic speech recording," "comparing the respective synthetic speech recording to the expected speech recording," "updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording," and "applying one or more augmentation techniques to each synthetic speech recording of the
synthetic speech data set." (Remark page 13, second paragraph)
Examiner Note: The applicant brought above limitation from claim 6 incorporated with new limitation "applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set.". The examiner used Gabrys et al. US 20230260502 A1 to reject that. In addition to this, examiner used Ljolje et al. US 12340793 B1 to reject "applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set.". To reject examiner used Adan in view of Gabrys and further in view of Ljolje. Please see claim rejection below.
.
Applicant further Asserts: (“Kobayashi does not remedy the shortcomings of Adam with respect to claim 1 and is not relied upon for this purpose in the Office Action. Kobayashi is directed to recording medium for receiving information on the emotion to synthesize the speech. (Kobayashi, Abstract.)” (Response page 16, 3rd paragraph)
Examiner Note: Examiner respectfully disagree with this assertion. Kobayashi teaches (“[0047] … the emotion is expressed by changing such parameters as time duration, pitch or sound volume (sound intensity) depending on the emotion. …”)
Applicant further Asserts: “That is, as best understood by the Applicant, Ljolje groups voice samples into classes based on spectral representation and then augments each group to fall into another spectral representation class. Thus, Ljolje does not cure the deficiencies of Adam and Kobayashi.” (Remark page 17, first paragraph)
Examiner Note: Transforming audio (waveform/spectrogram) level is augmentation. Ljolje teaches augmentation techniques by transforming voice in to various version like male, female and child like. (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) (“(17) … data transformation module 198C further changes the tempo of the voice samples as part of the data transformation. …” col. 6, Lines 49-50 ) (“(28) … The transformations may further comprise changes in tempo and random spectral changes to a voice sample.” Col. 9, Lines 31-32)
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 9-13, and 15-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim(s) 9, , and 17, the limitation(s) of “receiving”, “identifying”, “receiving”, “modifying”, “determining”, a “identifying”, and “generating”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of000000000 generic computer components. Receiving audio/speech recording are just collecting data to train any ASR model to recognize speech. More specifically, the mental process of a human can do discriminative task like recognize speech in speech recognition or speaker recognition or speaker verification. Again, collecting some speech characteristics like prosody, tone, duration etc. Human can modify/change meta data like speech characteristics based on GMM or HMM. After modifying metadata, human can determine data are augmented data set. Human can changing/editing speech characteristics, compare and match expected data with reference data in speech recordings. Also, recognizing subset of data in the augmented created data, subset data can be male, female different voice characteristics data. Finally generating balanced dataset which contains all different sub set data like male, female class voice characteristic. The claim is also mention about training, but claim does not specify how model is trained and steps of trainings.
The claim recites augmented dataset will be used for training, however claim does not recite training steps or how model will be trained. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental.
This judicial exception is not integrated into a practical application because the recitation of “processing device” and “A non-transitory computer-readable storage medium comprising instructions” in claim 9 and 17-20, reads to generalized computer/processing components, based upon the claim interpretation wherein the structure is interpreted using [0113], and [0115] in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim(s) is/are directed to an abstract idea.
The claim(s) do(es) not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to “receiving”, “identifying”, “receiving”, “modifying”, “determining”, “identifying”, and “generating”, indicate amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim(s) is/are not patent eligible.
With respect to claim(s) 12, the claim(s) recite(s) “wherein the speech-based discriminative task is one of: keyword spotting, wake word detection, phoneme spotting, emotion detection, transcription, natural language processing (NLP), or automatic speech recognition (ASR).” Human can verify/detect keyword/ wake word. The claim(s) is/are directed to an abstract idea.
With respect to claim(s) 11, the claim(s) recite(s) “wherein the set of speech characteristics includes at least one of: prosody, duration, emotion, pitch, pace, emphasis, accents, or language.” Human knows prosody, duration, emotion, pitch, pace, emphasis, accents, or language as a speech characteristics.” The claim(s) is/are directed to an abstract idea. No additional elements are present in the claim.
With respect to claim(s) 13, the claim(s) recite(s) “wherein the language science resources include at least one of a phonemes library, an acoustic model, or a linguistic library.” The claim is defining/narrowing language science resources have different libraries. No additional elements are present in the claim.
With respect to claim(s) 10 and 18, recites “wherein receiving, based on the set of speech characteristics and the plurality of natural speech recordings, the synthetic speech data set comprises: identifying, based on the set of speech characteristics, a subset of the plurality of natural speech recordings; configuring a speech generation engine in view of the set of speech characteristics; and inputting, into the speech generation engine, the subset of the plurality of natural speech recordings to generate the synthetic speech data set.” Human can do data collection. Also, human can identify speech characteristics and which will help to create synthetic speech. Human mental and analyzing process. The claim recites speech generation engine as an additional element. This element are used in this technical area of art. Perucci et al. teaches in US 20210366460 A1 (“[0047] In one or more examples, each of the speech engine 104, the processing unit 106, and the neural network 108 may be embodied as a multi-core processor, a single core processor, or a combination of one or more multi-core processors and one or more single core processors. …”). The claim(s) is/are directed to an abstract idea.
With respect to claim(s) 19, “wherein modifying, based on language science resources, metadata associated with each synthetic speech recording of the synthetic speech data set comprises: for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; comparing the respective synthetic speech recording to the expected speech recording; and updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording.”. Human can changing/editing speech characteristics, compare and match expected data with reference data in speech recordings . No additional elements are present in the claim.
With respect to claim(s) 7, recites, “determining, based on the modified synthetic speech data set, an augmented synthetic speech data set comprises: for each synthetic speech recording of the synthetic speech data set, identifying information from the metadata for modifying a respective synthetic speech recording; and modifying, based on the information, one or more characteristics of the respective synthetic speech recording to generate a corresponding augmented synthetic speech recording of the augmented synthetic speech data set.” Human can analyze speech data as data modified. By analyzing speech data, human knows augmented data set, No additional elements are present in the claim.
With respect to claim(s) 16, “wherein generating, based on the subset of the augmented synthetic speech data set and the subset of the plurality of natural speech recordings, the balanced data set to train the discriminative model comprises: determining, based on the speech-based discriminative task, a distribution configuration; generating, based on the distribution configuration, a first subset of the balanced data set, wherein the first subset of the balanced data set comprises one or more augmented synthetic speech recordings of the subset of the augmented synthetic speech data set; generating, based on the distribution configuration, a second subset of the balanced data set, wherein the second subset of the balanced data set comprises one or more natural speech recordings of the subset of the plurality of natural speech recordings; and combining the first subset of the balanced data set and the second subset of the balanced data set to generate the balanced data set.” Human can create subset data based on how data are distributed in different subset and these subsets from augment data set. No additional elements are present.
With respect to claim(s) 15 and 20, recites “wherein modifying, based on language science resources, the synthetic speech data set comprises: identifying, for each synthetic speech recording of the synthetic speech data set, phonemes associated with a respective synthetic speech recording; determining phonemes associated with text of the respective synthetic speech recording; aligning the phonemes associated with the respective synthetic speech recording with the phonemes associated with the text of the respective synthetic speech recording; and responsive to failing to align the phonemes associated with the respective synthetic speech with the phonemes associated with the text of the respective synthetic speech recording, removing the respective synthetic speech recording from the synthetic speech data set.”, human can identify phoneme corresponds to synthetic recording or not and human can align phoneme/speech associated with text and if there is any misalignment phoneme, discard the synthetic recording from synthetic speech data set.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 9 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Adam et al. US 20230326445 A1 in view of Gabrys et al. US 20230260502 A1 and further in view of Ljolje et al. US 12340793 B1.
.
Regarding Claim 9, Adam teaches:
9. A system comprising: a processing device to perform operations comprising: Adam teaches (“[0130] At operation 802, the animated speech refinement system 230 processes the audio stream by an automated speech recognition (ASR) engine to identify base timing of one or more phonemes corresponding to the one or more spoken words, as discussed above.”) by Adam et al. US 20230326445 A1
generating, based on a plurality of natural speech recordings, a synthetic speech data set; (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
generating, based on the synthetic speech data set and the plurality of natural speech recordings, a balanced data set for training a discriminative model to perform a speech-based discriminative task. Adam teaches (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) by Adam et al. US 20230326445 A1
Adam does not explicitly teach for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; comparing the respective synthetic speech recording to the expected speech recording; and updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording modifying, based on language science resources, metadata associated with each synthetic speech recording of the synthetic speech data set comprises:
Gabrys teaches:
for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; Gabrys teaches (“[0023] Second, the trained single-speaker TTS component may be used to generate a synthetic parallel dataset for a multi-speaker corpus. The multi-speaker corpus
include samples recorded speech from multiple speakers along with transcripts corresponding to the recorded speech. The multi-speaker corpus may include, for example, example speech of multiple or many speakers, with samples covering many or all phonemes for a particular language. The corpus may include transcript(s) of what the speakers are saying. The single-speaker TTS component may process the transcript to generate synthesized speech that may be used as an input for training the voice-modifying model. The recorded speech may serve as target speech for the training (e.g., the target speech may be used to evaluate the output of the voice-modifying model during training). The synthesized speech and target speech (as well as voice characteristics determined from the recorded speech) form the synthetic parallel dataset used to pre-train the voice-modifying model.”) by Gabrys et al. US 20230260502 A1
generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; Gabrys teaches (“[0022] First, a single-speaker TTS component may be trained with a large single-speaker corpus. The single-speaker corpus may include recorded speech and a corresponding transcript of the words spoken. The single-speaker TTS component may be trained to process the transcript to generate synthesized speech that approximates the recorded speech.”) (“[0023] Second, the trained single-speaker TTS component may be used to generate a synthetic parallel dataset for a multi-speaker corpus. The multi-speaker corpus may include samples recorded speech from multiple speakers along with transcripts corresponding to the recorded speech. The multi-speaker corpus may include, for example, example speech of multiple or many speakers, with samples covering many or all phonemes for a particular language. The corpus may include transcript(s) of what the speakers are saying. The single-speaker TTS component may process the transcript to generate synthesized speech that may be used as an input for training the voice-modifying model. The recorded speech may serve as target speech for the training (e.g., the target speech may be used to evaluate the output of the voice-modifying model during training). The synthesized speech and target speech (as well as voice characteristics determined from the recorded speech) form the synthetic parallel dataset used to pre-train the voice-modifying model.”) (“[0025] … The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. …”)
(“[0035] … The parallel dataset 130/140 may include the synthesized spectrogram data 182 (e.g., as generated by the TTS component 180 based on a transcript corresponding to the target speech), a target spectrogram 164 (representing a recording of the target speech), speaker embedding data (e.g., representing identifiable characteristics of the target speech), and/or frequency data 168 (e.g., representing pitch information corresponding to the target speech). …”) by Gabrys et al. US 20230260502 A1
comparing the respective synthetic speech recording to the expected speech recording; Gabrys teaches (“[0025] … The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. The voice-modified speech may be compared to the corresponding examples of the target voice. …”) by Gabrys et al. US 20230260502 A1
updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording;
Gabrys teaches (“[0044] The method 400 may include pre-training the voice modifier component 190 using the synthetic parallel dataset (stage 430). The voice modifier component 190 may receive the synthesized spectrogram data and the voice characteristic data generated during the stage 420, and train a trained model to output voice-modified spectrogram data approximating the corresponding target spectrogram data. For each speaker represented in the synthetic parallel dataset (e.g., having a respective target spectrogram, synthesized spectrogram, and voice characteristic data), a loss may be calculated between the voice-modified spectrogram data and the target spectrogram data, and parameters of the voice modifier component 190 models may be adjusted to reduce the calculated loss. The resulting pre-trained model may be fine-tuned to modify synthesized speech to generate voice-modified spectrogram data having voice characteristics similar to those of the target voice.”) (“[0050] The method 500 may include adapting the fundamental frequency (“f.sub.0”) of the synthesized speech to that of the target voice (stage 530). The synthesized speech may have a fundamental frequency (e.g., pitch and/or timbre), which may be constant or have some contour (e.g., an upward and/or downward contour). The speech feature extractor component 160 may determine a mean and/or variance of the fundamental frequency of the synthesized speech. The speech feature extractor component 160 may compare the fundamental frequency mean and/or variance of the synthesized speech to that of the target voice. The synthesized speech may thus be modified such that fundamental frequency mean and/or variance matches or approximates the target voice.”) (“[0051] The method 500 may include using the voice modifier component 190 to modify the synthesized spectrogram data according to the target speaker embedding (stage 540). The voice modifier component 190 may receive the synthesized speech (e.g., the predicted spectrogram) and the voice characteristic data of the target voice (e.g., the speaker embedding data). The voice modifier component 190 may process the input to generate voice-modified synthesized speech having voice characteristics similar to the target voice.”) .”) by Gabrys et al. US 20230260502 A1
Gabrys is considered to be analogous to the claimed invention because it relates to a text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Gabrys in order to include parametric synthesis as augmentation techniques.
One could have been motivated to do so because speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. (“[0085] … The speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. …”) by Gabrys et al. US 20230260502 A1
The combination does not explicitly teach applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set;
Ljolje teaches:
applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set; and
Ljolje teaches the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) (“(30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model.”) (“(36) FIG. 5 is a conceptual illustration of applying transformations to voice samples grouped into classes of spectral representations. In illustration 500, spectral representations are visualized as spectrograms. A set of spectral representations 510 are split into a female speaker class 520 of spectral representations and a male speaker class 540 of spectral representations. The female speaker class 520 is expanded by 20%, shifting down the fundamental frequencies captured at each time interval to generate male voice augmentations 530 of the female speaker class 520. The male speaker class 540 is compressed by 20%, shifting up the fundamental frequencies captured at each time interval to generate female voice augmentations 550 of the male speaker class 540. As such, the shape of the corresponding spectrograms for each speaker class (and thus the real-world voice characteristics of their spectral representations) is maintained during the transformations.”) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Gabrys to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
Regarding Claim 16, the combination teaches the system claim 9 as identified above
Adam further teaches:
16. The system of claim 9, wherein generating, based on the modified synthetic speech data set, the balanced data set from the discriminative model comprises. Adam teaches the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. Adam teaches (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
The combination does not explicitly teach determining, based on the speech-based discriminative task, a distribution configuration.
determining, based on the speech-based discriminative task, a distribution configuration; Ljolje teaches (“(15) Spectral comparison module 198B determines spectral change ratios based on a comparison of warp distributions. Spectral comparison module 198B is configured to obtain, determine, or calculate warp distributions associated with each class of spectral representations used for grouping of voice samples 110 by spectral classification module 198A. As used herein a “warp value” may refer to a value indicating the spectral difference between a particular voice sample and a normalized voice sample. For example, a warp value may be the spectral difference or “warp” between a voice of a particular person having a particular vocal tract length and the voice of an average or median vocal tract length across a set of samples. As one example, a set of male, female, and child speakers may speak a transcript of words and phrases, and the normalized voice sample for the set of samples may be a hypothetical voice sample of an androgenous, average-aged speaker speaking the transcript. In one embodiment, spectral comparison 198B may obtain warp distributions for each class of spectral representations by determining a warp value for each voice sample 110 and plotting the warp values of each class as a gaussian distribution. The “peak warp value” may refer to the most frequently occurring warp value for a given class of spectral representations (e.g., the highest point on a histogram of the warp values for the class or the center of a gaussian distribution of the warp values for the class). For example, a peak warp value for a class of male speaker types may be centered at 1.1, with warp values within one standard deviation of the peak warp value sitting between 1.06 and 1.14. In one embodiment, the spectral comparison module 198B may determine the peak warp values associated with each class of spectral representation by applying the voice samples of each class into a trained acoustic model. For example, an acoustic model may be trained to receive a group of voice samples and estimate a peak warp value for the group of samples. In one embodiment, the trained acoustic model may be a vocal tract length normalization (VTLN) acoustic model, such as described, referenced, and incorporated in: “Low Latency Real-Time Vocal Tract Length Normalization”, by Andrej Ljolje, Vincent Goffin and Murat Saraclar, Proceedings: Text, Speech and Dialogue, 7th International Conference, TSD 2004, Brno, Czech Republic, September 2004. In embodiments, the peak warp value associated with a particular class of spectral representations may be identified as the target for transforming other voice samples into the particular class. For example, the peak warp value for male voice samples can be the target for augmenting female and child voice samples. As such, the difference in peak warp values for each class of spectral representations may be used to determine spectral differences (i.e., spectral change ratios) between speaker types (e.g., between male, female, and child voices). In one example, when plotting or determining a warp distribution for male voices, the peak warp value may be 1.1, while plotting for female voice the peak warp value may be 0.9. This may indicate about a 20% spectral difference between male (1.1) and female (0.9) voices, and thus a spectral change ratio of 20% compression for the spectral representations of male voice samples and 20% expansion for the spectral representations of female voice samples.” Entire col. 6) by Ljolje et al. US 12340793 B1
The combination does not explicitly teach generating, based on the distribution configuration, a first subset of the balanced data set, wherein the first subset of the balanced data set comprises one or more synthetic speech recordings of the modified synthetic speech data set;
Ljolje teaches:
generating, based on the distribution configuration, a first subset of the balanced data set, wherein the first subset of the balanced data set comprises one or more synthetic speech recordings of the modified synthetic speech data set; Ljolje teaches (“(28) The system determines 350 determines transformations based on the spectral change ratios determined at step 340. For example, to generate additional variations of the voice samples, spectral change ratios may be applied to the voice samples in each class of spectral representations in a manner that shifts its warp distribution towards the center of the warp distribution associated with another class. As one example, variations may include shifting/compressing the male voice samples by −20% to generate corresponding female voice samples, shifting/expanding the female voice samples by +20% to generate corresponding male voice samples, shifting the child voice samples to generate adult male and adult female versions of the child voice samples, and so on with each of the various groups of voice samples. The transformations may further comprise changes in tempo and random spectral changes to a voice sample.”) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) by Ljolje et al. US 12340793 B1
generating, based on the distribution configuration, a second subset of the balanced data set, wherein the second subset of the balanced data set comprises one or more natural speech recordings of the plurality of natural speech recordings; and Ljolje teaches the original voice samples (i.e. second subset )(“ (30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model”) by Ljolje et al. US 12340793 B1.
combining the first subset of the balanced data set and the balanced data set to generate the balanced data set. Ljolje teaches the original voice samples (i.e. second subset )(“ (30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model”) by Ljolje et al. US 12340793 B1Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Gabrys to further incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
Claim 10, 11, 12, 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Adam, Gabrys and Ljolje in view of Kobayashi et al. US 20040019484 A1 and in view of Aher et al. US 20210319780 A1.
Regarding Claim 10, the combination teaches the system claim 9 as identified above.
The combination does not explicitly teach identifying, based on the speech-based discriminative task, a set of speech characteristics;
Kobayashi teaches:
10. The system of claim 9, wherein generating, based on the plurality of natural speech recordings, the synthetic speech data set comprises: identifying, based on the speech-based discriminative task, a set of speech characteristics; Kobayashi teaches (“[0020] If the speech is to be synthesized for a meaningful sentence, seasoned with emotion, there is a risk that, except if control is made so that the prosodic characteristics of the language in question, such as accent positions, duration or loudness, are maintained, the hearer is unable to understand the meaning of the synthesized speech correctly.”) (“[0021] It is therefore an object of the present invention to provide a method and apparatus for speech synthesis, program, recording medium, method and apparatus for generating constraint information, and a robot apparatus, in which the emotion can be added to the synthesized speech as the prosodic characteristics of the language in question are maintained.”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys and Ljolje to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device is outputting the meaningful synthesized speech by adding emotion in the device. (“[0043] The addition of the emotion expression to the uttered speech, as a function in e.g., a robot apparatus, simulating the human being, and which has the functions of outputting the meaningful synthesized speech, operates extremely effectively in promoting the intimacy between the robot apparatus and the human being. by Kobayashi et al. US 20040019484 A1
The combination does not explicitly teach selecting, based on the set of speech characteristics, a subset of the plurality of natural speech recordings;
Aher teaches:
selecting, based on the set of speech characteristics, a subset of the plurality of natural speech recordings; Aher teaches (“[0081] In an embodiment, at step 608, the voice application determines predicted prosodic characteristics of the response using a model, and modifies the predicted prosodic characteristics to generate the prosodic characteristics of the synthesized speech response. The model may, for example, include the results of a training model (e.g., training model 370 of FIG. 3), which may include correlations, probabilities, confidences, and other values indicative of the model output. In an embodiment, the voice application may select from among a plurality of versions of a word, each having a particular prosodic character, for the version that most matches the desired or predicted prosodic character. For example, the voice application may access a database that stores a plurality of audio files, each corresponding to a word, phrase, or grouping thereof, and may select the audio file having associated prosodic metrics that are most similar to the predicted metrics.”) by Aher et al. US 20210319780 A1
configuring a speech generation engine in view of the set of speech characteristics; and Aher teaches (“[0034] Prosodic engine 223 is configured to determine one or more prosodic metrics associated with a word, group of words, or a voice input. Prosodic engine 223 may include, for example, temporal and spectral analyzers for extracting information about an audio file. In an embodiment, prosodic engine 223 is configured to determine pitch values, note values, rate values, timber values, volume values, emotional metric values (e.g., based on prosodic metrics), any other suitable data, or any combination thereof. Prosodic engine 223 may, for example, apply one or more operations provided by an algorithm to extract metrics of the voice input.”) by Aher et al. US 20210319780 A1
generating, using the speech generation engine, the synthetic speech data set based on the subset of the plurality of natural speech recordings. (“[0036] Speech generator 225 is configured to synthesize and output the synthesized speech response to the voice input. In an embodiment, speech generator 225 includes a text-to-speech engine configured to identify a text string to be synthesized as a synthesized speech response. For example, speech generator 225 may generate audio output at a speaker or other audio device based on the text string and audio settings. For example, speech generator 225 may use one or more settings including prosodic metrics corresponding to each word or a group of words to specify voice details (e.g., male/female voice, accent, rate, emphasis, or other details), playback speed, or any other suitable settings that may affect the generated audio output.”) by Aher et al. US 20210319780 A1
Aher is considered to be analogous to the claimed invention because it relates to a relates to systems for managing responses to voice inputs, and, more particularly, systems for generating more natural speech responses to voice inputs based on prosody.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys, Ljolje and Kobayashi, to incorporate the teachings of Aher in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because natural language understanding models are applied to the voice input to determine a more correct and accurate answer for the voice input. (“[0052] … In an embodiment, natural language understanding models are applied to the voice input to determine a more correct and accurate answer for the voice input. In an embodiment, module 350 determines prosodic character of a response. For example, an audio acoustic model for the answer may be provide by a text-to-speech module. In an embodiment, question and answer audio signals are submitted to training model 370, wherein the model is used to predict the right set of audio features to be applied for each phrase and word. In an embodiment, to improve naturalness, the predicted features are post-processed using interpolation to manage the prosodic character and prosodic transitions thereof (e.g., transitions between words of the generated response). …”) by Aher et al. US 20210319780 A1
Regarding Claim 11, the combination teaches the system claim 10 as identified above
Kobayashi further teaches:
11. The system of claim 10, wherein the set of speech characteristics includes at least one of: prosody, duration, emotion, pitch, pace, emphasis, accents, or language.
Kobayashi teaches (“[0047] Thus, in the embodiments of the present invention, the correlation between the emotion and the acoustic characteristics are modeled and speech utterance is made on the basis of these acoustic characteristics to express the emotion in the speech. Moreover, in the present embodiments, the emotion is expressed by changing such parameters as time duration, pitch or sound volume (sound intensity) depending on the emotion. At this time, the constraint information, which will be explained subsequently, is added to the parameters changed, so that the prosodic characteristics of the language of the text to be synthesized will be maintained, that is so that no changes will be made in the uttered speech contents.”) (“[0051] FIG. 3 shows the relation between the duration of each phoneme and the pitch;”) (“[0069] At the next step S2, prosodic data, representing the duration, pitch and loudness of the phoneme in question, is prepared, by statistical techniques, such as quantification class 1, using the information such as accent types extracted from the string of pronunciation symbols, number of accent phrases in the sentence, positions of the accents in the sentence, number of phonemes in the accent phrases or the types of the phonemes.”
by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys and Ljolje to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device is outputting the meaningful synthesized speech by adding emotion in the device. (“[0043] The addition of the emotion expression to the uttered speech, as a function in e.g., a robot apparatus, simulating the human being, and which has the functions of outputting the meaningful synthesized speech, operates extremely effectively in promoting the intimacy between the robot apparatus and the human being. by Kobayashi et al. US 20040019484 A1
Regarding Claim 12, the combination teaches the system claim 10 as identified above.
12. The system of claim 10, wherein the speech-based discriminative task is one of: keyword spotting, wake word detection, phoneme spotting, emotion detection, transcription, natural language processing (NLP), or automatic speech recognition (ASR). Adam teaches ASR model. (“[0098] … The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 …”) by Adam et al. US 20230326445 A1
Kobayashi further teaches emotion model. (“[0065] At a first step S1 in FIG. 1, the emotion condition of the emotion model of the speaking entity is discriminated. Specifically, the state of the emotion model (emotion condition) is changed depending on the surrounding environments (extraneous factors) or internal states (internal factors). As to the emotion states, it is discriminated which of the calm, anger, sadness, happiness and comfort is the prevailing emotion.”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys and Ljolje to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device is outputting the meaningful synthesized speech by adding emotion in the device. (“[0043] The addition of the emotion expression to the uttered speech, as a function in e.g., a robot apparatus, simulating the human being, and which has the functions of outputting the meaningful synthesized speech, operates extremely effectively in promoting the intimacy between the robot apparatus and the human being. by Kobayashi et al. US 20040019484 A1
Regarding Claim 13, the combination teaches the system claim 9 as identified above.
The combination does not explicitly teach identifying, based on the speech-based discriminative task, a set of speech characteristics;
Kobayashi teaches:
13. The system of claim 9, wherein the language science resources include at least one of: a phonemes library, an acoustic model, or a linguistic library.
Kobayashi teaches (“[0047] Thus, in the embodiments of the present invention, the correlation between the emotion and the acoustic characteristics are modeled and speech utterance is made on the basis of these acoustic characteristics to express the emotion in the speech. …”) (“[0059] FIG. 11 is a block diagram showing the structure of a behavioral model library of the application layer;”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys and Ljolje to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device can have respective independent behavioral models. (“[0162] The behavioral model library 80 is provided with respective independent behavioral models …”) by Kobayashi et al. US 20040019484 A1
Claim 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Adam, Gabrys and Ljolje in view of RAGHAVENDRA E VEERA et al. AU 2019202146 A1
Regarding Claim 15 The combination teaches the method claim 9 as identified above
Adam further teaches:
Regarding Claim 15, Adam teaches:
15. The system of claim 9, wherein modifying, based on language science resources, the synthetic speech data set comprises: Adam teaches (“[0018] In some examples, the machine learning model is trained based on training data that includes synthesized (artificial) speech. The synthesized speech can be generated by a text-to-speech (TTS) system that receives a text file and outputs synthesized speech audio speaking words of the text file and ground truth phoneme locations of the spoken words. This audio can be processed by the ASR to generate a base alignment (timing) for the phoneme locations. The base timing can be processed by the machine learning adjustment model to generate a correction or the offset of the base alignment of the phoneme timing locations, by learning from the ground truth phoneme locations provided by the TTS to update one or more parameters of the machine learning model. By using the TTS to generate the training data, a large and robust collection of training data that includes synthesized speech and ground truth phoneme locations of the spoken words of the synthesized speech can be generated easily and efficiently by simply generating audio of a large corpus of text. In this way, speech does not need to be manually processed to accurately specify the phoneme locations as the TTS automatically generates the accurate phoneme locations of the synthesized speech.”) by Adam et al. US 20230326445 A1
identifying, for each synthetic speech recording of the synthetic speech data set, Adam teaches (“[0121] The phoneme module 630 can receive the text (e.g., a sample of text or transcription of text) from the text input module 610 and can extract phonemes from the sample of text. The phoneme module 630 can provide the phonemes extracted from the sample of text together with or separate from the sample of text to the TTS module 620. The phoneme module 630 also provide an identifier of the randomly selected voice or speaker for each sentence in the transcription.”) by Adam et al. US 20230326445 A1
phonemes associated with a respective synthetic speech recording; Adam teaches (“[0053] In some examples, the animated speech refinement system 230 trains the machine learning model by generating training data that includes multiple sets of synthesized audio stream or synthesized voices and their corresponding ground truth phoneme timing locations. The synthesized audio stream or synthesized voices can be generated by a text-to-speech system that can receive a large corpus of text files and can generate speech spoken by various voices using different embeddings. In some cases, the text-to-speech system can generate the synthesized speech by applying a TTS (or other neural network) to a text file and an embedding to generate an audio stream in which a speaker (associated with the embedding) speaks the words of the text file with an emotion or level of emotions provided by an emotion classification system or device. …”) by Adam et al. US 20230326445 A1
determining phonemes associated with text of the respective synthetic speech recording; Adam teaches (“[0053] In some examples, the animated speech refinement system 230 trains the machine learning model by generating training data that includes multiple sets of synthesized audio stream or synthesized voices and their corresponding ground truth phoneme timing locations. The synthesized audio stream or synthesized voices can be generated by a text-to-speech system that can receive a large corpus of text files and can generate speech spoken by various voices using different embeddings. In some cases, the text-to-speech system can generate the synthesized speech by applying a TTS (or other neural network) to a text file and an embedding to generate an audio stream in which a speaker (associated with the embedding) speaks the words of the text file with an emotion or level of emotions provided by an emotion classification system or device. In some examples, the text is normalized to generate a Mel spectrogram for the words of the text file, such as by mapping embedding vectors and translating the Mel spectrogram into an audio stream, such as using vocoder (e.g., a neural network). The audio stream can then be associated with phonemes timing details, including start and end of each phoneme and used as part of the training data to be processed by the ASR engine and to train the machine learning model to predict or estimate timing offsets to the timing provided by the ASR engine. In some examples, the training data audio streams include words of various text files spoken by any specified speaker with any specified emotion, such as neutral, joy, sad, anger, sleepy, disgust, surprise, fear, or any combination thereof.”) by Adam et al. US 20230326445 A1
aligning the phonemes associated with the respective synthetic speech recording with the phonemes associated with the text of the respective synthetic speech recording; and Adam teaches (“[0016] The disclosed techniques improve the quality of the resulting visual and audio match by providing an automated system that predicts alignment offsets of phonemes corresponding to an audio file timing recognized by an ASR engine. The predicted alignment offset is used to adjust the timing of the phonemes generated by the ASR to generate refined phoneme timing.”) (“[0018] In some examples, the machine learning model is trained based on training data that includes synthesized (artificial) speech. The synthesized speech can be generated by a text-to-speech (TTS) system that receives a text file and outputs synthesized speech audio speaking words of the text file and ground truth phoneme locations of the spoken words. This audio can be processed by the ASR to generate a base alignment (timing) for the phoneme locations. The base timing can be processed by the machine learning adjustment model to generate a correction or the offset of the base alignment of the phoneme timing locations, by learning from the ground truth phoneme locations provided by the TTS to update one or more parameters of the machine learning model. By using the TTS to generate the training data, a large and robust collection of training data that includes synthesized speech and ground truth phoneme locations of the spoken words of the synthesized speech can be generated easily and efficiently by simply generating audio of a large corpus of text. In this way, speech does not need to be manually processed to accurately specify the phoneme locations as the TTS automatically generates the accurate phoneme locations of the synthesized speech.”) (“[0096] As discussed below, during the training phase, the machine learning model module 540 is trained to estimate offsets for each phoneme generated by the ASR module 530 for a given audio stream. …”) (“[0101] During training, the machine learning model module 540 implements an artificial neural network or other machine learning technique or network. The machine learning model module 540 is trained to receive an audio stream processed by the ASR module 530, the transcription and/or the list of timestamps (or play positions) of the audio stream and corresponding phoneme for each timestamp in the list of timestamps from the ASR module 530. The machine learning model module 540 is trained to predict or estimate an offset, alignment, modification, or refinement for the phoneme timing information generated by the ASR module 530. The machine learning model module 540 adjusts or provides offsets to the list of timestamps (or play positions) of the audio stream and corresponding phoneme for each timestamp in the list of timestamps based on the predicted or estimated offset, alignment, modification, or refinement for the phoneme timing information. For example, the machine learning model module 540 can be trained to predict a first negative or positive offset (e.g., 5 millisecond) offset for a first type of phoneme and can be trained to predict a second negative or positive offset (e.g., 3 millisecond) offset for a second type of phoneme. The list of timestamps can be updated to add the negative or positive offset to the phoneme specified in the list of timestamps based on the output of the machine learning model module 540(“[0103] … The machine learning model module 540 predicts or estimates a plurality of offsets or refinement information or data for each phoneme in the base phoneme locations corresponding to the given training audio stream. During training, the ground truth phoneme locations are then retrieved and compared with the predicted or estimated plurality of offsets to generate a loss. The loss is then used to update one or more parameters of the machine learning model module 540 and another set of training data is received and processed in a similar manner until a stopping criterion is reached.”) by Adam et al. US 20230326445 A1
The combination does not explicitly teach responsive to failing to align the phonemes associated with the respective synthetic speech with the phonemes associated with the text of the respective synthetic speech recording, removing the respective synthetic speech recording from the synthetic speech data set.
RAGHAVENDRA E VEERA teaches:
responsive to failing to align the phonemes associated with the respective synthetic speech with the phonemes associated with the text of the respective synthetic speech recording, removing the respective synthetic speech recording from the synthetic speech data set. RAGHAVENDRA E VEERA et al. AU 2019202146 A1
teaches (“The present invention relates to a method and system for outlier identification to remove poor alignments in speech synthesis. The identification of poor alignment is based on fundamental frequency methods and group delay-based outlier detection methods, wherein instances of phonemes in a sentence are identified as outliers based on the above fundamental frequency and group delay methods and if the sentence has more than a given number of outliers, discarding the sentence from speech model training.”) by RAGHAVENDRA E VEERA et al. AU 2019202146 A1
RAGHAVENDRA E VEERA is considered to be analogous to the claimed invention because it relates to text-to-speech systems.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Gabrys and Ljolje to incorporate the teachings of RAGHAVENDRA E VEERA in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because removal poor alignment, which improves the synthesis quality of the text-to-speech system. (“[0002] A system and method are presented for outlier identification to remove poor alignments in speech synthesis. The quality of the output of a text-to-speech system directly depends on the accuracy of alignments of a speech utterance. The identification of mis-alignments and mis-pronunciations from automated alignments may be made based on fundamental frequency methods and group delay based outlier methods. The identification of these outliers allows for their removal, which improves the synthesis quality of the text-to-speech system.”) by RAGHAVENDRA E VEERA et al. AU 2019202146 A1
Claim 1, 2, 3, 4, 5, 7, 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Adam et al. US 20230326445 A1 in view of Kobayashi et al. US 20040019484 A1 and in view of Ljolje et al. US 12340793 B1 and further in view of Finkelstein et al. US 20230018384 A1 and further in view of Gabrys et al. US 20230260502 A1.
Regarding Claim 1, Adam teaches:
1. A method comprising:
receiving a plurality of natural speech recordings to train a discriminative model to perform a speech-based discriminative task; FIG. 4 - 8A, Adam teaches the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. Adam teaches (“[0011] FIG. 8A-C is a flowchart illustrating example operations of the animated speech refinement system, according to some examples.”) (“[0088] message duration parameter”) (“[0015] Certain social networking systems allow users to request that audio streams be represented visually and spoken by an avatar. To do so, the audio stream is processed by an automated speech recognition (ASR) engine to identify phonemes and their respective timings with respect to the audio stream. …”) (“[0080] Training data 307 stores a plurality of audio streams that include words of various text files spoken by any specified speaker with any specified emotion, such as neutral, joy, sad, anger, sleepy, disgust, surprise, fear, or any combination thereof. The audio streams can be synthesized by a TTS that processes various text files and can include ground truth phoneme timing locations. Namely, the ground truth phoneme timing locations specify the play positions of each phoneme corresponding to a portion of an audio stream. Specifically, each timestamp or play position of the audio stream can be associated with one or more phonemes, which can be used to animate an avatar speaking (and/or performing gestures associated with) the audio stream.”) (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
generating, using the subset of the plurality of natural speech recordings and the set of speech characteristics, a synthetic speech data set, Adam teaches (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
Adam teaches:
generating, based on the subset of the the discriminative model; and.
. Adam teaches the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. Adam teaches ASR module (.e. discriminative model) (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
training, using the balanced data set, the discriminative model to perform the speech-based discriminative task. Adam teaches ASR module (.e. discriminative model) (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) by Adam et al. US 20230326445 A1
Adam does not explicitly teach identifying, based on the speech-based discriminative task, a set of speech characteristics
Kobayashi teaches:
identifying, based on the speech-based discriminative task, a set of speech characteristics; Kobayashi teaches (“[0020] If the speech is to be synthesized for a meaningful sentence, seasoned with emotion, there is a risk that, except if control is made so that the prosodic characteristics of the language in question, such as accent positions, duration or loudness, are maintained, the hearer is unable to understand the meaning of the synthesized speech correctly.”) (“[0021] It is therefore an object of the present invention to provide a method and apparatus for speech synthesis, program, recording medium, method and apparatus for generating constraint information, and a robot apparatus, in which the emotion can be added to the synthesized speech as the prosodic characteristics of the language in question are maintained.”) (“[0022] In one aspect, the present invention provides a speech synthesis method for receiving information on the emotion to synthesize the speech, including a prosodic data forming step of forming prosodic data from a string of pronunciation marks which is based on an uttered text, uttered as speech, a constraint information generating step of generating the constraint information used for maintaining prosodical features of the uttered text, a parameter changing step of changing parameters of the prosodic data, in consideration of the constraint information, responsive to the information on the emotion, and a speech synthesis step of synthesizing the speech based on the prosodic data the parameters of which have been changed in the parameter changing step”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device can have respective independent behavioral models. (“[0162] The behavioral model library 80 is provided with respective independent behavioral models …”) by Kobayashi et al. US 20040019484 A1
The combination does not explicitly teach applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set, and identifying, based on the augmented synthetic speech data set, a subset of the augmented synthetic speech data set in view of language science resources and identifying, based on the augmented synthetic speech data set, a subset of the augmented synthetic speech data set in view of language science resources ;
Ljolje teaches:
applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set, Ljolje teaches the system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples. (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) (“(30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model.”) by Ljolje et al. US 12340793 B1
Ljolje teaches:
identifying, based on the
Ljolje teaches voice samples are split into groups (e.g., high/low, female/male) (i.e. subset), such as by referencing gaussian distributions of spectra for female and male speech (or their parametrized representations) or by using a VTLN-trained acoustic model. Also, Ljolje taking a set of voice samples grouped into a male speaker class of representations, a transformation may be applied across the spectral representations for the set of voice samples in order to augment them into spectral representations that fit the distribution for a female speaker class. referencing gaussian distributions (i.e. in view of language science resources). (“(15) … In one embodiment, spectral comparison 198B may obtain warp distributions for each class of spectral representations by determining a warp value for each voice sample 110 and plotting the warp values of each class as a gaussian distribution. The “peak warp value” may refer to the most frequently occurring warp value for a given class of spectral representations (e.g., the highest point on a histogram of the warp values for the class or the center of a gaussian distribution of the warp values for the class). For example, a peak warp value for a class of male speaker types may be centered at 1.1, with warp values within one standard deviation of the peak warp value sitting between 1.06 and 1.14. In one embodiment, the spectral comparison module 198B may determine the peak warp values associated with each class of spectral representation by applying the voice samples of each class into a trained acoustic model. For example, an acoustic model may be trained to receive a group of voice samples and estimate a peak warp value for the group of samples. …” col. 5, lines 25-43) (“(17) … For example, taking a set of voice samples grouped into a male speaker class of representations, a transformation may be applied across the spectral representations for the set of voice samples in order to augment them into spectral representations that fit the distribution for a female speaker class. …” col. 6, lines 29-33) (“(22) In embodiments, voice samples are split into groups (e.g., high/low, female/male), such as by referencing gaussian distributions of spectra for female and male speech (or their parametrized representations) or by using a VTLN-trained acoustic model. …” col. 8, lines 4-8) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.” Col. 9, lines 33-44) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Kobayashi to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
The combination does not explicitly teach cloning voice or resemble speaker voice.
Finkelstein teaches:
identifying receiving, based on the set of speech characteristics a subset of the plurality of natural speech recordings, wherein each natural speech recording of the subset represents a speaker to be cloned; Finkelstein teaches (“[0037] FIG. 2A shows an example of the trained voice cloning system 200, 200a of the system 100. The trained voice cloning system 200a receives a training audio signal 102 corresponding to a reference utterance spoken by the targets speaker in a first accent/dialect and a corresponding transcription 106 of the reference utterance, and generates a training synthesized speech representation 202 that clones the voice of the target speaker in a second accent/dialect different than the first accent/dialect. …”) (“[0047] The untrained TTS system 300 includes a TTS model 400 and a synthesizer 150. The TTS model 400 includes an encoder portion 400a and a decoder portion 400b. The TTS model 400 may additionally include a variation layer. The encoder portion 400a is trained to learn how to encode the training synthesized speech representation 202 into a corresponding utterance embedding 204 that represents a prosody and/or the second accent/dialect captured by the training synthesized speech representation 202. During training, the decoder portion 400b is conditioned on the transcript 106 and the conditioning inputs (e.g., speaker embedding/identifiers 108 and accent/dialect identifier) …”) by Finkelstein et al. US 20230018384 A1
Finkelstein teaches:
wherein each synthetic speech recording of the synthetic speech data set is generated for a specified text in a voice that resembles a voice of a corresponding speaker of the subset of the plurality of natural speech recordings; Finkelstein teaches FIG, 1-6 (“[0046] FIG. 3 illustrates an example training process 301 for training the TTS system 300 on training synthesized speech representations 202 generated by the trained voice cloning system 200. The trained voice cloning system 200 obtains the training data 10 including training audio signals 102 and corresponding transcripts 106. Each training signal 102 may be associated with the conditioning inputs that include the speaker embedding/identifiers 108 and the accent/dialect identifier 109. Here, the training audio signals 102 of the training data 10 represent human speech in a first accent/dialect (e.g., American English). Based on the training audio signal 102 (and optionally the corresponding transcript), the trained voice cloning system 200 is configured to generate a training synthesized speech representation 202 including the voice of the target speaker in a second accent/dialect different than the first accent/dialect. The training synthesized speech representation 202 may include an audio waveform or a sequence of mel-frequency spectrograms. The trained voice cloning system 200 provides the training synthesized speech representation 202 for training the untrained TTS model 300.”) by Finkelstein et al. US 20230018384 A1
Finkelstein is considered to be analogous to the claimed invention because it relates to to two-level text-to-speech systems using synthetic training data.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Kobayashi, Ljolje, and to incorporate the teachings of Finkelstein in order to include voice cloning feature.
One could have been motivated to do so because system improves by adding the predicted mel-frequency spectrogram in the system by the linear projection 520. [0067] The convolutional post-net 540 with one or more convolutional layers processes the predicted mel-frequency spectrogram 502P for the time step to predict a residual 542 to add to the predicted mel-frequency spectrogram 502P at adder 550. This improves the overall reconstruction. Each convolutional layer except for the final convolutional layer may be followed by batch normalization and hyperbolic tangent (TanH) activations. The convolutional layers are regularized using dropout with a probability of, for example, 0.5. The residual 542 is added to the predicted mel-frequency spectrogram 502P generated by the linear projection 520, and the sum (i.e., the mel-frequency spectrogram 502) may be provided to the speech synthesizer 150. In some implementations, in parallel to the decoder portion 500 predicting mel-frequency spectrograms 502 for each time step, a concatenation of the output of the LSTM subnetwork 520, [the utterance embedding], and the portion of the training data 10 (e.g., a character embedding generated by a text encoder (not shown)) is projected to a scalar and passed through a sigmoid activation to predict the probability that the output sequence of mel frequency spectrograms 502 has completed. The output sequence mel-frequency spectrograms 502 corresponds to the training synthesized speech representation 202 for the training data 10 and includes the intended prosody and intended accent of the target speaker.”) by Finkelstein et al. US 20230018384 A1
The combination does not explicitly teach for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording.
Gabrys teaches:
for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; Gabrys teaches (“[0023] Second, the trained single-speaker TTS component may be used to generate a synthetic parallel dataset for a multi-speaker corpus. The multi-speaker corpus may include samples recorded speech from multiple speakers along with transcripts corresponding to the recorded speech. The multi-speaker corpus may include, for example, example speech of multiple or many speakers, with samples covering many or all phonemes for a particular language. The corpus may include transcript(s) of what the speakers are saying. The single-speaker TTS component may process the transcript to generate synthesized speech that may be used as an input for training the voice-modifying model. The recorded speech may serve as target speech for the training (e.g., the target speech may be used to evaluate the output of the voice-modifying model during training). The synthesized speech and target speech (as well as voice characteristics determined from the recorded speech) form the synthetic parallel dataset used to pre-train the voice-modifying model.”) by Gabrys et al. US 20230260502 A1
generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; Gabrys teaches (“[0035] … The parallel dataset 130/140 may include the synthesized spectrogram data 182 (e.g., as generated by the TTS component 180 based on a transcript corresponding to the target speech), a target spectrogram 164 (representing a recording of the target speech), speaker embedding data (e.g., representing identifiable characteristics of the target speech), and/or frequency data 168 (e.g., representing pitch information corresponding to the target speech). …”) by Gabrys et al. US 20230260502 A1
comparing the respective synthetic speech recording to the expected speech recording; Gabrys teaches (“[0025] … The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. The voice-modified speech may be compared to the corresponding examples of the target voice. …”) by Gabrys et al. US 20230260502 A1
updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording; Gabrys teaches (“[0050] The method 500 may include adapting the fundamental frequency (“f.sub.0”) of the synthesized speech to that of the target voice (stage 530). The synthesized speech may have a fundamental frequency (e.g., pitch and/or timbre), which may be constant or have some contour (e.g., an upward and/or downward contour). The speech feature extractor component 160 may determine a mean and/or variance of the fundamental frequency of the synthesized speech. The speech feature extractor component 160 may compare the fundamental frequency mean and/or variance of the synthesized speech to that of the target voice. The synthesized speech may thus be modified such that fundamental frequency mean and/or variance matches or approximates the target voice.”) (“[0051] The method 500 may include using the voice modifier component 190 to modify the synthesized spectrogram data according to the target speaker embedding (stage 540). The voice modifier component 190 may receive the synthesized speech (e.g., the predicted spectrogram) and the voice characteristic data of the target voice (e.g., the speaker embedding data). The voice modifier component 190 may process the input to generate voice-modified synthesized speech having voice characteristics similar to the target voice.”) .”) by Gabrys et al. US 20230260502 A1
Gabrys is considered to be analogous to the claimed invention because it relates to a text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Kobayashi, Ljolje, and Finkelstein to incorporate the teachings of Gabrys in order to include parametric synthesis as augmentation techniques.
One could have been motivated to do so because speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. (“[0085] … The speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. …”) by Gabrys et al. US 20230260502 A1
Regarding Claim 2, the combination teaches the method claim 1 as identified above .
Adam further teaches:
2. The method of claim 1, wherein the speech-based discriminative task is one of: keyword spotting, wake word detection, phoneme spotting, emotion detection, transcription, natural language processing (NLP), or automatic speech recognition (ASR). Adam teaches ASR model. (“[0098] … The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 …”) by
Kobayashi further teaches emotion model. (“[0065] At a first step S1 in FIG. 1, the emotion condition of the emotion model of the speaking entity is discriminated. Specifically, the state of the emotion model (emotion condition) is changed depending on the surrounding environments (extraneous factors) or internal states (internal factors). As to the emotion states, it is discriminated which of the calm, anger, sadness, happiness and comfort is the prevailing emotion.”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device is outputting the meaningful synthesized speech by adding emotion in the device. (“[0043] The addition of the emotion expression to the uttered speech, as a function in e.g., a robot apparatus, simulating the human being, and which has the functions of outputting the meaningful synthesized speech, operates extremely effectively in promoting the intimacy between the robot apparatus and the human being. by Kobayashi et al. US 20040019484 A1
Regarding Claim 3, the combination teaches the method claim 1 as identified above
Kobayashi teaches:
3. The method of claim 1, wherein the set of speech characteristics includes at least one of: prosody, duration, emotion, pitch, pace, emphasis, accents, or language. Kobayashi teaches (“[0047] Thus, in the embodiments of the present invention, the correlation between the emotion and the acoustic characteristics are modeled and speech utterance is made on the basis of these acoustic characteristics to express the emotion in the speech. Moreover, in the present embodiments, the emotion is expressed by changing such parameters as time duration, pitch or sound volume (sound intensity) depending on the emotion. At this time, the constraint information, which will be explained subsequently, is added to the parameters changed, so that the prosodic characteristics of the language of the text to be synthesized will be maintained, that is so that no changes will be made in the uttered speech contents.”) (“[0051] FIG. 3 shows the relation between the duration of each phoneme and the pitch;”) (“[0069] At the next step S2, prosodic data, representing the duration, pitch and loudness of the phoneme in question, is prepared, by statistical techniques, such as quantification class 1, using the information such as accent types extracted from the string of pronunciation symbols, number of accent phrases in the sentence, positions of the accents in the sentence, number of phonemes in the accent phrases or the types of the phonemes.”
by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device is outputting the meaningful synthesized speech by adding emotion in the device. (“[0043] The addition of the emotion expression to the uttered speech, as a function in e.g., a robot apparatus, simulating the human being, and which has the functions of outputting the meaningful synthesized speech, operates extremely effectively in promoting the intimacy between the robot apparatus and the human being. by Kobayashi et al. US 20040019484 A1
Regarding Claim 4, the combination teaches the method claim 1 as identified above. Kobayashi further teaches:
4. The method of claim 1, wherein the language science resources include at least one of: a phonemes library, an acoustic model, or a linguistic library. Kobayashi teaches (“[0047] Thus, in the embodiments of the present invention, the correlation between the emotion and the acoustic characteristics are modeled and speech utterance is made on the basis of these acoustic characteristics to express the emotion in the speech. …”) (“[0059] FIG. 11 is a block diagram showing the structure of a behavioral model library of the application layer;”) by Kobayashi et al. US 20040019484 A1
Kobayashi is considered to be analogous to the claimed invention because it relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Kobayashi in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because the device can have respective independent behavioral models. (“[0162] The behavioral model library 80 is provided with respective independent behavioral models …”) by Kobayashi et al. US 20040019484 A1
Regarding Claim 5, the combination teaches the method claim 1 as identified above.
Ljolje further teaches:
5. The method of claim 1, wherein generating , using the subset of the plurality of natural speech recordings and, based on the set of speech characteristics
Ljolje teaches (“(14) … In embodiments, the classes of spectral representations may correspond to a variety of speaker types having varying vocal tract lengths relative to one another, such as male speakers, female speakers, child speakers, among others. In some embodiments, the speaker types may include subtypes, such as male (high end, small male), male (average), male (low end, large male), female (high end, small female), female (average), female (low end, large female), androgenous, child (young age, small child), child (average), child (older age, larger child), etc. In one embodiment, the speaker types may include bass, tenor, alto, soprano, and the like. In one embodiment, spectral classification module 198A may group voice samples 110 into classes of spectral representations by identifying speaker type labels associated with each of the voice samples. For examples, data augmentation engine 198 may obtain voice samples 110 that are labeled according to gender, age, size of speaker, etc. In one embodiment, spectral classification module 198A may group voice samples 110 into classes of spectral representation …” col. 4, lines 35-65) by Ljolje et al. US 12340793 B1
configuring a speech generation engine in view of the set of speech characteristics; Ljolje teaches (“(15) Spectral comparison module 198B determines spectral change ratios based on a comparison of warp distributions. Spectral comparison module 198B is configured to obtain, determine, or calculate warp distributions associated with each class of spectral representations used for grouping of voice samples 110 by spectral classification module 198A. …”) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) by Ljolje et al. US 12340793 B1
and inputting, into the speech generation engine, the subset of the plurality of natural speech recordings to generate the synthetic speech data set. Ljolje teaches (“(8) Audio processing system 199 processes audio files. The audio files may include media files containing voice recordings. Additionally, the media files may include text data associated with the voice recordings. The audio processing system 199 may be configured to recognize speech contained in the voice recordings (e.g., human speech), and may further convert the speech into text for one or more languages. The audio processing system 199 is further configured to collect voice samples 110, augment the voice samples, generate/compile training dataset(s) 120 from the voice samples, and to train/build a recognition model 197 using the training dataset 120. To perform these functions, the audio processing system 199 comprises a speech recognition engine 101, training engine 102, and data augmentation engine 198. For ease of describing the invention, speech recognition engine 101, training engine 102, and data augmentation engine 198 are shown as integrated into a single system, audio processing system 199; however, in some embodiments, they may each be separate and distinct systems (e.g., separate and remote servers). In embodiments, audio processing system 199 may comprise one or more computing devices that include the components of the machine depicted in FIG. 4.” Column 2, lines 30-50) (“(17) … By applying a spectral change ratio derived from a comparison of peak warp values between the distributions of two classes, data transformation module 198C generates a set of augmented voice samples mapping one class of representations to another (i.e., creating a new set of voice samples that fits a typical distribution for the speaker type that is targeted). …”) (“(37) … A range of transformations can be achieved to fit a particular target distribution, which may be different from the original male or female distribution, thus providing greater flexibility over existing augmentation techniques. …”col. 11. Lines 5-10) (“(9) Speech recognition engine 101 provides a speech recognition service. For example, audio data received from client device 103 through network 106 may be processed and translated into text for human speech. In embodiments, the client device 103 may establish a connection to speech recognition engine via speech recognition application 111, which may provide functionality for obtaining audio data through the client device 103 (e.g., retrieving from memory or obtaining through an audio input device of the client device 103) …” col. 2, lines 55-65) (“(14) … In embodiments, the classes of spectral representations may correspond to a variety of speaker types having varying vocal tract lengths relative to one another, such as male speakers, female speakers, child speakers, among others. In some embodiments, the speaker types may include subtypes, such as male (high end, small male), male (average), male (low end, large male), female (high end, small female), female (average), female (low end, large female), androgenous, child (young age, small child), child (average), child (older age, larger child), etc. In one embodiment, the speaker types may include bass, tenor, alto, soprano, and the like. In one embodiment, spectral classification module 198A may group voice samples 110 into classes of spectral representations by identifying speaker type labels associated with each of the voice samples. For examples, data augmentation engine 198 may obtain voice samples 110 that are labeled according to gender, age, size of speaker, etc. …” entire column 5) (“(15) Spectral comparison module 198B determines spectral change ratios based on a comparison of warp distributions. Spectral comparison module 198B is configured to obtain, determine, or calculate warp distributions associated with each class of spectral representations used for grouping of voice samples 110 by spectral classification module 198A. As used herein a “warp value” may refer to a value indicating the spectral difference between a particular voice sample and a normalized voice sample. For example, a warp value may be the spectral difference or “warp” between a voice of a particular person having a particular vocal tract length and the voice of an average or median vocal tract length across a set of samples. As one example, a set of male, female, and child speakers may speak a transcript of words and phrases, and the normalized voice sample for the set of samples may be a hypothetical voice sample of an androgenous, average-aged speaker speaking the transcript. In one embodiment, spectral comparison 198B may obtain warp distributions for each class of spectral representations by determining a warp value for each voice sample 110 and plotting the warp values of each class as a gaussian distribution. The “peak warp value” may refer to the most frequently occurring warp value for a given class of spectral representations (e.g., the highest point on a histogram of the warp values for the class or the center of a gaussian distribution of the warp values for the class). For example, a peak warp value for a class of male speaker types may be centered at 1.1, with warp values within one standard deviation of the peak warp value sitting between 1.06 and 1.14. In one embodiment, the spectral comparison module 198B may determine the peak warp values associated with each class of spectral representation by applying the voice samples of each class into a trained acoustic model. For example, an acoustic model may be trained to receive a group of voice samples and estimate a peak warp value for the group of samples. …” entire column 5) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Kobayashi to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
Regarding Claim 7, the combination teaches the method claim 1 as identified above
Ljolje further teaches:
7. The method of claim 1, wherein applying one or more augmentation techniques to each synthetic speech recording of : Ljolje teaches the female speaker class 520 is expanded by 20%, shifting down the fundamental frequencies captured at each time interval to generate male voice augmentations 530 of the female speaker class 520. the shape of the corresponding spectrograms for each speaker class (and thus the real-world voice characteristics of their spectral representations) is maintained during the transformations (“(36) FIG. 5 is a conceptual illustration of applying transformations to voice samples grouped into classes of spectral representations. In illustration 500, spectral representations are visualized as spectrograms. A set of spectral representations 510 are split into a female speaker class 520 of spectral representations and a male speaker class 540 of spectral representations. The female speaker class 520 is expanded by 20%, shifting down the fundamental frequencies captured at each time interval to generate male voice augmentations 530 of the female speaker class 520. The male speaker class 540 is compressed by 20%, shifting up the fundamental frequencies captured at each time interval to generate female voice augmentations 550 of the male speaker class 540. As such, the shape of the corresponding spectrograms for each speaker class (and thus the real-world voice characteristics of their spectral representations) is maintained during the transformations.”) Examiner Note: shifting up and down time interval of spectrogram and transformation is augmentation technique.
for each synthetic speech recording of the synthetic speech data set, identifying information from the metadata for modifying a respective synthetic speech recording; and Ljolje teaches (“(17) Data transformation module 198C determines and applies transformations to data to generate augmented data samples. In embodiments, data transformation module 198C applies a particular data transformation to voice samples grouped into a given class of spectral representations. For example, taking a set of voice samples grouped into a male speaker class of representations, a transformation may be applied across the spectral representations for the set of voice samples in order to augment them into spectral representations that fit the distribution for a female speaker class. In embodiments, data transformation module 198C may determine the transformations based on spectral change ratios determined by spectral comparison module 198B. Each spectral change ratio determines the change in frequency for shifting a warp distribution of a particular class of spectral representations to the center/peak of another class of spectral representations (e.g., shifting male to female, female to male, male to child, child to male, female to child, child to female, etc.). By applying a spectral change ratio derived from a comparison of peak warp values between the distributions of two classes, data transformation module 198C generates a set of augmented voice samples mapping one class of representations to another (i.e., creating a new set of voice samples that fits a typical distribution for the speaker type that is targeted). In one embodiment, data transformation module 198C further changes the tempo of the voice samples as part of the data transformation. In another embodiment, to add additional variance to the set of augmented voice samples, random noise or other random spectral changes may be added or incorporated into the transformations. Additional details regarding applying transformations to voice samples grouped into different classes of spectral representations are provided with respect to the description of FIG. 5, further below.”) by Ljolje et al. US 12340793 B1
modifying, based on the information, one or more characteristics of the respective synthetic speech recording using the one or more augmentation techniques to augment a corresponding Ljolje teaches the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) (“(30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model.”) (“(36) FIG. 5 is a conceptual illustration of applying transformations to voice samples grouped into classes of spectral representations. In illustration 500, spectral representations are visualized as spectrograms. A set of spectral representations 510 are split into a female speaker class 520 of spectral representations and a male speaker class 540 of spectral representations. The female speaker class 520 is expanded by 20%, shifting down the fundamental frequencies captured at each time interval to generate male voice augmentations 530 of the female speaker class 520. The male speaker class 540 is compressed by 20%, shifting up the fundamental frequencies captured at each time interval to generate female voice augmentations 550 of the male speaker class 540. As such, the shape of the corresponding spectrograms for each speaker class (and thus the real-world voice characteristics of their spectral representations) is maintained during the transformations.”) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Kobayashi to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
Regarding Claim 8, the combination teaches the method claim 1 as identified above
Adam further teaches:
8. The method of claim 1, wherein generating, based on the subset of the Adam teaches the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. Adam teaches (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
determining, based on the speech-based discriminative task, a distribution configuration; Ljolje teaches (“(15) Spectral comparison module 198B determines spectral change ratios based on a comparison of warp distributions. Spectral comparison module 198B is configured to obtain, determine, or calculate warp distributions associated with each class of spectral representations used for grouping of voice samples 110 by spectral classification module 198A. As used herein a “warp value” may refer to a value indicating the spectral difference between a particular voice sample and a normalized voice sample. For example, a warp value may be the spectral difference or “warp” between a voice of a particular person having a particular vocal tract length and the voice of an average or median vocal tract length across a set of samples. As one example, a set of male, female, and child speakers may speak a transcript of words and phrases, and the normalized voice sample for the set of samples may be a hypothetical voice sample of an androgenous, average-aged speaker speaking the transcript. In one embodiment, spectral comparison 198B may obtain warp distributions for each class of spectral representations by determining a warp value for each voice sample 110 and plotting the warp values of each class as a gaussian distribution. The “peak warp value” may refer to the most frequently occurring warp value for a given class of spectral representations (e.g., the highest point on a histogram of the warp values for the class or the center of a gaussian distribution of the warp values for the class). For example, a peak warp value for a class of male speaker types may be centered at 1.1, with warp values within one standard deviation of the peak warp value sitting between 1.06 and 1.14. In one embodiment, the spectral comparison module 198B may determine the peak warp values associated with each class of spectral representation by applying the voice samples of each class into a trained acoustic model. For example, an acoustic model may be trained to receive a group of voice samples and estimate a peak warp value for the group of samples. In one embodiment, the trained acoustic model may be a vocal tract length normalization (VTLN) acoustic model, such as described, referenced, and incorporated in: “Low Latency Real-Time Vocal Tract Length Normalization”, by Andrej Ljolje, Vincent Goffin and Murat Saraclar, Proceedings: Text, Speech and Dialogue, 7th International Conference, TSD 2004, Brno, Czech Republic, September 2004. In embodiments, the peak warp value associated with a particular class of spectral representations may be identified as the target for transforming other voice samples into the particular class. For example, the peak warp value for male voice samples can be the target for augmenting female and child voice samples. As such, the difference in peak warp values for each class of spectral representations may be used to determine spectral differences (i.e., spectral change ratios) between speaker types (e.g., between male, female, and child voices). In one example, when plotting or determining a warp distribution for male voices, the peak warp value may be 1.1, while plotting for female voice the peak warp value may be 0.9. This may indicate about a 20% spectral difference between male (1.1) and female (0.9) voices, and thus a spectral change ratio of 20% compression for the spectral representations of male voice samples and 20% expansion for the spectral representations of female voice samples.” Entire col. 6) by Ljolje et al. US 12340793 B1
Ljolje teaches:
generating, based on the distribution configuration, a first subset of the balanced data set, wherein the first subset of the balanced data set comprises one or more augmented synthetic speech recordings of the subset of the augmented synthetic speech data set; Ljolje teaches (“(28) The system determines 350 determines transformations based on the spectral change ratios determined at step 340. For example, to generate additional variations of the voice samples, spectral change ratios may be applied to the voice samples in each class of spectral representations in a manner that shifts its warp distribution towards the center of the warp distribution associated with another class. As one example, variations may include shifting/compressing the male voice samples by −20% to generate corresponding female voice samples, shifting/expanding the female voice samples by +20% to generate corresponding male voice samples, shifting the child voice samples to generate adult male and adult female versions of the child voice samples, and so on with each of the various groups of voice samples. The transformations may further comprise changes in tempo and random spectral changes to a voice sample.”) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) by Ljolje et al. US 12340793 B1
generating, based on the distribution configuration, a second subset of the balanced data set, Ljolje teaches the original voice samples (i.e. second subset )(“ (30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model”) by Ljolje et al. US 12340793 B1.
wherein the second subset of the balanced data set comprises one or more natural speech recordings of the subset of the plurality of natural speech recordings; Ljolje teaches the original voice samples (i.e. second subset )(“ (30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model”) by Ljolje et al. US 12340793 B1
and combining the first subset of the balanced data set and the second subset of the balanced data set to generate the balanced data set. Ljolje teaches the original voice samples (i.e. second subset ) (“(30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model”) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam and Kobayashi to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
Claim 17 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Adam et al. US 20230326445 A1 in view of Ljolje et al. US 12340793 B1 and further view of Finkelstein et al. US 20230018384 A1 and further view of Gabrys et al. US 20230260502 A1
Regarding Claim 17, Adam further teaches:
17. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processing device, cause the processing device to perform operations comprising: Adam teaches (“[0171] “Computer-readable storage medium” refers to both machine-storage media and transmission media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals. The terms “machine-readable medium,” “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure.”) (“[0174] “Non-transitory computer-readable storage medium” refers to a tangible medium that is capable of storing, encoding, or carrying the instructions for execution by a machine.”) by Adam et al. US 20230326445 A1
generating, using the subset of the plurality of natural speech recordings and the set of speech characteristics, a synthetic speech data set, Adam teaches (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
for each synthetic speech recording of the synthetic speech data set, updating metadata of a respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match an expected speech recording for the respective synthetic speech; Adam teaches (“[0018] In some examples, the machine learning model is trained based on training data that includes synthesized (artificial) speech. The synthesized speech can be generated by a text-to-speech (TTS) system that receives a text file and outputs synthesized speech audio speaking words of the text file and ground truth phoneme locations of the spoken words. This audio can be processed by the ASR to generate a base alignment (timing) for the phoneme locations. The base timing can be processed by the machine learning adjustment model to generate a correction or the offset of the base alignment of the phoneme timing locations, by learning from the ground truth phoneme locations provided by the TTS to update one or more parameters of the machine learning model. By using the TTS to generate the training data, a large and robust collection of training data that includes synthesized speech and ground truth phoneme locations of the spoken words of the synthesized speech can be generated easily and efficiently by simply generating audio of a large corpus of text. In this way, speech does not need to be manually processed to accurately specify the phoneme locations as the TTS automatically generates the accurate phoneme locations of the synthesized speech.”) (“[0126] Specifically, during training of the machine learning model module 540, the training data is provided to the ASR module 530. The training data can include samples generated by the training data generation module 510, such as any combination of the audio stream and ground truth timing information for the phoneme associated with the audio stream and/or a text sample used to generate the audio stream in which a speaker speaks the words of the text sample together with the ground truth timing of the phonemes associated with the text sample. In some examples, the audio of the training data is fed into the ASR module 530 (or the audio and text sample is fed into a forced aligner which includes at least a portion of the ASR module 530). The output of the ASR module 530 or the forced aligner provides the base timing for the training data (e.g., the audio and the corresponding text sample). …”) by Adam et al. US 20230326445 A1
generating, based on the subset, a balanced data set to train Adam teaches (“[0098] The speech module 520 is configured to receive an audio stream that includes one or more words. The audio stream can be received by recording a user speaking the one or more words and generating an audio file. In some examples, the audio stream is received through a messaging system or chat system from another user. In some examples, the audio stream is downloaded from the Internet and received from one or more websites. In some examples, the audio stream is selected from a set of pre-recorded audio streams. In such cases, a user interface is presented to a user in which a plurality of audio stream listings are presented and identified by respective icons or options. In response to receiving a user selection of an icon or option, the corresponding audio stream of the plurality of audio streams is retrieved by the speech module 520. The speech module 520 provides the audio stream including the one or more words to the ASR module 530.”) (“[0099] In some examples, during training, the speech module 520 accesses a plurality of training data from the training data generation module 510. The training data can include exclusively synthesized speech and corresponding ground truth phoneme timing locations. In some examples, the training data includes a mix of synthesized speech and corresponding ground truth phoneme timing locations, and real-world speech files and manually specified ground truth phoneme timing locations. During training, the training data is provided to the ASR module 530 and to the machine learning model module 540 to train the machine learning model to establish a relationship between a plurality of training base timings of a plurality of training phonemes and corresponding ground truth timing of the plurality of training phonemes generated by the speech module 520. In some examples, the speech module 520 randomly or pseudo-randomly selects a given training set or training audio stream generated by the training data generation module 510.”) (“[0119] Referring back to FIG. 5, during the training phase, the training data generation module 510 can generate synthetic audio streams and corresponding ground truth phoneme timing information for a large corpus of text data and voice data. FIG. 6 shows an example implementation of the training data generation module 510. The training data generation module 510 shown in FIG. 6 can include a text input module 610, a TTS module 620 and a phoneme module 630. In some examples, the training data generation module 510 can operate concurrently with the animated speech refinements system 230 to generate samples of training data to train the machine learning model module 540 on the fly.”) by Adam et al. US 20230326445 A1
Adam does not explicitly teach applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set; and selecting, based on language science resources, a subset of the synthetic speech data set.
Ljolje teaches:
applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set; Ljolje teaches the system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples. (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.”) (“(30) The system 370 compiles a training dataset using the set of augmented voice samples. For example, the training dataset may include the original voice samples in addition to the various augmented versions of the voice samples (e.g., average male version, average female version, average child version, higher-than average male version, lower-than average male version, higher-than average female version, lower-than average female version, etc.). The training dataset can then be then used to train a speech recognition model.”) by Ljolje et al. US 12340793 B1
Ljolje further teaches:
selecting, based on language science resources, a subset of the synthetic speech data set; and Ljolje teaches voice samples are split into groups (e.g., high/low, female/male) (i.e. subset), such as by referencing gaussian distributions of spectra for female and male speech (or their parametrized representations) or by using a VTLN-trained acoustic model. Also, Ljolje taking a set of voice samples grouped into a male speaker class of representations, a transformation may be applied across the spectral representations for the set of voice samples in order to augment them into spectral representations that fit the distribution for a female speaker class. referencing gaussian distributions (i.e. in view of language science resources). (“(15) … In one embodiment, spectral comparison 198B may obtain warp distributions for each class of spectral representations by determining a warp value for each voice sample 110 and plotting the warp values of each class as a gaussian distribution. The “peak warp value” may refer to the most frequently occurring warp value for a given class of spectral representations (e.g., the highest point on a histogram of the warp values for the class or the center of a gaussian distribution of the warp values for the class). For example, a peak warp value for a class of male speaker types may be centered at 1.1, with warp values within one standard deviation of the peak warp value sitting between 1.06 and 1.14. In one embodiment, the spectral comparison module 198B may determine the peak warp values associated with each class of spectral representation by applying the voice samples of each class into a trained acoustic model. For example, an acoustic model may be trained to receive a group of voice samples and estimate a peak warp value for the group of samples. …” col. 5, lines 25-43) (“(17) … For example, taking a set of voice samples grouped into a male speaker class of representations, a transformation may be applied across the spectral representations for the set of voice samples in order to augment them into spectral representations that fit the distribution for a female speaker class. …” col. 6, lines 29-33) (“(22) In embodiments, voice samples are split into groups (e.g., high/low, female/male), such as by referencing gaussian distributions of spectra for female and male speech (or their parametrized representations) or by using a VTLN-trained acoustic model. …” col. 8, lines 4-8) (“(29) The system applies 360 the transformations to the voice sample to generate augmented voice samples. For example, the system may be configured to apply the corresponding transformations for each class of spectral representations and generate versions of each voice sample that fit each of the classes (e.g., taking a large male voice sample and creating child-like versions, female versions, smaller male versions etc.). In embodiments, the system may be configured to apply the transformations to some of the voice samples, such as by selecting voice samples at random, by selecting a transformation to apply at random, or some combination thereof.” Col. 9, lines 33-44) by Ljolje et al. US 12340793 B1
Ljolje is considered to be analogous to the claimed invention because it relates to generally to the fields of acoustic modeling and speech recognition, and more specifically, to augmenting human speech data for training a recognition model.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam to incorporate the teachings of Ljolje in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because accuracy of a recognition model is improved. (“(37) … Thus, the accuracy of a recognition model is improved when trained on a dataset enlarged with the augmented voice samples that are generated as described. …”) col. 10, lines 64-67 by Ljolje et al. US 12340793 B1
The combination does not explicitly teach identifying speaker cloning features.
Finkelstein teaches:
identifying, based on a set of speech characteristics, a subset of a plurality of natural speech recordings, wherein each natural speech recording of the subset represents a speaker to be cloned; Finkelstein teaches (“[0037] FIG. 2A shows an example of the trained voice cloning system 200, 200a of the system 100. The trained voice cloning system 200a receives a training audio signal 102 corresponding to a reference utterance spoken by the targets speaker in a first accent/dialect and a corresponding transcription 106 of the reference utterance, and generates a training synthesized speech representation 202 that clones the voice of the target speaker in a second accent/dialect different than the first accent/dialect. …”) (“[0047] The untrained TTS system 300 includes a TTS model 400 and a synthesizer 150. The TTS model 400 includes an encoder portion 400a and a decoder portion 400b. The TTS model 400 may additionally include a variation layer. The encoder portion 400a is trained to learn how to encode the training synthesized speech representation 202 into a corresponding utterance embedding 204 that represents a prosody and/or the second accent/dialect captured by the training synthesized speech representation 202. During training, the decoder portion 400b is conditioned on the transcript 106 and the conditioning inputs (e.g., speaker embedding/identifiers 108 and accent/dialect identifier) …”) by Finkelstein et al. US 20230018384 A1
wherein each synthetic speech recording of the synthetic speech data set is generated for a specified text in a voice that resembles a voice of a corresponding speaker of the subset of the plurality of natural speech recordings; Finkelstein teaches FIG, 1-6 (“[0046] FIG. 3 illustrates an example training process 301 for training the TTS system 300 on training synthesized speech representations 202 generated by the trained voice cloning system 200. The trained voice cloning system 200 obtains the training data 10 including training audio signals 102 and corresponding transcripts 106. Each training signal 102 may be associated with the conditioning inputs that include the speaker embedding/identifiers 108 and the accent/dialect identifier 109. Here, the training audio signals 102 of the training data 10 represent human speech in a first accent/dialect (e.g., American English). Based on the training audio signal 102 (and optionally the corresponding transcript), the trained voice cloning system 200 is configured to generate a training synthesized speech representation 202 including the voice of the target speaker in a second accent/dialect different than the first accent/dialect. The training synthesized speech representation 202 may include an audio waveform or a sequence of mel-frequency spectrograms. The trained voice cloning system 200 provides the training synthesized speech representation 202 for training the untrained TTS model 300.”) by Finkelstein et al. US 20230018384 A1
Finkelstein is considered to be analogous to the claimed invention because it relates to to two-level text-to-speech systems using synthetic training data.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Ljolje, and to incorporate the teachings of Finkelstein in order to include voice cloning feature.
One could have been motivated to do so because system improves by adding the predicted mel-frequency spectrogram in the system by the linear projection 520. [0067] The convolutional post-net 540 with one or more convolutional layers processes the predicted mel-frequency spectrogram 502P for the time step to predict a residual 542 to add to the predicted mel-frequency spectrogram 502P at adder 550. This improves the overall reconstruction. Each convolutional layer except for the final convolutional layer may be followed by batch normalization and hyperbolic tangent (TanH) activations. The convolutional layers are regularized using dropout with a probability of, for example, 0.5. The residual 542 is added to the predicted mel-frequency spectrogram 502P generated by the linear projection 520, and the sum (i.e., the mel-frequency spectrogram 502) may be provided to the speech synthesizer 150. In some implementations, in parallel to the decoder portion 500 predicting mel-frequency spectrograms 502 for each time step, a concatenation of the output of the LSTM subnetwork 520, [the utterance embedding], and the portion of the training data 10 (e.g., a character embedding generated by a text encoder (not shown)) is projected to a scalar and passed through a sigmoid activation to predict the probability that the output sequence of mel frequency spectrograms 502 has completed. The output sequence mel-frequency spectrograms 502 corresponds to the training synthesized speech representation 202 for the training data 10 and includes the intended prosody and intended accent of the target speaker.”) by Finkelstein et al. US 20230018384 A1
The combination does not explicitly teach for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; comparing the respective synthetic speech recording to the expected speech recording; updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording; applying one or more augmentation techniques to each synthetic speech recording of the synthetic speech data set;
Gabrys teaches:
for each synthetic speech recording of the synthetic speech data set, identifying an expected speech characteristic for a respective synthetic speech recording; Gabrys teaches (“[0023] Second, the trained single-speaker TTS component may be used to generate a synthetic parallel dataset for a multi-speaker corpus. The multi-speaker corpus may include samples recorded speech from multiple speakers along with transcripts corresponding to the recorded speech. The multi-speaker corpus may include, for example, example speech of multiple or many speakers, with samples covering many or all phonemes for a particular language. The corpus may include transcript(s) of what the speakers are saying. The single-speaker TTS component may process the transcript to generate synthesized speech that may be used as an input for training the voice-modifying model. The recorded speech may serve as target speech for the training (e.g., the target speech may be used to evaluate the output of the voice-modifying model during training). The synthesized speech and target speech (as well as voice characteristics determined from the recorded speech) form the synthetic parallel dataset used to pre-train the voice-modifying model.”) by Gabrys et al. US 20230260502 A1
generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; Gabrys teaches (“[0035] … The parallel dataset 130/140 may include the synthesized spectrogram data 182 (e.g., as generated by the TTS component 180 based on a transcript corresponding to the target speech), a target spectrogram 164 (representing a recording of the target speech), speaker embedding data (e.g., representing identifiable characteristics of the target speech), and/or frequency data 168 (e.g., representing pitch information corresponding to the target speech). …”) by Gabrys et al. US 20230260502 A1
comparing the respective synthetic speech recording to the expected speech recording; Gabrys teaches (“[0025] … The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. The voice-modified speech may be compared to the corresponding examples of the target voice. …”) by Gabrys et al. US 20230260502 A1
updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording; Gabrys teaches (“[0050] The method 500 may include adapting the fundamental frequency (“f.sub.0”) of the synthesized speech to that of the target voice (stage 530). The synthesized speech may have a fundamental frequency (e.g., pitch and/or timbre), which may be constant or have some contour (e.g., an upward and/or downward contour). The speech feature extractor component 160 may determine a mean and/or variance of the fundamental frequency of the synthesized speech. The speech feature extractor component 160 may compare the fundamental frequency mean and/or variance of the synthesized speech to that of the target voice. The synthesized speech may thus be modified such that fundamental frequency mean and/or variance matches or approximates the target voice.”) (“[0051] The method 500 may include using the voice modifier component 190 to modify the synthesized spectrogram data according to the target speaker embedding (stage 540). The voice modifier component 190 may receive the synthesized speech (e.g., the predicted spectrogram) and the voice characteristic data of the target voice (e.g., the speaker embedding data). The voice modifier component 190 may process the input to generate voice-modified synthesized speech having voice characteristics similar to the target voice.”) .”) by Gabrys et al. US 20230260502 A1
Gabrys is considered to be analogous to the claimed invention because it relates to a text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam Ljolje, and Finkelstein to further incorporate the teachings of Gabrys in order to include parametric synthesis as augmentation techniques.
One could have been motivated to do so because speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. (“[0085] … The speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. …”) by Gabrys et al. US 20230260502 A1
Regarding Claim 19, the combination teaches the non-transitory computer-readable storage medium claim 17 as identified above
The combination does not explicitly teach wherein updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording for the respective synthetic speech.
Gabrys further teaches:
19. The non-transitory computer-readable storage medium of claim 17, wherein updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording for the respective synthetic speech comprises: Gabrys teaches (“[0050] The method 500 may include adapting the fundamental frequency (“f.sub.0”) of the synthesized speech to that of the target voice (stage 530). The synthesized speech may have a fundamental frequency (e.g., pitch and/or timbre), which may be constant or have some contour (e.g., an upward and/or downward contour). The speech feature extractor component 160 may determine a mean and/or variance of the fundamental frequency of the synthesized speech. The speech feature extractor component 160 may compare the fundamental frequency mean and/or variance of the synthesized speech to that of the target voice. The synthesized speech may thus be modified such that fundamental frequency mean and/or variance matches or approximates the target voice.”) (“[0051] The method 500 may include using the voice modifier component 190 to modify the synthesized spectrogram data according to the target speaker embedding (stage 540). The voice modifier component 190 may receive the synthesized speech (e.g., the predicted spectrogram) and the voice characteristic data of the target voice (e.g., the speaker embedding data). The voice modifier component 190 may process the input to generate voice-modified synthesized speech having voice characteristics similar to the target voice.”) .”) by Gabrys et al. US 20230260502 A1
identifying an expected speech characteristic for the respective synthetic speech recording; Gabrys teaches (“[0023] Second, the trained single-speaker TTS component may be used to generate a synthetic parallel dataset for a multi-speaker corpus. The multi-speaker corpus may include samples recorded speech from multiple speakers along with transcripts corresponding to the recorded speech. The multi-speaker corpus may include, for example, example speech of multiple or many speakers, with samples covering many or all phonemes for a particular language. The corpus may include transcript(s) of what the speakers are saying. The single-speaker TTS component may process the transcript to generate synthesized speech that may be used as an input for training the voice-modifying model. The recorded speech may serve as target speech for the training (e.g., the target speech may be used to evaluate the output of the voice-modifying model during training). The synthesized speech and target speech (as well as voice characteristics determined from the recorded speech) form the synthetic parallel dataset used to pre-train the voice-modifying model.”) by Gabrys et al. US 20230260502 A1
generating, based on language science resources, an expected speech recording for the respective synthetic speech recording; Gabrys teaches (“[0035] … The parallel dataset 130/140 may include the synthesized spectrogram data 182 (e.g., as generated by the TTS component 180 based on a transcript corresponding to the target speech), a target spectrogram 164 (representing a recording of the target speech), speaker embedding data (e.g., representing identifiable characteristics of the target speech), and/or frequency data 168 (e.g., representing pitch information corresponding to the target speech). …”) by Gabrys et al. US 20230260502 A1
comparing the respective synthetic speech recording to the expected speech recording; and Gabrys teaches (“[0025] … The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. The voice-modified speech may be compared to the corresponding examples of the target voice. …”) by Gabrys et al. US 20230260502 A1
determining, based on the comparison, the information used to update the metadata of the respective synthetic speech recording. (“[0025] Finally, the pre-trained voice-modifying model may be fine-tuned using examples of the target voice. Fine-tuning the voice-modifying model may configure it to receive synthesized speech from the single-speaker TTS component and generate voice-modified speech that sounds like the target voice. For the fine-tuning, the single-speaker TTS component may process a transcript of the examples of the target voice to generate synthesized speech. The voice-modifying model may process synthesized speech and target voice characteristics to generate voice-modified speech. The voice-modified speech may be compared to the corresponding examples of the target voice. The voice-modifying model may be adjusted based on a difference between the voice-modified speech and the target voice. As a result of the fine-tuning, the voice-modifying model may be configured to receive synthesized audio from the single-speaker TTS component and output voice-modified audio with voice characteristics approximating the target voice characteristics; that is, to output synthesized speech that sounds as though it was spoken by the same speaker as target voice.”) (“[0036] … The frequency data 168 may help the model better absorb prosodic differences between synthesized speech and the target speaker. Thus, training of the voice modifier component 190 may be focused on modifying speaker-defining information rather than adjusting prosody information between source and target speakers.”) by Gabrys et al. US 20230260502 A1
Gabrys is considered to be analogous to the claimed invention because it relates to a text-to-speech (TTS) system may be configured to imitate characteristics of a target voice based on a limited dataset.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Ljolje, Finkelstein to incorporate the teachings of Gabrys in order to include updating metadata of the respective synthetic speech recording to include information used to augment the respective synthetic speech recording to match the expected speech recording.
One could have been motivated to do so because speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. (“[0085] … The speech synthesis engine 718 may be configured to convert the input text into high-quality natural-sounding speech in an efficient manner. …”) by Gabrys et al. US 20230260502 A1
Claim 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Adam and Ljolje, Finkelstein and Gabrys in view of Aher et al. US 20210319780 A1
The combination does not explicitly teach selecting, based on the set of speech characteristics, a subset of the plurality of natural speech recordings;
Aher teaches:
18. The non-transitory computer-readable storage medium of claim 17, wherein the processing device is to perform operations further comprising:
selecting, based on the set of speech characteristics, a subset of the plurality of natural speech recordings; Aher teaches (“[0081] In an embodiment, at step 608, the voice application determines predicted prosodic characteristics of the response using a model, and modifies the predicted prosodic characteristics to generate the prosodic characteristics of the synthesized speech response. The model may, for example, include the results of a training model (e.g., training model 370 of FIG. 3), which may include correlations, probabilities, confidences, and other values indicative of the model output. In an embodiment, the voice application may select from among a plurality of versions of a word, each having a particular prosodic character, for the version that most matches the desired or predicted prosodic character. For example, the voice application may access a database that stores a plurality of audio files, each corresponding to a word, phrase, or grouping thereof, and may select the audio file having associated prosodic metrics that are most similar to the predicted metrics.”) by Aher et al. US 20210319780 A1
configuring a speech generation engine in view of the set of speech characteristics; and Aher teaches (“[0034] Prosodic engine 223 is configured to determine one or more prosodic metrics associated with a word, group of words, or a voice input. Prosodic engine 223 may include, for example, temporal and spectral analyzers for extracting information about an audio file. In an embodiment, prosodic engine 223 is configured to determine pitch values, note values, rate values, timber values, volume values, emotional metric values (e.g., based on prosodic metrics), any other suitable data, or any combination thereof. Prosodic engine 223 may, for example, apply one or more operations provided by an algorithm to extract metrics of the voice input.”) by Aher et al. US 20210319780 A1
generating, using the speech generation engine, the synthetic speech data set based on the subset of the plurality of natural speech recordings. (“[0036] Speech generator 225 is configured to synthesize and output the synthesized speech response to the voice input. In an embodiment, speech generator 225 includes a text-to-speech engine configured to identify a text string to be synthesized as a synthesized speech response. For example, speech generator 225 may generate audio output at a speaker or other audio device based on the text string and audio settings. For example, speech generator 225 may use one or more settings including prosodic metrics corresponding to each word or a group of words to specify voice details (e.g., male/female voice, accent, rate, emphasis, or other details), playback speed, or any other suitable settings that may affect the generated audio output.”) by Aher et al. US 20210319780 A1
Aher is considered to be analogous to the claimed invention because it relates to a relates to systems for managing responses to voice inputs, and, more particularly, systems for generating more natural speech responses to voice inputs based on prosody.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Ljolje, Finkelstein and Gabrys
to further incorporate the teachings of Aher in order to include receiving and modifying synthetic speech data based on speech characteristics.
One could have been motivated to do so because natural language understanding models are applied to the voice input to determine a more correct and accurate answer for the voice input. (“[0052] … In an embodiment, natural language understanding models are applied to the voice input to determine a more correct and accurate answer for the voice input. In an embodiment, module 350 determines prosodic character of a response. For example, an audio acoustic model for the answer may be provide by a text-to-speech module. In an embodiment, question and answer audio signals are submitted to training model 370, wherein the model is used to predict the right set of audio features to be applied for each phrase and word. In an embodiment, to improve naturalness, the predicted features are post-processed using interpolation to manage the prosodic character and prosodic transitions thereof (e.g., transitions between words of the generated response). …”) by Aher et al. US 20210319780 A1
Claim 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Adam, Ljolje, Finkelstein and Gabrys in view of RAGHAVENDRA E VEERA et al. AU 2019202146 A1
Regarding Claim 20, the combination teaches the method claim 17 as identified above
Adam further teaches:20. The non-transitory computer-readable storage medium of claim 17, wherein selecting, based on language science resources, the subset of the synthetic speech data set comprises: Adam teaches (“[0171] “Computer-readable storage medium” refers to both machine-storage media and transmission media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals. The terms “machine-readable medium,” “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure.”) (“[0174] “Non-transitory computer-readable storage medium” refers to a tangible medium that is capable of storing, encoding, or carrying the instructions for execution by a machine.”) by Adam et al. US 20230326445 A1
identifying, for each synthetic speech recording of the synthetic speech data set, phonemes associated with a respective synthetic speech recording; Adam teaches (“[0121] The phoneme module 630 can receive the text (e.g., a sample of text or transcription of text) from the text input module 610 and can extract phonemes from the sample of text. The phoneme module 630 can provide the phonemes extracted from the sample of text together with or separate from the sample of text to the TTS module 620. The phoneme module 630 also provide an identifier of the randomly selected voice or speaker for each sentence in the transcription.”) (“[0053] In some examples, the animated speech refinement system 230 trains the machine learning model by generating training data that includes multiple sets of synthesized audio stream or synthesized voices and their corresponding ground truth phoneme timing locations. The synthesized audio stream or synthesized voices can be generated by a text-to-speech system that can receive a large corpus of text files and can generate speech spoken by various voices using different embeddings. In some cases, the text-to-speech system can generate the synthesized speech by applying a TTS (or other neural network) to a text file and an embedding to generate an audio stream in which a speaker (associated with the embedding) speaks the words of the text file with an emotion or level of emotions provided by an emotion classification system or device. …”) by Adam et al. US 20230326445 A1
determining phonemes associated with text of the respective synthetic speech recording; Adam teaches (“[0053] In some examples, the animated speech refinement system 230 trains the machine learning model by generating training data that includes multiple sets of synthesized audio stream or synthesized voices and their corresponding ground truth phoneme timing locations. The synthesized audio stream or synthesized voices can be generated by a text-to-speech system that can receive a large corpus of text files and can generate speech spoken by various voices using different embeddings. In some cases, the text-to-speech system can generate the synthesized speech by applying a TTS (or other neural network) to a text file and an embedding to generate an audio stream in which a speaker (associated with the embedding) speaks the words of the text file with an emotion or level of emotions provided by an emotion classification system or device. In some examples, the text is normalized to generate a Mel spectrogram for the words of the text file, such as by mapping embedding vectors and translating the Mel spectrogram into an audio stream, such as using vocoder (e.g., a neural network). The audio stream can then be associated with phonemes timing details, including start and end of each phoneme and used as part of the training data to be processed by the ASR engine and to train the machine learning model to predict or estimate timing offsets to the timing provided by the ASR engine. In some examples, the training data audio streams include words of various text files spoken by any specified speaker with any specified emotion, such as neutral, joy, sad, anger, sleepy, disgust, surprise, fear, or any combination thereof.”) by Adam et al. US 20230326445 A1
aligning the phonemes associated with the respective synthetic speech recording with the phonemes associated with the text of the respective synthetic speech recording; and Adam teaches (“[0016] The disclosed techniques improve the quality of the resulting visual and audio match by providing an automated system that predicts alignment offsets of phonemes corresponding to an audio file timing recognized by an ASR engine. The predicted alignment offset is used to adjust the timing of the phonemes generated by the ASR to generate refined phoneme timing.”) (“[0018] In some examples, the machine learning model is trained based on training data that includes synthesized (artificial) speech. The synthesized speech can be generated by a text-to-speech (TTS) system that receives a text file and outputs synthesized speech audio speaking words of the text file and ground truth phoneme locations of the spoken words. This audio can be processed by the ASR to generate a base alignment (timing) for the phoneme locations. The base timing can be processed by the machine learning adjustment model to generate a correction or the offset of the base alignment of the phoneme timing locations, by learning from the ground truth phoneme locations provided by the TTS to update one or more parameters of the machine learning model. By using the TTS to generate the training data, a large and robust collection of training data that includes synthesized speech and ground truth phoneme locations of the spoken words of the synthesized speech can be generated easily and efficiently by simply generating audio of a large corpus of text. In this way, speech does not need to be manually processed to accurately specify the phoneme locations as the TTS automatically generates the accurate phoneme locations of the synthesized speech.”) (“[0096] As discussed below, during the training phase, the machine learning model module 540 is trained to estimate offsets for each phoneme generated by the ASR module 530 for a given audio stream. …”) (“[0101] During training, the machine learning model module 540 implements an artificial neural network or other machine learning technique or network. The machine learning model module 540 is trained to receive an audio stream processed by the ASR module 530, the transcription and/or the list of timestamps (or play positions) of the audio stream and corresponding phoneme for each timestamp in the list of timestamps from the ASR module 530. The machine learning model module 540 is trained to predict or estimate an offset, alignment, modification, or refinement for the phoneme timing information generated by the ASR module 530. The machine learning model module 540 adjusts or provides offsets to the list of timestamps (or play positions) of the audio stream and corresponding phoneme for each timestamp in the list of timestamps based on the predicted or estimated offset, alignment, modification, or refinement for the phoneme timing information. For example, the machine learning model module 540 can be trained to predict a first negative or positive offset (e.g., 5 millisecond) offset for a first type of phoneme and can be trained to predict a second negative or positive offset (e.g., 3 millisecond) offset for a second type of phoneme. The list of timestamps can be updated to add the negative or positive offset to the phoneme specified in the list of timestamps based on the output of the machine learning model module 540(“[0103] … The machine learning model module 540 predicts or estimates a plurality of offsets or refinement information or data for each phoneme in the base phoneme locations corresponding to the given training audio stream. During training, the ground truth phoneme locations are then retrieved and compared with the predicted or estimated plurality of offsets to generate a loss. The loss is then used to update one or more parameters of the machine learning model module 540 and another set of training data is received and processed in a similar manner until a stopping criterion is reached.”) by Adam et al. US 20230326445 A1
The combination does not explicitly teach responsive to failing to align the phonemes associated with the respective synthetic speech with the phonemes associated with the text of the respective synthetic speech recording, removing the respective synthetic speech recording from the synthetic speech data set.
RAGHAVENDRA E VEERA teaches:
responsive to failing to align the phonemes associated with the respective synthetic speech with the phonemes associated with the text of the respective synthetic speech recording, removing the respective synthetic speech recording from the synthetic speech data set to generate the subset.
RAGHAVENDRA E VEERA et al. AU 2019202146 A1
teaches (“The present invention relates to a method and system for outlier identification to remove poor alignments in speech synthesis. The identification of poor alignment is based on fundamental frequency methods and group delay-based outlier detection methods, wherein instances of phonemes in a sentence are identified as outliers based on the above fundamental frequency and group delay methods and if the sentence has more than a given number of outliers, discarding the sentence from speech model training.”) by RAGHAVENDRA E VEERA et al. AU 2019202146 A1
RAGHAVENDRA E VEERA is considered to be analogous to the claimed invention because it relates to text-to-speech systems.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Adam, Ljolje, Finkelstein and Gabrys to incorporate the teachings of RAGHAVENDRA E VEERA in order to include a subset of the augmented synthetic speech data set in view of language science resources.
One could have been motivated to do so because removal poor alignment, which improves the synthesis quality of the text-to-speech system. (“[0002] A system and method are presented for outlier identification to remove poor alignments in speech synthesis. The quality of the output of a text-to-speech system directly depends on the accuracy of alignments of a speech utterance. The identification of mis-alignments and mis-pronunciations from automated alignments may be made based on fundamental frequency methods and group delay-based outlier methods. The identification of these outliers allows for their removal, which improves the synthesis quality of the text-to-speech system.”) by RAGHAVENDRA E VEERA et al. AU 2019202146 A1
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, howev2er, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FOUZIA HYE SOLAIMAN whose telephone number is (571)270-5656. The examiner can normally be reached M-F (8-5)AM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/F.H.S./Examiner, Art Unit 2653
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653
03/27/2026