Office Action Analysis: 18769197 — DETECTING SYNTHETIC SPEECH

Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant claims the benefit of US Provisional Application No. 63/528,143 filed July 21, 2023. Claims 1-20 have been afforded the benefit of the July 21, 2023 filing date.

Information Disclosure Statement
The IDS dated on 10/17/2024 and the IDS dated 1/5/2026 has been considered and placed in the application file.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 18-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because they recite a computer-readable storage media, which comprises digital signals per se. The applicant’s specification does not provide a special definition for computer-readable storage media, thus, using its plain meaning, the term includes data signals per se as one potential form of the media. Data signals per se do not fall into one of the four statutory categories of invention. As such, they are non-statutory subject matter. In contrast, a claimed non-statutory computer readable storage media excludes data signals from its scope, and does fall into one of the four statutory categories of invention. 





Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim [1, 4, 7, 8, 9, 11, 14, 16, 17, 18] are rejected under 35 U.S.C. 103 as
 being unpatentable over Shor (US 11756572 B2) in view of Mont-Reynaud (US 10311858 B1).
Regarding claim 1, Shor discloses a method for detecting synthetic speech of a speaker in an 
audio clip, comprising -[Column 4, Lines 26- 32 “Implementations herein are directed toward detecting synthetic speech in audio data based on a self-supervised model that extracts audio features from the audio data and a shallow discriminator model that determines a probability that synthetic speech is present in the audio features and thus in the audio data”]; [ Column 5, Lines 13- 23 “In the example shown, the audio source 10 produces an utterance 119 that includes the speech “My name is Jane Smith.” The audio feature extractor 220 receives audio data 120 characterizing the utterance 119 in the streaming audio 118 and generates, from the audio data 120, a plurality of audio feature vectors 212, 212a-n. Each audio feature vector 212 represents audio features (i.e., audio characteristics
such as spectrograms (e.g., mel-frequency spectrograms and mel-frequency ceptstral coefficients (MFCCs)) of a chunk or portion of the audio data 120 (i.e., a portion of the streaming audio 118 or utterance 119”].
generating, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the speaker, wherein the one or more reference embeddings characterize a first set of acoustic features 
generating, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that characterizes a second set of acoustic features and 
computing a score based on the test embedding and the one or more reference embeddings -[Column 5, Lines 31-41 “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2) that indicates a presence of synthetic speech in the streaming audio 118 based on the corresponding audio features of each audio feature vector 212”].
and outputting, based on the score, an indication of whether the audio clip includes synthetic speech - [Column 5, Lines 41-48 The synthetic speech detector 220 determines whether the score 224 (e.g., a probability score) satisfies a synthetic speech detection threshold. When the score 224 satisfies the synthetic speech detection threshold, the synthetic speech detector 220 determines that the speech (i.e., the utterance 119) in the streaming audio 118 captured by the user device 102 includes synthetic speech”].
However, Shor does not teach generating an embedding wherein the one or more reference embeddings characterize a first set of phonetic features. 
But Mont-Reynaud does teach generating a embedding wherein the one or more reference embeddings characterize a first set of phonetic features- [Column 10 , lines 33-39 “ Phonetic features 337 such as phoneme sequences or lattices can be processed to derive extended features 255 such as phoneme length, or articulatory-phonetic features, such as the place and manner of articulation for consonants, vowel placement, and other articulatory features that may be used for determining value for the accent property in the user profile”]; [Column 8, lines 40-54 “The extracted feature set 257 comprises more than simple sequences of elements such as frames, states, phonemes, or
words, but also may include relationships (such as alignments) or mappings between the elements at successive levels, sufficient to derive additional information. For example, when a speech transcription contains a certain word (in the textual features 347), the extracted feature set 257 also delimits the specific phoneme subsequence (from the phonetic features 337) that matches this word in the
transcription; and each phoneme in turn can be mapped to the sequence of states or frames that it spans. Such a cross-referencing capability (between words and phonemes and acoustic features) is often useful for deriving extended features; the generation of extended features 250 is discussed later” [Column 10, lines 54- 67 , Column 11, lines 1-4 “ FIG. 4 is block diagram that illustrates the flow of generating information for inclusion in the user profile, according to an implementation of the invention. In particular, FIG. 4 shows user profile generation module 265 integrating acoustic and phonetic features 425 with linguistic features 429 to generate the user profile 147. Acoustic and phonetic features 425 are a subset of the extended feature set 255 and are used by speech profile generation module 435 to determine speech profile characteristics 445. Linguistic features 429 are another subset of the extended feature set 255 and are used by language profile generation module 439 to determine language profile characteristics 449. These profile generation modules 435 and 439 may contain classifiers as described above to transform the feature vectors in the extended feature set 255 to profile characteristics. The speech profile characteristics 445 and language profile characteristics
449 are combined in the integrated user profile 467 by the profile integration module 457”]’ [Column 4, lines 32-40 “Classifier: a function taking extracted and/or extended features as input and assigning a value to a user profile property. A classifier function is a software module that is trained as known in the fields of machine learning and pattern recognition. The training data includes known associations
between features (usually called a feature vector) and corresponding user characteristics (usually called the ground truth). After training, a classifier can accept a feature vector as input and map it to the most probable property value for that input”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because generating embeddings with phonetic features as well as generating embeddings with acoustic features will significantly improving synthetic speech detection accuracy, robustness to voice variations, and adaptability to new, unseen synthetic voices. This results in more accurate speech characterization and classification. 
Regarding claim 4, Shor discloses the method of claim 1, wherein outputting the indication comprises: based on the score satisfying a threshold, outputting an indication that the audio clip includes synthetic speech - [Column 5, lines 32-52  “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2) that indicates a presence of synthetic speech in the streaming audio 118 based on the corresponding audio features of each audio feature vector 212”]; [Column 5, Lines 41-52 “The synthetic speech detector 220 determines whether the score 224 (e.g., a probability score) satisfies a synthetic speech detection threshold. When the score 224 satisfies the synthetic speech detection threshold, the synthetic speech detector 220 determines that the speech (i.e., the utterance 119) in the streaming audio 118 captured by the user device 102 includes synthetic speech. The synthetic speech detector 220 may determine that the utterance 119 includes synthetic speech even when a majority of the utterance 119 includes human-originated speech (i.e., a small portion of synthetic speech is interjected or interspersed with human-originated speech)”].
Regarding claim 7, Shor discloses the method of claim 1, wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals-  [Column 5, Lines 13- 23 “In the example shown, the audio source 10 produces an utterance 119 that includes the speech “My name is Jane Smith.” The audio feature extractor 220 receives audio data 120 characterizing the utterance 119 in the streaming audio 118 and generates, from the audio data 120, a plurality of audio feature vectors 212, 212a-n. Each audio feature vector 212 represents audio features (i.e., audio characteristics such as spectrograms (e.g., mel-frequency spectrograms and mel-frequency ceptstral coefficients (MFCCs)) of a chunk or portion of the audio data 120 (i.e., a portion of the streaming audio 118 or utterance 119)”].
Regarding claim 8, Shor does not disclose method of claim 1, wherein the first set of phonetic 
features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals.
However,  Mont-Reynaud discloses the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals - [ Column 4, lines 5-19 “Phonetic features: The most common format for representing phonetic information is a phonetic sequence. A speech recognition system may identify several possible phonetic sequences for a user utterance. Weights may be applied to phonemes within each sequence according to some method of assigning probability. A score for each alternative phonetic sequence may be computed, and the most likely phonetic sequence may be selected based on the score. In some cases, multiple phonetic sequences are kept at the same time, often as a phoneme lattice. In addition, acoustic-phonetic information, including phoneme length ( and HMM state info) is available and can also contribute to accent identification and detection. Articulatory-phonetic information includes a variety of secondary features: place and manner of articulation, vowel placement and more”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because it would allow the extracted audio features to more explicitly capture phonetic characteristics on the speech signal, thereby improving the accuracy and robustness of the tasks such as synthetic speech detection. This would have enhanced the system's ability to distinguish between genuine and synthesized speech using acoustic and phonetic features.
Regarding claim 9, Shor discloses the method of claim 1, further comprising: training, based on training data, a deep learning model to generate the one or more reference embeddings and generate the test embedding, wherein the training data includes sample speech clips labeled for authentic speech and synthetic speech - [Column 2, lines 3-16 "In some implementations, the shallow discriminator model includes one of a logistic regression model, a linear discriminant analysis model, or a random forest model. In some examples, the trained self-supervised model is trained on a first training dataset including only training samples of human-originated speech. The shallow discriminator model may be trained on a second training dataset including training samples of synthetic speech. The second training dataset may be smaller than the first training dataset. Optionally, the data processing hardware resides on the user device. The trained self-supervised model may include a representation model derived from a larger trained self-supervised model"]; 
Regarding claim 11, Shor discloses a computing system comprising processing circuitry and memory for executing a machine learning system, the machine learning system configured to - [Column 1, lines 13-16“A speech-enabled environment (e.g., home, workplace, school, automobile, etc.) allows a user to speak a query or a command out loud to a computer-based system that fields and answers the query and/or performs a function based on the command”]; [ Column 2, lines 17-23 “Another aspect of the disclosure provides system for classifying whether audio data includes synthetic speech. The system includes data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that when executed on the data processing hardware cause the data processing hardware to perform operations”].
generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features verification model will extract a verification speaker embedding from the input audio features and compare the verification speaker embedding with a reference speaker embedding for the authorized user”]; [Column 6, Lines 60-67 “Referring now to FIG. 2, schematic view 200 includes the audio feature extractor 210 executing a deep neural network 250. The deep neural network 250 may include any number of hidden layers that are configured to receive the audio data 120. In some implementations, the deep neural network 250 of the audio feature extractor 210 generates, from the audio data 120, the plurality of audio feature vectors 212, 212a-n (i.e., embeddings)”].
generate, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features 
compute a score based on the test embedding and the one or more reference embeddings; -[ Column 5, lines 32-41  “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2) that indicates a presence of synthetic speech in the streaming audio 118 based on the corresponding audio features of each audio feature vector 212”].
and output, based on the score, an indication of whether the audio clip includes synthetic speech - [Column 5, Lines 41-48 The synthetic speech detector 220 determines whether the score 224 (e.g., a probability score) satisfies a synthetic speech detection threshold. When the score 224 satisfies the synthetic speech detection threshold, the synthetic speech detector 220 determines that the speech (i.e., the utterance 119) in the streaming audio 118 captured by the user device 102 includes synthetic speech”].
However, Shor does not teach generating an embedding wherein he one or more reference embeddings specify characterize a first set of phonetic features associated with the registered speaker;
But Mont-Reynaud does teach generating a embedding specify characterize a first set of phonetic features associated with the registered speaker - [Column 10 , lines 33- 39 “ Phonetic features 337 such as phoneme sequences or lattices can be processed to derive extended features 255 such as phoneme length, or articulatory-phonetic features, such as the place and manner of articulation for consonants, vowel placement, and other articulatory features that may be used for determining value for the accent property in the user profile”]; [Column 8, lines 40-54 “The extracted feature set 257 comprises more than simple sequences of elements such as frames, states, phonemes, or
words, but also may include relationships (such as alignments) or mappings between the elements at successive levels, sufficient to derive additional information. For example, when a speech transcription contains a certain word (in the textual features 347), the extracted feature set 257 also delimits the specific phoneme subsequence (from the phonetic features 337) that matches this word in the
transcription; and each phoneme in turn can be mapped to the sequence of states or frames that it spans. Such a cross-referencing capability (between words and phonemes and acoustic features) is often useful for deriving extended features; the generation of extended features 250 is discussed later” [“ Column 10, lines 54- 67 , Column 11, lines 1-4 “ FIG. 4 is block diagram that illustrates the flow of generating information for inclusion in the user profile, according to an implementation of the invention. In particular, FIG. 4 shows user profile generation module 265 integrating acoustic and phonetic features 425 with linguistic features 429 to generate the user profile 147. Acoustic and phonetic features 425 are a subset of the extended feature set 255 and are used by speech profile generation module 435 to determine speech profile characteristics 445. Linguistic features 429 are another subset of the extended feature set 255 and are used by language profile generation module 439 to determine language profile characteristics 449. These profile generation modules 435 and 439 may contain classifiers as described above to transform the feature vectors in the extended feature set 255 to profile characteristics. The speech profile characteristics 445 and language profile characteristics
449 are combined in the integrated user profile 467 by the profile integration module 457”]’ [Column 4, lines 32-40 “Classifier: a function taking extracted and/or extended features as input and assigning a value to a user profile property. A classifier function is a software module that is trained as known in the fields of machine learning and pattern recognition. The training data includes known associations
between features (usually called a feature vector) and corresponding user characteristics (usually called the ground truth). After training, a classifier can accept a feature vector as input and map it to the most probable property value for that input”]. [Column 9, lines 25-35 “Alternatively, a non-registered user may only be known to persist within a session ( or collection of sessions), and the user profile 147 will, by necessity, be based on a smaller number of interactions. In either case, standard speaker
identification techniques may be applied to detect speaker changes, done to address the undesirable case in which the identity of the user (speaker) unexpectedly changes midsession.
Such a speaker change reduces the validity of the statistics being gathered, and their applicability to the current user to the same user”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because generating embeddings with phonetic features as well as generating embeddings with acoustic features will significantly improving synthetic speech detection accuracy, robustness to voice variations, and adaptability to new, unseen synthetic voices. This results in more accurate speech characterization and classification. 
Regarding claim 14, Shor discloses the computing system of claim 11, wherein to output the indication, the machine learning system is configured to output, based on the score satisfying a threshold, an indication that the audio clip includes synthetic speech. Claim 14 is rejected for the same reasons as claim 4.
Regarding claim 16, Shor discloses the computing system of claim 11, wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals. Claim 16 is rejected for the same reasons as claim 7.
Regarding claim 17, Shor in view of  Mont-Reynaud does disclose the computing system of claim 11, wherein the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals. 
Claim 17 is rejected for the same reasons as claim 8. 
Regarding claim 18, Shor discloses a computer-readable storage media comprising machine readable instructions for configuring processing circuitry to: -[ column 11,  lines 22-36 “These computer programs ( also known as programs , software , software applications or code ) include machine instructions for a programmable processor , and can be implemented in a high - level procedural and / or object-oriented programming language , and / or in assembly / machine language . As used herein, the terms “machine - readable medium” and “computer - readable medium” refer to any computer program product, non - transitory computer readable medium, apparatus and / or device (e.g. , magnetic discs , optical disks , memory , Programmable Logic Devices ( PLDs ) used to provide machine instructions and / or data to a programmable processor , including a machine - readable medium that receives machine instructions as a machine-readable signal . The term “machine - readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor”].
generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features 
generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features column 7, lines 1-5  “The shallow discriminator model 222 receives the plurality of audio feature vectors 212 simultaneously, sequentially, or concatenated together. The plurality of audio feature vectors 212 may undergo some processing between the audio feature extractor 210 and the shallow discriminator model 222”]. This indicates several vectors are being generated.  
compute a score based on the test embedding and the one or more reference embeddings; -[ Column 5, lines 32-41  “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2) that indicates a presence of synthetic speech in the streaming audio 118 based on the corresponding audio features of each audio feature vector 212”].
and output, based on the score, an indication of whether the audio clip includes synthetic speech - [Column 5, Lines 41-48 ‘The synthetic speech detector 220 determines whether the score 224 (e.g., a probability score) satisfies a synthetic speech detection threshold. When the score 224 satisfies the synthetic speech detection threshold, the synthetic speech detector 220 determines that the speech (i.e., the utterance 119) in the streaming audio 118 captured by the user device 102 includes synthetic speech”].
However, Shor does not teach generating an embedding wherein he one or more reference embeddings specify characterize a first set of phonetic features associated with the registered speaker;
But Mont-Reynaud does teach generating a embedding specify characterize a first set of phonetic features associated with the registered speaker - [Column 10 , lines 33- 39 “ Phonetic features 337 such as phoneme sequences or lattices can be processed to derive extended features 255 such as phoneme length, or articulatory-phonetic features, such as the place and manner of articulation for consonants, vowel placement, and other articulatory features that may be used for determining value for the accent property in the user profile”]; [Column 8, lines 40-54 “The extracted feature set 257 comprises more than simple sequences of elements such as frames, states, phonemes, or
words, but also may include relationships (such as alignments) or mappings between the elements at successive levels, sufficient to derive additional information. For example, when a speech transcription contains a certain word (in the textual features 347), the extracted feature set 257 also delimits the specific phoneme subsequence (from the phonetic features 337) that matches this word in the
transcription; and each phoneme in turn can be mapped to the sequence of states or frames that it spans. Such a cross-referencing capability (between words and phonemes and acoustic features) is often useful for deriving extended features; the generation of extended features 250 is discussed later” [“ Column 10, lines 54- 67 , Column 11, lines 1-4 “ FIG. 4 is block diagram that illustrates the flow of generating information for inclusion in the user profile, according to an implementation of the invention. In particular, FIG. 4 shows user profile generation module 265 integrating acoustic and phonetic features 425 with linguistic features 429 to generate the user profile 147. Acoustic and phonetic features 425 are a subset of the extended feature set 255 and are used by speech profile generation module 435 to determine speech profile characteristics 445. Linguistic features 429 are another subset of the extended feature set 255 and are used by language profile generation module 439 to determine language profile characteristics 449. These profile generation modules 435 and 439 may contain classifiers as described above to transform the feature vectors in the extended feature set 255 to profile characteristics. The speech profile characteristics 445 and language profile characteristics
449 are combined in the integrated user profile 467 by the profile integration module 457”]’ [Column 4, lines 32-40 “Classifier: a function taking extracted and/or extended features as input and assigning a value to a user profile property. A classifier function is a software module that is trained as known in the fields of machine learning and pattern recognition. The training data includes known associations
between features (usually called a feature vector) and corresponding user characteristics (usually called the ground truth). After training, a classifier can accept a feature vector as input and map it to the most probable property value for that input”]. [Column 9, lines 25-35 “Alternatively, a non-registered user may only be known to persist within a session ( or collection of sessions), and the user profile 147 will, by necessity, be based on a smaller number of interactions. In either case, standard speaker
identification techniques may be applied to detect speaker changes, done to address the undesirable case in which the identity of the user (speaker) unexpectedly changes midsession.
Such a speaker change reduces the validity of the statistics being gathered, and their applicability to the current user to the same user”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because generating embeddings with phonetic features as well as generating embeddings with acoustic features will significantly improving synthetic speech detection accuracy, robustness to voice variations, and adaptability to new, unseen synthetic voices. This results in more accurate speech characterization and classification. 
Claim [2, 3, 12, 13, 19, 20] are rejected under 35 U.S.C. 103 as being unpatentable over Shor (US 11756572 B2) in view of Mont-Reynaud (US 10311858 B1) and in further view of Stork (US-5621858-A).
Regarding claim 2, Shor discloses the method of claim 1, wherein generating the one or more reference embeddings for the speaker comprises: extracting, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features acoustic properties of the audio data 120 based on the self-supervised learning”]; [ Column 12, lines 1-13 “To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input”].
and generating the one or more reference embeddings based on the enrollment feature vector, wherein the one or more reference embeddings include speaker specific information, - [Column 3, lines 49-67, column 4, lines 1-12 “In one example, an ASV system implementing a speaker verification model is used in conjunction with a hotword detection model so that an authorized user can invoke a speech-enabled device to wake-up and process subsequent spoken input from the user by speaking a predefined fixed phrase (e.g., a hotword, wake word, keyword, invocation phrase, etc.). In this example, the hotword detection model is configured to detect audio features characterizing the predefined fixed phrase in audio data and the speaker verification model is configured to verify that the audio features characterizing the predefined fixed phrase were spoken by the authorized user. Generally, the speaker verification model will extract a verification speaker embedding from the input audio features and compare the verification speaker embedding with a reference speaker embedding for the authorized user. Here, the reference speaker embedding can be previously obtained by having the particular user speaker the same predefined fixed phrase (e.g., during an enrollment process) and stored as part of a user profile for the authorized user. When the verification speaker embedding matches the reference speaker embedding, the hotword detected in the audio data is verified as being spoken by the authorized user to thereby permit the speech-enabled device to wake-up and process subsequent speech spoken by the authorized user”].
and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker - [ Column 5, lines 32-39  “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2)”].
However, Shor does not teach extracting, based on one or more sample audio clips of the speaker speaking, first set of phonetic features and combining the first set of acoustic features and the first set of phonetic features 
But Mont-Reynaud does teach extracting, based on one or more sample audio clips of the speaker speaking, first set of phonetic features and combining the first set of acoustic features and the first set of phonetic features speech profile generation module 435 to determine speech profile characteristics 445. Linguistic features 429 are another subset of the extended feature set 255 and are used by language profile generation module 439 to determine language profile characteristics 449. These profile generation modules 435 and 439 may contain classifiers as described above to transform the feature vectors in the extended feature set 255 to profile characteristics. The speech profile characteristics 445 and language profile characteristics 449 are combined in the integrated user profile 467 by the profile integration module 457”]’ [Column 4, lines 32-40 “Classifier: a function taking extracted and/or extended features as input and assigning a value to a user profile property. A classifier function is a software module that is trained as known in the fields of machine learning and pattern recognition. The training data includes known associations between features (usually called a feature vector) and corresponding user characteristics (usually called the ground truth). After training, a classifier can accept a feature vector as input and map it to the most probable property value for that input”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because combining acoustic features with phonetic features will allow the generated feature vectors to capture more detailed characteristics of the speech , improving reliability when comparing the generated embedding with the reference speaker embedding. 
Shor in view of Mont-Reynaud do not teach combining the two vectors to generate an enrollment feature vector.
However, Stork teaches combining two vectors to generate an enrollment feature vector – [ Column 20, 15-19 “applying a corresponding pair of dynamic acoustic and a visual feature training vectors to the set of inputs of the time delay neural network classification apparatus and generating an output response vector”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor in view Mont-Reynaud into the teachings of Stork because combining vectors to generate a new vector offers significant advantages in simplifying  complex systems. By aggregating vectors, calculation of resultant vector allows for easier analysis and improves the system performance. 
Regarding claim 3, Shor discloses the method of claim 1, wherein generating the test embedding for the audio clip comprises: extracting, based on the audio clip, the second set of acoustic features and  [Column 4, lines 56-58 “The user device 102 includes an audio feature extractor 210 configured to extract audio features from audio data 120 characterizing speech obtained by the user device 102. For example, the audio data 120 is captured from streaming audio 118 by the user device 102”]; [Column 5, lines 28-31 “The audio feature vectors 212 from the audio feature extractor 210 capture a large number of acoustic properties of the audio data 120 based on the self-supervised learning”]; Column 12, lines 1-13 “To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input”].
generate a test feature vector; and generating the test embedding based on the test feature vector – [Column 3, lines 49-67, column 4, lines 1-12 “In one example, an ASV system implementing a speaker verification model is used in conjunction with a hotword detection model so that an authorized user can invoke a speech-enabled device to wake-up and process subsequent spoken input from the user by speaking a predefined fixed phrase (e.g., a hotword, wake word, keyword, invocation phrase, etc.). In this example, the hotword detection model is configured to detect audio features characterizing the predefined fixed phrase in audio data and the speaker verification model is configured to verify that the audio features characterizing the predefined fixed phrase were spoken by the authorized user.”] [Column 3, 60-64 “Generally, the speaker verification model will extract a verification speaker embedding from the input audio features and compare the verification speaker embedding with a reference speaker embedding for the authorized user. Here, the reference speaker embedding can be previously obtained by having the particular user speaker the same predefined fixed phrase (e.g., during an enrollment process) and stored as part of a user profile for the authorized user. When the verification speaker embedding matches the reference speaker embedding, the hotword detected in the audio data is verified as being spoken by the authorized user to thereby permit the speech-enabled device to wake-up and process subsequent speech spoken by the authorized user. 
	However, Shor does not teach combining the first set of acoustic features and the first set of phonetic features 
But Mont-Reynaud does teach combining the first set of acoustic features and the first set of phonetic features phonetic features 425 are a subset of the extended feature set 255 and are used by speech profile generation module 435 to determine speech profile characteristics 445. Linguistic features 429 are another subset of the extended feature set 255 and are used by language profile generation module 439 to determine language profile characteristics 449. These profile generation modules 435 and 439 may contain classifiers as described above to transform the feature vectors in the extended feature set 255 to profile characteristics. The speech profile characteristics 445 and language profile characteristics 449 are combined in the integrated user profile 467 by the profile integration module 457”]’ [Column 4, lines 32-40 “Classifier: a function taking extracted and/or extended features as input and assigning a value to a user profile property. A classifier function is a software module that is trained as known in the fields of machine learning and pattern recognition. The training data includes known associations between features (usually called a feature vector) and corresponding user characteristics (usually called the ground truth). After training, a classifier can accept a feature vector as input and map it to the most probable property value for that input”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Mont-Reynaud because combining acoustic features with phonetic features will allow the generated feature vectors to capture more detailed characteristics of the speech , improving reliability when comparing the generated embedding with the reference speaker embedding. 
Shor in view of Mont-Reynaud do not teach combining the two features to generate a test feature vector.
However, Stork teaches combining two vectors to generate a feature vector – [ Column 20, 15-19 “applying a corresponding pair of dynamic acoustic and a visual feature training vectors to the set of inputs of the time delay neural network classification apparatus and generating an output response vector”].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor in view Mont-Reynaud into the teachings of Stork because combining vectors to generate a new vector offers significant advantages in simplifying  complex systems. By aggregating vectors, calculation of resultant vector allows for easier analysis and improves the system performance. 
Regarding claim 12, Shor discloses the computing system of claim 11, wherein to generate the one or more reference embeddings for the speaker, the machine learning system is configured to: extract, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combine the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generate the one or more reference embeddings based on the enrollment feature vector, wherein the reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. Claim 12 is rejected for the same reasons as claim 2.
Regarding claim 13, Shor discloses the computing system of claim 11, wherein to generate the test embedding for the audio clip, the machine learning system is configured to: extract, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combine the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generate the test embedding based on the test feature vector. Claim 13 is rejected for the same reasons as claim 3.
Regarding claim 19, Shor discloses the computer-readable storage media of claim 18, wherein to generate the one or more reference embeddings for the speaker, the processing circuitry is configured to: extract, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combine the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generate the one or more reference embeddings based on the enrollment feature vector, wherein the reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. Claim 19 is rejected for the same reasons as claim 2.
Regarding claim 20, Shor discloses the computer-readable storage media of claim 18, wherein to generate the test embedding for the audio clip, the processing circuitry is configured to: extract, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combine the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generate the test embedding based on the test feature vector. Claim 20 is rejected for the same reasons as claim 3.
Claim [ 5, 15 ] are rejected under 35 U.S.C. 103 as being unpatentable over Shor 
(US 11756572 B2) in view of Liu (US-7454339-B2).
Regarding claim 5, Shor discloses the method of claim 1, wherein computing the score based on the test embedding and the one or more reference embeddings comprises:  [Column 5, lines 32-41 “After generating the audio feature vectors 212, the audio feature extractor 210 sends the audio feature vectors 212 to a synthetic speech detector 220 that includes a shallow discriminator model 222. As discussed in more detail below, the shallow discriminator model 222 is a shallow neural network (i.e., with little to no hidden layers) that generates, based on each of the audio feature vector 212, a score 224 (FIG. 2) that indicates a presence of synthetic speech in the streaming audio 118 based on the corresponding audio features of each audio feature vector 212”] ; [Column 3, 60-64  “Generally, the speaker verification model will extract a verification speaker embedding from the input audio features and compare the verification speaker embedding with a reference speaker embedding for the authorized user”; [ Column 5, lines 23- 31 “For example, each audio feature vector represents features for a 960 millisecond portion of the audio data 120. The portions may overlap. For instance, the audio feature extractor 210 generates eight audio feature vectors 212 (each representing 960 milliseconds of the audio data 120) for five seconds of audio data 120. The audio feature vectors 212 from the audio feature extractor 210 capture a large number of acoustic properties of the audio data 120 based on the self-supervised learning”]; [ Column 6 line 66, column 7, lines 1-5 “The shallow discriminator model 222 receives the plurality of audio feature vectors 212 simultaneously, sequentially, or concatenated together. The plurality of audio feature vectors 212 may undergo some processing between the audio feature extractor 210 and the shallow discriminator model 222”].
However, Shor does not discloses computing one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings.
But Liu discloses computing one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings - [Column 3, lines 43-55, “FIG. 2 illustrates exemplary steps performed in order to effectively train acoustic models. For every training speech segment X in either true data set or competing data set for word W, a log-likelihood ratio score is determined at 100 using a true acoustic model for word W and an anti-acoustic model which represents words other than word W. An average of log-likelihood ratio scores over the true data set is then determined at 110. In the same way, an average of log-likelihood ratio scores over the competing data set is determined at 120. A difference is determined at 130 based on the two averages and parameters of the models are adjusted at 140. To optimize the difference, the process may be iterative as shown at 150”]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Liu because applying such a scoring technique (log-likelihood ratios) to the embedding comparison mechanism in order to quantitatively determine similarity between a generated embedding and a reference embedding. This enables more accurate and flexible matching decisions. 
Regarding claim 15, Shor in view of Liu does disclose the computing system of claim 11, wherein to compute the score, the machine learning system is configured to compute one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings. Claim 15 is rejected for the same reasons as claim 5.
Claim [ 6 ] are rejected under 35 U.S.C. 103 as being unpatentable Shor (US 11756572 B2) in view of Charlet (US-20040107099-A1).
Regarding claim 6, Shor discloses the method of claim 1, wherein computing the score comprises: computing a raw score based on a comparison of the test embedding to the one or more reference embeddings; audio data 120 based on the self-supervised learning”]; [ Column 6 line 66, column 7, lines 1-5 “The shallow discriminator model 222 receives the plurality of audio feature vectors 212 simultaneously, sequentially, or concatenated together. The plurality of audio feature vectors 212 may undergo some processing between the audio feature extractor 210 and the shallow discriminator model 222”].
However, Shor does not teach computing, based on a calibration of the raw score, the score.
But Charlet teaches computing, based on a calibration of the raw score, the score - [Page 2, 0020 "The main object of the invention is to normalize the verification score so that it is compared to a decision threshold that is always pertinent, independently of the speaker, whilst assuring that the verification score evolves with the voice of the authorized speaker without having recourse to additional recordings of impostors. Consequently, in relation to the speech recognition device, another object of the invention is to reduce the memory space necessary for supplementary recordings of impostors whilst guaranteeing a more accurate and fast decision"]; [Page 2, 0021"To achieve the above objects, a device for automatically recognizing the voice of a speaker authorized to access an application, comprising means for generating beforehand, during a learning phase, parameters of an acceptance voice model relative to a voice segment spoken by the authorized speaker and parameters of a rejection voice model, means for normalizing by means of normalization parameters a speaker verification score depending on the likelihood ratio between a voice segment to be tested and the acceptance model and rejection model, and means for comparing the normalized verification score to a first threshold in order to authorize access to the application by the speaker who spoke the voice segment to be tested only if the normalized verification score is at least as high as the first threshold. This device is characterized, according to the invention, in that it includes means for updating at least one of the normalization parameters as a function of a preceding value of said parameter and the speaker verification score on each voice segment test only if the normalized verification score is at least equal to a second threshold that is at least equal to the first threshold"];[Page 2, 0024 "Thus the normalized score is updated on-line, as and when speaker verification attempts and therefore requests to access the application are made, so that the normalized score evolves with changes in the voice of the speaker. Updating as a function of at least one parameter and not a threshold means that the normalized decision score can be modified independently of the operating point required by the application"]; [Page 2, 0025 “The updated normalization parameter can be representative of the statistical mean value of the speaker verification score or of the standard deviation of the speaker verification score, or these two parameters are updated"]; [Page 2, 0026 "The updating of the normalized score is further improved if the device comprises means for updating at least one of the parameters of the acceptance model as a function of a preceding value of said model parameter only if the normalized verification score is at least equal to the second threshold”].
In this reference, the verification score is the "raw score" derived from likelihood values. The calibration is the normalization. Then a normalization score is used going forward. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Charlet because applying a normalized similarity score for threshold based decision making would improve robustness in input conditions and enable more consistent and reliable authentication outcomes.
Claim [10 ] are rejected under 35 U.S.C. 103 as being unpatentable Shor (US 11756572 B2) in view in view of Arik (US-10796686-B2).
Regarding Claim 10, Shor does not disclose the method of claim 9, wherein the deep learning model includes a residual network architecture. 
However, Arik teaches the deep learning model includes a residual network architecture- Column 16, lines 59-67, column 17, lines 1-2 “FIG. 7 graphically depicts an example detailed Deep Voice 3 model architecture, according to embodiments of the present disclosure. In one or more embodiments, the model 700 uses a deep residual convolutional network to encode text and/or phonemes into per-timestep key 720 and value 722 vectors for an attentional decoder 730. In one or more embodiments, the decoder 730 uses these to predict the mel-band log magnitude spectrograms 742 that correspond to the output audio.”].	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine teachings of Shor into the teachings of Arik because residual neural network architecture is well known to improve the training and performance of deep neural networks. This modification would improve the stability and overall accuracy of the learned embeddings thereby creating a more effective and reliable system. 
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHEZA ABDUL AZIZ whose telephone number is (571)272-9610. The examiner can normally be reached Monday-Friday 7:30am-5pm Alternate Fridays off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657
Read full office action
DETECTING SYNTHETIC SPEECH

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

DETECTING SYNTHETIC SPEECH

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email