DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on 12/10/2025.
Claims 1-18 and 20-21 are pending and have been examined.
All previous objections / rejections not mentioned in this Office Action have been withdrawn by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments / Amendments
Applicant has amended independent claims 1, 16, and 20. The added limitations raises new grounds for rejection. Since Applicant’s arguments are directed towards the new amendment, the arguments are moot in view of new grounds for rejection. Hence, new references have been applied.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (U.S. PG Pub No. 20250029632), hereinafter Park, in view of Biadsy et al. (U.S. PG Pub No. 20230360632), hereinafter Biadsy, and further view of Li et al. (U.S. PG Pub No. 20160275947), hereinafter Li.
Regarding claim 1, 16, and 20 Park teaches:
(Claim 1) A computer-implemented method for audio transcription, the computer-implemented method comprising: (P0002, At least one embodiment pertains to processing resources used to perform and facilitate speaker identification, verification, diarization, and/or speech recognition and transcription. For example, at least one embodiment pertains to systems and techniques that facilitate efficient automated association of speech utterances with speakers in acoustic multi-speaker environments.)
(Claim 16) A system for automated speech recognition, the system comprising: one or more hardware processors configured to: (P0030, Speech conversion system.; P0061, FIG. 5 is a schematic view of an example computing device that may be used to implement the systems and methods described in this document.)
(Claim 20) A non-transitory computer-readable medium storing program instructions that, when executed by one or more hardware processors of a speech recognition system, cause the speech recognition system to perform operations comprising: (P0030, Speech conversion system.; P0062, The processor can process instructions for execution within the computing device, including instructions stored in the memory.; P0063, The memory stores information non-transitorily within the computing device.)
receiving an audio stream from a user device; (P0029, Audio processing server may include a desktop computer, a laptop computer, a smartphone, a tablet computer, a server, a wearable device. … Audio processing server may be configured to receive audio data that may be associated with any speech episode involving one or more speakers.)
identifying a particular user that speaks in the audio stream; (P0063, Audio processing model identifies speaker labels associating specific temporal intervals of audio data with respective speakers that produced speech of those temporal intervals.)
selecting at least a first vector from a first plurality of vectors generated by a first machine learning model, wherein the first vector represents with speech characteristics that are based on a pronunciation or a voice feature of the particular user and that differ from baseline speech characteristics of one or more users used to train a third machine learning model for the audio transcription; (P0065, Audio processing model that can be used for efficient multi-channel multi-speaker identification, verification, and diarization. … Audio processing model may include a neural network that generates speaker embeddings for characterization of speech spoken at various time intervals.)
Park does not specifically teach:
adjusting a third machine learning model based on the speech characteristics encoded within the first vector and the audio characteristics encoded within the second vector; and
using the third machine learning model to convert speech of the particular user into text after said adjusting.
Biadsy, however, teaches:
adjusting one or more vectors of the third machine learning model based on a first biasing of the one or more vectors with the first vector that adjusts for differences in the pronunciation or the voice feature of the particular user and a second biasing of the one or more vectors with the second vector that adjusts for differences in the audio characteristics of the environment or the audio capture device; and (P0056, At operation 406, the method includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. At operation 408, the method includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, the speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.)
using the third machine learning model that is dynamically modified based on the first biasing and the second biasing to convert speech of the particular user into text. (P0052, In other examples, the sub-model is disposed in a neural network layer of the decoder or between two neural network layers of the decoder.; P0058, The speech conversion model includes an automated speech recognition model configured to convert speech into text. In these examples, the output canonical representation includes a canonical textual representation of the utterance spoken by the target speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to adjust vectors of a third machine learning model based on differences in pronunciation or voice feature of a user when converting speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Park in view of Biadsy does not specifically teach:
selecting at least a second vector from a second plurality of vectors generated by a second machine learning model, wherein the second vector represents audio characteristics of an environment or an audio capture device that affect a capture of the audio stream differently than baseline audio characteristics of one or more sample audio signals that are captured in different environments or with different audio capture devices that are used to train the third machine learning model;
adjusting one or more vectors of the third machine learning model based on a first biasing of the one or more vectors with the first vector that adjusts for differences in the pronunciation or the voice feature of the particular user and a second biasing of the one or more vectors with the second vector that adjusts for differences in the audio characteristics of the environment or the audio capture device; and
Li, however, teaches:
selecting at least a second vector from a second plurality of vectors generated by a second machine learning model, wherein the second vector represents audio characteristics of an environment or an audio capture device that affect a capture of the audio stream differently than baseline audio characteristics of one or more sample audio signals that are captured in different environments or with different audio capture devices that are used to train the third machine learning model; (P0018, Improving speech recognition across multiple environments. For example, the quality of speech recognition results often varies across quiet environments and noisy environments. … incorporating environment variables into components of a deep neural network (DNN) for use in speech recognition systems.; P0025, Where the VCDNN utilizes the SNR as the environment variable, the environment variable module 206 calculates, measures, and/or determines the SNR during speech capture.)
adjusting one or more vectors of the third machine learning model based on a first biasing of the one or more vectors with the first vector that adjusts for differences in the pronunciation or the voice feature of the particular user and a second biasing of the one or more vectors with the second vector that adjusts for differences in the audio characteristics of the environment or the audio capture device; and (P0027, The speech recognition decoder utilizes a VCDNN module to complete the recognition. … The VCDNN module receives the captured speech or feature vectors and processes the speech or vectors utilizing the respective VCDNN.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to adjust vectors of a third machine learning model based on differences in the audio characteristics of the environment when converting speech to text. It would have been obvious to combine the references because incorporating environment variables into components of a deep neural network (DNN) for use in speech recognition systems provide higher quality results across the multiple environments. (Li P0018).
Regarding claim 2 and 17 Park in view of Biadsy and further view of Li teach claim 1 and 16.
Park further teaches:
identifying a second user that speaks in the audio stream during a time that is different than when the particular user speaks; (P0074, FIG. 5 is a flow diagram of an example method 500 of efficient multi-channel multi-speaker identification, verification, and/or diarization.; P0078, as illustrated with the callout block 532, processing the second set of ADCs to obtain the association of the speech to the one or more speakers may include partitioning the speech into one or more intervals. Individual intervals of the one or more intervals may be mapped to respective speakers that generated speech associated with the individual intervals (e.g., by assigning one or more speaker labels 250 in FIG. 2).)
selecting at least a third vector from the first plurality of vectors, wherein the third vector represents differences between speech characteristics for a different pronunciation or a different voice feature of the second user and the baseline speech characteristics; (P0065, Audio processing model may include a neural network that generates speaker embeddings for characterization of speech spoken at various time intervals.)
Park does not specifically teach:
adjusting the one or more vectors of the third machine learning model based on a third biasing of the one or more vectors with the third vector that adjusts for the differences between the speech characteristics of the second user and the baseline speech characteristics; and; and
converting speech of the second user into text based on the third biasing and the second biasing of the machine learning model.
Biadsy, however, teaches:
adjusting the one or more vectors of the third machine learning model based on a third biasing of the one or more vectors with the third vector that adjusts for the differences between the speech characteristics of the second user and the baseline speech characteristics; and; and (P0056, At operation 406, the method 400 includes receiving a speech conversion request that includes input audio data 102 corresponding to an utterance spoken 108 by the target speaker 104 associated with the atypical speech. At operation 408, the method 400 includes biasing, using the speaker embedding 350 generated for the target speaker 104 by the speaker embedding network 250, the speech conversion model 210 to convert the input audio data 102 corresponding to the utterance 108 spoken by the target speaker 104 associated with atypical speech into an output canonical representation 106, 120 of the utterance spoken by the target speaker.)
converting speech of the second user into text based on the third biasing and the second biasing of the machine learning model. (P0058, The speech conversion model 210 includes an automated speech recognition model 210b configured to convert speech into text. In these examples, the output canonical representation includes a canonical textual representation of the utterance 108 spoken by the target speaker 104.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to adjust a third machine learning model based on speech characteristics encoded within a vector to convert speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Regarding claim 3 and 18 Park in view of Biadsy and further view of Li teach claim 1 and 17.
Park further teaches:
determining the audio capture device that records or encodes the audio stream; and (P0045, Audio processing pipeline may receive audio data captured by multiple audio sensors, e.g., microphones. Audio sensors may include distributed microphones and arrays of microphones, which may be arranged into any spatial pattern, e.g., a linear pattern, a circular (arc) pattern, a two-dimensional pattern (e.g., microphones placed on desks of a conference hall), a three-dimensional pattern (e.g., microphones placed at different heights of the conference hall), and/or the like. Microphones can include dynamic microphones, condenser microphones, ribbon microphones, unidirectional microphones, omnidirectional microphones, and/or any other types of microphones.)
wherein selecting the second vector comprises determining that the audio characteristics encoded within the second vector correspond to capture characteristics with which the audio capture device records or encodes the audio stream. (P0046, Audio data collected by audio sensors may undergo speech preprocessing and segmentation. For example, preprocessing may include audio filtering, denoising, amplification, dereverberation, and/or any other suitable enhancement.)
Regarding claim 4 Park in view of Biadsy and further view of Li teach claim 3.
Park further teaches:
wherein the capture characteristics represent one or more adjustments that are made to the speech of the particular user in the audio stream as a result of recording or encoding the speech with the audio capture device. (P0045, Audio sensors may include distributed microphones and arrays of microphones, which may be arranged into any spatial pattern, e.g., a linear pattern, a circular (arc) pattern, a two-dimensional pattern (e.g., microphones placed on desks of a conference hall), a three-dimensional pattern (e.g., microphones placed at different heights of the conference hall), and/or the like. Microphones can include dynamic microphones, condenser microphones, ribbon microphones, unidirectional microphones, omnidirectional microphones, and/or any other types of microphones.; P0046, Audio data collected by audio sensors may undergo speech preprocessing and segmentation. For example, preprocessing may include audio filtering, denoising, amplification, dereverberation, and/or any other suitable enhancement.)
Regarding claim 5 Park in view of Biadsy and further view of Li teach claim 1 and 16.
Park further teaches:
receiving identifying information with the audio stream; (P0023, An arbitrary number N0 of microphones may generate the corresponding number of audio data channels that are preprocessed (e.g., noise-filtered), digitized, and then processed by a channel clustering component that identifies a number N of channel clusters (where N≤N0) representative of distinct local audio environments. … The number N of combined channels need not be fixed in advance and may change (e.g., from 1 to N0) depending on a specific environment, placement of audio capture devices, and/or positioning of audio sources present in the environment.)
determining the environment in which the particular user is located based on the identifying information; and (P0051, FIG. 3A depicts a similarity matrix for an example non-limiting situation where N0=8 channels, numbered with circled numerals 1 . . . 8, have a certain physical arrangement in an audio environment where microphones associated with channels 1, 2, and 5 capturing substantially similar audio content, microphones associated with channels 4 and 7 similarly capturing a different audio content, and each of the remaining channels 3, 6, and 8 having distinct audio content that is dissimilar to the audio content captured by all other microphones.)
wherein selecting the second vector comprises determining that the audio characteristics encoded within the second vector correspond to acoustic characteristics of the environment. (P0052, Channel clustering can be performed by a trained machine learning model. … Prior to inputting the sets of audio data (e.g., {ej(f)} or {ej(t)}) into the channel clustering model, the audio data can first be converted into embeddings.; P0053, The N combined channels may be processed by a suitable embeddings model that applies a sliding window to the channel audio data.)
Regarding claim 6 Park in view of Biadsy and further view of Li teach claim 5.
Park further teaches:
wherein the audio characteristics correspond to sounds that are added to the speech of the particular user in the audio stream based on environmental factors associated with the environment. (P0045, Audio sensors may capture not only a speech signal but also background noise, interference signals, e.g., emitted by TV devices, radio devices, alarm devices, and/or any other equipment, or sounds naturally occurring (e.g., sound of wind, water, birds, etc.).)
Regarding claim 7 Park in view of Biadsy and further view of Li teach claim 1.
Park does not specifically teach:
wherein adjusting the one or more vectors of the third machine learning model comprises: modifying a vector of the third machine learning model, that represents a first sound for recognizing one or more letters in the speech according to a first set of speech characteristics, based on the first vector being encoded with a second set of speech characteristics that represent a second sound for recognizing the one or more letters in the speech.
Biadsy, however, teaches:
wherein adjusting the third machine learning model comprises: modifying a vector of the third machine learning model, that represents a first sound for recognizing one or more letters in the speech according to a first set of speech characteristics, based on the first vector being encoded with a second set of speech characteristics that represent a second sound for recognizing the one or more letters in the speech. (P0056, At operation 406, the method 400 includes receiving a speech conversion request that includes input audio data 102 corresponding to an utterance spoken 108 by the target speaker 104 associated with the atypical speech. At operation 408, the method 400 includes biasing, using the speaker embedding 350 generated for the target speaker 104 by the speaker embedding network 250, the speech conversion model 210 to convert the input audio data 102 corresponding to the utterance 108 spoken by the target speaker 104 associated with atypical speech into an output canonical representation 106, 120 of the utterance spoken by the target speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to adjust a third machine learning model based on speech characteristics encoded within a vector to convert speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Regarding claim 8 Park in view of Biadsy and further view of Li teach claim 1.
Park further teaches:
receiving a first audio sample of the particular user speaking and a second audio sample of a second user speaking; determining a first set of speech characteristics associated with the particular user speaking and a second set of speech characteristics associated with the second user speaking; and generating a first set of vectors of the first plurality of vectors that encode differences between the first set of speech characteristics and the baseline speech characteristics, and a second set of vectors of the first plurality of vectors that encode differences between the second set of speech characteristics and the baseline speech charracteristics, wherein first set of vectors comprises the first vector, and wherein the second set of vectors comprises the second vector. (P0022, Speaker embeddings (e.g., in an embedding or latent space) that can be used as digital fingerprints to identify a speaker. A speaker embedding may be viewed as a vector in a special or latent embeddings space. A well-designed and well-trained model generates embeddings for different utterances produced (spoken) by the same person that differ significantly less (in the embeddings space) than utterances produced by different people.; P0068, Audio processing model allows for obtaining fixed-size speaker embeddings from variable-duration speech utterances. A linear layer may generate logits that determine probabilities of speaker embeddings belonging to one of N classes (e.g., N speakers in the training database). During training, the linear layer may feed logits into a suitable loss function.)
Regarding claim 9 Park in view of Biadsy and further view of Li teach claim 8.
Park further teaches:
associating the first set of vectors to one or more identifiers associated with the particular user; and associating the second set of vectors to one or more identifiers associated with the second user. (P0061, The mixed embeddings may then be processed by audio processing model trained to perform one or more of a speaker identification, speaker verification, diarization, and/or the like. In some embodiments, audio processing model identifies speaker labels associating specific temporal intervals of audio data with respective speakers that produced speech of those temporal intervals.)
Regarding claim 10 Park in view of Biadsy and further view of Li teach claim 1.
Park further teaches:
receiving an audio sample of the particular user speaking; comparing the audio sample to a set of audio samples used in training the third machine learning model; determining differences between the speech characteristics of the particular user and speech characteristics associated with the set of audio samples; and generating the first vector to encode one or more of the differences. (P0053, Embeddings model represents the audio data in the sliding window via embeddings (feature vectors) that capture audio features of the audio data, e.g., spectral features, cadence, volume, and/or the like. An embedding should be understood as any suitable digital representation of an input data, e.g., as a vector (string) of any number D of components, which can have integer values or floating-point values. Embeddings can be considered as vectors or points in a D-dimensional embedding space. The dimensionality D of the embedding space (defined as part of embeddings model architecture) can be smaller than the size of the input data (the sets of audio spectrograms or frames). During training, embeddings model learns to associate similar sets of training audio spectrograms/frames with similar embeddings represented by points closely situated in the embedding space and further learns to associate dissimilar sets of training audio spectrograms/frames with points that are located further apart in the embedding space.)
Regarding claim 11 Park in view of Biadsy and further view of Li teach claim 1.
Park does not specifically teach:
training the third machine learning model to recognize words in speech according to a first set of speech characteristics; and wherein adjusting the one or more vectors of the third machine learning model comprises biasing one or more of the first set of speech characteristics to recognize one or more of the words according to the speech characteristics of the particular use represented by the first vector.
Biadsy, however, teaches:
training the third machine learning model to recognize words in speech according to a first set of speech characteristics; and wherein adjusting the third machine learning model comprises biasing one or more of the first set of speech characteristics to recognize one or more of the words according to the speech characteristics encoded within the first vector. (P0015, A training process trains the speech conversion model end-to-end concurrently with the speaker embedding network by obtaining multiple sets of spoken training utterances and training the speech conversion model and the speaker embedding model concurrently on the multiple sets of spoken training utterances.; P0016, For each corresponding set of the multiple sets of spoken training utterances, training the speech conversion model includes biasing the speech conversion model for the corresponding set of the spoken training utterances using the respective personalization embedding that maps to the style cluster that includes the respective speaker embedding extracted from the corresponding set of the spoken training utterances.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to train a third machine learning model based on speech characteristics encoded within a vector to convert speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Regarding claim 12 Park in view of Biadsy and further view of Li teach claim 1.
Park does not specifically teach:
training the third machine learning model with audio samples that are recorded or encoded with a first set of audio characteristics; and wherein adjusting the one or more vectors of the third machine learning model comprises: biasing one or more of the first set of audio characteristics that differ from the audio characteristics encoded within the second vector; and performing speech recognition that compensates for differences between the first set of audio characteristics used to train the third machine learning model and the audio characteristics encoded within the second vector in response to biasing the one or more of the first set of audio characteristics.
Biadsy, however, teaches:
training the third machine learning model with audio samples that are recorded or encoded with a first set of audio characteristics; and wherein adjusting the third machine learning model comprises: biasing one or more of the first set of audio characteristics that differ from the audio characteristics encoded within the second vector; and performing speech recognition that compensates for differences between the first set of audio characteristics used to train the third machine learning model and the audio characteristics encoded within the second vector in response to biasing the one or more of the first set of audio characteristics. (P0015, A training process trains the speech conversion model end-to-end concurrently with the speaker embedding network by obtaining multiple sets of spoken training utterances and training the speech conversion model and the speaker embedding model concurrently on the multiple sets of spoken training utterances.; P0016, For each corresponding set of the multiple sets of spoken training utterances, training the speech conversion model includes biasing the speech conversion model for the corresponding set of the spoken training utterances using the respective personalization embedding that maps to the style cluster that includes the respective speaker embedding extracted from the corresponding set of the spoken training utterances.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to train a third machine learning model based on speech characteristics encoded within a vector to convert speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Regarding claim 13 Park in view of Biadsy and further view of Li teach claim 1.
Park does not specifically teach:
wherein adjusting the one or more vectors of the third machine learning model comprises: modifying a first vector value of the third machine learning model representing a particular speech characteristic with a different value that is specified for the same particular speech characteristic in the first vector; and modifying a second vector value of the third machine learning model representing a particular audio characteristic with a different value that is specified for the same particular audio characteristic in the second vector.
Biadsy, however, teaches:
wherein adjusting the third machine learning model comprises: modifying a first vector value of the third machine learning model representing a particular speech characteristic with a different value that is specified for the same particular speech characteristic in the first vector; and modifying a second vector value of the third machine learning model representing a particular audio characteristic with a different value that is specified for the same particular audio characteristic in the second vector. (P0056, At operation 406, the method 400 includes receiving a speech conversion request that includes input audio data 102 corresponding to an utterance spoken 108 by the target speaker 104 associated with the atypical speech. At operation 408, the method 400 includes biasing, using the speaker embedding 350 generated for the target speaker 104 by the speaker embedding network 250, the speech conversion model 210 to convert the input audio data 102 corresponding to the utterance 108 spoken by the target speaker 104 associated with atypical speech into an output canonical representation 106, 120 of the utterance spoken by the target speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to adjust a third machine learning model based on speech characteristics encoded within a vector when converting speech to text. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Regarding claim 14 Park in view of Biadsy and further view of Li teach claim 1.
Park further teaches:
identifying the audio capture device that captures the audio stream; and (P0023, An arbitrary number N0 of microphones may generate the corresponding number of audio data channels that are preprocessed (e.g., noise-filtered), digitized, and then processed by a channel clustering component that identifies a number N of channel clusters (where N≤N0) representative of distinct local audio environments. … The number N of combined channels need not be fixed in advance and may change (e.g., from 1 to N0) depending on a specific environment, placement of audio capture devices, and/or positioning of audio sources present in the environment.)
wherein selecting the second vector comprises selecting a vector from the second plurality of vectors that represents properties with which the audio capture device captures the audio stream. (P0052, Channel clustering can be performed by a trained machine learning model. … Prior to inputting the sets of audio data (e.g., {ej(f)} or {ej(t)}) into the channel clustering model, the audio data can first be converted into embeddings.; P0053, The N combined channels may be processed by a suitable embeddings model that applies a sliding window to the channel audio data.)
Regarding claim 15 Park in view of Biadsy and further view of Li teach claim 1.
Park further teaches:
wherein the first machine learning model is a speaker adaptation model, the second machine learning model is an environment adaptation model, and the third machine learning model is a speech recognition model comprising vectors for detecting and transcribing spoken words to text. (P0065, Audio processing model that can be used for efficient multi-channel multi-speaker identification, verification, and diarization. … Audio processing model may include a neural network that generates speaker embeddings for characterization of speech spoken at various time intervals.; P0052, Channel clustering can be performed by a trained machine learning model. … Prior to inputting the sets of audio data (e.g., {ej(f)} or {ej(t)}) into the channel clustering model, the audio data can first be converted into embeddings.; P0053, The N combined channels may be processed by a suitable embeddings model that applies a sliding window to the channel audio data.)
Park does not specifically teach:
wherein the first machine learning model is a speaker adaptation model, the second machine learning model is an environment adaptation model, and the third machine learning model is a speech recognition model comprising vectors for detecting and transcribing spoken words to text.
Biadsy, however, teaches:
wherein the first machine learning model is a speaker adaptation model, the second machine learning model is an environment adaptation model, and the third machine learning model is a speech recognition model comprising vectors for detecting and transcribing spoken words to text. (P0040, A speech conversion model associated speech-to-text conversion system may include a speech-to-text conversion model (interchangeably referred to as an automated speech recognition (ASR) model) configured to perform speech recognition on the utterance of atypical speech by converting the input audio data into the canonical textual representation (i.e., transcription) of the utterance.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have a third machine learning model as a speech recognition model. It would have been obvious to combine the references because the use of speech recognition model is a known technique to yield a predictable result of converting speech to text.
Regarding claim 21 Park in view of Biadsy and further view of Li teach claim 1.
Park does not specifically teach:
dynamically modifying the baseline speech characteristics of the third learning model to account for differences with the speech characteristics of the particular user in response to the first biasing of the one or more vectors with the first vector of the first learning model; and
dynamically modifying the baseline audio characteristics of the third learning model to account for the differences with the audio characteristics of the audio signal captured in the audio stream in response to the second biasing of the one or more vectors with the second vector of the second learning model.
Biadsy, however, teaches:
dynamically modifying the baseline speech characteristics of the third learning model to account for differences with the speech characteristics of the particular user in response to the first biasing of the one or more vectors with the first vector of the first learning model; and (P0056, At operation 406, the method includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. At operation 408, the method includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, the speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to modify speech characteristics of the third learning model to account for differences with the speech characteristics of the particular user. It would have been obvious to combine the references because speech recognition machine learning models have difficulty in recognizing speech that are spoken by speakers with atypical speech patterns and utilizing speaker speech characteristics encoded in an embedding achieves acceptable accuracy for atypical speech patterns. (Biadsy P0003).
Park in view of Biadsy does not specifically teach:
dynamically modifying the baseline audio characteristics of the third learning model to account for the differences with the audio characteristics of the audio signal captured in the audio stream in response to the second biasing of the one or more vectors with the second vector of the second learning model.
Li, however, teaches:
dynamically modifying the baseline audio characteristics of the third learning model to account for the differences with the audio characteristics of the audio signal captured in the audio stream in response to the second biasing of the one or more vectors with the second vector of the second learning model. (P0027, The speech recognition decoder utilizes a VCDNN module to complete the recognition. In some examples, the VCDNN module may be incorporated into or used in conjunction with an acoustic model of the speech recognition decoder. The VCDNN module receives the captured speech or feature vectors and processes the speech or vectors utilizing the respective VCDNN. The VCDNN also incorporates the value of the environment variable.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to modify audio characteristics of the third learning model to account for differences with the audio characteristics. It would have been obvious to combine the references because incorporating environment variables into components of a deep neural network (DNN) for use in speech recognition systems provide higher quality results across the multiple environments. (Li P0018).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL WONSUK CHUNG whose telephone number is (571)272-1345. The examiner can normally be reached Monday - Friday (7am-4pm)[PT].
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL W CHUNG/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659