DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-25 are pending and have been examined.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 12/14/2023, 12/20/2023, 12/20/2024, 2/11/2025, 2/17/2025, and 3/26/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement was considered and attached by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1, 2, 5-8, 11-16, 18, and 20-24 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1 the limitations of “obtaining speech audio to be encoded”, “applying the speech audio to an audio encoder that is part of a neural network audio codec system that includes the audio encoder and an audio decoder, wherein the audio encoder and the audio decoder have been trained in an end-to-end manner”, “encoding the speech audio with the audio encoder to generate embedding vectors that represent a snapshot of speech audio attributes over successive timeframes of the speech audio”, and “generating from the embedding vectors, codeword indices to entries in a codebook”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human obtaining speech audio, applying a set of rules or instructions to obtain a representation of speech in the mind or using a pen or pencil, and generating codeword indices for the representation using a codebook. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
Regarding claim 18 the limitations of “encode speech audio to generate embedding vectors that represent a snapshot of speech audio attributes over successive timeframes of the speech audio, and to generate from the embedding vectors, codeword indices to entries in a codebook” and “a communication interface configured to transmit a bit stream that includes the codeword indices”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human applying a set of rules or instructions to obtain a representation of speech audio in the mind or using a pen or pencil, generating codeword indices for the representation using a codebook, and writing the codeword indices on paper. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
Regarding claim 22 the limitations of “obtaining text to be converted to speech audio”, “converting the text to speech vectors of a default voice prosody”, “mapping the speech vectors of the default voice prosody to speech vectors of a target voice prosody that is different from the default voice prosody”, and “decoding the speech vectors of the target voice prosody to produce output speech audio in the target voice prosody”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. More specifically, the mental process of a human obtaining text, converting the text into phonetic representations with prosodic symbols in the mind, mapping the phonetic representation and prosodic symbol to phonetic representation and prosodic symbols of a target prosody in the mind or on paper using a pen or pencil, and thinking of speech in the target prosody using the phonetic representation and prosodic symbols. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of an apparatus in claim 18, reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using P0120-P0132 in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to (1) obtain speech audio, apply a set of rules or instructions to obtain a representation of speech in the mind or using a pen or pencil, and generate codeword indices for the representation using a codebook, (2) apply a set of rules or instructions to obtain a representation of speech audio in the mind or using a pen or pencil, generate codeword indices for the representation using a codebook, and write the codeword indices on paper, and (3) obtain text, convert the text into phonetic representations with prosodic symbols in the mind, map the phonetic representation and prosodic symbol to phonetic representation and prosodic symbols of a target prosody in the mind or on paper using a pen or pencil, and think of speech in the target prosody using the phonetic representation and prosodic symbols amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
With respect to claim 2, the claim recites “encoding comprises encapsulating speech and background noise characteristics jointly or separately”, which reads on a human applying a set of rules or instructions to obtain a representation of speech that encapsulates speech in the mind or using a pen or pencil. No additional limitations are present.
With respect to claim 5, the claim recites “wherein the speech audio is to be converted to text, further comprising: decoding the codeword indices associated with the embedding vectors to produce text for the speech audio”, which reads on a human utilizing speech representation to think of text in the mind. No additional limitations are present.
With respect to claim 6, the claim recites “wherein the speech audio contains speech in a first language and the codeword indices include a sequence of first codeword indices to the codebook for the first language, further comprising”, “mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second language”, and “decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second language” which reads on a human mapping codeword indices found in a codebook to another codebook in the mind. No additional limitations are present.
With respect to claim 7, the claim recites “wherein the speech audio contains speech of a first prosody and the codeword indices include a sequence of first codeword indices to the codebook for the first prosody, further comprising”, “mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody”, and “decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second prosody” which reads on a human mapping codeword indices found in a codebook to another codebook in the mind. No additional limitations are present.
With respect to claim 8, the claim recites “converting the embedding vectors to language embeddings that are suitable to be provided as input to a large language model for a text generation task or for a discriminative task”, which reads on a human converting speech representation to another representation in the mind. No additional limitations are present.
With respect to claim 11, the claim recites “wherein encoding comprises encoding the speech audio at any bit rate within a rate range, in increments, based on use of a corresponding number of codeword indices included in an audio packet”, which reads on a human applying instructions in different increments of speech data in the mind or on paper using a pen or pencil. No additional limitations are present.
With respect to claim 12, the claim recites “transmitting multiple audio packets within a network packet”, which reads on a human organizing speech representations to write the representations on paper using a pen or pencil. No additional limitations are present.
With respect to claim 13, the claim recites “wherein transmitting comprises transmitting network packets, each network packet including a most recent audio packet in a sequence, the most recent audio packet encoded at a first bit rate Ro and L plurality of previous audio packets in the sequence encoded at bit rates R1, . . . , RL, respectively, wherein rate Ro is greater than bit rates R1, . .., RL”, which reads on a human organizing speech representations according to the increments where instructions are applied to speech data. No additional limitations are present.
With respect to claim 14, the claim recites “wherein bit rates R1,... , RL are each a second bit rate”, which reads on a human organizing speech representations according to the increments where instructions are applied to speech data. No additional limitations are present.
With respect to claim 15, the claim recites “wherein the first bit rate is 6 kbps and the second bit rate is 1 kbps”, which reads on a human organizing speech representations according to the increments where instructions are applied to speech data. No additional limitations are present.
With respect to claim 16, the claim recites “decoding, with the audio decoder, the codeword indices to recover the speech audio”, which reads on a human finding speech representation using codeword indices and using the speech representation to think of speech audio in the mind. No additional limitations are present.
With respect to claim 20, the claim recites “wherein the audio encoder generates speech embedding vectors representing speech semantics and stationary attributes including volume, pitch modulation, and accents nuances”, which reads on a human applying a set of rules or instructions to obtain a representation of speech audio in the mind or using a pen or pencil that includes speech semantics and stationary attributes. No additional limitations are present.
With respect to claim 21, the claim recites “wherein the speech audio contains speech of a first prosody and the codeword indices include a sequence of first codeword indices to the codebook for the first prosody, wherein the one or more processors are configured to” and “map the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody”, which reads on a human mapping codeword indices found in a codebook to another codebook in the mind. No additional limitations are present.
With respect to claim 23, the claim recites “wherein converting comprises generating first speech vectors representing speech semantics and stationary attributes including volume, pitch modulation, and accents nuances associated with the default voice prosody”, which reads on a human applying a set of rules or instructions to obtain a representation of speech audio in the mind or using a pen or pencil that includes speech semantics and stationary attributes. No additional limitations are present.
With respect to claim 24, the claim recites “wherein mapping comprises mapping the first speech vectors to second speech vectors for the target voice prosody”, which reads on a human mapping the phonetic representation and prosodic symbol to phonetic representation and prosodic symbols of a target prosody in the mind or on paper using a pen or pencil. No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-4, 11, 16-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zeghidour et al (U.S. PG Pub No. 20230186927), hereinafter Zeghidour.
Regarding claim 1 Zeghidour teaches:
A method comprising: (Abstract, Methods, systems and apparatus.)
obtaining speech audio to be encoded; (P0029, The audio waveform can originate from any suitable audio source. For example, the waveform can be a recording from an external audio device (e.g., speech from a microphone).)
applying the speech audio to an audio encoder that is part of a neural network audio codec system that includes the audio encoder and an audio decoder, wherein the audio encoder and the audio decoder have been trained in an end-to-end manner; (P0031, The audio waveform is processed (e.g., encoded) by the encoder to generate a sequence of feature vectors representing the waveform. Feature vectors (e.g., embeddings, latent representations) are compressed representations of waveforms that extract the most relevant information about their audio content. … the encoder neural network can use multiple convolutional layers with increasing strides to generate feature vectors at the lower sampling rate (e.g., lower temporal resolution).; P0041, The QFVs can then be processed (e.g., decoded) by the decoder to generate an audio waveform. The decoder generally mirrors the processes of the encoder by outputting waveforms starting from (quantized) feature vectors.; P0043, The neural network architecture can be trained using a training system. The training system can enable efficient general-purpose compression or tailored compression (e.g., speech-tailored) by utilizing a suitable set of training examples and various training procedures.)
encoding the speech audio with the audio encoder to generate embedding vectors that represent a snapshot of speech audio attributes over successive timeframes of the speech audio; and generating from the embedding vectors, codeword indices to entries in a codebook. (P0031, The audio waveform is processed (e.g., encoded) by the encoder to generate a sequence of feature vectors representing the waveform.; P0035, At the first vector quantizer, the quantizer can receive the feature vector and select a code vector from its codebook to represent the feature vector based on a smallest distance metric. A residual vector can be computed as the difference between the feature vector and the code vector representing the feature vector.)
Regarding claim 2 Zeghidour teach claim 1 and further teaches:
encoding comprises encapsulating speech and background noise characteristics jointly or separately. (P0029, The audio waveform can originate from any suitable audio source. For example, the waveform can be a recording from an external audio device (e.g., speech from a microphone), a purely digital production (e.g., electronic music), or generic audio such as sound effects and background noise (e.g., white noise, room tone). In some implementations, the audio compression system can perform audio enhancement, e.g., suppressing unwanted background noise, simultaneously when compressing the waveform.)
Regarding claim 3 Zeghidour teach claim 1 and further teaches:
wherein the audio encoder is trained to encode degraded speech content into speech vectors by ignoring artifacts and impairments, and wherein encoding comprises producing speech embedding vectors that are more condensed compared to embedding vectors encompassing speech distorted by artifacts and impairments. (P0029, The audio compression system can perform audio enhancement, e.g., suppressing unwanted background noise, simultaneously when compressing the waveform.
P0052, In some cases, the target waveform is identical to the input waveform, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform can also be modified with respect to the input waveform to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples with certain qualities. For instance, the target waveform can be a speech enhanced version of the input waveform, such that the neural networks improve audio dialogue upon reconstruction of waveforms. Alternatively or in addition, the target waveform can be a denoised version of the input waveform, which trains the networks to suppress background noise.)
Regarding claim 4 Zeghidour teach claim 3 and further teaches:
wherein encoding comprises generating speech embedding vectors representing speech semantics and stationary attributes including volume, pitch modulation, and accents nuances. (P0052, In some cases, the target waveform is identical to the input waveform, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform can also be modified with respect to the input waveform to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples with certain qualities. For instance, the target waveform can be a speech enhanced version of the input waveform, such that the neural networks improve audio dialogue upon reconstruction of waveforms. Alternatively or in addition, the target waveform can be a denoised version of the input waveform, which trains the networks to suppress background noise.)
Regarding claim 11 Zeghidour teach claim 1 and further teaches:
wherein encoding comprises encoding the speech audio at any bit rate within a rate range, in increments, based on use of a corresponding number of codeword indices included in an audio packet. (P0057, To adequately train the neural networks for variable (e.g., scalable) bit rates, the training system can select a particular number nq of vector quantizers to be used for each training example, such that the number of quantizers differs between training examples. For instance, the training system can sample nq uniformly at random in [1; Nq], for each training example, and only use the first i=1 . . . nq quantizers in the sequence. Consequently, the networks are trained to encode and decode audio waveforms for all target bitrates corresponding to the range nq=1. . . Nq and no architectural changes are necessary for the encoder or decoder.)
Regarding claim 16 Zeghidour teach claim 1 and further teaches:
decoding, with the audio decoder, the codeword indices to recover the speech audio. (P0041, The QFVs can then be processed (e.g., decoded) by the decoder to generate an audio waveform. The decoder generally mirrors the processes of the encoder by outputting waveforms starting from (quantized) feature vectors.)
Regarding claim 17 Zeghidour teach claim 1 and further teaches:
wherein the audio encoder and audio decoder have been trained with generative and adversarial loss functions of one or more deep neural network models in an end-to-end manner using clean speech audio distorted by artifacts and impairments. (P0044, In general, the training system can utilize unsupervised learning algorithms, semi-supervised learning algorithms, supervised learning algorithms, or more elaborate combinations of these. For example, the training system can balance reconstruction losses with adversarial losses to enable audio compression that is both faithful and perceptually similar to the original audio on playback.; P0049, The training system receives a set of training examples. Each training example includes a respective input audio waveform and a corresponding target audio waveform that the neural networks are trained to reconstruct. That is, using the objective function, the target waveform can be compared with a resulting output audio waveform to evaluate performance of the neural networks.; P0090, The target audio waveform of one or more of the training examples can be an enhanced version of the input audio waveform, such as a denoised version of the input audio waveform.)
Regarding claim 18 Zeghidour teaches:
An apparatus comprising: (Abstract, Methods, systems and apparatus, including computer programs encoded on computer storage media.)
one or more processors configured to execute instructions for an audio encoder to encode speech audio to generate embedding vectors that represent a snapshot of speech audio attributes over successive timeframes of the speech audio, and to generate from the embedding vectors, codeword indices to entries in a codebook; and (P0099, apparatus, devices, and machines for processing data, including by way of example a programmable processor.; P0031, The audio waveform is processed (e.g., encoded) by the encoder to generate a sequence of feature vectors representing the waveform.; P0035, At the first vector quantizer, the quantizer can receive the feature vector and select a code vector from its codebook to represent the feature vector based on a smallest distance metric. A residual vector can be computed as the difference between the feature vector and the code vector representing the feature vector.)
a communication interface configured to transmit a bit stream that includes the codeword indices. (P0008, The compression system can generate a compressed representation of an input audio waveform and transmit the compressed representation to a destination over a data communication network, e.g., a local area network, a wide area network, or the internet.)
Regarding claim 19 Zeghidour teach claim 18 and further teaches:
wherein the audio encoder is trained to encode degraded speech content into speech vectors by ignoring artifacts and impairments, and wherein the audio encoder generates speech embedding vectors that are more condensed compared to embedding vectors encompassing speech distorted by artifacts and impairments. (P0029, The audio compression system can perform audio enhancement, e.g., suppressing unwanted background noise, simultaneously when compressing the waveform.; P0052, In some cases, the target waveform is identical to the input waveform, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform can also be modified with respect to the input waveform to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples with certain qualities. For instance, the target waveform can be a speech enhanced version of the input waveform, such that the neural networks improve audio dialogue upon reconstruction of waveforms. Alternatively or in addition, the target waveform can be a denoised version of the input waveform, which trains the networks to suppress background noise.)
Regarding claim 20 Zeghidour teach claim 18 and further teaches:
wherein the audio encoder generates speech embedding vectors representing speech semantics and stationary attributes including volume, pitch modulation, and accents nuances. (P0052, In some cases, the target waveform is identical to the input waveform, which can train the neural networks towards faithful and perceptually similar reconstructions. However, the target waveform can also be modified with respect to the input waveform to encourage more sophisticated functionalities, such as joint compression and enhancement. The nature of the enhancement can be determined by designing training examples with certain qualities. For instance, the target waveform can be a speech enhanced version of the input waveform, such that the neural networks improve audio dialogue upon reconstruction of waveforms. Alternatively or in addition, the target waveform can be a denoised version of the input waveform, which trains the networks to suppress background noise.)
Claims 22-25 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Finkelsteinet al (U.S. PG Pub No. 20230064749), hereinafter Finkelstein.
Regarding claim 22 Finkelstein teaches:
A method comprising: (P0005, A method for synthesizing an input text utterance into expressive speech having an intended prosody and a target voice.)
obtaining text to be converted to speech audio; (P0032, Input text.)
converting the text to speech vectors of a default voice prosody; (P0033, The first TTS model is trained to produce an intermediate speech representation.)
mapping the speech vectors of the default voice prosody to speech vectors of a target voice prosody that is different from the default voice prosody; and (P0033, The second TTS model is trained to reproduce the intended prosody captured by the intermediate speech representation and generate expressive speech having the intended produced in the target voice. That is, the second TTS model generates expressive speech with the intended prosody and having speaker characteristics associated with the target voice. Here, the target voice may be associated with an actor that never spoke any of the training utterances possessing the intended prosody.)
decoding the speech vectors of the target voice prosody to produce output speech audio in the target voice prosody. (Col. 10, 28-38, Lines the input audio signal can include speech spoken by a first speaker and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker. As one example of this, the input audio signal can include speech spoken by a first speaker with a first accent of a natural language and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker with a different accent of the natural language.)
Regarding claim 23 Finkelstein teach claim 22 and further teaches:
wherein converting comprises generating first speech vectors representing speech semantics and stationary attributes including volume, pitch modulation, and accents nuances associated with the default voice prosody. (P0010, The first TTS model generate a corresponding reference audio signal including a training synthesized speech representation that captures the intended prosody of the corresponding utterance of human speech.)
Regarding claim 24 Finkelstein teach claim 23 and further teaches:
wherein mapping comprises mapping the first speech vectors to second speech vectors for the target voice prosody. (P0033, The second TTS model is trained to reproduce the intended prosody captured by the intermediate speech representation and generate expressive speech having the intended produced in the target voice. That is, the second TTS model generates expressive speech with the intended prosody and having speaker characteristics associated with the target voice. Here, the target voice may be associated with an actor that never spoke any of the training utterances possessing the intended prosody.)
Regarding claim 25 Finkelstein teach claim 23 and further teaches:
wherein decoding is performed with an audio decoder that is part of a neural network audio codec system that includes an audio encoder and the audio decoder, which has been trained end-to-end with generative and adversarial loss functions of one or more deep neural network models, using clean speech audio distorted by artifacts and impairments. (P0034, The second TTS model may correspond to a prosody transfer model that includes an encoder portion and a decoder portion. Here, the prosody transfer model may correspond to a variational autoencoder (VAE) architecture or a sequence-to-sequence feature prediction network architecture.; P0018, Second TTS model includes a second neural network architecture.; P0010, Training, by the data processing hardware, using the corresponding transcript of the training data, the decoder portion of the second TTS model by decoding the corresponding utterance embedding encoded by the encoder portion into a predicted output audio signal of expressive speech having the intended prosody; generating gradients/losses between the predicted output audio signal and the corresponding reference audio signal; and back-propagating the gradients/losses through the second TTS model.)
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Asghar (U.S. Patent No. 6347297).
Regarding claim 5 Zeghidour teach claim 1.
Zeghidour does not specifically teach:
wherein the speech audio is to be converted to text, further comprising: decoding the codeword indices associated with the embedding vectors to produce text for the speech audio.
Asghar, however, teaches:
wherein the speech audio is to be converted to text, further comprising: decoding the codeword indices associated with the embedding vectors to produce text for the speech audio. (Col. 2, Lines 47-60, The extracted features representing input signal are segmented into short-term input signal frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a D-dimensional vector and compared with predetermined, stored reference patterns by the pattern similarity operation. Similarity between the input signal pattern and the stored reference patterns 208 is determined in pattern similarity operation using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of input signal and each of the stored reference patterns.; Col. 6, Lines 43-51, In addition to matrix and vector quantization, the speech recognition system may further utilize probabilistic classification processes to further enhance speech recognition accuracy. Matrix and vector quantizers serve as front end speech classifiers to provide observation sequences, in the forms of respective classification vectors, to respective HMMs in order to characterize the HMMs during training. Each of the HMMs are preferably trained for a single word and may be gender specific.; Col. 6, Lines 62-66, A neural network, such as an MLP neural network, enhances recognition accuracy by processing input data generated by the mixer by determining the probabilities of each vocabulary word matching input signal.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to decode the codeword indices to produce text. It would have been obvious to combine the references because quantization of input speech improves speech recognition. (Asghar, Abstract)
Claims 6, 7, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Agostinelli et al. (U.S. Patent No. 11915689), hereinafter Agostinelli.
Regarding claim 6 Zeghidour teach claim 1.
Zeghidour does not specifically teach:
mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second language; and
decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second language.
Agostinelli, however, teaches:
mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second language; and (Col. 7, Lines 59-63, The system can also include an embedding neural network. In some examples where the context includes an input, the system can process the input to map the input to one or more embedding tokens, also referred to as audio embedding tokens.)
decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second language. (Col. 10, Lines 25-29, The input audio signal can include speech in one natural language and the output audio signal can represent speech in a target, different natural language that is a translation of the input speech into the target language.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to map indices to produce an output in the second language. It would have been obvious to combine the references because mapping of sequences is a known technique to yield a predictable result of translating speech.
Regarding claim 7 Zeghidour teach claim 1.
Zeghidour does not specifically teach:
mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody; and
decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second prosody.
Agostinelli, however, teaches:
mapping the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody; and (Col. 7, Lines 59-63, The system can also include an embedding neural network. In some examples where the context includes an input, the system can process the input to map the input to one or more embedding tokens, also referred to as audio embedding tokens.)
decoding the sequence of second codeword indices to produce an output audio stream of the speech audio in the second prosody. (Col. 10, 28-38, Lines the input audio signal can include speech spoken by a first speaker and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker. As one example of this, the input audio signal can include speech spoken by a first speaker with a first accent of a natural language and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker with a different accent of the natural language.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to map indices to produce an output in the second prosody. It would have been obvious to combine the references because mapping of sequences is a known technique to yield a predictable result of changing prosody.
Regarding claim 21 Zeghidour teach claim 18.
Zeghidour does not specifically teach:
wherein the speech audio contains speech of a first prosody and the codeword indices include a sequence of first codeword indices to the codebook for the first prosody, wherein the one or more processors are configured to: map the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody.
Agostinelli, however, teaches:
wherein the speech audio contains speech of a first prosody and the codeword indices include a sequence of first codeword indices to the codebook for the first prosody, wherein the one or more processors are configured to: map the sequence of first codeword indices to a sequence of second codeword indices to a codebook for a second prosody. (Col. 7, Lines 59-63, The system can also include an embedding neural network. In some examples where the context includes an input, the system can process the input to map the input to one or more embedding tokens, also referred to as audio embedding tokens.; Col. 10, 28-38, Lines the input audio signal can include speech spoken by a first speaker and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker. As one example of this, the input audio signal can include speech spoken by a first speaker with a first accent of a natural language and the output audio signal can represent the same semantic content as the input speech but spoken by a different speaker with a different accent of the natural language.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to map indices to produce an output in the second prosody. It would have been obvious to combine the references because mapping of sequences is a known technique to yield a predictable result of changing prosody.
Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Wang et al. (U.S. PG Pub No. 20240386881), hereinafter Wang.
Regarding claim 8 Zeghidour teach claim 1.
Zeghidour does not specifically teach:
converting the embedding vectors to language embeddings that are suitable to be provided as input to a large language model for a text generation task or for a discriminative task.
Wang, however, teaches:
converting the embedding vectors to language embeddings that are suitable to be provided as input to a large language model for a text generation task or for a discriminative task. (P0037, The computing system can map each audio embedding of the plurality of audio embeddings to a textual embedding using the speech adapter. The textual embeddings the audio embeddings are mapped to can be any textual token embeddings in the textual embedding space of a LLM., Fig. 2, Example of search discriminative task.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to convert embedding vectors to language embeddings as input to large language model. It would have been obvious to combine the references because the conversion avoids potential ASR misrecognitions and weakness in processing domain-specific entities. (Wang P0004)
Regarding claim 9 Zeghidour teach claim 8.
Zeghidour does not specifically teach:
wherein converting is performed by a translator that has been trained across a broad spectrum of speaker profiles and content diversity to account for speech audio that results in embedding vectors of diverse sizes, to be invariant to speaker-specific variations, and to use one or more transformer models that account for different length sequences.
Wang, however, teaches:
wherein converting is performed by a translator that has been trained across a broad spectrum of speaker profiles and content diversity to account for speech audio that results in embedding vectors of diverse sizes, to be invariant to speaker-specific variations, and to use one or more transformer models that account for different length sequences. (P0019, Present disclosure is directed to a joint speech and language model (“SLM”) that can map speech into a text token embedding spaces without speech information loss. This can be accomplished by utilizing blank filtering, which is a technique that is used to reduce speech data sequence length to the same order of magnitude as the text token embedding space.; P0020, The resulting filtered frames (e.g., only frames containing speech) therefore only provide semantic relevant information from the speech input to the speech encoder, which makes fine-tuning the model easier.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to convert embedding vectors to language embeddings as input to large language model. It would have been obvious to combine the references because the conversion avoids potential ASR misrecognitions and weakness in processing domain-specific entities. (Wang P0004)
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Wang and further view of Coucheiro Limeres (U.S. PG Pub No. 20250149032).
Regarding claim 10 Zeghidour in view of Wang teach claim 9.
Zeghidour in view of Wang does not specifically teach:
providing an end-of-sequence token to indicate that conversion for a sequence of indices is complete.
Coucheiro Limeres, however, teaches:
providing an end-of-sequence token to indicate that conversion for a sequence of indices is complete. (P0023, This marking could occur, for example, by the use of additional tokens that indicate the start and end of a command.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to provide end of sequence token. It would have been obvious to combine the references because the end of sequence token addresses any potential ambiguity that might arise that belong to normal conversational speech. (Coucheiro Limeres P0023)
Claims 12, 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Vos et al. (U.S. PG Pub No. 20110077940), hereinafter Vos.
Regarding claim 12 Zeghidour teach claim 11.
Zeghidour does not specifically teach:
transmitting multiple audio packets within a network packet.
Vos, however, teaches:
transmitting multiple audio packets within a network packet. (P0028, The output bitstream representing a speech signal and comprising a residual signal encoded at a first rate, the error correction data comprising the residual signal encoded at a second rate lower than the first rate.; P0087, The output FEC bitstream generated for payload n is buffered in the buffer 522 in order to piggyback it to the bitstream for payload n+1 or payload n+2.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to transmit multiple audio packets. It would have been obvious to combine the references because packets can be lost on transmission and FEC information can be used to decode an output signal for the lost packet. (Vos P0009)
Regarding claim 13 Zeghidour in view of Vos teach claim 12.
Zeghidour does not specifically teach:
transmitting network packets, each network packet including a most recent audio packet in a sequence, the most recent audio packet encoded at a first bit rate Ro and L plurality of previous audio packets in the sequence encoded at bit rates R1, . . . , RL, respectively, wherein rate Ro is greater than bit rates R1, . .., RL.
Vos, however, teaches:
transmitting network packets, each network packet including a most recent audio packet in a sequence, the most recent audio packet encoded at a first bit rate Ro and L plurality of previous audio packets in the sequence encoded at bit rates R1, . . . , RL, respectively, wherein rate Ro is greater than bit rates R1, . .., RL. (P0030, A first signal-processing module configured to encode a residual signal at a first bit rate; a first arithmetic encoder configured to generate an output bitstream based on the residual signal encoded at the first bit rate; and a second signal-processing module configured to encode the residual signal at a second bit rate that is lower than the first bit rate and to generate error correction data based on the residual signal encoded at the second bit rate.; P0032, The encoder may further comprise a buffer configured to delay the error correction bitstream relative to the output bit stream.; P0033, The encoder may further comprise a gain adjustment module configured to control the quantization gain used to encode the residual information at the second bit rate to thereby control the second bit rate.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to transmit audio packets with previous audio packets with different bit rates. It would have been obvious to combine the references because packets can be lost on transmission and FEC information can be used to decode an output signal for the lost packet. (Vos P0009)
Regarding claim 14 Zeghidour in view of Vos teach claim 13.
Zeghidour does not specifically teach:
wherein bit rates R1,... , RL are each a second bit rate.
Vos, however, teaches:
wherein bit rates R1,... , RL are each a second bit rate. (P0030, A first signal-processing module configured to encode a residual signal at a first bit rate; a first arithmetic encoder configured to generate an output bitstream based on the residual signal encoded at the first bit rate; and a second signal-processing module configured to encode the residual signal at a second bit rate that is lower than the first bit rate and to generate error correction data based on the residual signal encoded at the second bit rate.; P0032, The encoder may further comprise a buffer configured to delay the error correction bitstream relative to the output bit stream.; P0033, The encoder may further comprise a gain adjustment module configured to control the quantization gain used to encode the residual information at the second bit rate to thereby control the second bit rate.)
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Zeghidour in view of Vos and further view of "SoundStream: An End-to-End Neural Audio Codec" by Zeghidour et al.
Regarding claim 15 Zeghidour in view of Vos teach claim 14.
Zeghidour does not specifically teach:
wherein the first bit rate is 6 kbps and the second bit rate is 1 kbps.
SoundStream paper by Zeghidour, however, teaches:
wherein the first bit rate is 6 kbps and the second bit rate is 1 kbps. (V. Results, SoundStream operates at three different bitrates: i) low (3kbps); ii) medium(6 kbps); iii) high (12kbps). … We observe similar results when SoundStream operates at 6kbps and 12kbps.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to utilize 6 kbps and 1kbps. It would have been obvious to combine the references because the quantized and transmission bitrate were obvious to try as choosing from a finite number of identified, predictable solutions came with a reasonable expectation of success. (Soundstream, V. Results)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL WONSUK CHUNG whose telephone number is (571)272-1345. The examiner can normally be reached Monday - Friday (7am-4pm)[PT].
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL W CHUNG/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659