DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on December 19, 2024, January 14, 2025, April 15, 2025, October 14, 2025 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 2-5, 7-8, 10, 21-24 and 27-33 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Tomar et al. (US 20180358005 A1).
As to claim 2, Tomar discloses a computer-implemented method for generating an acoustic representation of an audio signal [Paragraph 0001], the method comprising:
receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window [“The decision fusion module processes inputs using a contextual learning component at a given output time.” Paragraph 0057];
obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window [“The system obtains a semantic representation, the semantic representation is a semantic token or vector for different times.” Paragraph 0043]; and
generating, using one or more generative neural networks [205 on FIG. 2] and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window [“The text processing component is implemented using one or more neural networks.” Paragraph 0055], wherein the set of one or more respective acoustic tokens [Acoustic input 101 on FIG. 1] at each of the plurality of second time steps comprises a plurality of acoustic tokens [Confidence score] that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step [“The decision fusion module contains both the predicted action by the system and a confidence score for the prediction of an output at a running time.” Paragraphs 0059 and 0060],
the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector quantization system as the final output and utilize a decision matrix to perform the hierarchy to decide which of the predicted intent or predicted text to choose.” Paragraph 0048], and
the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer [“The semantic representation is a fixed length vector in which entries represent a vocal expression or acoustic referring to the relevant semantic that users refer to control a device by voice.” Paragraph 0050].
As to claim 3, Tomar discloses the method of claim 2, wherein each semantic token is selected from a vocabulary [Library 204 on FIG. 2] of semantic tokens and represents semantic content of the audio signal at the corresponding first time step [“The library contains a representation of the knowledge of the possible acoustic inputs that the system would recognize at a given time.” Paragraph 0054].
As to claim 4, Tomar discloses the method of claim 2, wherein the one or more respective acoustic tokens at each second time step represent acoustic properties of the audio signal at the corresponding second time step [“The library contains a representation of the knowledge of the possible acoustic inputs that the system would recognize at a given time.” Paragraph 0054].
As to claim 5, Tomar discloses the method of claim 2, further comprising: processing at least the acoustic representation using a decoder neural network to generate a prediction of the audio signal [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector.” Paragraph 0048].
As to claim 7, Tomar discloses the method of claim 2, wherein the acoustic representation is a prediction of a ground truth acoustic representation that would be generated from outputs of an encoder neural network by processing the audio signal [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector.” Paragraph 0048].
As to claim 8, Tomar discloses the method of claim 7, wherein the encoder neural network outputs a respective embedding at each of the plurality of second time steps, and wherein the ground truth acoustic representation is generated by applying quantization to each of the respective embeddings [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector quantization system as the final output and utilize a decision matrix to perform the hierarchy to decide which of the predicted intent or predicted text to choose.” Paragraph 0048].
As to claim 10, Tomar discloses the method of claim 2, wherein the hierarchy comprises the one or more coarse vector quantizers at one or more first positions in the hierarchy and the one or more fine vector quantizers at one or more last positions in the hierarchy [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector quantization system as the final output and utilize a decision matrix to perform the hierarchy to decide which of the predicted intent or predicted text to choose.” Paragraph 0048].
As to claim 21, Tomar discloses the method of claim 2, wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation auto-regressively using a third generative neural network [“The text processing component is implemented using one or more neural networks.” Paragraph 0055].
As to claim 22, Tomar discloses the method of claim 2, wherein the request specifies a context for the audio signal and the audio signal is conditioned on the context [“The decision module is including a contextual learning for contextual information to improve the accuracy of the system.” Paragraph 0051].
As to claim 23, Tomar discloses the method of claim 22, wherein the context specifies semantic properties of the audio signal and wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation conditioned on the context [“The text processing component is implemented using one or more neural networks.” Paragraph 0055].
As to claim 24, Tomar discloses the method of claim 22, wherein the context specifies acoustic properties of the audio signal, and wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, the acoustic representation of the audio signal comprises: generating, using one or more generative neural networks and conditioned on the semantic representation and the context, the acoustic representation of the audio signal [“The text processing component is implemented using one or more neural networks.” Paragraph 0055].
As to claim 27, Tomar discloses the method of claim 22, wherein the context comprises an audio input [The context is an acoustic input.” Paragraph 0051].
As to claim 28, Tomar discloses the method of claim 22, wherein the context comprises visual data [The context is a fixed length vector.” Paragraph 0050].
As to claim 29, Tomar discloses the method of claim 22, wherein the context comprises text data [The context is a textual data.” Paragraph 0051].
As to claim 30, Tomar discloses the method of claim 2, wherein a number of first time steps and a number of second time steps that span the time window is less than the number of output time steps that span the time window [The library contents a representation of the knowledge about the possible acoustic at a given time.” Paragraph 0054].
As to claim 31, Tomar discloses the method of claim 30, wherein the number of first time steps that span the time window is less than the number of second time steps that span the time window [The library contents a representation of the knowledge about the possible acoustic at a given time.” Paragraph 0054].
As to claim 32, Tomar discloses a system [FIG. 1] comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for generating an acoustic representation of an audio signal [Paragraph 0001], wherein the operations comprise:
receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window [“The decision fusion module processes inputs using a contextual learning component at a given output time.” Paragraph 0057];
obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window [“The system obtains a semantic representation, the semantic representation is a semantic token or vector for different times.” Paragraph 0043]; and
generating, using one or more generative neural networks [205 on FIG. 2] and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window [“The text processing component is implemented using one or more neural networks.” Paragraph 0055], wherein the set of one or more respective acoustic tokens [Acoustic input 101 on FIG. 1] at each of the plurality of second time steps comprises a plurality of acoustic tokens [Confidence score] that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step [“The decision fusion module contains both the predicted action by the system and a confidence score for the prediction of an output at a running time.” Paragraphs 0059 and 0060],
the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector quantization system as the final output and utilize a decision matrix to perform the hierarchy to decide which of the predicted intent or predicted text to choose.” Paragraph 0048], and
the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer [“The semantic representation is a fixed length vector in which entries represent a vocal expression or acoustic referring to the relevant semantic that users refer to control a device by voice.” Paragraph 0050].
As to claim 33, Tomar discloses one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for generating an acoustic representation of an audio signal [Paragraph 0077], wherein the operations comprise:
receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window [“The decision fusion module processes inputs using a contextual learning component at a given output time.” Paragraph 0057];
obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window [“The system obtains a semantic representation, the semantic representation is a semantic token or vector for different times.” Paragraph 0043]; and
generating, using one or more generative neural networks [205 on FIG. 2] and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window [“The text processing component is implemented using one or more neural networks.” Paragraph 0055], wherein the set of one or more respective acoustic tokens [Acoustic input 101 on FIG. 1] at each of the plurality of second time steps comprises a plurality of acoustic tokens [Confidence score] that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step [“The decision fusion module contains both the predicted action by the system and a confidence score for the prediction of an output at a running time.” Paragraphs 0059 and 0060],
the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers [“The decision fusion module can take into account a confidence in the predicted intent and predicted text to choose the outcome of the more confidence system or vector quantization system as the final output and utilize a decision matrix to perform the hierarchy to decide which of the predicted intent or predicted text to choose.” Paragraph 0048], and
the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer [“The semantic representation is a fixed length vector in which entries represent a vocal expression or acoustic referring to the relevant semantic that users refer to control a device by voice.” Paragraph 0050].
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 2-33 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-39 of U.S. Patent No. 12,020,138 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because at least one claim of the instant application is being taught by the claims of the U.S. Patent.
Patented claim 1 recites a computer-implemented method for generating a prediction of an audio signal which perform the feature of the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step.
The pending claim 2 recites a computer-implemented method for generating an acoustic representation of an audio signal which perform the similar feature of the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step.
Therefore, the patented claim 1 anticipates the pending 2.
Pending claims 3-33 have similar limitations comparing the patented claims 2-39 as shown on the table below.
Pending claims
Patented claims
2. A computer-implemented method for generating an acoustic representation of an audio signal, the method comprising: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window; and generating, using one or more generative neural networks and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, wherein the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer.
3. The method of claim 2, wherein each semantic token is selected from a vocabulary of semantic tokens and represents semantic content of the audio signal at the corresponding first time step.
4. The method of claim 2, wherein the one or more respective acoustic tokens at each second time step represent acoustic properties of the audio signal at the corresponding second time step.
5. The method of claim 2, further comprising: processing at least the acoustic representation using a decoder neural network to generate a prediction of the audio signal.
6. The method of claim 5, wherein the decoder neural network is a decoder neural network of a neural audio codec that has been trained jointly with an encoder neural network on an objective that measures reconstruction quality of predicted audio signals generated by the decoder neural network from acoustic representations generated using outputs generated by the encoder neural network.
7. The method of claim 2, wherein the acoustic representation is a prediction of a ground truth acoustic representation that would be generated from outputs of an encoder neural network by processing the audio signal.
8. The method of claim 7, wherein the encoder neural network outputs a respective embedding at each of the plurality of second time steps, and wherein the ground truth acoustic representation is generated by applying quantization to each of the respective embeddings.
9. The method of claim 8, wherein: the quantization is residual vector quantization that encodes each respective embedding using the hierarchy of the plurality of vector quantizers, and the set of one or more respective acoustic tokens at each second time step comprise, for each vector quantizer, a respective acoustic token that is a prediction of a ground truth acoustic token that would be generated by the vector quantizer from a ground truth embedding generated by the encoder neural network at the second time step.
10. The method of claim 2, wherein the hierarchy comprises the one or more coarse vector quantizers at one or more first positions in the hierarchy and the one or more fine vector quantizers at one or more last positions in the hierarchy.
11. The method of claim 2, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a first generative neural network and for each of the one or more coarse vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the coarse vector quantizer conditioned on at least the semantic representation.
12. The method of claim 11, wherein the first generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a first generation order, and wherein each particular acoustic token for each particular coarse vector quantizer and at each particular second time step is conditioned on at least the semantic representation and any acoustic tokens that precede the particular acoustic token in the first generation order.
13. The method of claim 12, wherein each particular acoustic token for each particular coarse vector quantizer and at each particular second time step is preceded in the first generation order by (i) any acoustic token for any of the coarse vector quantizers at any second time step that precedes the particular second time step and (ii) any acoustic tokens at the particular second time step for any coarse vector quantizers that precede the particular coarse vector quantizer in the hierarchy.
14. The method of claim 11, wherein the first generative neural network has a decoder-only Transformer architecture or an encoder-decoder Transformer architecture.
15. The method of claim 11, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a second generative neural network and for each of the one or more fine vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the fine vector quantizer conditioned on the respective acoustic tokens for the second time steps for the one or more coarse vector quantizers in the hierarchy.
16. The method of claim 15, wherein the second generative neural network is not conditioned on the semantic representation.
17. The method of claim 15, wherein the second generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a second generation order, and wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is conditioned on (i) the respective acoustic tokens for at least a subset of the second time steps for the one or more coarse vector quantizers and (ii) at least a subset of the acoustic tokens that precede the particular acoustic token in the second generation order.
18. The method of claim 17, wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is preceded in the second generation order by (i) any acoustic token for any of the fine vector quantizers at any second time step that precedes the particular second time step and (ii) any acoustic tokens at the particular second time step for any fine vector quantizers that precede the particular fine vector quantizer in the hierarchy.
19. The method of claim 17, wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is conditioned on (i) the respective acoustic tokens for the one or more coarse vector quantizers that are at most a threshold number of second time steps before the second time step and (ii) any acoustic tokens that precede the particular second time step in the second generation order and that are at second time steps that are at most a threshold number of second time steps before the second time step.
20. The method of claim 15, wherein the second generative neural network has a decoder-only Transformer architecture or an encoder-decoder Transformer architecture.
21. The method of claim 2, wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation auto-regressively using a third generative neural network.
22. The method of claim 2, wherein the request specifies a context for the audio signal and the audio signal is conditioned on the context.
23. The method of claim 22, wherein the context specifies semantic properties of the audio signal and wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation conditioned on the context.
24. The method of claim 22, wherein the context specifies acoustic properties of the audio signal, and wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, the acoustic representation of the audio signal comprises: generating, using one or more generative neural networks and conditioned on the semantic representation and the context, the acoustic representation of the audio signal.
25. The method of claim 24, wherein the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer, and wherein generating, using one or more generative neural networks and conditioned on the semantic representation and the context, an acoustic representation of the audio signal comprises: generating, using a first generative neural network and for each of the one or more coarse vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the coarse vector quantizer conditioned on the semantic representation and the context.
26. The method of claim 24, wherein processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal comprises: processing the acoustic representation and an acoustic representation of the context using the decoder neural network to generate the prediction of the audio signal.
27. The method of claim 22, wherein the context comprises an audio input.
28. The method of claim 22, wherein the context comprises visual data.
29. The method of claim 22, wherein the context comprises text data.
30. The method of claim 2, wherein a number of first time steps and a number of second time steps that span the time window is less than the number of output time steps that span the time window.
31. The method of claim 30, wherein the number of first time steps that span the time window is less than the number of second time steps that span the time window.
32. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for generating an acoustic representation of an audio signal, wherein the operations comprise: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window; and generating, using one or more generative neural networks and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, wherein the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer.
33. One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for generating an acoustic representation of an audio signal, wherein the operations comprise: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window; and generating, using one or more generative neural networks and conditioned on at least the semantic representation, the acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, wherein the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers and one or more fine vector quantizers, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer.
1. A computer-implemented method for generating a prediction of an audio signal, the method comprising: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token being selected from a vocabulary of semantic tokens and representing semantic content of the audio signal at the corresponding first time step; generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step; and processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal.
2. The method of claim 1, wherein the decoder neural network is a decoder neural network of a neural audio codec that has been trained jointly with an encoder neural network on an objective that measures reconstruction quality of predicted audio signals generated by the decoder neural network from acoustic representations generated using outputs generated by the encoder neural network.
3. The method of claim 1, wherein the acoustic representation is a prediction of a ground truth acoustic representation that would be generated from outputs of an encoder neural network by processing the audio signal.
4. The method of claim 3, wherein the encoder neural network outputs a respective embedding at each of the plurality of second time steps, and wherein the ground truth acoustic representation is generated by applying quantization to each of the respective embeddings.
5. The method of claim 4, wherein: the quantization is residual vector quantization that encodes each embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy, and the set of one or more respective acoustic tokens at each second time step comprise, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer and being a prediction of a ground truth acoustic token that would be generated by the vector quantizer from a ground truth embedding generated by the encoder neural network at the second time step.
6. The method of claim 1, wherein: the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer.
7. The method of claim 6, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a first generative neural network and for each of the one or more coarse vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the coarse vector quantizer conditioned on at least the semantic representation.
8. The method of claim 7, wherein the first generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a first generation order, and wherein each particular acoustic token for each particular coarse vector quantizer and at each particular second time step is conditioned on at least the semantic representation and any acoustic tokens that precede the particular acoustic token in the first generation order.
9. The method of claim 8, wherein each particular acoustic token for each particular coarse vector quantizer and at each particular second time step is preceded in the first generation order by (i) any acoustic token for any of the coarse vector quantizers at any second time step that precedes the particular second time step and (ii) any acoustic tokens at the particular second time step for any coarse vector quantizers that precede the particular coarse vector quantizer in the hierarchy.
10. The method of claim 7, wherein the first generative neural network has a decoder-only Transformer architecture or an encoder-decoder Transformer architecture.
11. The method of claim 7, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a second generative neural network and for each of the one or more fine vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the fine vector quantizer conditioned on the respective acoustic tokens for the second time steps for the one or more coarse vector quantizers in the hierarchy.
12. The method of claim 11, wherein the second generative neural network is not conditioned on the semantic representation.
13. The method of claim 11, wherein the second generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a second generation order, and wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is conditioned on (i) the respective acoustic tokens for at least a subset of the second time steps for the one or more coarse vector quantizers and (ii) at least a subset of the acoustic tokens that precede the particular acoustic token in the second generation order.
14. The method of claim 13, wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is preceded in the second generation order by (i) any acoustic token for any of the fine vector quantizers at any second time step that precedes the particular second time step and (ii) any acoustic tokens at the particular second time step for any fine vector quantizers that precede the particular fine vector quantizer in the hierarchy.
15. The method of claim 13, wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is conditioned on (i) the respective acoustic tokens for the one or more coarse vector quantizers that are at most a threshold number of second time steps before the second time step and (ii) any acoustic tokens that precede the particular second time step in the second generation order and that are at second time steps that are at most a threshold number of second time steps before the second time step.
16. The method of claim 11, wherein the second generative neural network has a decoder-only Transformer architecture or an encoder-decoder Transformer architecture.
17. The method of claim 1, wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation auto-regressively using a third generative neural network.
18. The method of claim 1, wherein the request specifies a context for the audio signal and the audio signal is conditioned on the context.
19. The method of claim 18, wherein the context specifies semantic properties of the audio signal and wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation conditioned on the context.
20. The method of claim 18, wherein the context specifies acoustic properties of the audio signal, and wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using one or more generative neural networks and conditioned on the semantic representation and the context, an acoustic representation of the audio signal.
21. The method of claim 20, wherein the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer, and wherein generating, using one or more generative neural networks and conditioned on the semantic representation and the context, an acoustic representation of the audio signal comprises: generating, using a first generative neural network and for each of the one or more coarse vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the coarse vector quantizer conditioned on the semantic representation and the context.
22. The method of claim 20, wherein processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal comprises: processing the acoustic representation and an acoustic representation of the context using the decoder neural network to generate the prediction of the audio signal.
23. The method of claim 18, wherein the context comprises an audio input.
24. The method of claim 18, wherein the context comprises visual data.
25. The method of claim 18, wherein the context comprises text data.
26. The method of claim 1, wherein a number of first time steps and a number of second time steps that span the time window is less than the number of output time steps that span the time window.
27. The method of claim 26, wherein the number of first time steps that span the time window is less than the number of second time steps that span the time window.
28. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising a method for generating a prediction of an audio signal, the method comprising: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token being selected from a vocabulary of semantic tokens and representing semantic content of the audio signal at the corresponding first time step; generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step; and processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal.
29. The system of claim 28, wherein: the set of one or more respective acoustic tokens at each of the plurality of second time steps comprises a plurality of acoustic tokens that collectively represent a prediction of an output of a residual vector quantization applied to an embedding that represents acoustic properties of the audio signal at the second time step, the residual vector quantization encodes the embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy, and the set of acoustic tokens at each second time step comprises, for each vector quantizer, a respective acoustic token selected from the vocabulary for the vector quantizer.
30. The system of claim 29, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a first generative neural network and for each of the one or more coarse vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the coarse vector quantizer conditioned on at least the semantic representation.
31. The system of claim 30, wherein the first generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a first generation order, and wherein each particular acoustic token for each particular coarse vector quantizer and at each particular second time step is conditioned on at least the semantic representation and any acoustic tokens that precede the particular acoustic token in the first generation order.
32. The system of claim 30, wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using a second generative neural network and for each of the one or more fine vector quantizers in the hierarchy, the respective acoustic tokens for the second time steps for the fine vector quantizer conditioned on the respective acoustic tokens for the second time steps for the one or more coarse vector quantizers in the hierarchy.
33. The system of claim 32, wherein the second generative neural network is an auto-regressive neural network that is configured to generate the acoustic tokens auto-regressively according to a second generation order, and wherein each particular acoustic token for each particular fine vector quantizer and at each particular second time step is conditioned on (i) the respective acoustic tokens for at least a subset of the second time steps for the one or more coarse vector quantizers and (ii) at least a subset of the acoustic tokens that precede the particular acoustic token in the second generation order.
34. The system of claim 28, wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation auto-regressively using a third generative neural network.
35. The system of claim 28, wherein the request specifies a context for the audio signal and the audio signal is conditioned on the context.
36. The system of claim 35, wherein the context specifies semantic properties of the audio signal and wherein obtaining a semantic representation of the audio signal comprises: generating the semantic representation conditioned on the context.
37. The system of claim 35, wherein the context specifies acoustic properties of the audio signal, and wherein generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal comprises: generating, using one or more generative neural networks and conditioned on the semantic representation and the context, an acoustic representation of the audio signal.
38. The system of claim 35, wherein the context comprises an audio input.
39. One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising a method for generating a prediction of an audio signal, the method comprising: receiving a request to generate an audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window; obtaining a semantic representation of the audio signal that specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token being selected from a vocabulary of semantic tokens and representing semantic content of the audio signal at the corresponding first time step; generating, using one or more generative neural networks and conditioned on at least the semantic representation, an acoustic representation of the audio signal, the acoustic representation specifying a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step; and processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal.
.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892 form.
van den Oord et al. (US 20200126539 A1) discloses a method for performing speech recognition by generating a neural network output from an audio data input sequence, where the neural network output characterizes words spoken in the audio data input sequence.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GERALD GAUTHIER whose telephone number is (571)272-7539. The examiner can normally be reached 8:00 AM to 4:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CAROLYN R EDWARDS can be reached at (571) 270-7136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/GERALD GAUTHIER/Primary Examiner, Art Unit 2692
January 15, 2026
/CAROLYN R EDWARDS/Supervisory Patent Examiner, Art Unit 2692