Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments filed, with respect to the amended independent claims (Claims 1, 18 and 20) and the previously cited prior art, have been fully considered and are persuasive. Therefore, the rejection of Claims 1-20 under 35 USC 103 has been withdrawn. However, upon further consideration, a new ground of rejection is made in view of Chintalapudi et al. (US Patent 12,390,724 B2).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8, 11, 15, 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chang (US Patent 11,915,690 B1) in view of Saxena (US Patent 12,182,506 B2), and in further view of Chintalapudi et al. (US Patent 12,390,724 B2).
Regarding Claims 1, 18 and 20, Chang teaches a method implemented using one or more processors (see Fig.2, Fig.17 (1704), Col.7, Line 14-18 and Col.34, Line8-12), comprising:
receiving a stream of audio input frames (see Fig.2 (111), Fig.7 (111) and Col.12, Line 33-44, receiving a sequence of audio frames to be processed);
while the stream of audio input frames is received, tokenizing audio input frames received up to a current time step to generate a stream of audio input tokens (see Fig.2 (280) and Col.10, Line 62 - Col.11, Line 18, performing token embedding for the audio frames up to the current time frame in autoregressive manner);
using a Transformer-based causal attention model to predict a stream of audio output tokens (see Fig.2 (230,232,234), Col.10, Line 62 – Col.11, Line 22 and Col.15, Line 13-18, using self-attention transformer for encoding in autoregressive manner where the future frames are masked and not attended to), wherein using the Transformer-based causal attention model comprises iteratively applying the Transformer-based causal attention model (see Fig.2 (230), Fig.7 (730) and Col.10, Line 62 - Col.11, Line 22, encoding in autoregressive manner) to:
at least some of the audio input tokens tokenized up to the current time step (see Fig.2 (230), Fig.7 (730), Col.11, Line 5-27 and Col.14, Line 57 – Col.15, Line 12, encoding the input frames up the current time frame),
and detokenizing the stream of audio output tokens to generate a stream of audio output frames (see Fig.2 (145), Fig.7 (145), Fig.9 (920), Col.12, Line 33-44 and Col.17, Line 31-45, outputting acoustic data generated from the sequence of frames or tokens).
Chang fails to teach iteratively applying the Transformer-based causal attention mode to at least some of the audio output tokens predicted up to the current time step, and causing the stream of audio output frames to be rendered using one or more speakers to generate audio output that is perceptible audibly.
Chang, however, teaches generating a sequence of output tokens or audio frames predicted up to the current time step (see Fig.2 (230), Fig.7 (730), Col.12, Line 33-44 and Col.15, Line 42-48), sequence of output tokens generated from the transformer).
Saxena teaches iteratively applying a self-attention transformer to a sequence of output tokens one by one in order to generate the next output token (see Fig.5B (550,554,564) and Col.13, Line 36-56, an output token being fed back to the decoder of the transformer one by one to generate the next token in the sequence of output tokens).
Chintalapudi teaches outputting audio frames from an audio stream via a speaker of gaming controller or headset (see Fig.9A (905), Fig.9B (955,960), Col.17, Line 32-40 and Col.18, Line 4-15).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to include to Chang’s method the step for iteratively applying the Transformer-based causal attention mode to at least some of the audio output tokens predicted up to the current time step. The motivation would be to generate a sequence of output tokens for the entire input audio stream or the full utterance of a speaker.
It is further obvious to include to Chang’s method the step for causing the stream of audio output frames to be rendered using one or more speakers to generate audio output that is perceptible audibly. The motivation would be to output audio data to users of devices configured with speakers.
Regarding Claims 2 and 19, Chang teaches generating a sequence of input tokens and output tokens predicted up to the current time step (see Fig.2 (230), Col.11, Line 5-27 and Col.12, Line 33-44), but fails to disclose mixing audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens.
Saxena, however, teaches using a self-attention transformer to combine output tokens with the incoming input token to generate a stream of tokens containing at least one input token and one output token (see Fig.5B (550,554,564) and Col.13, Line 36-56, the output tokens are being fed back to the decoder of the transformer to generate a sequence of tokens).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to include to Chang’s method the step for mixing audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens. The motivation would be to generate a sequence of tokens corresponding to the entire input audio stream or the full utterance of a speaker.
Regarding Claim 3, Saxena further teaches wherein the mixed stream of audio tokens is iteratively processed using the Transformer-based causal attention model (see Fig.5B (550,554,564) and Col.13, Line 36-56, applying self-attention at the transformer by iteratively feeding the output tokens back to the decoder to generate the sequence of tokens).
Regarding Claim 4, Saxena further teaches interleaving audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step (see Fig.5B (550,554,564) and Col.13, Line 36-56, feeding back the output tokens at the decoder of the transformer while processing the incoming input token to generate the sequence of tokens).
Regarding Claim 5, Saxena further teaches wherein the transformer-based causal attention model comprises a decoder-only transformer (see Col.14, Line 5-10).
Regarding Claim 6, Chang further teaches wherein the Transformer-based causal attention model comprises an encoder transformer and a decoder transformer operably coupled using cross attention (see Fig.2 (220,230,232), Col.7, Line 14-18 and Col.10, Line 37-41).
Regarding Claim 7, Chang further teaches wherein the decoder transformer attends to audio output tokens of the stream of audio output tokens (see Fig.2 (230) and Col.10, Line 37-41, the decoder takes in as input the prior tokens in autoregressive manner).
Regarding Claim 8, Chang further teaches wherein the encoder transformer attends to audio input tokens of the stream of audio input tokens (see Fig.2 (232,234,280) and Col.11, Line 5-11).
Regarding Claim 11, Chang further teaches wherein the stream of audio input tokens includes at least acoustic input tokens generated using a neuro audio codec (see Fig.1 (135,140a), Col.5, Line 40-49 and Col.10, Line 41-44, the preprocessing component include at least one encoder for encoding the microphone input data before processing at the transformer where the acoustic tokens are generated).
Regarding Claim 15, Chang further teaches wherein the Transformer-based causal attention model comprises a first model used to process the at least some of the audio input tokens tokenized up to the current time step and at least some of the audio output tokens predicted up to the current time step to generate coarse acoustic tokens (see Fig.2 (230,232), Col.11, Line 5-27 and Col.12, Line 33-44, the first model being the encoder/decoder attention model for generating the input and output tokens, and a second model to process the coarse acoustic tokens to generate fine acoustic tokens (see Fig.2 (260,145) and Col.12, Line 2-10, using the Softmax model to generate the acoustic unit data as output
Regarding Claim 17, Chang further teaches wherein during each iteration of the Transformer- based causal attention model, the Transformer-based causal attention model is applied to: a current audio state (see Fig.2 (230), Fig.7 (730) and Col.10, Line 62 – Col.11, Line 22, encoding in autoregressive manner), wherein the current audio state was generated autoregressively based on one or more prior iterations of the Transformer-based causal attention model to prior audio input tokens (see Fig.2 (230), Fig.7 (730) and Col.10, Line 62 – Col.11, Line 22, encoding in autoregressive manner); and one or more next audio input tokens (see Fig.2 (230), Fig.7 (730) and Col.10, Line 62 – Col.11, Line 22, encoding in autoregressive manner).
Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Chang (US Patent 11,915,690 B1) in view of Saxena (US Patent 12,182,506 B2), and in further view of Chintalapudi et al. (US Patent 12,390,724 B2), and in further view of Qian et al. (US Patent 10,699,700 B2).
Regarding Claim 9, Chang, Saxena and Chintalapudi teach the method of Claim 1but they fail to teach wherein the Transformer-based causal attention model uses local attention.
Qian, however, teaches a transformer configured to use local attention that selectively attends to a context window of input embedding sequence (see Fig.2 (213,214) and Col.8, Line 32-44).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to configure the Transformer-based causal attention model to use local attention. The motivation would be to restrict each token’s attention to a fixed size window of surrounding tokens for efficiently encoding the tokens.
Regarding Claim 10, Qian further teaches adjusting a future context length of the local attention to add a controllable lookahead (see Col.14, Line 21-34, the context window size can be configured to different sizes).
Claims 12-14 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Chang (US Patent 11,915,690 B1) in view of Saxena (US Patent 12,182,506 B2), and in further view of Chintalapudi et al. (US Patent 12,390,724 B2), and in further view of Agostinelli et al. (US Patent 11,915,689 B1).
Regarding Claim 12, Chang, Saxena and Chintalapudi teach the method of Claim 11 but fail to teach wherein the stream of audio input tokens further includes at least some semantic input tokens generated using a semantic tokenizer that is trained to capture, in the audio input tokens, semantic features of audio input frames.
Agostinelli, however, teaches generating semantic input tokens from an audio signal that represents semantic content of the audio signal (see Fig.1 (106,130) and Col.8, Line 16-27).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to the method of Claim 11 the step for generating semantic input tokens that is trained to capture, in the audio input tokens, semantic features of audio input frames. The motivation would be to determine additional context information, such as phonetic and prosodic features, from the audio input.
Regarding Claim 13, Chang teaches wherein each of the audio input tokens includes an acoustic input token (see Fig.2 (230) and Col.10, Line 37-44) but Chang, Saxena and Chintalapudi fail to teach wherein each of the audio input tokens includes a semantic token.
Agostinelli, however, teaches generating semantic input tokens from an audio signal that represents semantic content of the audio signal (see Fig.1 (106,130) and Col.8, Line 16-27).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to the method of Claim 12 the step for generating audio input tokens that include acoustic and semantic tokens. The motivation would be to determine additional context information, such as phonetic and prosodic features, from the audio input.
Regarding Claim 14, Chang teaches wherein each of the audio output tokens comprises a predicted acoustic output token (see Fig.2 (145), Fig.7 (145), Fig.9 (920), Col.12, Line 33-44 and Col.17, Line 31-45), and wherein the detokenizing comprises decoding the predicted acoustic output token (see Fig.2 (145), Fig.7 (145), Fig.9 (920), Col.12, Line 33-44 and Col.17, Line 31-45, outputting acoustic data generated from the sequence of frames or tokens), but Chang, Saxena and Chintalapudi fail to teach wherein each of the audio output tokens comprises a predicted semantic output token.
Agostinelli, however, teaches generating semantic input tokens from an audio signal that represents semantic content of the audio signal (see Fig.1 (106,130) and Col.8, Line 16-27).
It would have been obvious for one skilled in the art, before the effective filing date of the application, to the method of Claim 13 the step for generating predicted semantic output tokens. The motivation would be to generate additional context information for predicting the output tokens.
Regarding Claim 16, Agostinelli further teaches wherein the semantic features include phonetic features of the audio input frames; prosodic features of the audio input frames; melodic features of the audio input frames; and rhythmic features of the audio input frames (see Fig.1 (106,130) and Col.8, Line 16-27).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VU B HANG whose telephone number is (571)272-0582.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan, can be reached at (571)272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/VU B HANG/Primary Examiner, Art Unit 2654