DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the response to this office action, the Examiner respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Examiner in prosecuting this application.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 8, 10, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kurata et al. (US 20220208179 A1, hereinafter Kurata) and in view of reference Rosenberg et al. (US 20240029715 A1, hereinafter Rosenberg).
Claim 1: Kurata teaches a computer-implemented method (title and abstract, ln 1-17, fig. 1) for training an Automatic Speech Recognition ASR model (a recurrent neural network transducer RNN-T in fig. 1 and the RNN-T is trained according to algorithm in fig. 2), the method comprising:
training an automatic speech recognition ASR computer model (the algorithm for training the RNN-T in fig. 2, para 61 and as computer-implemented method, abstract) based on full utterance training data (X=(x1, …, XT) in 210 in fig. 2, para 30), the ASR computer model having an audio encoder (an encoder 130 in fig. 1), text predictor (predictor network 140 is for text, para 28), and a joint network (joiner 150 in fig. 1) which combines outputs of both the audio encoder and the text predictor (for combining the outputs of the encoder 130 and the predictor 140, para 28); and executing fine-tuning training on the trained ASR computer model (trained RNN-T as a teacher neural network configured to learn ASR and preparing for a student neural network, para 131), except the execution of the fine-tuning training on the trained ASR computer model is at least by:
receiving full utterance data as input to a knowledge distillation framework;
executing, by the knowledge distillation framework, a chunking operation on the full utterance data to generate a plurality of data chunks corresponding to full utterances in the full utterance data; and
executing, by the knowledge distillation framework, a knowledge distillation operation with two encoder embeddings, wherein the two encoder embeddings comprise a first encoder embedding obtained from the full utterance data, and a second encoder embedding obtained from the data chunks corresponding to the full utterances in the full utterance data, wherein operational parameters of the trained ASR model are updated based on a loss determined from the first encoder embedding and second encoder embedding.
Rosenberg teaches an analogous field of endeavor by disclosing computer-implemented method for training an ASR model (title and abstract, ln 1-15 and training process 300 fig. 3A, para 29 and training an encoder in the speech recognition model of fig. 2) and wherein Rosenberg teaches training an automatic speech recognition ASR computer model based on full utterance training data (training before the fine-tuning the speech encoder 204 included in the ASR model in fig. 2, para 61 and as a computer-implemented method, para 4), the ASR computer model (fig. 2) having an audio encoder (encoder 210 in fig. 2), text predictor (prediction network 220 in fig. 2), and a joint network which combines outputs of both the audio encode and the text predictor (joint network 230 by taking outputs from the prediction network 220 and encoder 210 in fig. 2, para 26); and
wherein an execution of a fine-tuning training on the trained ASR computer model is disclosed (speech recognition system in fig. 2, and the training process 500 in figs. 5A-5C for TTS model in figs. 5A-5C is fine-tune the pre-trained speech encoder 204 and text encoder 202, para 61) to be at least by:
receiving full utterance data as input (one in training utterance pair of a reference speech 504 and corresponding input text 502 to the reference speech 504 in figs. 5A-5C) to a knowledge distillation framework (including alignment model 400, masking module 218, convolution block 212, etc., in fig. 5A or alignment model 400, speech encoder 204 and text encoder 202, modality loss 505, etc., in figs. 5B-5C and providing encoded audio and text as embeddings to shared encoder 250 in figs. 5A-5C, wherein the alignment model 400 in fig. 4);
executing, by the knowledge distillation framework, a chunking operation on the full utterance data (taking corresponding input text 502 corresponding to reference speech 504) to generate a plurality of data chunks corresponding to full utterances in the full utterance data (predicting duration of continuous phoneme duration 422, as the chunking operation on the input text in fig. 4, para 39); and
executing, by the knowledge distillation framework, a knowledge distillation operation with two encoder embeddings (211m, 213m, and input to Quantizer 127 in fig. 5A or encoded speech 514 and encoded text by taking chunks from the alignment model 400 in fig. 5B-5C), wherein the two encoder embeddings comprise a first encoder embedding obtained from the full utterance data (211, 213 to the quantizer 127in fig. 5A or encoded speech 514 by taking reference speech 504 in figs. 5B-5C), and a second encoder embedding obtained from the data chunks corresponding to the full utterances in the full utterance data (211m 213m in fig. 5A or encoded text 512 in figs. 5B-5C and both are exchangeable, para 44), wherein operational parameters of the trained ASR model are updated based on a loss determined from the first encoder embedding and second encoder embedding (the parameters of the encoder 210 are trained through the contractive loss in equation 3, para 45) for benefits of broadening an application of the ASR (fitting a target language although training language is different, abstract) and improving quality of the ASR system (by using pronunciation model in the alignment model, that converts a script of the unspoken textual utterances in the target language into phonetic representation across multiple languages, para 38).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the executing of the fine-tuning training on the trained ASR computer model performed at least by receiving the full utterance data as the input to the knowledge distillation framework; executing, by the knowledge distillation framework, the chunking operation on the full utterance data to generate the plurality of data chunks corresponding to full utterances in the full utterance data; and executing, by the knowledge distillation framework, the knowledge distillation operation with two encoder embeddings, wherein the two encoder embeddings comprise the first encoder embedding obtained from the full utterance data, and the second encoder embedding obtained from the data chunks corresponding to the full utterances in the full utterance data, wherein operational parameters of the trained ASR model are updated based on the loss determined from the first encoder embedding and second encoder embedding, as taught by Rosenberg, to the executing of the fine-tuning training on the trained ASR computer model in the computer-implemented method, as taught by Kurata, for the benefits discussed above.
Claim 10 has been analyzed and rejected according to claim 1 above and the combination Kurata and Rosenberg further teaches a computer program product comprising a computer readable storage medium having a computer readable program stored therein (Kurata, memory ROM 508, cache 506, RAM 510, etc., para 89 and Rosenberg, non-transitory memory 820 in fig. 8, para 80), wherein the computer-readable program, when executed on a computing device, causes the computing device (Kurata, GPU 505, CPU 504, para 89 and Rosenberg, computer program product, para 81, and implemented by FPGA, ASIC hardware, para 86) to implement the computer-implemented method of claim 1 (discussed in claim 1 above).
Claim 18 has been analyzed and rejected according to claims 1, 10 above.
Claim 8: the combination of Kurata and Rosenberg further teaches, according to claim 1 above, wherein each data chunk corresponds to a portion of a single word or a short utterance comprising multiple words but less than a corresponding full utterance in the full utterance data (Rosenberg, text chunks having a duration of word, word-piece, phoneme, and/or grapheme duration, para 34).
Claims 2, 4-5, 9, 11, 13-15, 17, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kurata (above) and in view of references Rosenberg (above) and Leal et al. (US 20220309340 A1, hereinafter Leal).
Claim 2: the combination of Kurata and Rosenberg teaches the knowledge distillation framework having the encoder (Rosenberg, speech encoder 204 and text encoder 202 in fig. 5A-5C, and discussed in claim 1 above), according to claim 1 above, and a cross-entropy used for training the language model (Kurata, cross-entropy CE used for training the language model and used for initializing the encoder and predictor, para 45), except wherein a teacher encoder of the knowledge distillation framework generates the first encoder embedding from the full utterance data, and a student encoder of the knowledge distillation framework generates the second encoder embedding from the chunking data, and wherein the loss is based on a cross-entropy loss computed between the first encoder embedding from the teacher encoder and the second encoder embedding from the student encoder.
Leal teaches an analogous field of endeavor by disclosing computer-implemented method for training an automatic speech recognition ASR model (title and abstract, ln 1-17 and fig. 2A-2B and implemented by computer-implemented method, para 4) and wherein a teacher encoder (teacher model 210a-210n with different languages, and each of the teachers model 210a-210n is RNN-T model architecture, para 29, and the RNN-T model in fig. 3, with an encoder 310 in fig. 3) of knowledge distillation framework (a framework in figs. 2A/2B and including a RNN-T model as adaptive model 200 of the speech recognition system 140 in figs. 1-2, para 25) generates the first encoder embedding from the full utterance data (distillation processes 220 based on the sample database 150 in fig. 2A , para 27), and a student encoder of the knowledge distillation framework (multingual student model 200 as RNN-T model in fig. 2A, para 25) generates the second encoder embedding from the chunking data (output from the encoder of RNN-T of model 200 based on the student training examples 154a-n as chunks as audible signal in fig. 2A, para 27-28), and wherein the loss (a total loss as a combination of the distillation loss, para 28) is based on distillation loss (based on distillation loss, para 28) computed between the first encoder embedding from the teacher encoder and the second encoder embedding from the student encoder (the distillation loss includes a decreasing function that may decrease the first RNN-T loss corresponding to the teachers models and increase the second RNN-T loss corresponding to the student ASR model over the instant of time, para 6) for the benefits of the ASR performance (by increasing accuracy and robustness, para 3, by adapting the ASR to different nature languages, para 25, by raising efficiency based on distilled neural network and compact size, para 26).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the knowledge distillation framework and wherein the teacher encoder of the knowledge distillation framework generates the first encoder embedding from the full utterance data, and the student encoder of the knowledge distillation framework generates the second encoder embedding from the chunking data, and wherein the loss is based on the distillation loss computed between the first encoder embedding from the teacher encoder and the second encoder embedding from the student encoder, as taught by Leal, to the knowledge distillation framework in the computer-implemented method, as taught by the combination of Kurata and Rosenberg, for the benefits discussed above.
However, the combination of Kurata, Rosenberg, and Leal does not explicitly teach that the distillation loss is cross-entropy loss.
An Official Notice is taken that cross-entropy loss is notoriously well-known as a standard loss function in the art for accelerating convergence, strong gradients to avoid slow learning steps, and handling large output categories in model optimization.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the cross-entropy loss in the model optimization, as taught by the well-known in the art, to the distillation loss in the computer-implemented method, as taught by the combination of Kurata, Rosenberg, and Leal, for the benefits discussed above.
Claim 4: the combination of Kurata, Rosenberg, and Leal further teaches, according to claim 2 above, wherein the student encoder masks one or more portions of embeddings generated by the student encoder (Kurata, the student model and the teacher model and the discussed in claim 1 above, and Leal, student RNN-T or adaptive model is trained by weights 222, 222b applied on the student training examples in fig. 2A, para 28), such that the first encoder embedding comprises a first portion having encodings corresponding to the data chunks (Leal, distillation weight 222, 222a from the teachers encoder model in fig. 2A), and a second portion having masked encodings (from 220, 220b or student training examples by applying the training weight 222, 222b in fig. 2A, para 28).
Claim 5: the combination of Kurata, Rosenberg, and Leal further teaches, according to claim 2 above, wherein the knowledge distillation is executed at an intermediate layer of the teacher encoder and student encoder with a cross-layer knowledge distillation loss (Leal, by taking distilling process of the teacher’s model RNN-T from multiple teachers RNN-T models, each having its own encoder in RNN-T model, and encoder of RNN-T as adaptive model 200 as multilingual student model in fig. 2A, and the student model 200 is trained by both the distillation process 220 from the teacher’s model and the its own training process 220, 220b, para 27).
Claim 6: the combination of Kurata, Rosenberg, and Leal further teaches, according to claim 2 above, wherein the chunking operation further comprises swapping adjacent data chunks prior to inputting the data chunks into the student encoder (Kurata, the student model, discussed in claim 1 above, and Rosenberg, chunks including word pieces and words, phonemes, and graphemes, and thus, swapping adjacent data chunks are inherency for words to word-pieces or vise verse, for example, words swapped to word-pieces, or swapping words based on phonemes or graphemes, or vise verse, etc., para 30, Leal, student encoder in the RNN-T adaptive model 200 by taking the distillation processing from the teacher’s model in fig. 2A).
Claim 9: the combination of Kurata, Rosenberg, and Leal further teaches, according to claim 1 above, wherein the operational parameters of the trained ASR model are operational parameters of the student encoder (Leal, the adaptive model RNN-T 200 is trained as student model in fig. 2A), and wherein the student encoder is deployed as the audio encoder of the ASR model after execution of the fine-tuning is complete (Leal, trained adaptive RNN-T model 200 in fig. 2A, and the RNN-T model has an encoder as the audio encoder of the ASR in fig. 3).
Claim 11 has been analyzed and rejected according to claims 10, 2 above.
Claim 13 has been analyzed and rejected according to claims 11, 4 above.
Claim 14 has been analyzed and rejected according to claims 11, 5 above.
Claim 15 has been analyzed and rejected according to claims 11, 6 above.
Claim 17 has been analyzed and rejected according to claims 10, 9 above.
Claim 19 has been analyzed and rejected according to claims 18, 2 above.
Claims 3, 12 are rejected under 35 U.S.C. 103 as being unpatentable over Kurata (above) and in view of references Rosenberg (above), Leal (above), and Yang (“Knowledge Distillation for End-to-End Automatic Speech Recognition”, Department of Engineering, University of Cambridge, Sidney Sussex College, Dissertation of Master of Philosophy in Machine Learning and Machine Intelligence, August, pp.1-78, year 2021).
Claim 3: the combination of Kurata, Rosenberg, and Leal further teaches, according to claim 2 above, the cross-entropy loss (discussed in claim 2 above), transducer loss (Kurata, RNN-T loss, para 53, Rosenberg, total loss based on contrastive losses, and supervised losses, para 32, and Leal, distillation loss is based on RNN-T loss corresponding to teacher ASR models and second RNN-T loss corresponding to the student ASR model, para 6), and wherein the audio encoder of the ASR computer model shares one or more encoder parameters with the student encoder (Kurata, student neural network model is prepared from the teacher neural network as RNN-T, para 131 and Leal, adaptive model as student RNN-T model 200 from distilled teacher’s RNN-T model, para 26), except explicitly teaching wherein the loss is an interpolated loss between a transducer loss and the cross entropy loss.
Yang teaches an analogous field of endeavor by disclosing a method for training an automatic speech Recognition ASR model (title and 3.3 knowledge distillation KD in RNN-T, p.25) and wherein an interpolated loss (loss function L in equation 3.5, p.25) is between a transducer loss (a CTC model loss Lctc in equation 3.7, p.27) and the knowledge distillation loss (knowledge distillation loss Lkd) is disclosed (the interpolation via the equation 3.7, p.27) for benefits of reducing computation complexity in a cost-save manner (reducing memory cost and computation effort, 3.3.1 Full-lattice KD with Collapsed Distribution, p.27).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the cross-entropy loss and the transducer loss in the computer-implemented method, as taught by the combination of Kurata, Rosenberg, and Leal, with calculated interpolated loss between the transducer loss and the knowledge distillation loss, as taught by Yang, for the benefits discussed above.
Claim 12 has been analyzed and rejected according to claims 10, 3 above.
Examiner Comment
in paragraph 0026 of applicant’s specification, applicant clearly defines the following: “A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide, light pulses passing through a fiber optic cable, electrical signals communicated through a wire, and/or other transmission media”.
Allowable Subject Matters
Claims 7, 16, 20 are objected to as being dependent upon a rejected base claims 1, 10, 18, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LESHUI ZHANG whose telephone number is (571)270-5589. The examiner can normally be reached Monday-Friday 6:30amp-4:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached at 571-272-7848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LESHUI ZHANG/
Primary Examiner,
Art Unit 2695