DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 8/15/2024.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The Information Statements (IDS) filed on 8/15/2024, 8/23/2025, and 1/28/2026 have been accepted and considered in this office action and are in compliance with the provisions of 37 CFR 1.97.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-5, 7-9, and 11-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by HSU (Hsu WN, Bolte B, Tsai YH, Lakhotia K, Salakhutdinov R, Mohamed A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing. 2021 Oct 26;29:3451-60.).
REGARDING CLAIM 1, HSU discloses a method for generating pseudo-labeled training data from unlabeled training data, the method comprising:
accessing a set of unlabeled speech data (Pg. 3451 2nd Col – “Pseudo-labeling (PL), also known as self-training, belongs to the family of semi-supervised learning techniques, and has been the dominant approach for utilizing unlabeled speech and audio with successful applications dating back to the mid-1990s [8]–[11]. PL starts with some supervised data to train a “teacher” model in one specific downstream task. Pseudo-labels are then generated for the unlabeled data using the teacher model.”);
generating pseudo-labels for the set of unlabeled speech data (Pg. 3451 2nd Col – “Pseudo-labels are then generated for the unlabeled data using the teacher model.”) by at least one of:
(1) extracting a set of intermediate outputs from an automatic speech recognition model based on applying the automatic speech recognition model to the set of unlabeled speech data (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”),
clustering the set of intermediate outputs into different clusters, each cluster of the different clusters comprising a different sub-set of the set of intermediate outputs (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), and
generating a first set of pseudo-labels comprising cluster assignments associated with the different clusters and which correspond to the set of unlabeled speech data (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”), or
(2) generating a set of decoded word sequences for the set of unlabeled speech data by applying the automatic speech recognition model to the set of unlabeled speech data, and
generating a second set of pseudo-labels associated with the set of unlabeled speech data by applying a hybrid automatic speech recognition model to both (i) the set of decoded word sequences and (ii) the set of unlabeled speech data; and
generating a pseudo-labeled training dataset by combining the set of unlabeled speech data with either (i) the first set of pseudo-labels (Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”) or (ii) the second set of pseudo-labels.
REGARDING CLAIM 2, HSU discloses the method of claim 1, further comprising:
generating a pretrained language speech model by at least applying the pseudo-labeled training dataset to a speech processing model (Pg. 3456 Section C. Pre-Training – “The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Next we train HuBERT Large and X-Large for one iteration on 60,000 hours of Libri-light audio on 128 and 256 GPUs, respectively, for 400 k steps. The batch sizes are reduced to 56.25 and 22.5 seconds of audio per GPU due to memory constraints. Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”) for preparing the speech processing model to be trained with labeled speech data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”).
REGARDING CLAIM 3, HSU discloses the method of claim 2, wherein the speech processing model is an acoustic model (Pg. 3452 2nd Col – “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs.”; Pg. 3454 1st Col – “The final loss is computed as a weighted sum of the two terms: L=αLm+(1−α)Lu. In the extreme case when α=0, the loss is computed over the unmasked timesteps, which is similar to acoustic modeling in hybrid speech recognition systems [36]–[39]. In our setup, this limits the learning process to mimicking the clustering model.”).
REGARDING CLAIM 4, HSU discloses the method of claim 2, further comprising:
generating a trained speech processing model by at least applying labeled training data to the pretrained speech processing model to fine-tune the speech processing model (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”); and
using the trained speech processing model to perform at least one of speech recognition, speaker recognition or a speech separation task (Pg. 3456 Section D. Supervised fine-Tuning and Decoding – “We sweep over peak learning rate ([1e-5, 1e-4]), learning rate schedule (percentage of steps for linear ramp-up and decay), number of fine-tuning steps, freeze step, and waveform encoder output masking probability for each model size and fine-tuning split combination using the word error rate (WER) on the dev-other subset as a criterion for model selection. … where Y is the predicted text, |Y| is the length of the text, and w1 and w2 denote the language model weight and utterance length weight.”).
REGARDING CLAIM 5, HSU discloses the method of claim 4, wherein the automatic speech recognition model is previously trained on the labeled training data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”).
REGARDING CLAIM 7, HSU discloses the method of claim 2, wherein the second set of pseudo-labels comprises phoneme sequences (Pg. 3456 E. Metrics of Target Quality – “For analysis, we derive frame-level forced-aligned phonetic transcripts using a hybrid ASR system to measure the correlation between the k-means cluster assignments and the actual phonetic units. Given aligned frame-level phonetic labels [y1,…,yT] and k-means labels [z1,…,zT], the joint distribution between the two variables pyz(i,j) can be estimated by counting the occurrences: … where i denotes the i-th phoneme class and j denotes the j-th k-means label class.”; Pg. 3459 1st col – “Besides using continuous speech input rather than discrete units, we hypothesize that HuBERT achieves significantly better performance because its fewer k-means clusters of 100 or 500 help capture broad phonetic concepts without delving into inter/intra-speaker variation.”; In other words, each phoneme maps to a cluster of the k-means. Thus, the pseudo-labels (i.e., cluster assignments) comprises the phoneme sequences.).
REGARDING CLAIM 8, HSU discloses the method of claim 7, wherein the phoneme sequences are generated at a frame level (Pg. 3456 E. Metrics of Target Quality – “For analysis, we derive frame-level forced-aligned phonetic transcripts using a hybrid ASR system to measure the correlation between the k-means cluster assignments and the actual phonetic units. Given aligned frame-level phonetic labels [y1,…,yT] and k-means labels [z1,…,zT], the joint distribution between the two variables pyz(i,j) can be estimated by counting the occurrences: … where i denotes the i-th phoneme class and j denotes the j-th k-means label class.”).
REGARDING CLAIM 9, HSU discloses the method of claim 8, wherein the method includes training the speech processing model by at least performing phoneme-based masking to the pseudo-labeled training dataset (Pg.3455 Fig. 1 – “The HuBERT approach predicts hidden cluster assignments of the masked frames (y2,y3,y4 in the figure) generated by one or more iterations of k-means clustering.”; Pg. 3452 1st Col -- “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs. First, the model needs to model unmasked inputs into meaningful continuous latent representations, which maps to the classical acoustic modeling problem. Second, to reduce the prediction error, the model needs to capture the long-range temporal relations between learned representations. One crucial insight motivating this work is the importance of consistency of the targets, not just their correctness, which enables the model to focus on modeling the sequential structure of input data. Our approach draws inspiration from the DeepCluster method for self-supervised visual learning [24]; however, HuBERT benefits from the masked prediction loss over speech sequences to represent their sequential structure.”; Pg. 3459 Section VI Conclusion – “This paper presents HuBERT, a speech representation learning approach that relies on predicting K-means cluster assignments of masked segments of continuous input.”; Note that the clusters correspond to frame-based phonemes.).
REGARDING CLAIM 11, HSU discloses the method of claim 1, wherein clustering the set of intermediate outputs comprises applying one of: a K-means clustering algorithm to the set of intermediate outputs or a spectral clustering algorithm (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”).
REGARDING CLAIM 12, HSU discloses the method of claim 1, wherein the cluster assignments are generated at a frame level (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”).
REGARDING CLAIM 13, HSU discloses the method of claim 1, wherein the set of intermediate outputs comprises hidden layer embeddings associated with one or more hidden layers of the automatic speech recognition model (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”).
REGARDING CLAIM 14, HSU discloses the method of claim 1, wherein the method includes said generating pseudo-labels for the set of unlabeled speech data by:
extracting the set of intermediate outputs from the automatic speech recognition model, which is an end-to-end or hybrid automatic speech recognition model (Pg. 3452 1st Col -- “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs”; In other words, the pre-trained model HuBERT is an end-to-end model.), based on applying the automatic speech recognition model to the set of unlabeled speech data (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”),
clustering the set of intermediate outputs into the different clusters (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), and
generating the first set of pseudo-labels comprising cluster assignments associated with the different clusters (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”).
REGARDING CLAIM 15, HSU discloses a computing system configured for generating a pre-trained speech processing model using pseudo-labeled training data, the computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-executable instructions that are executable by the one or more processors to configure the computing system (HSU Section IV Experimental Details) to at least: to perform the steps of claim 1; thus, it is rejected under the same rationale.
Claims 1, 10, and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by XU (Xu Q, Baevski A, Likhomanenko T, Tomasello P, Conneau A, Collobert R, Synnaeve G, Auli M. Self-training and pre-training are complementary for speech recognition. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 Jun 6 (pp. 3030-3034). IEEE.).
REGARDING CLAIM 1, XU discloses a method for generating pseudo-labeled training data from unlabeled training data, the method comprising:
accessing a set of unlabeled speech data (Section 2.2 Self-training Approach – “We adopt the pseudo-labeling strategy of Kahn et al. (2020; [13]) and Synnaeve et al. (2020; [2]). This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling.”);
generating pseudo-labels for the set of unlabeled speech data (Section 2.2 Self-training Approach – “We adopt the pseudo-labeling strategy of Kahn et al. (2020; [13]) and Synnaeve et al. (2020; [2]). This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling.”) by at least one of:
(1) extracting a set of intermediate outputs from an automatic speech recognition model based on applying the automatic speech recognition model to the set of unlabeled speech data,
clustering the set of intermediate outputs into different clusters, each cluster of the different clusters comprising a different sub-set of the set of intermediate outputs, and
generating a first set of pseudo-labels comprising cluster assignments associated with the different clusters and which correspond to the set of unlabeled speech data, or
(2) generating a set of decoded word sequences for the set of unlabeled speech data by applying the automatic speech recognition model to the set of unlabeled speech data (XU Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”; In other words, the candidate transcriptions (i.e., a set of decoded word sequences) are generated first generated for generating the pseudo-labeled data.), and
generating a second set of pseudo-labels associated with the set of unlabeled speech data by applying a hybrid automatic speech recognition model to both (i) the set of decoded word sequences and (ii) the set of unlabeled speech data (XU Section 2.2 Self-training Approach – “This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling. Finally, a new acoustic model is trained on the pseudo-labeled data as well as the original labeled data.”; Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”; In other words, the pseudo labels are generated by applying the initial model (e.g., an acoustic model/wave2vec model) as well as a language model to label the unlabeled data. Using the acoustic model phoneme sequences are generated, and using the language model, the word sequence is generated.); and
generating a pseudo-labeled training dataset by combining the set of unlabeled speech data with either (i) the first set of pseudo-labels or (ii) the second set of pseudo-labels (XU Section 2.2 Self-training Approach – “This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling. Finally, a new acoustic model is trained on the pseudo-labeled data as well as the original labeled data.”; Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”).
REGARDING CLAIM 10, XU discloses the method of claim 1, wherein the second set of pseudo-labels comprises graphemic units (Section 3.4 Final Model – “We follow Synnaeve et al. (2020; [2]) and train a Transformer-based sequence to sequence model with log-Mel filterbank inputs after pseudo-labeling using wav2letter++ [37].”; One of ordinary skill in the art would know wav2letter++ generates graphemic units. [37] recites, “Many of the recent open-source ASR toolkits, including the one presented in this paper, rely on end-to-end acoustic modeling based on graphemes rather than phonemes.”).
REGARDING CLAIM 15, XU discloses a computing system configured for generating a pre-trained speech processing model using pseudo-labeled training data, the computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-executable instructions that are executable by the one or more processors to configure the computing system (XU Section 3 Experimental Setup) to at least: to perform the steps of claim 1; thus, it is rejected under the same rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over HSU, and in further view of XU (Xu Q, Baevski A, Likhomanenko T, Tomasello P, Conneau A, Collobert R, Synnaeve G, Auli M. Self-training and pre-training are complementary for speech recognition. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 Jun 6 (pp. 3030-3034). IEEE.).
REGARDING CLAIM 6, HSU discloses the method of claim 4.
HSU discloses, wherein the automatic speech recognition model is previously trained on [a different set of labeled speech data than] the labeled training data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”).
HSU does not explicitly teach the [square-bracketed] limitations. In other words, HSU teaches trained on the same labeled data, but does not explicitly teach training on a different set of data.
XU discloses the [square-bracketed] limitations. XU discloses a method/system for speech recognition, wherein the automatic speech recognition model is previously trained on [a different set of labeled speech data than] the labeled training data (XU Section 3.3 Self-training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of HSU to substitute a training a model on a set of labeled data with training a model on a different set of labeled data, as taught by XU.
Since each individual element and its function are shown in the prior art, albeit shown in separate references, the simple substitution of one known element for another producing a predictable result renders the claim obvious. For more on this combination rationale, see MPEP § 2143(B).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JONATHAN C KIM/Primary Examiner, Art Unit 2655