Prosecution Insights
Last updated: May 29, 2026
Application No. 18/838,867

ADVANCED CLUSTERING FOR SELF-SUPERVISED LEARNING IN SPEECH RECOGNITION

Non-Final OA §102§103
Filed
Aug 15, 2024
Priority
Mar 24, 2022 — nonprovisional of PCTCN2022082664
Examiner
KIM, JONATHAN C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Shujie Liu
OA Round
1 (Non-Final)
74%
Grant Probability
Favorable
1-2
OA Rounds
8m
Est. Remaining
99%
With Interview

Examiner Intelligence

Grants 74% — above average
74%
Career Allowance Rate
265 granted / 360 resolved
+11.6% vs TC avg
Strong +40% interview lift
Without
With
+40.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
12 currently pending
Career history
376
Total Applications
across all art units

Statute-Specific Performance

§101
3.7%
-36.3% vs TC avg
§103
90.8%
+50.8% vs TC avg
§102
1.2%
-38.8% vs TC avg
§112
1.3%
-38.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 360 resolved cases

Office Action

§102 §103
DETAILED ACTION This Office Action is in response to the correspondence filed by the applicant on 8/15/2024. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The Information Statements (IDS) filed on 8/15/2024, 8/23/2025, and 1/28/2026 have been accepted and considered in this office action and are in compliance with the provisions of 37 CFR 1.97. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention. Claims 1-5, 7-9, and 11-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by HSU (Hsu WN, Bolte B, Tsai YH, Lakhotia K, Salakhutdinov R, Mohamed A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing. 2021 Oct 26;29:3451-60.). REGARDING CLAIM 1, HSU discloses a method for generating pseudo-labeled training data from unlabeled training data, the method comprising: accessing a set of unlabeled speech data (Pg. 3451 2nd Col – “Pseudo-labeling (PL), also known as self-training, belongs to the family of semi-supervised learning techniques, and has been the dominant approach for utilizing unlabeled speech and audio with successful applications dating back to the mid-1990s [8]–​[11]. PL starts with some supervised data to train a “teacher” model in one specific downstream task. Pseudo-labels are then generated for the unlabeled data using the teacher model.”); generating pseudo-labels for the set of unlabeled speech data (Pg. 3451 2nd Col – “Pseudo-labels are then generated for the unlabeled data using the teacher model.”) by at least one of: (1) extracting a set of intermediate outputs from an automatic speech recognition model based on applying the automatic speech recognition model to the set of unlabeled speech data (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), clustering the set of intermediate outputs into different clusters, each cluster of the different clusters comprising a different sub-set of the set of intermediate outputs (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), and generating a first set of pseudo-labels comprising cluster assignments associated with the different clusters and which correspond to the set of unlabeled speech data (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”), or (2) generating a set of decoded word sequences for the set of unlabeled speech data by applying the automatic speech recognition model to the set of unlabeled speech data, and generating a second set of pseudo-labels associated with the set of unlabeled speech data by applying a hybrid automatic speech recognition model to both (i) the set of decoded word sequences and (ii) the set of unlabeled speech data; and generating a pseudo-labeled training dataset by combining the set of unlabeled speech data with either (i) the first set of pseudo-labels (Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”) or (ii) the second set of pseudo-labels. REGARDING CLAIM 2, HSU discloses the method of claim 1, further comprising: generating a pretrained language speech model by at least applying the pseudo-labeled training dataset to a speech processing model (Pg. 3456 Section C. Pre-Training – “The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Next we train HuBERT Large and X-Large for one iteration on 60,000 hours of Libri-light audio on 128 and 256 GPUs, respectively, for 400 k steps. The batch sizes are reduced to 56.25 and 22.5 seconds of audio per GPU due to memory constraints. Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”) for preparing the speech processing model to be trained with labeled speech data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”). REGARDING CLAIM 3, HSU discloses the method of claim 2, wherein the speech processing model is an acoustic model (Pg. 3452 2nd Col – “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs.”; Pg. 3454 1st Col – “The final loss is computed as a weighted sum of the two terms: L=αLm+(1−α)Lu. In the extreme case when α=0, the loss is computed over the unmasked timesteps, which is similar to acoustic modeling in hybrid speech recognition systems [36]–​[39]. In our setup, this limits the learning process to mimicking the clustering model.”). REGARDING CLAIM 4, HSU discloses the method of claim 2, further comprising: generating a trained speech processing model by at least applying labeled training data to the pretrained speech processing model to fine-tune the speech processing model (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”); and using the trained speech processing model to perform at least one of speech recognition, speaker recognition or a speech separation task (Pg. 3456 Section D. Supervised fine-Tuning and Decoding – “We sweep over peak learning rate ([1e-5, 1e-4]), learning rate schedule (percentage of steps for linear ramp-up and decay), number of fine-tuning steps, freeze step, and waveform encoder output masking probability for each model size and fine-tuning split combination using the word error rate (WER) on the dev-other subset as a criterion for model selection. … where Y is the predicted text, |Y| is the length of the text, and w1 and w2 denote the language model weight and utterance length weight.”). REGARDING CLAIM 5, HSU discloses the method of claim 4, wherein the automatic speech recognition model is previously trained on the labeled training data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”). REGARDING CLAIM 7, HSU discloses the method of claim 2, wherein the second set of pseudo-labels comprises phoneme sequences (Pg. 3456 E. Metrics of Target Quality – “For analysis, we derive frame-level forced-aligned phonetic transcripts using a hybrid ASR system to measure the correlation between the k-means cluster assignments and the actual phonetic units. Given aligned frame-level phonetic labels [y1,…,yT] and k-means labels [z1,…,zT], the joint distribution between the two variables pyz(i,j) can be estimated by counting the occurrences: … where i denotes the i-th phoneme class and j denotes the j-th k-means label class.”; Pg. 3459 1st col – “Besides using continuous speech input rather than discrete units, we hypothesize that HuBERT achieves significantly better performance because its fewer k-means clusters of 100 or 500 help capture broad phonetic concepts without delving into inter/intra-speaker variation.”; In other words, each phoneme maps to a cluster of the k-means. Thus, the pseudo-labels (i.e., cluster assignments) comprises the phoneme sequences.). REGARDING CLAIM 8, HSU discloses the method of claim 7, wherein the phoneme sequences are generated at a frame level (Pg. 3456 E. Metrics of Target Quality – “For analysis, we derive frame-level forced-aligned phonetic transcripts using a hybrid ASR system to measure the correlation between the k-means cluster assignments and the actual phonetic units. Given aligned frame-level phonetic labels [y1,…,yT] and k-means labels [z1,…,zT], the joint distribution between the two variables pyz(i,j) can be estimated by counting the occurrences: … where i denotes the i-th phoneme class and j denotes the j-th k-means label class.”). REGARDING CLAIM 9, HSU discloses the method of claim 8, wherein the method includes training the speech processing model by at least performing phoneme-based masking to the pseudo-labeled training dataset (Pg.3455 Fig. 1 – “The HuBERT approach predicts hidden cluster assignments of the masked frames (y2,y3,y4 in the figure) generated by one or more iterations of k-means clustering.”; Pg. 3452 1st Col -- “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs. First, the model needs to model unmasked inputs into meaningful continuous latent representations, which maps to the classical acoustic modeling problem. Second, to reduce the prediction error, the model needs to capture the long-range temporal relations between learned representations. One crucial insight motivating this work is the importance of consistency of the targets, not just their correctness, which enables the model to focus on modeling the sequential structure of input data. Our approach draws inspiration from the DeepCluster method for self-supervised visual learning [24]; however, HuBERT benefits from the masked prediction loss over speech sequences to represent their sequential structure.”; Pg. 3459 Section VI Conclusion – “This paper presents HuBERT, a speech representation learning approach that relies on predicting K-means cluster assignments of masked segments of continuous input.”; Note that the clusters correspond to frame-based phonemes.). REGARDING CLAIM 11, HSU discloses the method of claim 1, wherein clustering the set of intermediate outputs comprises applying one of: a K-means clustering algorithm to the set of intermediate outputs or a spectral clustering algorithm (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”). REGARDING CLAIM 12, HSU discloses the method of claim 1, wherein the cluster assignments are generated at a frame level (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”). REGARDING CLAIM 13, HSU discloses the method of claim 1, wherein the set of intermediate outputs comprises hidden layer embeddings associated with one or more hidden layers of the automatic speech recognition model (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”). REGARDING CLAIM 14, HSU discloses the method of claim 1, wherein the method includes said generating pseudo-labels for the set of unlabeled speech data by: extracting the set of intermediate outputs from the automatic speech recognition model, which is an end-to-end or hybrid automatic speech recognition model (Pg. 3452 1st Col -- “Intuitively, the HuBERT model is forced to learn both acoustic and language models from continuous inputs”; In other words, the pre-trained model HuBERT is an end-to-end model.), based on applying the automatic speech recognition model to the set of unlabeled speech data (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), clustering the set of intermediate outputs into the different clusters (Pg. 3456 Section B. Unsupervised Unit Discovery-- “To generate better targets for the subsequent iterations, we run k-means clustering with 500 clusters on the latent features extracted from the HuBERT model pre-trained in the previous iteration (not fine-tuned) at some intermediate transformer layer. Since the feature dimension at the transformer output is much higher than the MFCC features (768-D for HuBERT Base), we cannot afford to load the entire 960 h training split to the memory. So instead, we randomly sample 10% of the data for fitting the k-means model.”), and generating the first set of pseudo-labels comprising cluster assignments associated with the different clusters (Pg. 3452 Section II Method – “Inspired by this, we propose to use acoustic unit discovery models to provide frame-level targets. Let X denote a speech utterance X=[x1,…,xT] of T frames. Discovered hidden units are denoted with h(X)=Z=[z1,…,zT], where zt∈{1,…,C} is a C-class categorical variable and h is a clustering model, e.g. k-means.”; Pg. 3456 Section C. Pre-Training – “We train the Base model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250 k steps, while the second iteration is trained for 400 k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. … Instead of restarting the iterative process from clustering MFCC features, we extract features from the 9-th transformer layer of the second iteration Base HuBERT for clustering and use those labels for training these two models. Hence, these two models can also be seen as the third iteration models.”). REGARDING CLAIM 15, HSU discloses a computing system configured for generating a pre-trained speech processing model using pseudo-labeled training data, the computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-executable instructions that are executable by the one or more processors to configure the computing system (HSU Section IV Experimental Details) to at least: to perform the steps of claim 1; thus, it is rejected under the same rationale. Claims 1, 10, and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by XU (Xu Q, Baevski A, Likhomanenko T, Tomasello P, Conneau A, Collobert R, Synnaeve G, Auli M. Self-training and pre-training are complementary for speech recognition. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 Jun 6 (pp. 3030-3034). IEEE.). REGARDING CLAIM 1, XU discloses a method for generating pseudo-labeled training data from unlabeled training data, the method comprising: accessing a set of unlabeled speech data (Section 2.2 Self-training Approach – “We adopt the pseudo-labeling strategy of Kahn et al. (2020; [13]) and Synnaeve et al. (2020; [2]). This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling.”); generating pseudo-labels for the set of unlabeled speech data (Section 2.2 Self-training Approach – “We adopt the pseudo-labeling strategy of Kahn et al. (2020; [13]) and Synnaeve et al. (2020; [2]). This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling.”) by at least one of: (1) extracting a set of intermediate outputs from an automatic speech recognition model based on applying the automatic speech recognition model to the set of unlabeled speech data, clustering the set of intermediate outputs into different clusters, each cluster of the different clusters comprising a different sub-set of the set of intermediate outputs, and generating a first set of pseudo-labels comprising cluster assignments associated with the different clusters and which correspond to the set of unlabeled speech data, or (2) generating a set of decoded word sequences for the set of unlabeled speech data by applying the automatic speech recognition model to the set of unlabeled speech data (XU Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”; In other words, the candidate transcriptions (i.e., a set of decoded word sequences) are generated first generated for generating the pseudo-labeled data.), and generating a second set of pseudo-labels associated with the set of unlabeled speech data by applying a hybrid automatic speech recognition model to both (i) the set of decoded word sequences and (ii) the set of unlabeled speech data (XU Section 2.2 Self-training Approach – “This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling. Finally, a new acoustic model is trained on the pseudo-labeled data as well as the original labeled data.”; Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”; In other words, the pseudo labels are generated by applying the initial model (e.g., an acoustic model/wave2vec model) as well as a language model to label the unlabeled data. Using the acoustic model phoneme sequences are generated, and using the language model, the word sequence is generated.); and generating a pseudo-labeled training dataset by combining the set of unlabeled speech data with either (i) the first set of pseudo-labels or (ii) the second set of pseudo-labels (XU Section 2.2 Self-training Approach – “This first trains an initial acoustic model on the available labeled data and then labels the unlabeled data with the initial model as well as a language model in a step we call pseudo-labeling. Finally, a new acoustic model is trained on the pseudo-labeled data as well as the original labeled data.”; Section 2.3 Combining the two Approaches –“To combine the approaches, we replace the initial model for pseudo-labeling with a pre-trained model. The resulting training pipeline is as follows: we first pre-train a wav2vec 2.0 model on the unlabeled data, fine-tune it on the available labeled data, use the model to label the unlabeled data, and finally use the pseudo-labeled data to train the final model. In our experiments, we also consider a variant where we fine-tune the original wav2vec 2.0 model on the pseudo-labeled data.”; Section 3.3 Self-Training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits. For labeling, we follow the two-pass rescoring procedure of Synnaeve et al. (2020; [2]): first, we generate a list of candidate transcriptions by combining wav2vec 2.0 and the standard Librispeech 4-gram language model during beam-search with beam 800.”). REGARDING CLAIM 10, XU discloses the method of claim 1, wherein the second set of pseudo-labels comprises graphemic units (Section 3.4 Final Model – “We follow Synnaeve et al. (2020; [2]) and train a Transformer-based sequence to sequence model with log-Mel filterbank inputs after pseudo-labeling using wav2letter++ [37].”; One of ordinary skill in the art would know wav2letter++ generates graphemic units. [37] recites, “Many of the recent open-source ASR toolkits, including the one presented in this paper, rely on end-to-end acoustic modeling based on graphemes rather than phonemes.”). REGARDING CLAIM 15, XU discloses a computing system configured for generating a pre-trained speech processing model using pseudo-labeled training data, the computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-executable instructions that are executable by the one or more processors to configure the computing system (XU Section 3 Experimental Setup) to at least: to perform the steps of claim 1; thus, it is rejected under the same rationale. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over HSU, and in further view of XU (Xu Q, Baevski A, Likhomanenko T, Tomasello P, Conneau A, Collobert R, Synnaeve G, Auli M. Self-training and pre-training are complementary for speech recognition. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021 Jun 6 (pp. 3030-3034). IEEE.). REGARDING CLAIM 6, HSU discloses the method of claim 4. HSU discloses, wherein the automatic speech recognition model is previously trained on [a different set of labeled speech data than] the labeled training data (Pg. 3456 Section D. Supervised Fine-Tuning and Decoding – “We fine-tune each model on 8 GPUs on the labeled splits described in Section IV-A. The batch sizes per GPU are at most 200/80/40 seconds of audio for Base/Large/X-Large models.”; Pg. 3457 2nd Col – “2) pre-training: first use unlabeled speech for pre-training a model, and then fine-tune the model on labeled data with a supervised training objective.”). HSU does not explicitly teach the [square-bracketed] limitations. In other words, HSU teaches trained on the same labeled data, but does not explicitly teach training on a different set of data. XU discloses the [square-bracketed] limitations. XU discloses a method/system for speech recognition, wherein the automatic speech recognition model is previously trained on [a different set of labeled speech data than] the labeled training data (XU Section 3.3 Self-training – “We pseudo-label the audio data of either LS-960 or LV-60k using wav2vec 2.0 LARGE fine-tuned on different labeled data splits.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of HSU to substitute a training a model on a set of labeled data with training a model on a different set of labeled data, as taught by XU. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the simple substitution of one known element for another producing a predictable result renders the claim obvious. For more on this combination rationale, see MPEP § 2143(B). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /JONATHAN C KIM/Primary Examiner, Art Unit 2655
Read full office action

Prosecution Timeline

Aug 15, 2024
Application Filed
Apr 02, 2026
Non-Final Rejection mailed — §102, §103
May 06, 2026
Applicant Interview (Telephonic)
May 08, 2026
Examiner Interview Summary
May 14, 2026
Response Filed

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12640148
Network Microphone Device With Command Keyword Conditioning
3y 6m to grant Granted May 26, 2026
Patent 12640150
VOICE-BASED CHATBOT POLICY OVERRIDE(S) FOR EXISTING VOICE-BASED CHATBOT(S)
2y 10m to grant Granted May 26, 2026
Patent 12614547
METHOD FOR CONTROLLING UTTERANCE OF UTTERANCE DEVICE, SERVER CONTROLLING UTTERANCE OF UTTERANCE DEVICE, UTTERANCE DEVICE, AND PROGRAM
2y 3m to grant Granted Apr 28, 2026
Patent 12609108
ADAPTIVE SELF-TRAINED COMPUTER ENGINES WITH ASSOCIATED DATABASES AND METHODS OF USE THEREOF
4y 8m to grant Granted Apr 21, 2026
Patent 12573391
Generating Contextual Responses for Out-of-coverage Requests for Assistant Systems
2y 11m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+40.4%)
2y 5m (~8m remaining)
Median Time to Grant
Low
PTA Risk
Based on 360 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month