Last updated: May 29, 2026
Application No. 18/662,550
Abridged Multilingual Speech Models For Automatic Speech Recognition

Non-Final OA §103
Filed
May 13, 2024
Examiner
DUGDA, MULUGETA TUJI
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Oracle International Corporation
OA Round
1 (Non-Final)
Interview Optional

— +19.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 82% grant rate with +19.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 50 resolved cases, 2023–2026
Examiner Intelligence

DUGDA, MULUGETA TUJI View full profile →
Grants 82% — above average
Career Allowance Rate
41 granted / 50 resolved
+20.0% vs TC avg
Strong +19% interview lift
Without
With
+19.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
10 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
6.2%
-33.8% vs TC avg
§103
89.2%
+49.2% vs TC avg
§102
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 50 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 
Claims 1-20 are pending, and claims 1, 8 and 15 are independent claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5, 8-10, 12, 15-17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. Pat App No. US 20250022457 A1 (Wu) in view of Mavandadi et al., "A Truly Multilingual First Pass and Monolingual Second Pass Streaming on-Device ASR System," 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 838-845 (Year: 2022).

Regarding Claim 1, Wu discloses one or more non-transitory computer-readable media storing instructions which, when executed by one or more hardware processors, cause performance of operations (Wu, para 0045, processing units performing any of methods 400 and 500 may be executing instructions stored on a non-transitory computer-readable storage media) comprising: 
accessing a multilingual automatic speech recognition (ASR) model comprising a token embedding matrix corresponding to a plurality of language tokens, wherein the plurality of language tokens comprises at least (a) a first subset of language tokens associated with a first language and (b) a second subset of language tokens associated with a second language (Wu, para 0010-0013,  providing systems and techniques that use a cascaded diffusion model (e.g., a machine learning model that includes a diffusion process and receives, as input, output from another machine learning model) for multi-lingual semi-supervised ASR. The cascaded diffusion model may receive as training data a plurality of (speech, text) pairs… A vector representation of the speech data may be concatenated with a vector representation of the text data of a (speech, text) pair to form an input tensor (e.g., one or more vectors or tensors within an embedding space) for training the diffusion model… In some embodiments, the input speech data and the output text data are in the same language. In some embodiments, the input speech data is in a first language, and the output text data is in a second language (e.g., language translation). In some embodiments, the input speech data includes multiple languages, and the output text data similarly includes multiple languages (e.g., multi-lingual support); para 0040, Figure 3, The (speech, text) pair may include a speech portion 310 and a text portion 312. Speech portion 310 may be represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). Text portion 312 may be represented as a sequence of discrete tokens. Text portion 312 may be converted (e.g., embedding 320) from discrete tokens into text vectors 314;); 
	Wu does not specifically disclose generating a language-specific ASR model for the first language, retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language, removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language, and applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language.
	However, Mavandadi, in the same field of endeavor, discloses 
generating a language-specific ASR model for the first language (Mavandadi, page 839, Figure 1, baseline monolingual architecture), at least by: 
retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language (Mavandadi, Figure 2 see Caption: Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, left col, 1st para, We use the trained parameters from E0 to initialize the first pass encoders of our remaining experimental architectures and keep those parameters frozen during training in order to preserve the models’ capability to recognize multilingual speech using their first pass  ; [i.e., From Figure 1,  the Causal Encoder has all the Multilingual components of the different languages and it retains  everything and after going through LID, each Non-Causal Encoder and associated Decoder has the language specific tokens]); 
removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language (Mavandadi, page 840, Figure 2, Fig. 2 Figure caption: Block diagram for E3 multilingual model. Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, right col, 1st para, for E3, we convert the non-causal encoder to be language-dependent as well, resulting in the entire second pass being replicated per-language. Fig. 2 illustrates the E3 architecture.; [i.e., 1st Pass Hypothesis – Multilingual Decoder as “multilingual ASR model”; LID= language identification (LID) information; After 1st pass, switching with LID to individual specific languages like en-US, zh-TW …en-GB basically “REMOVES” all other languages except those specific language going through the Non-Casual Encoder and Decoder before the 2nd pass]  … …);
applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language (Mavandadi, page 843, (1) Decoders with Language Specific Non-Causal Encoders produce 2nd pass Hypo and (2) Multilingual Decoder produced 1st Pass Hypo. Experimental model WERs for different Architectures, including E3, E3+ and E31 are provided in Table 4 and discussed. In conclusion: "Our best system combines per-language decoders and second-pass encoders with a shared multilingual encoder to achieve a 5.5% reduction of WER relative to monolingual models”).  
	Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Mavandadi in the method of Wu because this would enable a truly multilingual first-pass and monolingual second-pass streaming on-device ASR system based on the recently developed Cascaded Encoders model so that the ASR systems will be more accurate, with low latency, and effectively handle language switching in order to be useful for the 60% of the world population that speaks more than one language (Mavandadi, Abstract).
 
Regarding Claim 2, Wu in view of Mavandadi discloses the one or more media of claim 1, wherein the first subset of language tokens associated with the first language comprises tokens associated with a particular language script (Wu, para 0020-0021, In some embodiments, the speech portion of a (speech, text) pair and the text portion of the pair are in the same language. In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130. The speech input and the text input of a given (speech, text) pair may be combined to create a single input tensor for the diffusion model. In some embodiments, the speech data is represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). For example, a neural network (e.g., Wav2Vec) may be used to convert the input speech data into vectors within an embedding space. The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.)).  
 
Regarding Claim 3, The one or more media of claim 1.
Mavandadi further teaches:
 wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to numerical characters (Mavandadi, page 840, see Figure 2; Causal Encoder is inherently converting text to vectors which are just numbers/numerical characters).  
 
Regarding Claim 5, The one or more media of claim 1, the operations further comprising: identifying the first subset of language tokens associated with the first language, at least by tokenizing a corpus of text written in the first language (Wu, para 0020-0023, In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130…The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.). The text may be converted from discrete tokens into vectors within an embedding space (e.g., vectors each having 512 dimensions). In some embodiments, the embedding space of the text data may be different from the embedding space of the speech data. To convert the discrete tokens to vectors, an embedding mapping may be used, where each token is replaced by a vector within the embedding space. In some embodiments, the embedding mapping is performed using a lookup table. In some embodiments, the embedding mapping is performed using a neural network (e.g., Word2Vec). In some embodiments, the text data is converted from a first set of discrete tokens (e.g., words, phonemes) to a second set of discrete tokens (e.g., international phonetic alphabet (IPA) symbols). Then the second set of discrete tokens may be converted to the vector representation… The text portion may be represented as one or more vectors within an embedding space, so a linear model with a softmax layer may be used to translate (e.g., “round”) the vector representations back into discrete text tokens).  
 
Regarding Claim 8, Wu discloses a system comprising: 
one or more hardware processors (Wu, para 0045, processing units performing any of methods 400 and 500 may be executing instructions stored on a non-transitory computer-readable storage media. In at least one embodiment, any of methods 400 and 500 may be performed using multiple processor threads (e.g., CPU threads and/or GPU threads)); 
one or more non-transitory computer-readable media (Wu, para 0045, processing units performing any of methods 400 and 500 may be executing instructions stored on a non-transitory computer-readable storage media); and 
program instructions stored on the one or more non-transitory computer readable media which, when executed by the one or more hardware processors, cause the system to perform operations (Wu, para 0045, processing units performing any of methods 400 and 500 may be executing instructions stored on a non-transitory computer-readable storage media) comprising: 
accessing a multilingual automatic speech recognition (ASR) model comprising a token embedding matrix corresponding to a plurality of language tokens, wherein the plurality of language tokens comprises at least (a) a first subset of language tokens associated with a first language and (b) a second subset of language tokens associated with a second language (Wu, para 0010-0013,  providing systems and techniques that use a cascaded diffusion model (e.g., a machine learning model that includes a diffusion process and receives, as input, output from another machine learning model) for multi-lingual semi-supervised ASR. The cascaded diffusion model may receive as training data a plurality of (speech, text) pairs… A vector representation of the speech data may be concatenated with a vector representation of the text data of a (speech, text) pair to form an input tensor (e.g., one or more vectors or tensors within an embedding space) for training the diffusion model… In some embodiments, the input speech data and the output text data are in the same language. In some embodiments, the input speech data is in a first language, and the output text data is in a second language (e.g., language translation). In some embodiments, the input speech data includes multiple languages, and the output text data similarly includes multiple languages (e.g., multi-lingual support); para 0040, Figure 3, The (speech, text) pair may include a speech portion 310 and a text portion 312. Speech portion 310 may be represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). Text portion 312 may be represented as a sequence of discrete tokens. Text portion 312 may be converted (e.g., embedding 320) from discrete tokens into text vectors 314); 
	Wu does not specifically disclose generating a language-specific ASR model for the first language, at least by, retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language, removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language, and applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language.
	However, Mavandadi, in the same field of endeavor, discloses 
generating a language-specific ASR model for the first language (Mavandadi, page 839, Figure 1, baseline monolingual architecture), at least by: 
retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language (Mavandadi, pages 839 and 840, Figure 2, Figure 2 Figure Caption: Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, left col, 1st para, We use the trained parameters from E0 to initialize the first pass encoders of our remaining experimental architectures and keep those parameters frozen during training in order to preserve the models’ capability to recognize multilingual speech using their first pass); ; [i.e., From Figure 1,  the Causal Encoder has all the Multilingual components of the different languages and it retains  everything and after going through LID, each Non-Causal Encoder and associated Decoder has the language specific tokens]); 
removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language (Mavandadi, page 840, Figure 2, , Fig. 2 Figure caption: Block diagram for E3 multilingual model. Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, right col, 1st para, for E3, we convert the non-causal encoder to be language-dependent as well, resulting in the entire second pass being replicated per-language. Fig. 2 illustrates the E3 architecture.; [i.e., 1st Pass Hypothesis – Multilingual Decoder as “multilingual ASR model”; LID= language identification (LID) information; After 1st pass, switching with LID to individual specific languages like en-US, zh-TW …en-GB basically “REMOVES” all other languages except those specific language going through the Non-Casual Encoder and Decoder before the 2nd pass])
applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language (Mavandadi, page 843, (1) Decoders with Language Specific Non-Causal Encoders produce 2nd pass Hypo and (2) Multilingual Decoder produced 1st Pass Hypo. Experimental model WERs for different Architectures, including E3, E3+ and E31 are provided in Table 4 and discussed. In conclusion: "Our best system combines per-language decoders and second-pass encoders with a shared multilingual encoder to achieve a 5.5% reduction of WER relative to monolingual models”).  	Therefore, it would have been obvious for one having ordinary skill in the art before theeffective filing date of the claimed invention to incorporate the method of Mavandadi in the method of Wu because this would enable a truly multilingual first-pass and monolingual second-pass streaming on-device ASR system based on the recently developed Cascaded Encoders model so that the ASR systems will be more accurate, with low latency, and effectively handle language switching in order to be useful for the 60% of the world population that speaks more than one language (Mavandadi, Abstract).

Regarding Claim 9, Wu in view of Mavandadi discloses the system of claim 8, wherein the first subset of language tokens associated with the first language comprises tokens associated with a particular language script ((Wu, para 0020-0021, In some embodiments, the speech portion of a (speech, text) pair and the text portion of the pair are in the same language. In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130. The speech input and the text input of a given (speech, text) pair may be combined to create a single input tensor for the diffusion model. In some embodiments, the speech data is represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). For example, a neural network (e.g., Wav2Vec) may be used to convert the input speech data into vectors within an embedding space. The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.))).  
 
Regarding Claim 10, Wu in view of Mavandadi discloses the system of claim 8.
Mavandadi further teaches:
wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to numerical characters (Mavandadi, page 840, see Figure 2: Causal Encoder is inherently converting text to vectors which are numbers/numerical characters).  
 
Regarding Claim 12, Wu in view of Mavandadi discloses the system of claim 8, the operations further comprising: identifying the first subset of language tokens associated with the first language, at least by tokenizing a corpus of text written in the first language (Wu, para 0020-0023, In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130… The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.). The text may be converted from discrete tokens into vectors within an embedding space (e.g., vectors each having 512 dimensions). In some embodiments, the embedding space of the text data may be different from the embedding space of the speech data. To convert the discrete tokens to vectors, an embedding mapping may be used, where each token is replaced by a vector within the embedding space. In some embodiments, the embedding mapping is performed using a lookup table. In some embodiments, the embedding mapping is performed using a neural network (e.g., Word2Vec). In some embodiments, the text data is converted from a first set of discrete tokens (e.g., words, phonemes) to a second set of discrete tokens (e.g., international phonetic alphabet (IPA) symbols). Then the second set of discrete tokens may be converted to the vector representation… The text portion may be represented as one or more vectors within an embedding space, so a linear model with a softmax layer may be used to translate (e.g., “round”) the vector representations back into discrete text tokens).  
   
Regarding Claim 15, Wu discloses a method comprising: 
accessing a multilingual automatic speech recognition (ASR) model comprising a token embedding matrix corresponding to a plurality of language tokens, wherein the plurality of language tokens comprises at least (a) a first subset of language tokens associated with a first language and (b) a second subset of language tokens associated with a second language (Wu, para 0010-0013,  providing systems and techniques that use a cascaded diffusion model (e.g., a machine learning model that includes a diffusion process and receives, as input, output from another machine learning model) for multi-lingual semi-supervised ASR. The cascaded diffusion model may receive as training data a plurality of (speech, text) pairs… A vector representation of the speech data may be concatenated with a vector representation of the text data of a (speech, text) pair to form an input tensor (e.g., one or more vectors or tensors within an embedding space) for training the diffusion model… In some embodiments, the input speech data and the output text data are in the same language. In some embodiments, the input speech data is in a first language, and the output text data is in a second language (e.g., language translation). In some embodiments, the input speech data includes multiple languages, and the output text data similarly includes multiple languages (e.g., multi-lingual support); para 0040, Figure 3, The (speech, text) pair may include a speech portion 310 and a text portion 312. Speech portion 310 may be represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). Text portion 312 may be represented as a sequence of discrete tokens. Text portion 312 may be converted (e.g., embedding 320) from discrete tokens into text vectors 314)
	Wu does not specifically disclose generating a language-specific ASR model for the first language, at least by: retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language, removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language, applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language, wherein the method is performed by at least one device including a hardware processor.
	However, Mavandadi, in the same field of endeavor, discloses 
generating a language-specific ASR model for the first language (Mavandadi, page 839, Figure 1, baseline monolingual architecture), at least by: 
retaining, from the multilingual ASR model, a first portion of the embedding matrix corresponding to the first subset of language tokens associated with the first language (Mavandadi, pages 839 and 840, , Figure 1 and Figure 2, Figure 2 Figure Caption, Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, left col, 1st para, We use the trained parameters from E0 to initialize the first pass encoders of our remaining experimental architectures and keep those parameters frozen during training in order to preserve the models’ capability to recognize multilingual speech using their first pass; [i.e., From Figure 1,  the Causal Encoder has all the Multilingual components of the different languages and it retains  everything and after going through LID, each Non-Causal Encoder and associated Decoder has the language specific tokens]); 
removing, from the multilingual ASR model, a second portion of the embedding matrix corresponding to the second subset of language tokens associated with the second language (Mavandadi, page 840, Figure 2, Fig. 2 Figure caption: Block diagram for E3 multilingual model. Model uses separate non-causal encoders and decoders for each language. Either predicted or inferred LID is used to activate one encoder-decoder pair for each utterance. A multilingual decoder can be used to generate hypotheses directly based on the causal encoder outputs without needing LID; Mavandadi, page 840, right col, 1st para, for E3, we convert the non-causal encoder to be language-dependent as well, resulting in the entire second pass being replicated per-language. Fig. 2 illustrates the E3 architecture.; [i.e., 1st Pass Hypothesis – Multilingual Decoder as “multilingual ASR model”; LID= language identification (LID) information; After 1st pass, switching with LID to individual specific languages like en-US, zh-TW …en-GB basically “REMOVES” all other languages except those specific language going through the Non-Casual Encoder and Decoder before the 2nd pass]); 
applying a digital audio input, comprising spoken language in the first language, to the language-specific ASR model, to obtain a transcript of the digital audio input in the first language (Mavandadi, page 843, (1) Decoders with Language Specific Non-Causal Encoders produce 2nd pass Hypo and (2) Multilingual Decoder produced 1st Pass Hypo. Experimental model WERs for different Architectures, including E3, E3+ and E31 are provided in Table 4 and discussed. In conclusion: "Our best system combines per-language decoders and second-pass encoders with a shared multilingual encoder to achieve a 5.5% reduction of WER relative to monolingual models”), 
wherein the method is performed by at least one device including a hardware processor (Mavandadi, page 838, Advances in hardware [1, 2] and algorithms [3–5] have made ASR an accessible and reliable approach for a variety of hands-free tasks in daily routines. In addition, the growing prevalence of personal computing devices has made it possible to bring ASR to much of the world’s population if models and algorithms are efficient enough to operate on hardware with a diverse range of computation capabilities).
	Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Mavandadi in the method of Wu because this would enable a truly multilingual first-pass and monolingual second-pass streaming on-device ASR system based on the recently developed Cascaded Encoders model so that the ASR systems will be more accurate, with low latency, and effectively handle language switching in order to be useful for the 60% of the world population that speaks more than one language (Mavandadi, Abstract).

Regarding Claim 16, Wu in view of Mavandadi discloses the method of claim 15, wherein the first subset of language tokens associated with the first language comprises tokens associated with a particular language script (Wu, para 0020-0021, In some embodiments, the speech portion of a (speech, text) pair and the text portion of the pair are in the same language. In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130. The speech input and the text input of a given (speech, text) pair may be combined to create a single input tensor for the diffusion model. In some embodiments, the speech data is represented as a sequence of mel-spectrograms, where each frame of the mel-spectrogram includes a vector within an embedding space (e.g., an 80-dimension vector). For example, a neural network (e.g., Wav2Vec) may be used to convert the input speech data into vectors within an embedding space. The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.)).  
 
Regarding Claim 17, Wu in view of Mavandadi discloses the method of claim 15.
Mavandadi further teaches:
wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to numerical characters (Mavandadi, page 840, see Figure 2: Causal Encoder is inherently converting text to vectors which are numbers/numerical characters).  
 
Regarding Claim 19, Wu in view of Mavandadi discloses the method of claim 15, the method further comprising: identifying the first subset of language tokens associated with the first language, at least by tokenizing a corpus of text written in the first language (Wu, para 0020-0023, In some embodiments, the speech portion of a (speech, text) pair may be in a first language while the text portion is in a second language (e.g., language translation). In some embodiments, the speech portion includes speech in multiple languages and the text portion includes text in those same languages (e.g., multi-lingual support). Cascaded ASR module 120 may include a diffusion machine learning model that is trained using the (speech, text) pairs in training dataset 130… The text data may be a represented as a sequence of discrete tokens (e.g., words, phonemes, IPA symbols, etc.). The text may be converted from discrete tokens into vectors within an embedding space (e.g., vectors each having 512 dimensions). In some embodiments, the embedding space of the text data may be different from the embedding space of the speech data. To convert the discrete tokens to vectors, an embedding mapping may be used, where each token is replaced by a vector within the embedding space. In some embodiments, the embedding mapping is performed using a lookup table. In some embodiments, the embedding mapping is performed using a neural network (e.g., Word2Vec). In some embodiments, the text data is converted from a first set of discrete tokens (e.g., words, phonemes) to a second set of discrete tokens (e.g., international phonetic alphabet (IPA) symbols). Then the second set of discrete tokens may be converted to the vector representation… The text portion may be represented as one or more vectors within an embedding space, so a linear model with a softmax layer may be used to translate (e.g., “round”) the vector representations back into discrete text tokens).
Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Mavandadi, and further in view of Bălan, Dragoș Alexandru, "Improving the State-of-the-Art Frisian ASR by fine-tuning Large-Scale Cross-Lingual Pre-Trained Models." PhD diss., 2023 (Balan).

Regarding Claim 4, Wu in view of Mavandadi disclose the one or more media of claim 1.
	Wu in view of Mavandadi do not specifically disclose wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model.
However, Balan, in the same field of endeavor, discloses wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model (Balan, page 30, 3rd-4th para, To fine-tune the framework for speech recognition, a linear projection is added on top which is trained using tokens, and a CTC loss which outputs sequences of tokens. Tokens in this case correspond to characters. Employing the CTC method, we make predictions of tokens for every frame in our audio features. These tokens are selected from a predefined vocabulary…The special tokens consist of the whitespace token, a blank token that helps with decoding words that contain two letters next to each other (such as ”namme”), or an unknown token for characters that are out of the vocabulary. After tokens have been predicted for each frame, the non-space tokens that appear consecutively are combined into a single character, resulting in a character sequence for each group of tokens separated by a space token. The CTC loss function’s role is to minimize the differences between the predicted and the target character sequences so that the model outputs accurate transcriptions). 
	Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Balan in the method of Wu in view of Mavandadi because this would enhance Frisian ASR performance and address the challenges posed by its lowresource status by focusing on fine-tuning the XLS-R model, a large-scale cross-lingual pre-trained model, that has shown promising results in multilingual ASR tasks (Balan, Abstract).
 
Regarding Claim 11, Wu in view of Mavandadi discloses the system of claim 8.
Wu in view of Mavandadi do not specifically disclose wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model.
However, Balan, in the same field of endeavor, discloses wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model (Balan, page 30, 3rd-4th para, To fine-tune the framework for speech recognition, a linear projection is added on top which is trained using tokens, and a CTC loss which outputs sequences of tokens. Tokens in this case correspond to characters. Employing the CTC method, we make predictions of tokens for every frame in our audio features. These tokens are selected from a predefined vocabulary…The special tokens consist of the whitespace token, a blank token that helps with decoding words that contain two letters next to each other (such as ”namme”), or an unknown token for characters that are out of the vocabulary. After tokens have been predicted for each frame, the non-space tokens that appear consecutively are combined into a single character, resulting in a character sequence for each group of tokens separated by a space token. The CTC loss function’s role is to minimize the differences between the predicted and the target character sequences so that the model outputs accurate transcriptions ).  
	Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Balan in the method of Wu in view of Mavandadi because this would enhance Frisian ASR performance and address the challenges posed by its lowresource status by focusing on fine-tuning the XLS-R model, a large-scale cross-lingual pre-trained model, that has shown promising results in multilingual ASR tasks (Balan, Abstract).

Regarding Claim 18, Wu in view of Mavandadi discloses the method of claim 15.
Wu in view of Mavandadi do not specifically disclose wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model.
However, Balan, in the same field of endeavor, discloses wherein generating the language-specific ASR model further comprises: retaining, from the multilingual ASR model, a third portion of the embedding matrix corresponding to special characters used to direct operation of the multilingual ASR model (Balan, page 30, 3rd-4th para, To fine-tune the framework for speech recognition, a linear projection is added on top which is trained using tokens, and a CTC loss which outputs sequences of tokens. Tokens in this case correspond to characters. Employing the CTC method, we make predictions of tokens for every frame in our audio features. These tokens are selected from a predefined vocabulary…The special tokens consist of the whitespace token, a blank token that helps with decoding words that contain two letters next to each other (such as ”namme”), or an unknown token for characters that are out of the vocabulary. After tokens have been predicted for each frame, the non-space tokens that appear consecutively are combined into a single character, resulting in a character sequence for each group of tokens separated by a space token. The CTC loss function’s role is to minimize the differences between the predicted and the target character sequences so that the model outputs accurate transcriptions ).  
	Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Balan in the method of Wu in view of Mavandadi because this would enhance Frisian ASR performance and address the challenges posed by its lowresource status by focusing on fine-tuning the XLS-R model, a large-scale cross-lingual pre-trained model, that has shown promising results in multilingual ASR tasks (Balan, Abstract).


Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Mavandadi, and further in view of MacWhinney, Brian, and Davida Fromm. "Language sample analysis with TalkBank: An update and review." Frontiers in communication 7 (2022): 865498 (MacWhinney)).

Regarding Claim 7, The one or more media of claim 1.
Wu in view of Mavandadi do not specifically disclose the operations further comprising: identifying the second subset of language tokens as tokens having a token reading direction different from a reading direction of the first language. 
However, MacWhinney, in the same field of endeavor, discloses the operations further comprising: identifying the second subset of language tokens as tokens having a token reading direction different from a reading direction of the first language (MacWhinney, page 5, right col, 2nd para, Entry of characters from languages that write from right to left is possible. However, combining right-to-left script with the left-to-right features of CHAT can be tricky. For that reason, we recommend the use of romanization for languages with right to left orthographies).  
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of MacWhinney in the method of Wu in view of Mavandadi because this would enable utilization of open and free methods that have been used for language sample analysis (LSA) with several clinical populations examine automatic and implementation of TalkBank methods that use ASR (automatic speech recognition), NLP (natural language processing), database technology, statistics in R and Python, and ML (machine learning) (MacWhinney, Abstract).

Regarding Claim 14, The system of claim 8.
Wu in view of Mavandadi do not specifically disclose the operations further comprising: identifying the second subset of language tokens as tokens having a token reading direction different from a reading direction of the first language.
However, MacWhinney, in the same field of endeavor, discloses the operations further comprising: identifying the second subset of language tokens as tokens having a token reading direction different from a reading direction of the first language (MacWhinney, page 5, right col, 2nd para, Entry of characters from languages that write from right to left is possible. However, combining right-to-left script with the left-to-right features of CHAT can be tricky. For that reason, we recommend the use of romanization for languages with right to left orthographies).  
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of MacWhinney in the method of Wu in view of Mavandadi because this would enable utilization of open and free methods that have been used for language sample analysis (LSA) with several clinical populations examine automatic and implementation of TalkBank methods that use ASR (automatic speech recognition), NLP (natural language processing), database technology, statistics in R and Python, and ML (machine learning) (MacWhinney, Abstract).
Allowable Subject Matter
Claims 6, 13 and 20 are objected to as being dependent upon rejected base claims, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The reasons for allowance are that the prior art of record do not specifically teach the limitations as recited in the mentioned claims


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MULUGETA T. DUGDA whose telephone number is (703)756-1106. The examiner can normally be reached Mon - Fri, 4:30am - 7:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at 571-270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MULUGETA TUJI DUGDA/Examiner, Art Unit 2653                                                                                                                                                                                                        
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653                                                                                                                                                                                                        
02/02/2026
Read full office action
Prosecution Timeline

May 13, 2024
Application Filed
Feb 04, 2026
Non-Final Rejection mailed — §103
Apr 10, 2026
Examiner Interview Summary
Apr 10, 2026
Applicant Interview (Telephonic)
Apr 13, 2026
Response Filed
Precedent Cases

Applications granted by this same examiner with similar technology

18/041,756
Patent 12620387
VOICE GENERATION METHOD AND APPARATUS, DEVICE, AND COMPUTER READABLE MEDIUM
3y 2m to grant Granted May 05, 2026
17/912,112
Patent 12597424
METHOD AND APPARATUS FOR DETERMINING SKILL FIELD OF DIALOGUE TEXT
3y 6m to grant Granted Apr 07, 2026
17/912,912
Patent 12592244
REDUCED-BANDWIDTH SPEECH ENHANCEMENT WITH BANDWIDTH EXTENSION
3y 6m to grant Granted Mar 31, 2026
17/662,896
Patent 12579366
DEVELOPMENT PLATFORM FOR FACILITATING THE OPTIMIZATION OF NATURAL-LANGUAGE-UNDERSTANDING SYSTEMS
3y 10m to grant Granted Mar 17, 2026
18/015,732
Patent 12573417
A COMPUTER-IMPLEMENTED METHOD OF PROVIDING DATA FOR AN AUTOMATED BABY CRY ASSESSMENT
3y 1m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+19.3%)
2y 11m (~10m remaining)
Median Time to Grant
Low
PTA Risk
Based on 50 resolved cases by this examiner. Grant probability derived from career allowance rate.