Last updated: April 19, 2026

Application No. 18/188,632

Streaming End-to-end Multilingual Speech Recognition with Joint Language Identification

Non-Final OA §102§103

Filed

Mar 23, 2023

Examiner

ISKENDER, ALVIN ALIK

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

This examiner grants 48% of cases after interview

— +60.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 25 resolved cases, 2023–2026

Examiner Intelligence

ISKENDER, ALVIN ALIK View full profile →

Grants 48% of resolved cases

Career Allow Rate

12 granted / 25 resolved

-14.0% vs TC avg

Strong +60% interview lift

Without

With

+60.3%

Interview Lift

resolved cases with interview

Typical timeline

3y 4m

Avg Prosecution

20 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

15.6%

-24.4% vs TC avg

§103

53.0%

+13.0% vs TC avg

§102

25.8%

-14.2% vs TC avg

§112

5.4%

-34.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 25 resolved cases

Office Action

§102 §103

DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale , or otherwise available to the public before the effective filing date of the claimed invention. (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. Claim(s) FILLIN "Insert the claim numbers which are under rejection." \d "[ 1 ]" 1-3, 8, 11-13, 18 is/are rejected under 35 U.S.C. 102 FILLIN "Insert either \“(a)(1)\” or \“(a)(2)\” or both. If paragraph (a)(2) of 35 U.S.C. 102 is applicable, use form paragraph 7.15.01.aia, 7.15.02.aia or 7.15.03.aia where applicable." \d "[ 2 ]" (a)(1) as being FILLIN "Insert either—clearly anticipated—or—anticipated—with an explanation at the end of the paragraph." \d "[ 3 ]" anticipated by FILLIN "Insert the prior art relied upon." \d "[ 4 ]" Kannan (US 20200380215 A1) . Claim 1: Kannan teaches a multilingual automated speech recognition (ASR) model comprising: a first encoder ([0010]: encoder network) configured to: receive, as input, a sequence of acoustic frames; ([0025]: input acoustic frames) generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; ([0031]: generate acoustic feature vector from acoustic frames) a second encoder configured to: receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature representation; ([0028]: language vector generation) a language identification (ID) predictor configured to: receive, as input, a concatenation of the first higher order feature representation generated by the first encoder at each of the plurality of output steps and the second higher order feature representation generated by the second encoder at each of the plurality of output steps; ([0008]: concatenate the vectors to create an input vector) generate, at each of the plurality of output steps, a language prediction representation; ([0028]-[0029]: language identifiers generates a vector representation of the language) a first decoder configured to: receive, as input, a concatenation of the second higher order feature representation generated by the second encoder at each of the plurality of output steps and the language prediction representation generated by the language ID predictor at each of the plurality of output steps; ([0015]: higher-order feature representation from the input vector concatenated from acoustic and language vectors) generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses. ([0015]: use the feature representation to create a probability distribution) Claim 2: Parent claim 1 is addressed above. Kannan further teaches a model comprising a second decoder configured to: receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; ([0038]: first higher order representation is passed through encoder network) generate, at each of the plurality of output steps, a second probability distribution over possible speech recognition hypotheses. ([0015]) Claim 3: Parent claim 2 is addressed above. Kannan further teaches a model wherein the second decoder is further configured to generate partial speech recognition results based on the second probability distribution over possible speech recognition hypotheses. ([0015]: generated transcription from the probability distribution) Claim 8: Parent claim 1 is addressed above. Kannan further teaches a model wherein the first encoder, the second encoder, and the language ID predictor are jointly trained on a set of multilingual training utterances by: generating a first loss for the first encoder, generating a second loss for the second encoder, generating a third loss for the language ID predictor, and minimizing a weighted sum of the first loss, the second loss, and the third loss. ([0040]-[0044]: language prediction losses) Claims 11-13 and 18 are analogous to claims 1-3 and 8 addressed above, and are thus rejected in a similar fashion. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim (s) FILLIN "Insert the claim numbers which are under rejection." \d "[ 1 ]" 4-7, 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over FILLIN "Insert the prior art relied upon." \d "[ 2 ]" Kannan (US 20200380215 A1) in view of Huang (US 20230096821 A1) . Claim 4: Parent claim 1 is addressed above. Kannan does not teach, but Huang does teach the model wherein the first decoder and the second decoder each comprise a corresponding prediction network followed by a corresponding joint network; (Figure 2) the corresponding prediction networks of the first and second decoders have a same structure comprising one of a long short-term memory (LSTM)-based prediction network or V2 embedding look-up table; ([0042]: LSTM layers) the corresponding joint networks of the first and second decoders comprise a same structure. (Figure 2, [0044]) Huang teaches a cascading encoding structure in its automatic speech recognition method. It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to incorporate cascading encoders taught by Huang in the method of Kannan because it increases the accuracy of rare words or long-tail proper nouns (see Huang [0030]). Claim 5: Parent claim 1 is addressed above. Kannan does not teach, but Huang does teach the model wherein the second encoder generates the second higher order feature representation without receiving any of the acoustic frames as input. (Figure 2A, [0007]: second encoder receives higher order feature representation as input) It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Huang with Kannan for a similar reason and motivation as described for claim 4 above. Claim 6: Parent claim 1 is addressed above. Kannan does not teach, but Huang does teach the model wherein the first encoder comprises a causal encoder comprising one of: a plurality of unidirectional long short-term memory (LSTM) layers , a plurality of conformer layers , or a plurality of transformer layers. ([0042]: first encoder includes unidirectional LSTM layers) It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Huang with Kannan for a similar reason and motivation as described for claim 4 above. Claim 7: Parent claim 1 is addressed above. Kannan does not teach, but Huang does teach the model wherein the second encoder comprises a non-causal encoder comprising one of: one or more bi-directional long short-term memory (LSTM) layers , a plurality of conformer layers , or a plurality of transformer layers. ([0042]: second encoder includes bidirectional LSTM layers) It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Huang with Kannan for a similar reason and motivation as described for claim 4 above. Claims 14-17 analogous to claims 4-7 addressed above, and are thus rejected in a similar manner. Claim (s) FILLIN "Insert the claim numbers which are under rejection." \d "[ 1 ]" 9-10 and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over FILLIN "Insert the prior art relied upon." \d "[ 2 ]" Kannan (US 20200380215 A1) in view of Ramanarayanan (US 11238844 B1) . Claim 9: Parent claim 8 is addressed above. Kannan does not teach the model wherein a language ID target token is added as a first token of a corresponding ground-truth transcription of each multilingual training utterance in the set of multilingual training utterances, the language ID target token identifying a language of the corresponding multilingual training utterance. However, Ramanarayanan does teach the method wherein a language ID target token is added as a first token of a corresponding ground-truth transcription of each multilingual training utterance in the set of multilingual training utterances, the language ID target token identifying a language of the corresponding multilingual training utterance. (Columns 3-4 “Language Identification from Text”: per-word language identification) It would have been obvious to one with ordinary skill in the art before the effective filing date to have use a multilingual language identification token because it allows for the transcription of code-switched language (see Ramanarayanan Background) Claim 10: Parent claim 8 is addressed above. Kannan does not teach the model wherein a language ID target token is added to each position where a code-switch occurs in a corresponding ground-truth transcription of each multilingual training utterance in the set of multilingual training utterances. However, Ramanarayanan does teach the method wherein a language ID target token is added to each position where a code-switch occurs in a corresponding ground-truth transcription of each multilingual training utterance in the set of multilingual training utterances. (Columns 3-4 “Language Identification from Text”: per-word language identification) It would have been obvious to one with ordinary skill in the art before the effective filing date to have use a multilingual language identification token because it allows for the transcription of code-switched language (see Ramanarayanan Background) Claims 19-20 are analogous to claims 9-10 above, and are thus rejected in a similar manner. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to FILLIN "Examiner name" \* MERGEFORMAT ALVIN ISKENDER whose telephone number is FILLIN "Phone number" \* MERGEFORMAT (703)756-4565 . The examiner can normally be reached FILLIN "Work Schedule?" \* MERGEFORMAT M-F . Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, FILLIN "SPE Name?" \* MERGEFORMAT HAI PHAN can be reached at FILLIN "SPE Phone?" \* MERGEFORMAT (571) 272-6338 . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ALVIN ISKENDER/ Examiner, Art Unit 2654 /HAI PHAN/ Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Mar 23, 2023

Application Filed

Mar 26, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/188,310

Patent 12562244

COMBINING DOMAIN-SPECIFIC ONTOLOGIES FOR LANGUAGE PROCESSING

2y 5m to grant Granted Feb 24, 2026

17/911,224

Patent 12531078

NOISE SUPPRESSION FOR SPEECH ENHANCEMENT

2y 5m to grant Granted Jan 20, 2026

17/926,994

Patent 12505825

SPONTANEOUS TEXT TO SPEECH (TTS) SYNTHESIS

2y 5m to grant Granted Dec 23, 2025

17/750,973

Patent 12456457

ALL DEEP LEARNING MINIMUM VARIANCE DISTORTIONLESS RESPONSE BEAMFORMER FOR SPEECH SEPARATION AND ENHANCEMENT

2y 5m to grant Granted Oct 28, 2025

18/054,153

Patent 12407783

DOUBLE-MICROPHONE ARRAY ECHO ELIMINATING METHOD, DEVICE AND ELECTRONIC EQUIPMENT

2y 5m to grant Granted Sep 02, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

48%

Grant Probability

99%

With Interview (+60.3%)

3y 4m

Median Time to Grant

Low

PTA Risk

Based on 25 resolved cases by this examiner. Grant probability derived from career allow rate.