Last updated: April 19, 2026

Application No. 18/820,028

LABEL-LOOPING PREDICTION FOR AUTOMATIC SPEECH RECOGNITION AND OTHER AI SYSTEMS

Non-Final OA §102§103

Filed

Aug 29, 2024

Examiner

MCLEAN, IAN SCOTT

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

1 (Non-Final)

This examiner grants 43% of cases after interview

— +31.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 44 resolved cases, 2023–2026

Examiner Intelligence

MCLEAN, IAN SCOTT View full profile →

Grants 43% of resolved cases

Career Allow Rate

19 granted / 44 resolved

-18.8% vs TC avg

Strong +31% interview lift

Without

With

+31.0%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

40 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

9.9%

-30.1% vs TC avg

§103

60.0%

+20.0% vs TC avg

§102

27.2%

-12.8% vs TC avg

§112

2.1%

-37.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 44 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
2.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


3.	Claims 1-8, 11-17 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bijwadia (US 2024/0029719).

Regarding Claim 1:
Bijwadia discloses a method comprising: 
performing a plurality of iterations of an outer processing loop to identify a plurality of content units (CUs) of a media item having a plurality of frames (Bijwadia: ¶[0043] discloses that the output labels may be wordpieces, graphemes or entire words, meaning it identifies content units. ¶[0004] discloses a sequence of audio frames, i.e., a plurality of frames), an individual iteration of the plurality of iterations of the outer processing loop comprising: 
updating, using a first neural network (NN) and an identified non-blank CU, a state of the media item (Bijwadia: ¶[0045] discloses the joint network, and prediction network, including one or more neural network layers, the prediction network outputs a dense representation);
and performing one or more iterations of an inner processing loop (Bijwadia: Fig. 2 ¶[0041]-[0042] discloses repeated iterative operations within each broader decoding process, predictor state is updated from prior non-blank outputs, the joint network combines predictor and encoder side representation, SoftMax classifier determines blank vs non blank symbol output, altogether this means the speech recognition unit is a nested loop, as it needs to happen for each individual frame), an individual iteration of the one or more iterations of the inner processing loop comprising:
processing, using a second NN, the state of the media item and an individual frame of the plurality of frames to predict a CU of one or more CUs associated with the individual frame (Bijwadia: Fig. 2 Joint network 252 is a second neural network, the state of the media item is the dense representation from the prediction network, the individual frame corresponds to the encoder representation of the audio frame derived from the frame sequence, and the predicted CU is the predicted output symbol), 
wherein the one or more iterations of the inner processing loop are performed until the predicted CU corresponds to a non-blank CU (Bijwadia: ¶[0041]-¶[0042] explicitly discloses blank and non blank symbols and feeds the non blank symbols back into the prediction network. That is the same kind of blank and non-blank disclosed in the claim limitation)
and generating, using the identified plurality of CUs, a representation of the media item (Bijwadia: ¶[0007] and Fig. 2 elements 256 – 257 generates a transcription from the predicted symbols).
Regarding Claim 2:
Bijwadia further discloses the method of claim 1, wherein the updating the state of the media item comprises: processing, using the first NN, the state of the media item and the identified non-blank CU (Bijwadia: ¶[0041]-[0042] discloses that the prediction network processes previously identified non blank symbols to generate an updated dense hidden representation, which is analogous to the claimed state of the media item. Specifically, Bijwadia states that the prediction network processes sequence of non blank symbols into a dense or hidden representation and explains that the past non blank symbols are used to assist prediction of the next symbol. Fig. 2 likewise shows the feedback of prior non-blank output into prediction network 254. With output supplied onward).

Regarding Claim 3:
Bijwadia further discloses the method of claim 2, wherein the first NN comprises a predictor NN (Bijwadia: ¶[0007], ¶[0012] and [0042] discloses that the decoder includes a prediction network 254 , that prediction network is the claimed predictor NN).

Regarding Claim 4:
Bijwadia further discloses the method of claim 1, wherein the second NN comprises a joiner NN, and wherein the processing the state of the media item and the individual frame comprises: 
processing, using the joiner NN, the state of the media item and an encoder vector to generate a prediction vector, wherein the encoder vector is generated by applying an encoder NN to the individual frame (Bijwadia: ¶[0032]-[0035] discloses an audio encoder that receives the sequence of frames and generates encoded representations including latent representation 243 and higher-order feature representations. A joint network 252 in the decoder is disclosed in ¶[0041] and ¶[0043] discloses the joint network combines encoder side representation with the predictor side dense representation to generate a probability distribution over next output symbols).

Regarding Claim 5:
	Bijwadia further discloses the method of claim 4, wherein the processing the state of the media item and the individual frame further comprises: generating, using a classifier NN and the prediction vector, a plurality of probabilities that the individual frame is associated with at least one of: a plurality of vocabulary CUs, or a blank CU (Bijwadia: ¶[0042]-[0044] discloses that the joint network 252 combines the encoder representation and the predictor side state, it outputs a probability distribution over possible next output symbols. Bijwadia further teaches a final SoftMax layer 256 that receives that distribution and selects the output label/symbol with the highest probability. Bijwadia also discloses that the decoder predicts a next symbol or blank symbol. Therefore, Bijwadia teaches a classifier NN (SoftMax layer), a prediction vector (output of the joint network/probability distribution), the plurality of probabilities over vocabulary content units and blanks (the posterior probabilities over output labels, including blank symbol)).

Regarding Claim 6:
	Bijwadia further discloses the method of claim 1, wherein the individual iteration of the one or more iterations of the inner processing loop further comprises: maintaining, responsive to determining that the predicted CU corresponds to a blank CU, the state of the media item (Bijwadia: ¶[0042] distinguishes between blank and non blank outputs and explains that the prediction network processes the sequence of non blank symbols output so far, meaning that where the current prediction is a blank symbol, no new non blank symbol is added to the sequence being processed by the prediction network, and the decoder side state based on prior non-blank outputs is maintained rather than advanced by a newly identified non-blank unit).

Regarding Claim 7:
	Bijwadia further discloses the method of claim 1, wherein a next iteration of the plurality of iterations of the outer processing loop is initiated responsive to identification of the non-blank CU (Bijwadia: ¶[0041] discloses that the prediction network receives the prior non-blank symbols 257 output so far and those prior non-blank symbol is what advances the decoder history/state for the next cycle of prediction, which corresponds to the initiation of the next outer loop iteration responsive to identification of the non-blank portions, i.e., in Bijwadia, identification of a non blank symbol feeds forward into the prediction network as the updated history for the next decoding cycle, which corresponds to the next outer loop iteration being initiated responsive to identification of a non blank CU).

Regarding claim 8:
Bijwadia further discloses the method of claim 1, wherein the media item comprises a speech utterance, and wherein the representation of the media item comprises a transcription of the speech utterance (Bijwadia: discloses the input is a speech utterance represented as audio frames and the output is a transcription).

Regarding Claim 11:
	Claim 11 has been analyzed with regards to claim 1 (see rejection above) and is rejected for the same reasons of anticipation used above.
	It is noted that Bijwadia discloses one or more processors at ¶[0051].
Regarding Claim 12:
	Claim 12 has been analyzed with regards to claim 2 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 13:
	Claim 13 has been analyzed with regards to claim 4 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 14:
	Claim 14 has been analyzed with regards to claim 5 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 15:
Claim 15 has been analyzed with regards to claim 6 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 16:
Claim 16 has been analyzed with regards to claim 7 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 17:
Claim 17 has been analyzed with regards to claim 8 (see rejection above) and is rejected for the same reasons of anticipation used above.

Regarding Claim 19:
Bijwadia further discloses the system of claim 11, wherein the system is comprised in at least one of:
an in-vehicle infotainment system for an autonomous or semi-autonomous machine;
a system for performing one or more simulation operations;
a system for performing one or more digital twin operations;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets;
a system for performing one or more deep learning operations;
a system implemented using an edge device;
a system for generating or presenting at least one of virtual reality content, mixed reality content, or augmented reality content;
a system implemented using a robot;
a system for performing one or more conversational AI operations; 
a system implementing one or more large language models (LLMs);
a system implementing one or more language models;
a system implementing one or more vision language models (VLMs);
a system implementing one or more multi-modal language models;
a system for performing one or more generative AI operations;
a system for generating synthetic data;
a system incorporating one or more virtual machines (VMs);
a system implemented at least partially in a data center; or
a system implemented at least partially using cloud computing resources (Bijwadia: ¶[0026] discloses cloud computing implementations, and in general discloses that the system may be implemented in a variety of applications, see ¶[0056] and is not limited to any one form of application, any system capable of storing software for a speech recognition system fits within this interpretation).

Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.



5.	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Bijwadia (US 2024/0029719) in view of Zhao (US 2021/0312905).

Regarding Claim 9:
	Bijwadia further discloses the method of claim 1, except further comprising: setting, prior to a first iteration of the plurality of iterations of the outer processing loop, the identified non-blank CU to a default beginning-of-sequence (BOS) CU. However, Zhao discloses setting, prior to a first iteration of the plurality of iterations of the outer processing loop, the identified non-blank CU to a default beginning-of-sequence (BOS) CU (Zhao: ¶[0030] discloses initializing the prediction side of the RNN-T with a prior output token before decoding proceeds and expressly states that the prediction network 110 receives a previous output token as input. ¶[0031] further explains that the prediction network converts the previous non-blank output token to a high level representation and that predicted symbols are fed back to the prediction network. Altogether this teaches initializing decoding by suppling the prediction network 110 with a previous output token input prior to iterative decoding, as decoding proceeds frame by frame from the first frame, non blank outputs update the prediction network (¶[0034]) therefore before the first iteration the prior output token input to the prediction network corresponds to a default beginning of sequence content unit).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose setting the non-blank CU to a default beginning of sequence as taught in Zhao into Bijwadia. Both references are from the same field of endeavor, speech recognition for converting spoken language into text on computing devices, e.g., both disclose computing systems for end to end speech recognition and making predictions on the proper output. The suggestion/motivation for doing so is “the pretraining of the RNN-T also provide technical benefits of at least improved computing processing and memory usage by facilitating a more efficient and less memory intensive means of training RNN-T.” as disclosed in ¶[0029] of Zhao.

6.	Claims 10 and 18  rejected under 35 U.S.C. 103 as being unpatentable over Bijwadia in view of Zhao and further in view of Cheng (US 2023/0317083).

Regarding Claim 10:
Bijwadia and Zhao further disclose the method of claim 1, except wherein the plurality of iterations of the outer processing loop to identify the plurality of CUs of the media item is performed in parallel to a second plurality of iterations of the outer processing loop performed to identify a second plurality of CUs of a second media item, and wherein the plurality of iterations and the second plurality of iterations comprise an equal number of calls to the first NN to identify an equal number of CUs. 
However, Bijwadia and Zhao in view of Cheng disclose this limitation (Cheng: ¶[0036]-[0038] ¶[0043]-[0048], discloses a first speech and second speech audio in parallel using corresponding thread pools and temporally ordered probability sets for respective speech segments. ¶[0043]-[[0045] and ¶[0053] Cheng further discloses storing multiple sets of probabilities derived from a single data segment in a data buffer and distributing these ordered set of data to threads for candidate word derivation and next word selection in equal packets. Bijwadia already discloses the speech recognition framework that processes audio frames and generates recognition hypotheses using encoder/prediction/joint network components. Applying both teaches, one can incorporate Cheng’s known parallel processing technique to BIjwadia’s speech recognition processing so that multiple media items or utterances could be decoded concurrently because Cheng expressly teaches using parallel execution for separate speech audios and distributing temporally structured and equal length data to derive words transcripts more efficiently).
Bijwadia, Zhao and Cheng are combinable because they are all from the same field of endeavor, automated speech-to-text recognition, e.g., all disclose computing systems for end to end speech recognition and making predictions on the proper output. Bijwadia already discloses a speech recognition model which discloses content units, state of the media item(s) and multiple loops of iteration. However, Bijwadia does not disclose processing in parallel, equal amounts of content units and equal number of calls made in parallel to the neural network(s). It would have been obvious to one of ordinary skill in the art to disclose wherein the plurality of iterations of the outer processing loop to identify the plurality of CUs of the media item is performed in parallel to a second plurality of iterations of the outer processing loop performed to identify a second plurality of CUs of a second media item, and wherein the plurality of iterations and the second plurality of iterations comprise an equal number of calls to the first NN to identify an equal number of CUs. Cheng teaches a known solution, process separate speech inputs in parallel using thread pools, ques and distributed temporally organized speech probability data. This idea can be applied to the content units and the number of neural network calls in Bijwadia. The suggestion/motivation for doing so is disclosed by Cheng in ¶[0009] “the preprocessing used to divide streamed speech audio and/or lengthy recorded speech audio into segments has seen comparatively little improvement.” In other words, dividing streamed input efficiently can help to reduce latency that has been a known problem within this field of endeavor.

Regarding Claim 18:
Claim 18 has been analyzed with regards to claim 9 (see rejection above) and is rejected for the same reasons of anticipation used above.



7.	Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Bijwadia in view of Cheng.

Regarding Claim 20:
Bijwadia discloses a processing device comprising a processing circuitry to: 
identify N calls for batch execution of a predictor network of a transducer speech-to-text model, at least (i) N non-blank units of a first speech utterance and (ii) N non-blank units of a second speech utterance (Bijwadia: ¶[0007], ¶[0012] and ¶[0041]-[0045] discloses the underlying transducer speech to text model including the prediction network of the decoder. Bijwadia discloses that the speech recognition model includes a decoder with a prediction network and joint network, where the prediction network processes prior non-blank symbols and generates dense representations used by the joint network to predict output symbols. Bijwadia also discloses speech recognition for speech utterances and generation of non blank output symbols as part of the transcription process).
Bijwadia does not explicitly disclose identifying in parallel, however, Cheng discloses identifying in parallel (Cheng: ¶[0044]-[0045] and ¶[0053] discloses receiving a request to perform speech to text conversion of a first speech data set, dividing speech audio into multiple equal data segments, using and acoustic language model to derive words for a transcript and doing so with multiple threads of a thread pool.).
Bijwadia and Cheng are combinable because they are from the same field of endeavor, automated speech-to-text recognition, e.g., both disclose computing systems for end to end speech recognition and making predictions on the proper output. Bijwadia already discloses a speech recognition model which discloses content units, state of the media item(s) and multiple loops of iteration. However, Bijwadia does not disclose processing in parallel, equal amounts of content units and equal number of calls made in parallel to the neural network(s). It would have been obvious to one of ordinary skill in the art to disclose wherein the plurality of iterations of the outer processing loop to identify the plurality of CUs of the media item is performed in parallel to a second plurality of iterations of the outer processing loop performed to identify a second plurality of CUs of a second media item, and wherein the plurality of iterations and the second plurality of iterations comprise an equal number of calls to the first NN to identify an equal number of CUs. Cheng teaches a known solution, process separate speech inputs in parallel using thread pools, ques and distributed temporally organized speech probability data. This idea can be applied to the content units and the number of neural network calls in Bijwadia. The suggestion/motivation for doing so is disclosed by Cheng in ¶[0009] “the preprocessing used to divide streamed speech audio and/or lengthy recorded speech audio into segments has seen comparatively little improvement.” In other words, dividing streamed input efficiently can help to reduce latency that has been a known problem within this field of endeavor.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IAN SCOTT MCLEAN/Examiner, Art Unit 2654      

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Aug 29, 2024

Application Filed

Mar 21, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/245,802

Patent 12602553

SPEECH TRANSLATION METHOD, DEVICE, AND STORAGE MEDIUM

2y 5m to grant Granted Apr 14, 2026

17/952,401

Patent 12494199

VOICE INTERACTION METHOD AND ELECTRONIC DEVICE

2y 5m to grant Granted Dec 09, 2025

18/063,167

Patent 12443805

Systems and Methods for Multilingual Data Processing and Arrangement on a Multilingual User Interface

2y 5m to grant Granted Oct 14, 2025

17/559,283

Patent 12437144

Content Recommendation Method and User Terminal

2y 5m to grant Granted Oct 07, 2025

17/357,751

Patent 12400644

DYNAMIC LANGUAGE MODEL UPDATES WITH BOOSTING

2y 5m to grant Granted Aug 26, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

43%

Grant Probability

74%

With Interview (+31.0%)

3y 2m

Median Time to Grant

Low

PTA Risk

Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.