Last updated: May 29, 2026

Application No. 18/616,117

AUTOMATIC SPEECH RECOGNITION WITH TARGET WORD SPOTTING

Non-Final OA §102§112

Filed

Mar 25, 2024

Priority

Dec 27, 2023 — provisional 63/615,010

Examiner

HOQUE, NAFIZ E

Art Unit

2693

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

1 (Non-Final)

Interview Optional

— +23.4% interview lift. Examiner has a relatively high allowance rate (75%); +23.4% interview lift. A written response may suffice.

Based on 613 resolved cases, 2023–2026

Examiner Intelligence

HOQUE, NAFIZ E View full profile →

Grants 75% — above average

Career Allowance Rate

461 granted / 613 resolved

+13.2% vs TC avg

Strong +23% interview lift

Without

With

+23.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

22 currently pending

Career history

632

Total Applications

across all art units

Statute-Specific Performance

§101

4.2%

-35.8% vs TC avg

§103

70.9%

+30.9% vs TC avg

§102

14.8%

-25.2% vs TC avg

§112

5.3%

-34.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 613 resolved cases

Office Action

§102 §112

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 19 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the enablement requirement.  The claim(s) contains subject matter which was not described in the specification in such a way as to enable one skilled in the art to which it pertains, or with which it is most nearly connected, to make and/or use the invention. 
Claim 14 recites “[a] system comprising: one or more processor to... audio data, ...spoken words..." and claim 19 recites “[the] system of claim 14, wherein the system is comprised in at least one of:” and then proceeds to list 17 different systems. However, the specification does not describe how the speech system is being used or applied to each of the different context. For example, digital twin operations, transport simulation, creation of 3D assets and etc. It is not clear how a speech system would be used in such systems. 
Furthermore, it also lacks written description support for being used in more than one system. Since claim 19 recites “at least one of”, then the speech system can be used in in-vehicle infotainment system and digital twin operations and on a robot, simultaneously. The specification does disclose how the speech system exists in multiple contexts simultaneously and how it even works in multiple complex systems. It maybe possible for some combination but its not clear how it is possible for all combinations. Appropriate correction is required.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 19 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Similarly as above, claim 19 recites “[the] system of claim 14, wherein the system is comprised in at least one of:” and then proceeds to list 17 different systems. It is not clear how more than one or all the systems would be simultaneously comprised in multiple, seemingly incompatible environments. As an example, how would the speech system be used in in-vehicle infotainment system and digital twin operations and on a robot, simultaneously. A person of ordinary skill in the art would struggle to determine the metes and bounds of the claim. Furthermore, some systems have further limitation of “at least one” which creates more complexity. For example, “at least one of virtual reality content, mixed reality content, or augmented reality content”. The claims are unclear how the speech systems can be comprised all three and then furthermore in other systems such as “performing light transport simulation”. Appropriate correction is required.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-3, 5, 7-11, 14-16, and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Prabhavalkar et al. (US Pub 2020/0357387).
Regarding claim 1, Prabhavalkar discloses a method comprising:
applying an automatic speech recognition (ASR) model to audio data to generate an ASR output representative of a likelihood that the audio data comprises one or more spoken speech units (SUs) (see abstract);
generating, using the ASR output, a first score characterizing a likelihood that the audio data comprises a first word included in a dictionary (see abstract, para 0011 – first encoder/attention generates scores for vocabulary words; para 0036 - audio attention module 218);
generating, using the ASR output, a second score characterizing a likelihood that the audio data comprises a second word, the second word including a word from a plurality of target words that are identified based at least on a context of the audio data (see abstract, para 0011 – “set of bias phrases corresponding to a context of the utterance” and “the bias encoder is configured to receive data indicating the obtained set of bias phrases”; para 0036 -  bias attention module 228); and
predicting, using the first score and the second score, a spoken word associated with the audio data (see para 0011 – “a decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module”; also see fig. 1, decoder 240).
Regarding claim 2, Prabhavalkar discloses wherein the ASR output comprises, for one or more time intervals of the audio data:
a plurality of SU likelihoods, wherein an individual SU likelihood of the plurality of SU likelihoods characterizes a probability that a respective SU of a plurality of SUs was spoken during a respective time interval (para 0014 – “the first encoder includes a stacked, recurrent neural network (RNN) and/or the decoder includes a stacked, unidirectional RNN configured to compute a probability of a sequence of output tokens” RNN produces per frame probability).
Regarding claim 3, Prabhavalkar discloses wherein the generating the second score comprises:
generating a plurality of hypotheses, wherein an individual hypothesis of the plurality of hypotheses:
associates a portion of the audio data with a respective hypothesized word of the plurality of target words (para 0015 – hypotheses for each bias phrase), and
assigns, based at least on the ASR output, a score to the respective hypothesized word, wherein the assigned score characterizes a likelihood that the portion of the audio data comprises the respective hypothesized word (para 0018 – “the bias encoder is configured to encode a corresponding bias context vector for each bias phrase in the set of bias phrases”); and
identifying, using the assigned scores, the second word as a most likely word represented in the portion of the audio data (para 0017 – “decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module”).
Regarding claim 5, Prabhavalkar discloses wherein the generating the second score comprises performing a plurality of iterations, wherein an individual iteration of the plurality of iterations is associated with a respective time interval of a plurality of time intervals of the audio data (para 0015), and wherein the individual iteration comprises:
for a respective time interval of the plurality of time intervals, selecting, using the ASR output, from at least:
an SU of the second word, the SU being spoken during the respective time interval, or
a state of no SU being spoken during the respective time interval (para 0018 – “additional bias context vector represents an option to not bias the likelihoods of sequences of speech elements determined by the decoder toward any of the bias phrases”).
Regarding claim 7, Prabhavalkar discloses wherein the state of no SU being spoken is selected responsive to no SU of the second word having a likelihood of being spoken above a threshold likelihood (para 0018 – “additional bias context vector represents an option to not bias the likelihoods of sequences of speech elements determined by the decoder toward any of the bias phrases”).
Regarding claim 8, Prabhavalkar discloses wherein the selecting from the at least the SU of the second word or the state of no SU comprises:
identifying a candidate SU (para 0012-0013);
obtaining, using the ASR output, a likelihood of the candidate SU being spoken during the respective time interval (para 0015); and
responsive to determining that the candidate SU matches the SU of the second word, enhancing the obtained likelihood of the candidate SU (para 0017 – “a decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module”).
Regarding claim 9, Prabhavalkar discloses wherein the generating the second score comprises:
performing a plurality of iterations identified by a context graph associated with the plurality of target words, wherein the context graph comprises one or more root nodes associated with a starting SU of one or more target words of the plurality of target words (para 0016 - “each bias prefix in the list of bias prefixes represents an initial portion of one or more of the bias phrases in the set of bias phrases” – prefix represents starting portion such as root node).
Regarding claim 10, Prabhavalkar discloses wherein the ASR model is further to generate an additional ASR output comprising a plurality of recognized words in the audio data (for example para 0076).
Regarding claim 11, Prabhavalkar discloses further comprising:
replacing one or more words of the plurality of recognized words with the predicted spoken word (para 0017 – bias output can replace the standard recognition).
Regarding claims 14 and 20, see rejection of claim 1.
Regarding claim 15, see rejection of claim 3.
Regarding claim 16, see rejection of claim 5.
Regarding claim 18, see rejection of claim 8.
Regarding claim 19, Prabhavalkar discloses wherein the system is comprised in at least one of:
an in-vehicle infotainment system for an autonomous or semi-autonomous machine;
a system for performing one or more simulation operations;
a system for performing one or more digital twin operations;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets;
a system for performing one or more deep learning operations (para 0014 - RNN);
a system implemented using an edge device (para 0034 – “the speech recognition model 200 resides on a user device 106 associated with a user 102”);
a system for generating or presenting at least one of virtual reality content, mixed reality content, or augmented reality content;
a system implemented using a robot;
a system for performing one or more conversational AI operations (para 0009);
a system implementing one or more large language models (LLMs);
a system implementing one or more language models;
a system for performing one or more generative AI operations;
a system for generating synthetic data;
a system incorporating one or more virtual machines (VMs);
a system implemented at least partially in a data center (para 0034); or
a system implemented at least partially using cloud computing resources (para 0034).

Allowable Subject Matter
Claims 4, 6, 12-13, 17 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.




Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NAFIZ E HOQUE whose telephone number is (571)270-1811. The examiner can normally be reached M-F 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ahmad Matar can be reached at (571)272-7488. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NAFIZ E HOQUE/           Primary Examiner, Art Unit 2693

Read full office action

Prosecution Timeline

Mar 25, 2024

Application Filed

Dec 27, 2025

Non-Final Rejection (signed) — §102, §112

Jan 30, 2026

Non-Final Rejection mailed — §102, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/484,512

Patent 12639866

PIPELINE FOR GENERATING EDITABLE GRAPHIC DESIGNS FROM NATURAL LANGUAGE PROMPTS

2y 7m to grant Granted May 26, 2026

18/480,039

Patent 12620393

TECHNOLOGIES FOR LEVERAGING MACHINE LEARNING TO PREDICT EMPATHY FOR IMPROVED CONTACT CENTER INTERACTIONS

2y 7m to grant Granted May 05, 2026

18/649,354

Patent 12619830

OPTIMIZING PERFORMANCE OF CONVERSATIONAL INTERFACE APPLICATIONS USING EXAMPLE FORGETTING

2y 0m to grant Granted May 05, 2026

18/695,752

Patent 12621386

INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND PROGRAM

2y 1m to grant Granted May 05, 2026

18/384,428

Patent 12614041

NONVERBAL MESSAGE EXTRACTION AND GENERATION

2y 6m to grant Granted Apr 28, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

75%

Grant Probability

99%

With Interview (+23.4%)

3y 1m (~11m remaining)

Median Time to Grant

Low

PTA Risk

Based on 613 resolved cases by this examiner. Grant probability derived from career allowance rate.