Last updated: May 29, 2026

Application No. 18/228,349

AUTOMATIC SPEECH RECOGNITION FOR INTERACTIVE VOICE RESPONSE SYSTEMS

Non-Final OA §103

Filed

Jul 31, 2023

Examiner

ROBERTS, SHAUN A

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Zoom Video Communications, Inc.

OA Round

3 (Non-Final)

Interview Optional

— +10.5% interview lift. Interview lift (+10.5%) is below the 15.0% threshold. A written response is recommended.

Based on 652 resolved cases, 2023–2026

Examiner Intelligence

ROBERTS, SHAUN A View full profile →

Grants 76% — above average

Career Allowance Rate

495 granted / 652 resolved

+13.9% vs TC avg

Moderate +10% lift

Without

With

+10.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

18 currently pending

Career history

679

Total Applications

across all art units

Statute-Specific Performance

§101

1.8%

-38.2% vs TC avg

§103

83.6%

+43.6% vs TC avg

§102

12.5%

-27.5% vs TC avg

§112

0.1%

-39.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 652 resolved cases

Office Action

§103

DETAILED ACTION
Continued Examination Under 37 CFR 1.114
1.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/8/2026 has been entered.
Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
3.	Claims 4, 11, and 18 have been cancelled.
Response to Arguments
4.	Applicant’s arguments filed have been fully considered but are moot based on the new grounds of rejection.
Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	Claims 1, 5, 7-8, 12, 14-15, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Sun et al (FawAI ASR System for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge) in view of Hofer et al (2016/0093292) in further view of Nagao (2015/0179177).

Regarding claim 1 Sun teaches A method comprising:
receiving an audio input from a user (abstract; 2.1 contents of the speech…user’s command; the commands involve navigation to point of interest, making a phone call, controlling the air conditioner and playing music; contact names); 
determining, using a first trained model, a plurality of candidate commands (3.4 during the decoding n-best candidates are achieved from CTC WFST beam search decoder), comprising; 
wherein the first trained model comprises a weighted finite state transducer ("WFST") (abstract; 3.4 WFST)
determining, using a second trained model, a recognized command from the plurality of candidate commands (3.4 the n-best candidates are rescored by the Attention Rescoring module to find the best candidate); and 
identifying a corresponding valid command in a set of valid commands based on the recognized command (2.1 contents of the speech…user’s command; the commands involve navigation to point of interest, making a phone call, controlling the air conditioner and playing music; contact names
Abstract: automatic speech recognition systems; connectionist temporal classification/attention-based encoder-decoder architecture…based on a weighted finite state transducer;
4 Experiments: performance).  

Sun does not specifically teach where Hofer et al (2016/0093292) teaches 
executing each available path within the first trained model [substantially in parallel ]
(abstract: method in a computing device for decoding a weighted finite state transducer for automatic speech recognition; 0001-0002: weighted score to each transition/arc; 4: state transition; 6; 0029; 0035

[0004] An HMM is a FSM with state transition probabilities and emissive (or observation) probabilities. A state transition probability of one state to another state represents the probability of transition from the one state to the other. An emissive probability for an observation is the probability that a state will “emit,” or generate, a particular observation. These probability values may be discovered for a particular system by a training process that uses training data. This training data includes observations along with the known states that generated these observations. After training, a decoding process, using a set of new observations, may traverse through the HMM to discover the most likely set of states that generated these observations. For example, after an HMM modeling the acoustic features to phones of a language has been trained, a decoding process may be used on a new set of audio (i.e., spoken words/sounds) to discover the most likely states and transitions that generated these observations. These states and transitions are associated with various phones in the language. If using a WFST, the input labels of this HMM WFST would be the acoustic features (the observations) the output labels would be the phones, and the weights of each transition would be the state transition probabilities. ).  

obtaining scores from the available paths (0001-0002: weighted score to each transition/arc; 4: state transition probabilities; 6); 
comparing the scores to a predetermined threshold (0010: threshold; 0029; 39); and 
determining the plurality of candidate commands based on the scores satisfying the predetermined threshold (abstract: exceeds a score threshold; 0035: After processing an utterance or section of speech, the token that remains and has the best score is the token that represents the path through the WFST with the most likely hypothesis of what was spoken.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Hofer to allow for the performance of the WFST (of Sun) for speech recognition.
Sun already teaches the use of the WFST, and one would look to Hofer to implement it in the particular fashion presenting a reasonable expectation of success of completing the recognition with the WFST, and for the benefit where, computational complexity is further reduced; reduces decoding time and power consumption (Hofer 0030).

Hofer does not explicitly teach where Nagao teaches executing each available path within the first trained model substantially in parallel (0041: In the process of searching a WFST, since a plurality of paths is searched in parallel, a plurality of tokens is managed at the same time. Moreover, a token holds the accumulation score of the path. Furthermore, a token holds a string of output symbols assigned in the paths which have been passed.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Nagao for an improved system allowing for quicker response times of the WFST, and ultimately the recognition overall.

Regarding claim 5 Sun teaches The method of claim 1, wherein the second trained model comprises an attention- based decoder (Abstract: attention based encoder decoder architecture; 3.4: Decoding; Attention rescoring module).  

Regarding claim 7 Sun teaches The method of claim 1, further comprising executing the corresponding valid command (abstract; 2.1 contents of the speech…user’s command; the commands involve navigation to point of interest, making a phone call, controlling the air conditioner and playing music; contact names).  


Regarding claim 8 Sun, Hofer, and Nagao teach A system comprising:
a non-transitory computer-readable medium; and 
one or more processors communicatively coupled to the non-transitory computer- readable medium, the one or more processors configured to execute instructions stored in the non-transitory computer-readable medium to: 
receive an audio input; 
execute, by a first trained model, each available path within the first trained model substantially in parallel, wherein the first trained model comprises a weighted finite state transducer ("WFST");
obtain scores from the available paths;
compare the scores to a predetermined threshold;
determine, a plurality of candidate commands based on the scores satisfying the predetermined threshold;
determine, using a second trained model, a recognized command from the plurality of candidate commands; and 
identify a corresponding valid command in a set of valid commands based on the recognized command
Rejected for similar rationale and reasoning as claim 1
While the teachings of Sun are inherently performed using computer-based processing, and where Sun further teaches in 3.3: train the model on a GPU server;
To advance prosecution, Hofer more explicitly teaches 
A system comprising:
a non-transitory computer-readable medium; and 
one or more processors communicatively coupled to the non-transitory computer- readable medium, the one or more processors configured to execute instructions stored in the non-transitory computer-readable medium (fig 5-6; 0075; 0082-0083).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate the system presenting a reasonable expectation of success of allowing the models of Sun to be utilized to allow for the execution of the speech recognition.  

Claim 12 recites limitations similar to claim 5 and is rejected for similar rationale and reasoning
Claim 14 recites limitations similar to claim 7 and is rejected for similar rationale and reasoning

Regarding claim 15 Sun, Hofer, and Nagao teach A non-transitory computer-readable medium comprising processor-executable instructions configured to cause one or more processors to:
receive an audio input; 
execute, by a first trained model, each available path within the first trained model substantially in parallel, wherein the first trained model comprises a weighted finite state transducer ("WFST");
obtain scores from the available paths;
compare the scores to a predetermined threshold;
determine, a plurality of candidate commands based on the scores satisfying the predetermined threshold;
determine, using a second trained model, a recognized command from the plurality of candidate commands; and 
identify a corresponding valid command in a set of valid commands based on the recognized command.  
Rejected for similar rationale and reasoning as claim 8

Claim 19 recites limitations similar to claim 5 and is rejected for similar rationale and reasoning


8.	Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Sun in view of Hofer et al (2016/0093292) in further view of Nagao in further view of Mukherjee et al (2021/0241354).

Regarding claim 6 Sun does not specifically teach where Mukherjee teaches The method of claim 1, wherein identifying the corresponding valid command comprises performing fuzzy matching using the recognized command and the set of valid commands (0050 the fuzzy matching algorithm can include matching the recipe descriptor (e.g., recipe name) from the voice command against recipe names (e.g., recipe titles) in a database.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate fuzzy matching to obtain the most similar (valid) command for improved speech recognition and several technological improvements (Mukherjee 0097).

Claim 13 recites limitations similar to claim 6 and is rejected for similar rationale and reasoning
Claim 20 recites limitations similar to claim 6 and is rejected for similar rationale and reasoning

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Jul 31, 2023

Application Filed

May 28, 2025

Non-Final Rejection mailed — §103

Aug 28, 2025

Response Filed

Sep 08, 2025

Final Rejection mailed — §103

Jan 08, 2026

Request for Continued Examination

Jan 23, 2026

Response after Non-Final Action

Feb 05, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/575,883

Patent 12639534

WEBTOON CONTENT MULTILINGUAL TRANSLATION METHOD

2y 3m to grant Granted May 26, 2026

18/667,219

Patent 12626705

APPARATUS AND METHOD FOR MAPPING EMERGENCY CALL DATA MANUAL

1y 12m to grant Granted May 12, 2026

18/566,268

Patent 12621616

METHOD OF OPERATING A HEARING AID SYSTEM AND A HEARING AID SYSTEM USING SPEECH FORECASTING

2y 5m to grant Granted May 05, 2026

18/274,775

Patent 12609133

SCENE ESTIMATE METHOD, SCENE ESTIMATE APPARATUS, AND PROGRAM

2y 8m to grant Granted Apr 21, 2026

18/312,688

Patent 12586599

AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL

2y 10m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

76%

Grant Probability

86%

With Interview (+10.5%)

2y 11m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 652 resolved cases by this examiner. Grant probability derived from career allowance rate.