Last updated: April 19, 2026

Application No. 18/202,228

END OF SPEECH DETECTION USING ONE OR MORE NEURAL NETWORKS

Non-Final OA §103§112

Filed

May 25, 2023

Examiner

PASHA, ATHAR N

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

5 (Non-Final)

Interview Optional

— +17.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 154 resolved cases, 2023–2026

Examiner Intelligence

PASHA, ATHAR N View full profile →

Grants 90% — above average

Career Allow Rate

138 granted / 154 resolved

+27.6% vs TC avg

Strong +17% interview lift

Without

With

+17.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

18 currently pending

Career history

172

Total Applications

across all art units

Statute-Specific Performance

§101

21.9%

-18.1% vs TC avg

§103

49.4%

+9.4% vs TC avg

§102

16.9%

-23.1% vs TC avg

§112

5.2%

-34.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 154 resolved cases

Office Action

§103 §112

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 12/29/2025 and is being considered by the examiner.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/26/2025 has been entered.


Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 21, 28 and 35 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. proportion as cited in the above claims does not have support or description in the as filed Applicant Specification. Proportion implies a comparison of two ratios, and no two ratios are defined in the specifications. The closest support come from the as-filed Specification “¶[0057] In at least one embodiment, these character probabilities can be analyzed 408 using one or more speech criteria to calculate a start time and an end time for one or more speech segments contained within this audio signal. In at least one embodiment, this can be performed using an end of speech (EOS) detector that uses speech criteria including length of a sliding window, and a percentage of blank characters predicted within that sliding window, for purposes of determining start and end of speech, where window sized and percentage thresholds can differ for start and end time calculations. In at least one embodiment, a decoder can be used to transform 410 these character probabilities between start and end times into transcripts for corresponding speech segments. In at least one embodiment, these transcripts may then be stored or provided to an intended recipient, such as a voice- controlled device that is enabled to act upon a command contained within a given speech segment.” The claimed subject matter relies on a relative comparison of two ratios of non-speech vs. speech characters to make the end of speech determination, and not a percentage of non-speech in the window as outlined in the specs.
Claims 26, 33 and 39 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. ratio as cited in the above claims does not have support or description in the as filed Applicant Specification. Proportion implies a comparison of two ratios, and no two ratios are defined in the specifications. The closest support come from the as-filed Specification “¶[0057] In at least one embodiment, these character probabilities can be analyzed 408 using one or more speech criteria to calculate a start time and an end time for one or more speech segments contained within this audio signal. In at least one embodiment, this can be performed using an end of speech (EOS) detector that uses speech criteria including length of a sliding window, and a percentage of blank characters predicted within that sliding window, for purposes of determining start and end of speech, where window sized and percentage thresholds can differ for start and end time calculations” The claimed subject matter relies on a relative comparison of non-speech vs. speech characters to make the end of speech determination, and not a percentage of non-speech in the window as outlined in the specs.
Response to Remarks
This Office Action is in response to Arguments/Remarks filed on 12/26/2025.
In light of amendments, the examiner removes the Claim Objections.
In light of amendments, the examiner is using new citations to reject the independent claims, rendering the arguments moot.

Claim Rejections - 35 USC § 103
 In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 21, 23, 26,28, 30, 33, 35 and 39 are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi (US 8843369 B1) in further view of Li (US 20220238104 A1) 
With respect to claims 21, 28 and 35 Sharifi teaches 
(claim 21) A processor, comprising circuitry to: (Sharifi¶Col11ll49-54 Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer );
(claim 28) A method, comprising: 
(claim 35) A system, comprising one or more processors to : (Sharifi¶Col11ll49-54 Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer ) 
use the one or more neural networks to indicate speech or non-speech characters corresponding to time steps within the one or more audio signals based, at least in part, on the extracted frequency-domain features (Sharifi¶Col6ll40-60 For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech, then the general endpointer 225 may determine that there is an ending point before the audio window and a beginning point at the end audio window. In this instance, the general endpointer 225 determined that there were three seconds of non-speech and compared that to a threshold that was three seconds or less. If the threshold were greater than three seconds, then the general endpointer 225 would not have added beginning points and ending points at the beginning and end of three seconds of non-speech, ¶Col2ll59-63 The acoustic features include mel-frequency cepstral coefficients, filterbank energies, or fast Fourier transform frames. A duration of the initial portion of the received audio data is a particular amount of time)
determine a value that indicates a proportion of the speech characters to the non- speech characters within a sliding window of a number of time steps of the one or more audio signals; (¶Col6ll4-60 For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech [ten audio frames define the non-speech characters as a percentage of the three hundred millisecond window. This is mapped to value], then the general endpointer 225 may determine that there is an ending point before the audio window and a beginning point at the end audio window). 
compare the value that indicates a proportion of the speech characters to the non- speech characters to a threshold value corresponding to an end of speech determined by the length of the sliding window (¶Col6ll4-60 The general endpointer 225 receives data indicating whether a particular audio window corresponds to speech or non-speech. When there are a particular number of audio frames of a window that do not correspond to speech, then the general endpointer 225 determines that there is a beginning point or an ending point where the window that do not correspond to speech stop or start. For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech [ten audio frames define the non-speech characters as a percentage of the three hundred millisecond window], then the general endpointer 225 may determine that there is an ending point before the audio window and a beginning point at the end audio window. In this instance, the general endpointer 225 determined that there were three seconds of non-speech and compared that to a threshold that was three seconds or less. If the threshold were greater than three seconds, then the general endpointer 225 would not have added beginning points and ending points at the beginning and end of three seconds of non-speech); and 
identify the end of speech within the one or more audio signals based, at least in part, on the comparison, [[wherein at least one of the non-speech characters corresponds to noise within the one or more audio signals]] (¶Col6ll4-60 For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech [ten audio frames define the non-speech characters as a percentage of the three hundred millisecond window], then the general endpointer 225 may determine that there is an ending point [end of speech] before the audio window and a beginning point at the end audio window. In this instance, the general endpointer 225 determined that there were three seconds of non-speech and compared that to a threshold that was three seconds or less. If the threshold were greater than three seconds, then the general endpointer 225 would not have added beginning points and ending points at the beginning and end of three seconds of non-speech.)
Sharifi does not explicitly disclose, however Li teaches wherein at least one of the non-speech characters corresponds to noise within the one or more audio signals (Li¶[0068] In the above formula, the numerator is the weighted sum of the maximum probability parameters that the each frame in the to-be-processed audio belongs to the candidate characters, a weight of the maximum probability parameter corresponds to the blank character (i.e., the ineffective probability) is 0, and a weight of the non-blank character (i.e., the effective probability) corresponding to the maximum probability parameter is 1; and the denominator is the number of the maximum probability parameters corresponding to the non-blank characters. For example, in the case where the to-be-processed audio does not have an effective probability (i.e., the denominator is 0), the target audio is judged as noise ).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify endpointer of Sharifi to include non-speech characters of Li in order to increase the accuracy of speech recognition results ([0040], Li); 
With respect to claims 23 and 30 Sharifi teaches wherein the sliding window is to be slid along the one or more audio signals, from beginning to end (Sharifi¶ Col6ll40-60 The general endpointer 225 receives data from the general speech activity detector 220 and identifies beginning points and ending points of speech in the received data. Examiner Note: See also Fig. 1 which shows general endpointer at beginning and end of all speech activity);

With respect to claim 26 Sharifi teaches wherein the one or more neural networks are to identify the end of the one or more audio signals further based, at least in part, on a ratio of the speech characters to non-speech characters within the one or more audio signals (¶Col6ll4-60 The general endpointer 225 receives data indicating whether a particular audio window corresponds to speech or non-speech. When there are a particular number of audio frames of a window that do not correspond to speech, then the general endpointer 225 determines that there is a beginning point or an ending point where the window that do not correspond to speech stop or start. For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech [ten audio frames define the non-speech characters as a percentage of the three hundred millisecond window], then the general endpointer 225 may determine that there is an ending point before the audio window and a beginning point at the end audio window. In this instance, the general endpointer 225 determined that there were three seconds of non-speech and compared that to a threshold that was three seconds or less. If the threshold were greater than three seconds, then the general endpointer 225 would not have added beginning points and ending points at the beginning and end of three seconds of non-speech).

With respect to claims 33 and 39 Sharifi teaches further comprising identifying the end of the one or more audio signals further based, at least in part, on a ratio of the speech characters to non-speech characters within the one or more audio signals (¶Col6ll4-60 The general endpointer 225 receives data indicating whether a particular audio window corresponds to speech or non-speech. When there are a particular number of audio frames of a window that do not correspond to speech, then the general endpointer 225 determines that there is a beginning point or an ending point where the window that do not correspond to speech stop or start. For example, if each audio window is three hundred milliseconds and the general endpointer 225 receives data indicating that there are ten audio frames that do not correspond to speech [ten audio frames define the non-speech characters as a percentage of the three hundred millisecond window], then the general endpointer 225 may determine that there is an ending point before the audio window and a beginning point at the end audio window. In this instance, the general endpointer 225 determined that there were three seconds of non-speech and compared that to a threshold that was three seconds or less. If the threshold were greater than three seconds, then the general endpointer 225 would not have added beginning points and ending points at the beginning and end of three seconds of non-speech).

Claims 22, 24, 27, 29, 31, 34, 36, 37 and 40 are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi and Li in further view of Zhou (US 20200005765 A1).
With respect to claim 22, 29 and 36 Sharifi and Li do not explicitly disclose, however Zhou teaches wherein the one or more circuits are further to analyze probabilities of each of the speech characters using a greedy decoder to generate a string of characters of individual time steps (Zhou¶[0047] For the connectionist temporal classification (CTC), consider an entire neural network to be simply a function that takes in some input sequence of length T and outputs some output sequence y also of length T [generated string of characters], ¶[0048] Connectionist temporal classification (CTC) 172 utilizes an objective function that allows RNN 352 to be trained for sequence transcription tasks without requiring any prior alignment between the input and target sequences. The output layer contains a single unit for each of the transcription labels, such as characters or phonemes plus an extra unit referred to as the "blank" which corresponds to a null emission. Given a length T input sequence X, the output vectors yt are normalized with the softmax function, then interpreted as the probability of emitting the label or blank with index k at time t, and ¶[0058] Training with the defined objective is efficient, since both sampling and greedy decoding are cheap).);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify endpointer of Sharifi to include the greedy decoder of Zhou, in order to increase computational efficiency and improve speech recognition accuracy (Zhou¶ [0032]);

With respect to claim 24, 31 and 37 Sharifi and Li do not explicitly disclose, however Zhou teaches wherein probabilities of each of the speech or non-speech characters are decoded up to the end of the one or more audio signals in order to generate one or more text transcripts of the one or more audio signals (Zhou¶ [0053] FIG. 4 shows an example whole transcription sampled [decoded] by e sampling module 125 from softmax probabilities generated by the RNN 352 after processing a speech sample annotated with a “HALO” transcription. The illustrated example would use CER as the evaluation metric. Another example could include words instead of characters, and calculate WER. In FIG. 4, the x axis shows the letters predicted for each 20 ms window, and the y axis lists the twenty-six letters of the alphabet and blank 472 and space 482. The bright red entries correspond to letters sampled by the sampling module 125. The sampled whole transcription is “HHHEE_LL_LLLOOO”. In some implementations, a collapsing module (not show) enforces CTC collapsing rules and removes repeated letters and blanks to produce a final whole transcription “HELLO”);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify endpointer of Sharifi in view of non-speech characters of Li to include the greedy decoder of Zhou, in order to increase computational efficiency and improve speech recognition accuracy (Zhou¶ [0032]);

With respect to claims 27, 34 and 40 Sharifi and Li do not explicitly disclose, however, Zhou teaches wherein the one or more circuits are further to use a connectionist temporal classification (CTC) function with one or more neural networks to generate probabilities of each of the speech or non-speech characters based on features extracted from the one or more audio signals (¶[0047] Connectionist temporal classification (CTC) 172 utilizes an objective function that allows RNN 352 to be trained for sequence transcription tasks without requiring any prior alignment between the input and target sequences. The output layer contains a single unit for each of the transcription labels, such as characters or phonemes plus an extra unit referred to as the "blank" which corresponds to a null emission. Given a length T input sequence X, the output vectors yt are normalized with the softmax function, then interpreted as the probability of emitting the label or blank with index k at time t :).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify endpointer of Sharifi in view of non-speech characters of Li to include the greedy decoder of Zhou, in order to increase computational efficiency and improve speech recognition accuracy (Zhou¶ [0032]);


Claims 25, 32 and 38 are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi and Li in further view of Gorny (US 20210020181 A1).
With respect to claims 25, 32 and 38 Li further teaches voice-controllable devices (Li¶[0072] For example, a speech signal can be output after speech synthesis based on the semantic understanding, thereby realizing human-computer intelligent communication. For example, a response text [response text is generated based on voice] corresponding to the semantic understanding result can be generated based on the semantic understanding, and the speech signal can be synthesized according to the response text.)
Sharifi and Li do not explicitly disclose, however, Gorny teaches wherein transcripts of the one or more audio signals are to be provided as input to one or more [[voice-controllable devices]] (Gorny¶[0042] According to embodiments, transcription module 206 accesses local device audio data 214 and transcribes the audio data stored in local device audio data 214 into a local device text transcript, ¶[0006] In embodiments of the disclosed subject matter, the computer merges the audio transcription data from each of the two or more communication devices into a master audio transcript. The computer transmits the master audio transcript to each of the two or more communication devices.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify endpointer of Sharifi in view of non-speech characters of Li to include the transcripts of Gorny in order to generate transcripts automatically and in real time ([0018], Gorny);

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

May 25, 2023

Application Filed

Jan 10, 2024

Non-Final Rejection — §103, §112

Jun 18, 2024

Response Filed

Jun 29, 2024

Final Rejection — §103, §112

Aug 28, 2024

Interview Requested

Aug 29, 2024

Applicant Interview (Telephonic)

Aug 29, 2024

Examiner Interview Summary

Dec 10, 2024

Notice of Allowance

Dec 12, 2024

Request for Continued Examination

Dec 16, 2024

Response after Non-Final Action

Dec 28, 2024

Non-Final Rejection — §103, §112

Mar 20, 2025

Response after Non-Final Action

Apr 03, 2025

Response Filed

Jun 25, 2025

Final Rejection — §103, §112

Sep 02, 2025

Interview Requested

Sep 08, 2025

Applicant Interview (Telephonic)

Oct 13, 2025

Interview Requested

Oct 14, 2025

Examiner Interview Summary

Dec 26, 2025

Request for Continued Examination

Dec 30, 2025

Response after Non-Final Action

Jan 06, 2026

Non-Final Rejection — §103, §112

Feb 27, 2026

Interview Requested

Mar 10, 2026

Applicant Interview (Telephonic)

Mar 10, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/335,256

Patent 12596882

COMPLIANCE DETECTION USING NATURAL LANGUAGE PROCESSING

2y 5m to grant Granted Apr 07, 2026

17/418,193

Patent 12586563

Method, System and Apparatus for Understanding and Generating Human Conversational Cues

2y 5m to grant Granted Mar 24, 2026

18/094,032

Patent 12579173

SYSTEMS AND METHODS FOR DYNAMICALLY PROVIDING INTELLIGENT RESPONSES

2y 5m to grant Granted Mar 17, 2026

18/087,629

Patent 12566921

GAZETTEER INTEGRATION FOR NEURAL NAMED ENTITY RECOGNITION

2y 5m to grant Granted Mar 03, 2026

18/200,559

Patent 12547844

INTELLIGENT MODEL SELECTION SYSTEM FOR STYLE-SPECIFIC DIGITAL CONTENT GENERATION

2y 5m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

5-6

Expected OA Rounds

90%

Grant Probability

99%

With Interview (+17.0%)

2y 8m

Median Time to Grant

High

PTA Risk

Based on 154 resolved cases by this examiner. Grant probability derived from career allow rate.

END OF SPEECH DETECTION USING ONE OR MORE NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email