Last updated: April 19, 2026

Application No. 17/570,725

METHOD FOR FACILITATING SPEECH ACTIVITY DETECTION FOR STREAMING SPEECH RECOGNITION

Non-Final OA §103

Filed

Jan 07, 2022

Examiner

MAUNG, THOMAS H

Art Unit

2692

Tech Center

2600 — Communications

Assignee

Gnani Innovations Private Limited

OA Round

5 (Non-Final)

This examiner grants 63% of cases after interview

— +38.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 382 resolved cases, 2023–2026

Examiner Intelligence

MAUNG, THOMAS H View full profile →

Grants 63% of resolved cases

Career Allow Rate

242 granted / 382 resolved

+1.4% vs TC avg

Strong +38% interview lift

Without

With

+38.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

24 currently pending

Career history

406

Total Applications

across all art units

Statute-Specific Performance

§101

6.4%

-33.6% vs TC avg

§103

54.5%

+14.5% vs TC avg

§102

13.7%

-26.3% vs TC avg

§112

12.9%

-27.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 382 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 03/03/2026 have been fully considered but they are not persuasive.  
However, in light of amendments, Examiner has applied an additional reference to address the amendments.
Claim 1 is essentially claiming in part: 1)classifying a textual form of an audio signal (into class of words and/or punctuation), 2) further predicting whether each word is a start word, a middle word, or an end word, and 3) based on the prediction, deactivate or activate audio recording. 
Based on Applicant’s previous statements, Examiner interprets the predefined class to include words class and/or punctuations class.  Minkin teaches transcription of audio signal at a word level for every word in a transcription ([0090][0092]), with the textual representation comprises at least one word and at least one pause  ([0011]). Note also features can be determined based on the textual representation 222 ([0192]). Minkin’s textual representation comprises at least one word and at least one pause. Regarding the attributes, Minkin, for example, further teaches as provided in the rejection below, that in [0179] The server 106 may be configured to determine a first in-use set of features 512 for the first in-use segment 502 similarly to how the server 106 is configured to determine the plurality of sets of features 320 for the plurality of segments 300. [0180] Once the first in-use set of features 512 is determined, the server 106 may then input the first in-use set of features 512 into the NN 400.
In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).









Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-7, and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Minkin (US 2020/0193987) in view of Aher et al. (US 2021/0327419) and Muchlinski et al. (US 2019/0043529).
Claim 1
Minkin teaches a system enabling automatic speech recoding, said system comprising a processor that executes a set of executable instructions that are stored in a memory, upon which execution, the processor causes the system to: 
receive a set of data packets from an audio device, said set of data packets corresponding an audio signal, wherein said audio signal is recorded or streamed by a speech recognition engine ([0048] As illustrated in FIG. 1, the user 102 may be uttering voice-based commands to the device 104. The device 104 is configured to record a digital audio signal 160 while the user 102 is uttering the voice-based command in a form of a user utterance 150. In other words, the device 104 is configured to record the digital audio signal 160 in real-time while the user 102 is uttering the user utterance 150 in proximity to the device 104. [0066] and [0084] teaches the electronic device such as a device 104 or server 106 provides speech recognition algorithm; See also [0057]); 
convert, by the speech recognition engine, said audio signal into textual form ([0214] In other embodiments, the electronic device may be providing at least some of the digital audio signal 160 to the ASR algorithm for determining the textual representation of the user utterance 150.); 
extract, by a classification engine, a first set of attributes from the textual form, said first set of attributes pertaining to any or a combination of a set of predefined class of words and punctuations for every input word converted by the speech recognition engine ([0091] In this example, the server 106 may input the digital audio signal 202 and the textual representation 222 into the ASA algorithm which is configured to automatically “time-align” the words from the textual representation 222 so as to obtain time intervals of the digital audio signal 202 during which the respective words from the textual representation 222 are uttered. [0179] The server 106 may be configured to determine a first in-use set of features 512 for the first in-use segment 502 similarly to how the server 106 is configured to determine the plurality of sets of features 320 for the plurality of segments 300. [0180] Once the first in-use set of features 512 is determined, the server 106 may then input the first in-use set of features 512 into the NN 400.); 
predict, by the classification engine, a second set of attributes (See 522 of Fig. 5 for example) from the first set of attributes, said second set of attributes pertaining to the set of predefined class of words and punctuations at any or a combination of beginning of the sentence, within the sentence and at the end of the sentence ([0180] Once the first in-use set of features 512 is determined, the server 106 may then input the first in-use set of features 512 into the NN 400. The NN 400 is configured to output a first in-use output value 522. Let it be assumed that the first in-use output value 522, as illustrated in FIG. 5, is “0.1” (or 10% for example). This means that the NN 400 may determine that there is a probability of “0.1” (or 10%) that the user utterance 150 of the user 102 has ended during the first in-use segment 502 of the digital audio signal 160. Examiner notes, this indicates the classification engine predicts by generating a second set of attributes being probability pertaining to the previously defined class such as utterance ending.); 
based on the predicted second set of attributes, facilitate, by an ML engine, deactivation or activation of a switching mechanism, wherein the switching mechanism controls the activation or deactivation of recording or streaming of the audio signal (See the comparison of predicted feature to a threshold to determine whether to continue recording or stop the recording, similar to the process performed for the top half of Fig. 5 and [0180]: [0186] As such, the server 106 may determine that the second in-use output value 524 is superior to the pre-determined prediction threshold 550. This means that the probability (determined by the NN 400) of that the user utterance 150 has ended during the second in-use segment 504 is high enough for the IPA processing system 108 to determine that the user utterance 150 has ended during the second in-use segment 504.[0187] In response to determining that the user utterance 150 has needed during the second in-use segment 504 of the digital audio signal 160, the IPA processing system 108 may generate a trigger for the device 104 to stop recording the digital audio signal 160 since the user 102 stopped uttering.).  
Minkin may not clearly detail prediction related to different parts of the sentence-specifically the beginning of the sentence and within the sentence; wherein the second set of attributes comprises, for every input word converted by the speech recognition engine, a classification of whether the input word belongs to a class of words pertaining to a start word, a middle word, or an end word.
The analogous art Aher teaches the word detection device that predicts different parts of the sentence and classification of whether the input word belongs to a class of words pertaining to a start word, a middle word, or an end word ([0030] device 202 is equipped with the capability to detect the beginning and ending of a phrase. [0031]  In implementations using model-based prediction, such as with the use of HMM or LSTM models, the model is trained to predict whether the uttered word is a start of the sentence, an intermediate word or the last word of the sentence. ).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate word detection as taught by Aher with the utterance identification system of Minkin, because doing so would have provided a model that is trained with and can therefore predict features such as, without limitation, question tags, WH (“what”) words, articles, part-of-speech tags, intonations, syllables, or any other suitable language attributes ([0031] of Aher).
Minkin in view of Aher discloses the system as claimed in claim 1, except wherein the ML engine is configured to detect, predict and discard word viruses.  
Muchlinski teaches wherein the ML engine is configured to detect, predict and discard word viruses ([0042], In another embodiment, acoustic model 208 is a pruned deep neural network having the number of outputs reduced or pruned such that only a subset of available outputs (e.g., as determined during set-up and/or training) are provided or activated. As discussed further herein, in some embodiments, output layer 407 may be pruned such that only predetermined output nodes (and associated scores 214) are provided such that a subset of available states or scores are implemented via neural network 400. Similarly, output nodes 522 corresponding to spoken noise audio units 502 provide probability scores for spoken noise audio units 502 such that each models or represents different spoken noise audio units 502 but all model or represent spoken noise audio units 502. For example, spoken noise audio units 502 include audio units that are recognized as spoken by a human but are not recognized as spoken language. See also [0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate speech classification of audio with the acoustic model as taught by Muchlinski with the utterance identification system of Minkin and Aher, because doing so would have provided a way to  provide high quality low resource speech/non-speech classification ([0003] of Muchlinski).
Claim 2
Minkin of the combination teaches the system as claimed in claim 1, said audio signal pertains to a conversation between at least one user and a computing device (See Fig. 1). 
Claim 4
Muchlinski of the combination teaches wherein on reaching the end of sentence, the execution of the speech recognition engine is ended or deactivated by the switching mechanism, wherein the switching mechanism is configured to return the control again to the speech recognition engine comprising a voice activity detector ([0037],  For example, voice activity detection module 207 may provide a low power always listening capability for system 200. For example, upon activation by initiation signal 217, audio data 211 may be continuously monitored for speech detection until controller 206 determines speech has been detected, as indicated by speech indicator 215, and buffer wake indicator 216 and/or system command 218 are provided or until a determination is made by voice activity detection module 207 to reenter a sleep mode or low power state or the like.). 


Claim 5
Minkin of the combination teaches the system as claimed in claim 1, wherein the ML engine is configured by a plurality of training data comprising a set of predefined class of words and punctuations, wherein the ML engine learns and self trains from the plurality of training data to facilitate auto activation and deactivation of the recording or streaming of the audio signal ([0019], The electronic device is configured to use the sets of features and the respective labels for training a Neural Network (NN) to predict during which segment of the digital audio signal the user utterance has ended. [0135] First, the NN 400 is trained in the training phase. During the training phase, a large number of training iterations may be performed by the server 106 on the NN 400. Broadly speaking, during a given training iteration, the NN 400 is inputted with sets of features associated with a common digital audio signal and, in a sense, “learns” which of these sets of features corresponds to a segment of that digital audio signal during which a user utterance in that digital audio signal has ended (using the adjusted end-of-utterance moment in time corresponding to the timestamp 350 depicted in FIG. 3 as a proxy thereof).).
Claims 6-7, and 9-10
	These claims recite substantially the same limitations as those provided in claims 1-2, and 4-5 respectively, and therefore they are rejected for the same reasons.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS H MAUNG whose telephone number is (571)270-5690. The examiner can normally be reached Monday-Friday, 9am-6pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Carolyn R. Edwards can be reached on 1-(571) 2707136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS H MAUNG/Primary Examiner, Art Unit 2692

Read full office action

Prosecution Timeline

Jan 07, 2022

Application Filed

Feb 09, 2024

Non-Final Rejection — §103

Jun 13, 2024

Response Filed

Jul 27, 2024

Final Rejection — §103

Dec 02, 2024

Response after Non-Final Action

Dec 31, 2024

Request for Continued Examination

Jan 08, 2025

Response after Non-Final Action

Apr 01, 2025

Non-Final Rejection — §103

Jul 07, 2025

Response Filed

Aug 30, 2025

Final Rejection — §103

Mar 03, 2026

Request for Continued Examination

Mar 05, 2026

Response after Non-Final Action

Mar 11, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/508,120

Patent 12602446

DATA COMMUNICATION SYSTEM

2y 5m to grant Granted Apr 14, 2026

17/878,697

Patent 12602196

Audio Playback Adjustment

2y 5m to grant Granted Apr 14, 2026

17/478,948

Patent 12585653

PARSING IMPLICIT TABLES

2y 5m to grant Granted Mar 24, 2026

17/658,807

Patent 12586562

ANIMATED SPEECH REFINEMENT USING MACHINE LEARNING

2y 5m to grant Granted Mar 24, 2026

18/136,779

Patent 12578918

STREAMING AUDIO TO DEVICE CONNECTED TO EXTERNAL DEVICE

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

5-6

Expected OA Rounds

63%

Grant Probability

99%

With Interview (+38.2%)

2y 11m

Median Time to Grant

High

PTA Risk

Based on 382 resolved cases by this examiner. Grant probability derived from career allow rate.

METHOD FOR FACILITATING SPEECH ACTIVITY DETECTION FOR STREAMING SPEECH RECOGNITION

This examiner grants 63% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email