Last updated: April 19, 2026

Application No. 18/037,546

METHOD AND APPARATUS FOR RECOGNIZING VOICE, ELECTRONIC DEVICE AND MEDIUM

Non-Final OA §101§103

Filed

May 17, 2023

Examiner

ARMSTRONG, ANGELA A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— +9.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 641 resolved cases, 2023–2026

Examiner Intelligence

ARMSTRONG, ANGELA A View full profile →

Grants 75% — above average

Career Allow Rate

478 granted / 641 resolved

+12.6% vs TC avg

Moderate +10% lift

Without

With

+9.5%

Interview Lift

resolved cases with interview

Typical timeline

3y 11m

Avg Prosecution

25 currently pending

Career history

666

Total Applications

across all art units

Statute-Specific Performance

§101

21.9%

-18.1% vs TC avg

§103

43.7%

+3.7% vs TC avg

§102

14.8%

-25.2% vs TC avg

§112

7.7%

-32.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 641 resolved cases

Office Action

§101 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the preliminary amendment filed May 17, 2023.   Claims 9, 14, and 15 have been amended.  Claim 13 has been cancelled.  Claims 16-20 have been added.  Claims 1-12 and 14-20 are pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 5/4/2023; 4/1/2024; 2/5/2025; 11/6/2025 is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-6, 8-12, and 14-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.   
Claims 1, 14, and 15 are directed to method, device and computer readable medium reciting limitations for acquiring an audio data to be recognized, the audio data to be recognized comprising a speech segment; determining a start and end time corresponding to the speech segment which is comprised in the audio data to be recognized; extracting at least one speech segment from the audio data to be recognized based on the determined start and end time; and performing speech recognition on the at least one extracted speech segment to generate recognition text corresponding to the audio data to be recognized.    The limitation for acquiring an audio data to be recognized, the audio data to be recognized comprising a speech segment is a data gathering step that can be achieved by a person hearing the audio.  The feature for determining a start and end time corresponding to the speech segment which is comprised in the audio data to be recognized can be achieved by the person listening to the audio, and using pen and paper, noting the start and end times of portions of the audio.  The limitation for extracting at least one speech segment from the audio data to be recognized based on the determined start and end time can be achieved by the person selecting a timing portion of the audio that should be recognized.  The feature for performing speech recognition on the at least one extracted speech segment to generate recognition text corresponding to the audio data to be recognized can be achieved by the person listening to the selected portion of the audio, listening to that portion again, understanding what was spoken, and using pen and paper generating the corresponding text heard in the audio.  The recited limitations are directed a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of the generic computer, apparatus, computer program product, and generic computer components.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recited generic computing device, processor, memory, computer readable medium and  program instructions amounts to no more than mere instructions to apply the exception using generic computer components.  Accordingly, the elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  The claims are directed to an abstract idea.  The claims are not patent eligible.
 The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as indicated with respect to integration of the abstract idea into a practical application, the additional elements of the generic computing device, processor, memory, computer readable medium and  program instructions to perform the various steps amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claims are not patent eligible.
Dependent claims 2-3, 5-6, 8-12, 16-18,  and 19-20 do not integrate the judicial exception into a practical application and do not include additional elements that are sufficient to amount to significantly more than the judicial exception.  The limitations of the dependent claims are directed to steps of organizing or manipulating functions and commands for the transcribed text, performing mathematical calculations for detecting voice/speech, utilizing mathematical algorithms for recognizing speech, mental processing steps for keyword spotting, and implementing extra-solution activity for displaying outputs or data.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 14-16 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Tian et al (US Patent Application Publication No. 2020/0075012), hereinafter Tian, in view of Schalkwyk et al (US Patent Application Publication No. 2015/0340034), hereinafter Schalkwyk.
Regarding claims 1,14, and 15, Tian teaches method, device (including processor and memory), and computer readable medium with program instructions [para 0005-0011] comprising steps for acquiring an audio data to be recognized, the audio data to be recognized comprising a speech segment [para 0027; 0031 – microphone array and speech processing device 103]; determining a start and end time corresponding to the speech segment which is comprised in the audio data to be recognized [para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal]; extracting at least one speech segment from the audio data to be recognized based on the determined start and end time  [para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal];  and performing speech recognition on the at least one extracted speech segment corresponding to the audio data to be recognized [0027; 0078 – speech recognition].  Tian fails to specifically teach the recognized speech is converted to text.  In a similar field of endeavor, Schalkwyk teaches a speech recognition system that receives an input and generates a predicted output based on the received input. In particular, the input is an audio input, e.g., a recording of a speech input spoken by a user, and the predicted output is a predicted text transcript for the speech input [para 0013].  One having ordinary skill in the art would have recognized the advantages of implementing the text generation processing suggested by Schalkwyk, in the system of Tian, for the purpose of providing an indication to the user that the words that were spoken were correctly recognized, and thereby ensuring the system is accurately operating and performing the user’s spoken intents.
Regarding claim 2, the combination of Tian and Schalkwyk teaches the method according to claim 1, wherein the determining a start and end time corresponding to the speech segment which is comprised in the audio data to be recognized comprises: extracting an audio frame feature of the audio data to be recognized to generate a first audio frame feature [Tian para 0068]; determining a probability that an audio frame corresponding to the first audio frame feature belongs to a speech {Tian para 0034; 0061-0070]; and generating, based on comparison between the determined probability and a predetermined threshold, the start and end time corresponding to the speech segment [Tian para 0061-0070].
Regarding claim 3, the combination of Tian and Schalkwyk teaches the method according to claim 2, wherein the determining a probability that an audio frame corresponding to the first audio frame feature belongs to a speech comprises: inputting the first audio frame feature into a pre-trained speech detection model, and generating the probability that the audio frame corresponding to the first audio frame feature belongs to the speech [Tian’s neural network VAD model – para 0071-0073].
Regarding claim 4, the combination of Tian and Schalkwyk teaches the method according to claim 3, wherein the speech detection model is obtained by training [Tian’s neural network VAD model training– para 0071-0073] through the following steps: acquiring a first training sample set, wherein first training samples in the first training sample set comprise first sample audio frame features and corresponding sample labeling information, the first sample audio frame features are obtained by extracting features of first sample audios, the sample labeling information is used for representing a category to which the first sample audios belong, and the category comprises a speech [Tian’s neural network VAD model training– para 0071-0073]; acquiring an initial speech detection model for classification [Tian’s neural network VAD model training– para 0071-0073]; and taking the first sample audio frame features in the first training sample set as inputs of the initial speech detection model, taking labeling information corresponding to the input first audio frame features as desired outputs, so as to obtain the speech detection model by training [Tian’s neural network VAD model training– para 0071-0073].
Regarding claim 5, the combination of Tian and Schalkwyk teaches the method according to claim 2, wherein the generating, based on comparison between the determined probability and a predetermined threshold, the start and end time corresponding to the speech segment comprises: using a preset sliding window to select probability values corresponding to a first number of audio frames [para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal];   determining a statistical value of the selected probability values [Tian’s para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal using probability thresholds];   and generating, in response to determining that the statistical value is greater than the predetermined threshold, the start and end time corresponding to the speech segment based on an audio segment consisting of the first number of audio frames corresponding to the selected probability values [para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal using probability threshold].
Regarding claims 6, 16, and 19, the combination of Tian and Schalkwyk teaches  wherein the performing speech recognition on the at least one extracted speech segment to generate recognition text corresponding to the audio data to be recognized comprises: extracting a frame feature of a speech from the at least one extracted speech segment to generate a second audio frame feature [Tian’s speech processing device 103; Schalkwyk Fig 1; Fig 2; para 0012-0030]; inputting the second audio frame feature into a pre-trained acoustic model, so as to obtain a second number of phoneme sequences to be matched corresponding to the second audio frame feature, and a corresponding score [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030]; inputting the second number of phoneme sequences to be matched into a pre-trained language model, so as to obtain text to be matched corresponding to the second number of phoneme sequences to be matched, and a corresponding score [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030];  selecting, based on the scores respectively corresponding to the obtained phoneme sequences to be matched and the text to be matched, text to be matched as matching text corresponding to the at least one speech segment from the obtained text to be matched [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030]; and generating, based on the selected matching text, the recognition text corresponding to the audio data to be recognized [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030].
Regarding claim 7, the combination of Tian and Schalkwyk teaches the method according to claim 6, wherein the acoustic model is obtained by training through the following steps: acquiring a second training sample set, wherein second training samples in the second training sample set comprise second sample audio frame features and corresponding sample text, the second sample audio frame features are obtained by extracting features of second sample audio, and the sample text is used for representing contents of the second sample audios [Schalkwyk’s speech recognition model training --  Fig 3; para 0034-0041]; acquiring an initial acoustic model; taking the second sample audio frame features in the second training sample set as inputs of the initial acoustic model, taking phonemes indicated by the sample text corresponding to the input second sample audio frame features as desired outputs, and pre-training the initial acoustic model on the basis of a first training criterion, wherein the first training criterion is generated on the basis of an audio frame sequence [Schalkwyk’s speech recognition model training --  Fig 3; para 0034-0041]; converting the phonemes indicated by second sample text into phoneme labels for a second training criterion by using a predetermined window function, wherein the second training criterion is generated on the basis of an audio frame [Schalkwyk’s speech recognition model training --  Fig 3; para 0034-0041]; and taking the second sample audio frame features in the second training sample set as inputs of the pre-trained initial acoustic model, taking the phoneme labels corresponding to the input second sample audio frame features as desired outputs, and using the second training criterion to train the pre-trained initial acoustic model, so as to obtain the acoustic model [Schalkwyk’s speech recognition model training --  Fig 3; para 0034-0041].
Regarding claim 8,  the combination of Tian and Schalkwyk teaches the method according to claim 6, wherein the selecting, based on the scores respectively corresponding to the obtained phoneme sequences to be matched and the text to be matched, text to be matched as matching text corresponding to the at least one speech segment from the obtained text to be matched, comprises: performing weighted summation on the scores respectively corresponding to the obtained phoneme sequences to be matched and the text to be matched to generate a total score corresponding to each piece of text to be matched [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030]; and selecting, from the obtained text to be matched, text to be matched with a highest total score as the matching text corresponding to the at least one speech segment [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030].

Claims 9-12, 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Tian in view of Schalkwyk  as applied to claims 1, 14, and 15 above, and further in view of Ganapathiraju et al (US Patent No. 9,672,815), hereinafter Ganapathiraju.

Regarding claims 9-12, 17-18 and 20, the combination of Tian and Schalkwyk teaches acquiring a video file [Tian’s image capturing data 104]; extracting an audio track from the video file to generate audio data to be recognized [Tian para 0009; 0034; 0061-0070; 0075-0077 -- performing a speech sound start and end point analysis to determine start and end time points of the speech sounds in the audio signal; 0027; 0078 – speech recognition]; determining whether a word in a preset word set exist in the recognition text [Schalkwyk’s speech recognition processing --  Fig 1; Fig 2; para 0012-0030].  The combination of Tian and Schalkwyk fails to specifically teach sending, in response to determining that the word in the preset word set exist in the recognition text, the video file to be reviewed and the recognition text to a target terminal.  In a similar field of endeavor, Ganapathiraju teaches keyword spotting in speech recognition processing, where keywords are set by user preference, speech input is recognized and the user set keywords are detected, where once detected a report is generated to inform a user the keyword was detected [col. 2, lines 20-46] and specifically teaches the system is advantageous in improving real time speech recognition and increasing the efficacy of the recognizer in multiple domains or languages [col. 1, line 65 to col. 2, line 6].  One having ordinary skill in the art would have recognized the advantages of implementing the keyword spotting and reporting techniques suggested by Ganapathiraju, in the Tian/ Schalkwyk system, and providing the audio/video data to a user for review for the purpose of  improving real time speech recognition and increasing the efficacy of the recognizer in multiple domains or languages, and to ensure any issues or problems user’s experience with the system can be monitored and adequately resolve, to thereby improve the system and enhance the user’s experience.


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

May 17, 2023

Application Filed

Jan 10, 2026

Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/240,480

Patent 12602547

DOMAIN ADAPTING GRAPH NETWORKS FOR VISUALLY RICH DOCUMENTS

2y 5m to grant Granted Apr 14, 2026

18/044,890

Patent 12596879

METHOD AND SYSTEM FOR IDENTIFYING CITATIONS WITHIN REGULATORY CONTENT

2y 5m to grant Granted Apr 07, 2026

18/082,503

Patent 12585892

AUTO-TRANSLATION OF CUSTOMIZED ASSISTANT

2y 5m to grant Granted Mar 24, 2026

18/466,230

Patent 12555491

Inclusive Intelligence for Digital Workplace

2y 5m to grant Granted Feb 17, 2026

17/660,813

Patent 12547843

SYSTEMS AND METHODS FOR GENERALIZED ENTITY MATCHING

2y 5m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

75%

Grant Probability

84%

With Interview (+9.5%)

3y 11m

Median Time to Grant

Low

PTA Risk

Based on 641 resolved cases by this examiner. Grant probability derived from career allow rate.