Last updated: April 19, 2026
Application No. 18/673,609
Speech Recognition Method, Speech Recognition Apparatus, and System

Non-Final OA §101§102§103
Filed
May 24, 2024
Examiner
BOGGS JR., JAMES
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Shenzhen Yinwang Intelligent Technologies Co., Ltd.
OA Round
1 (Non-Final)
This examiner grants 60% of cases after interview

— +38.8% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 107 resolved cases, 2023–2026
Examiner Intelligence

BOGGS JR., JAMES View full profile →
Grants 60% of resolved cases
Career Allow Rate
64 granted / 107 resolved
-2.2% vs TC avg
Strong +39% interview lift
Without
With
+38.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
28 currently pending
Career history
135
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
48.5%
+8.5% vs TC avg
§102
16.2%
-23.8% vs TC avg
§112
18.1%
-21.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 107 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference signs mentioned in the description: “500”, “501”, “502”, 503”, “504”, 505”, and “506” in paragraphs 0159 through 0180, 0222, and 0231 through 0232.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Objections
Claims 7, 15 and 22 are objected to because of the following informalities:
In claim 7, line 7, “is “ending”” should read “being “ending””.
In claim 15, line 6, “is “ending”” should read “being “ending””.
In claim 22, line 6, “is “ending”” should read “being “ending””.
Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “an obtainer configured to obtain first audio data” in claim 9.
Because this claim limitation is being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it is being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this limitation interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation to avoid it being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation recites sufficient structure to perform the claimed function so as to avoid it being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 17 – 22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims do not fall within at least one of the four categories of patent eligible subject matter because the broadest reasonable interpretation of a “computer program product comprising computer-executable instructions that are stored on a computer-readable storage media” can encompass non-statutory transitory forms of signal transmission, such as a propagating electrical or electromagnetic signal per se.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 – 2, 6 – 10, 14 – 18 and 21 – 22 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(1) as being anticipated by Liu et al. (US Patent No. 9,437,186), hereinafter Liu.
Regarding claim 1, Liu discloses a speech recognition method, comprising:
obtaining first audio data comprising a plurality of audio frames (Column 8, lines 4-6, "The ASR module 314 may include an acoustic front end (AFE), not shown. The AFE transforms audio data into data for processing by the speech recognition engine."; Column 8, lines 11-15, "The AFE may reduce noise in the audio data and divide the digitized audio data into frames representing time intervals for which the AFE determines a set of values, called a feature vector, representing the features/qualities of the utterance portion within the frame.");
extracting sound categories of the audio frames and semantics of the audio frames based on relationships between energies of the audio frames and preset energy thresholds (Column 4, lines 30-48, "Audio detection processing for endpoint determination may be performed by determining an energy level of the audio input. In some embodiments, the endpointing/audio detection may include a low-power digital signal processor (or other type of processor) configured to determine an energy level (such as a volume, intensity, amplitude, etc.) of an obtained audio input and for comparing the energy level of the audio input to an energy level threshold. The energy level threshold may be set according to user input, or may be set by a device. In some embodiments, the endpointing/audio detection may be further configured to determine that the audio input has an energy level satisfying a threshold for at least a threshold duration of time. In such embodiments, high-energy audio inputs of relatively short duration, which may correspond to sudden noises that are relatively unlikely to include speech, may be ignored. The endpointing/audio detection may compare the energy level to the energy level threshold (and optionally to the threshold duration) to determine whether the energy level threshold is met."; Column 4, lines 49-64, "If the endpointing/audio detection determines that the obtained audio input has an energy level satisfying an energy level threshold it may process audio input to determine whether the audio input includes speech. In some embodiments, the endpointing/audio detection works in conjunction with digital signal processing to implement one or more techniques to determine whether the audio input includes speech. Some embodiments may apply voice activity detection (VAD) techniques, such as harmonicity detection. Such techniques may determine whether speech is present in an audio input based on various quantitative aspects of the audio input, such as the spectral slope between one or more frames of the audio input; the energy levels of the audio input in one or more spectral bands; the signal-to-noise ratios of the audio input in one or more spectral bands; or other quantitative aspects."; Column 11, lines 27-44, "In a semantic interpretation process (which may be part of traditional NLU processing), which commonly takes places after ASR processing, semantic tagging is a process of recognizing and identifying specific important words of an ASR output and assigning a tag to those words, where the tag is a classification of the associated word. The tags may be called entities or named entities. Some words in a phrase may be considered less important, thus not considered for a named entity and may not receive a tag or may be given a catchall or default tag such as “Unknown” or “DontCare.” The tagging process may also be referred to as named entity recognition (NER).  In this aspect, the semantic information and tags are built in to the models used to perform ASR processing such that the semantic information (which may be less comprehensive than tags available in a post-ASR semantic tagging process) is output with the text as a result of the ASR process."); Comparing the energy level of the audio input to an energy level threshold to determine whether speech is present in an audio input, and performing semantic tagging of the speech, reads on extracting sound categories of the audio frames and semantics of the audio frames based on relationships between energies of the audio frames and preset energy thresholds.);
and obtaining a speech ending point of the first audio data based on the sound categories and the semantics (Column 3, lines 58-65, "In a further implementation according to the disclosure, semantic information in the user's speech may be used to help determine the end of an utterance. That is, an ASR processor may be configured so that semantic tags or other indicators may be included as part of the ASR output and used to determine, as the user's speech is recognized, whether the user's utterance has reached a logical stopping point."; Column 11, line 67 - Column 12, line 2, "Thus semantic information in the user's speech may be used to help determine the ending of an utterance, instead of basing it on non-speech audio frames only."; Determining the ending of an utterance using semantic information and non-speech audio frames reads on obtaining a speech ending point of the first audio data based on the sound categories and the semantics.).
Regarding claim 2, Liu discloses the speech recognition method as claimed in claim 1.
Liu further discloses:
wherein after obtaining the first audio data, the method further comprises responding to an instruction corresponding to second audio data that is prior to the speech ending point (Column 2, lines 27-43, "The downstream component may be any number of components or applications that operate on ASR output. Although many such downstream applications are envisioned for these techniques, for purposes of illustration this description will use an NLU process and application as the NLU process illustrates the benefits of early ASR output as described below. For example, the NLU process may take ASR output and determine, for example, the actions (sometimes referred to as an “application response” or “app response”) based on the recognized speech of the early ASR output. The app response based on the early ASR output may not be immediately activated but may be held pending successful comparison of the early ASR output to the final ASR output. If the comparison is successful, then the app response is immediately available for execution (rather than having to wait for NLU processing in a typically staged process), thus improving latency from a user's perspective."; Performing an application response based on the recognized speech of the early automatic speech recognition output reads on responding to an instruction corresponding to second audio data that is prior to the speech ending point.).
Regarding claim 6, Liu discloses the speech recognition method as claimed in claim 1.
Liu further discloses:
wherein the audio frames comprise a first audio frame and a second audio frame, wherein the first audio frame includes the semantics, wherein the second audio frame is subsequent to the first audio frame in the audio frames (Column 11, line 65 - Column 12, line 9, "Early or final endpoints, or both, may be adjusted in a system according to the disclosure. Thus semantic information in the user's speech may be used to help determine the ending of an utterance, instead of basing it on non-speech audio frames only. The threshold of the number of non-speech frames may be dynamically changed based on the semantic meaning the speech that has been recognized so far. Thus, the ASR module 314 may determine a likelihood that an utterance includes a complete command and use that utterance to adjust the threshold of non-speech frames for determining the end of the utterance."; Semantic information in the user's speech reads on a first audio frame including the semantics, and non-speech audio frames following speech reads on a second audio frame subsequent to the first audio frame.);
and wherein obtaining the speech ending point based on the sound categories comprises: obtaining the speech ending point based on the semantics and a first sound category of the second audio frame (Column 11, line 65 - Column 12, line 9, "Early or final endpoints, or both, may be adjusted in a system according to the disclosure. Thus semantic information in the user's speech may be used to help determine the ending of an utterance, instead of basing it on non-speech audio frames only. The threshold of the number of non-speech frames may be dynamically changed based on the semantic meaning the speech that has been recognized so far. Thus, the ASR module 314 may determine a likelihood that an utterance includes a complete command and use that utterance to adjust the threshold of non-speech frames for determining the end of the utterance."; Determining the ending of an utterance using semantic information in speech and non-speech audio frames following speech reads on obtaining the speech ending point based on the semantics and a first sound category of the second audio frame.).
Regarding claim 7, Liu discloses the speech recognition method as claimed in claim 6.
Liu further discloses:
wherein speech endpoint categories comprise "speaking", "thinking", and "ending" (Column 12, lines 38-46, "The pauses, P1 and P2, may be tagged as “Non-Speech” and be associated with respective durations or numbers of non-speech frames, or time ranges, as a result of the phrase and semantics. Some pauses, such as P2 in FIG. 6A, may represent a tag for the end of a command. The end of a command may be inserted based on whether the ASR model recognizes that the preceding words have formed a complete command."; Column 12, lines 36-60, "During runtime the ASR module 314 in FIG. 3, receives the textual input, compares the input with its available models and as part of ASR processing, outputs tags along with the associated processed speech with each word or pause (i.e. non-speech frames).  As described above with respect to FIG. 2, and further illustrated in FIGS. 6 and 7, this semantic information in the user's speech, and represented by the semantic tags output by the ASR, may be used to help determine the end of an utterance. The amount of time after non-speech is detected until the ASR is activated to produce speech results may be dynamically adjusted based on tags in the language model that provide the semantic information appropriate to more accurately determine the end of an utterance for a given user. Based on training data, the language model may be adjusted to reflect the likelihood that the ASR should await more speech to process a complete utterance. For example, if the ASR module 314 is processing the word “songs” and it knows that the word directly following is “songs” is “by” it may be more likely to continue processing in order to get to and apply the tag “ArtistName” to the words “Michael Jackson” as in FIG. 6B. The tag for pause P5 after “Jackson” may signal the end of the phrase/command and prompt generation of the ASR output without waiting for a traditional number of silent frames."; Detecting speech frames reads on a speech endpoint category of "speaking", detecting non-speech frames and determining that the automatic speech recognition should await more speech to process a complete utterance reads on a speech endpoint category of "thinking", and detecting non-speech frames and recognizing that the preceding words have formed a complete command reads on a speech endpoint category of "ending".),
and wherein obtaining the speech ending point based on the semantics and the first sound category comprises: determining a first speech endpoint category of the first audio data based on the semantics and the first sound category and obtaining the speech ending point in response to the first speech endpoint category is "ending" (Column 11, line 65 - Column 12, line 9, "Early or final endpoints, or both, may be adjusted in a system according to the disclosure. Thus semantic information in the user's speech may be used to help determine the ending of an utterance, instead of basing it on non-speech audio frames only. The threshold of the number of non-speech frames may be dynamically changed based on the semantic meaning the speech that has been recognized so far. Thus, the ASR module 314 may determine a likelihood that an utterance includes a complete command and use that utterance to adjust the threshold of non-speech frames for determining the end of the utterance."; Determining the ending of an utterance using semantic information in speech and non-speech audio frames following speech reads on determining a first speech endpoint category of the first audio data based on the semantics and the first sound category and obtaining the speech ending point in response to the first speech endpoint category is "ending".).
Regarding claim 8, Liu discloses the speech recognition method as claimed in claim 7.
Liu further discloses:
wherein determining the first speech endpoint category comprises: processing the semantics and the first sound category using a speech endpoint classification model (Column 3, line 65 - Column 4, line 2, "The amount of time after non-speech is detected until the ASR terminates its processing may be dynamically adjusted based on tags in an ASR model (described below) that provide the semantic information appropriate to more accurately determine the end of an utterance for a given user."; An automatic speech recognition model that provide the semantic information appropriate to more accurately determine the end of an utterance reads on a speech endpoint classification model.);
to obtain the first speech endpoint category, wherein the speech endpoint classification model is based on a speech sample and an endpoint category label of the speech sample, wherein a first format of the speech sample corresponds to a second format of the semantics and the first sound category, and wherein an endpoint category in the endpoint category label corresponds to the first speech endpoint category (Column 12, lines 36-60, "During runtime the ASR module 314 in FIG. 3, receives the textual input, compares the input with its available models and as part of ASR processing, outputs tags along with the associated processed speech with each word or pause (i.e. non-speech frames).  As described above with respect to FIG. 2, and further illustrated in FIGS. 6 and 7, this semantic information in the user's speech, and represented by the semantic tags output by the ASR, may be used to help determine the end of an utterance. The amount of time after non-speech is detected until the ASR is activated to produce speech results may be dynamically adjusted based on tags in the language model that provide the semantic information appropriate to more accurately determine the end of an utterance for a given user. Based on training data, the language model may be adjusted to reflect the likelihood that the ASR should await more speech to process a complete utterance. For example, if the ASR module 314 is processing the word “songs” and it knows that the word directly following is “songs” is “by” it may be more likely to continue processing in order to get to and apply the tag “ArtistName” to the words “Michael Jackson” as in FIG. 6B. The tag for pause P5 after “Jackson” may signal the end of the phrase/command and prompt generation of the ASR output without waiting for a traditional number of silent frames."; Detecting speech and using semantic tags to determine that processing of an utterance should continue reads on obtaining the first speech endpoint category, wherein the speech endpoint classification model is based on a speech sample and an endpoint category label of the speech sample, wherein a first format of the speech sample corresponds to a second format of the semantics and the first sound category, and wherein an endpoint category in the endpoint category label corresponds to the first speech endpoint category.).
Regarding claim 9, arguments analogous to claim 1 are applicable.  In addition, Liu discloses a speech recognition apparatus (Column 2, lines 54-57, “As illustrated, a speech recognition process 100 may be implemented on a client or local device 102 such as a smart phone or other local device”), comprising:
an obtainer configured to obtain first audio data comprising a plurality of audio frames (Column 8, lines 4-6, "The ASR module 314 may include an acoustic front end (AFE), not shown. The AFE transforms audio data into data for processing by the speech recognition engine."; Column 8, lines 11-15, "The AFE may reduce noise in the audio data and divide the digitized audio data into frames representing time intervals for which the AFE determines a set of values, called a feature vector, representing the features/qualities of the utterance portion within the frame.");
and a processor (Column 5, lines 23-26, “As discussed above, any or all of the modules may be embodied in one or more general-purpose microprocessors, or in one or more special-purpose digital signal processors or other dedicated microprocessing hardware.”) configured to perform the steps of claim 1.
Regarding claim 10, arguments analogous to claim 2 are applicable.
Regarding claim 14, arguments analogous to claim 6 are applicable.
Regarding claim 15, arguments analogous to claim 7 are applicable.
Regarding claim 16, arguments analogous to claim 8 are applicable.
Regarding claim 17, arguments analogous to claim 1 are applicable.  In addition, Liu discloses a computer program product comprising computer-executable instructions that are stored on a computer-readable storage media (Column 5, lines 47-51, “The memory can further include computer program instructions that an application processing module and/or processing unit in the device 102 executes in order to implement one or more embodiments of a speech recognition system with distributed endpointing according to the disclosure.”) and that, when executed by a processor, cause a speech recognition apparatus (Column 2, lines 54-57, “As illustrated, a speech recognition process 100 may be implemented on a client or local device 102 such as a smart phone or other local device”) to perform the steps of claim 1.
Regarding claim 18, arguments analogous to claim 2 are applicable.
Regarding claim 21 arguments analogous to claim 6 are applicable.
Regarding claim 22, arguments analogous to claim 7 are applicable.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4 – 5, 12 -– 13 and 19 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Rajan et al. (US Patent Application Publication No. 2002/0198704), hereinafter Rajan.
Regarding claim 4, Liu discloses the speech recognition method as claimed in claim 1, but does not specifically disclose: wherein the sound categories comprise "speech", "neutral", and "silence", wherein the preset energy thresholds comprise a first energy threshold and a second energy threshold, wherein the first energy threshold is greater than the second energy threshold, wherein a first sound category of the sound categories and of a first audio frame in the audio frames and with a first energy that is greater than or equal to the first energy threshold is "speech", wherein a second sound category of the sound categories and of a second audio frame in the audio frames and with a second energy that is less than the first energy threshold and is greater than the second energy threshold is "neutral", and wherein a third sound category of a third audio frame in the audio frames and with a third energy that is less than or equal to the second energy threshold is "silence". 
Rajan teaches:
wherein the sound categories comprise "speech", "neutral", and "silence", wherein the preset energy thresholds comprise a first energy threshold and a second energy threshold, wherein the first energy threshold is greater than the second energy threshold, wherein a first sound category of the sound categories and of a first audio frame in the audio frames and with a first energy that is greater than or equal to the first energy threshold is "speech", wherein a second sound category of the sound categories and of a second audio frame in the audio frames and with a second energy that is less than the first energy threshold and is greater than the second energy threshold is "neutral", and wherein a third sound category of a third audio frame in the audio frames and with a third energy that is less than or equal to the second energy threshold is "silence" (Paragraph 0038, lines 1-9, "In this embodiment, two threshold values are actually determined and stored within the threshold memory 39--a coarse threshold value which is used to indicate the start of the signal which is clearly not background noise and a fine threshold value which is used to determine the start point of speech more accurately. In this embodiment, the fine threshold value is the 0.01 percentile energy value discussed above and the coarse threshold value is the 0.05 percentile level."; Paragraph 0039, lines 9-21, "The speech/noise decision unit 38 then compares the energy values calculated for each block of samples (as determined by the block energy determining unit 35) with the threshold energy levels stored in the threshold memory 39. If the residual energy value for the current block being processed is below the thresholds, then the decision unit 38 decides that the corresponding audio corresponds to background noise. However, once the speech/noise decision unit 38 determines that there are a number of consecutive blocks (e.g. five consecutive blocks) whose residual energy values exceed the coarse threshold, then the decision unit 38 determines that the corresponding audio is speech."; Paragraph 0046, lines 1-12, "In the above embodiment, the speech/noise decision unit used two threshold values in determining whether or not the incoming audio was speech or noise. As those skilled in the art will appreciate, other decision strategies may be used. For example, the decision unit may decide that the input audio corresponds to speech as soon as a predetermined threshold value has been exceeded, however, such an embodiment is not preferred because it is susceptible to false detection of speech due to spurious short sounds or noises. Similarly, when detecting the end of speech, both the fine threshold and the coarse threshold could be used rather than just the fine threshold."; A fine threshold value reads on a first energy threshold, a coarse threshold value reads on a second energy threshold, determining that audio corresponds to background noise when the residual energy value for the block being processed is below the thresholds reads on the sound category "silence", determining the start point of speech with the fine threshold value reads on the sound category "speech", and indicating the start of the signal which is clearly not background noise with the coarse threshold value reads on the sound category "neutral".).
Rajan is considered to be analogous to the claimed invention because it is in the same field of speech recognition.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Rajan to determine that audio corresponds to background noise when the residual energy value for the block being processed is below a coarse threshold value, indicating the start of the signal which is clearly not background noise with the coarse threshold value, and determine the start point of speech with the fine threshold value.  Doing so would allow for detecting the boundary between speech and noise (Rajan; Paragraph 0006, lines 1-14).
Regarding claim 5, Liu in view of Rajan discloses the speech recognition method as claimed in claim 4.
Rajan further teaches:
further comprising determining the first energy threshold and the second energy threshold based on a second energy of background sound of the first audio data (Paragraph 0036, lines 1-15, "In this embodiment, one second of background noise is used in the training algorithm which, with the 16 kHz sampling rate, means that approximately 16,000 background noise samples are processed in the maximum likelihood analysis unit 31. Further, in this embodiment, the block energy determining unit 35 divides the residual error values determined for these samples into non-overlapping blocks of approximately eighty samples. Therefore, the block energy determining unit determines approximately 200 energy values for the training background noise. During the training routine, the energy values determined by the block energy determining unit 35 are passed via the switch 36 to a histogram analysis unit 37 which analyses the energy values to determine appropriate threshold values for use in detecting speech."; Determining appropriate threshold values for use in detecting speech based on the energy values of background noise samples reads on determining the first energy threshold and the second energy threshold based on a second energy of background sound of the first audio data.).
Rajan is considered to be analogous to the claimed invention because it is in the same field of speech recognition.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu in view of Rajan to further incorporate the teachings of Rajan to determine appropriate threshold values for use in detecting speech based on the energy values of background noise samples.  Doing so would allow for detecting the boundary between speech and noise (Rajan; Paragraph 0006, lines 1-14).
Regarding claim 12, arguments analogous to claim 4 are applicable.
Regarding claim 13, arguments analogous to claim 5 are applicable.
Regarding claim 19, arguments analogous to claim 4 are applicable.
Regarding claim 20, arguments analogous to claim 5 are applicable.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Maas et al. (US Patent No. 12,211,517)
Doshi et al. (US Patent No. 10,943,606)
Wein (US Patent No. 9,984,706)
Braho et al. (US Patent No. 9,076,459)
Hetherington et al. (US Patent No. 8,554,564)
Zhao  et al. (US Patent No. 6,480,823)
Chan et al. (US Patent Application Publication No. 2006/0200346)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JAMES BOGGS/Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

May 24, 2024
Application Filed
Jul 10, 2024
Response after Non-Final Action
Feb 25, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/163,848
Patent 12586600
Streaming Vocoder
2y 5m to grant Granted Mar 24, 2026
17/977,443
Patent 12573406
VOICE AUTHENTICATION BASED ON ACOUSTIC AND LINGUISTIC MACHINE LEARNING MODELS
2y 5m to grant Granted Mar 10, 2026
18/314,249
Patent 12572752
DYNAMIC CONTENT GENERATION METHOD
2y 5m to grant Granted Mar 10, 2026
18/483,896
Patent 12562170
BIOMETRIC AUTHENTICATION DEVICE, BIOMETRIC AUTHENTICATION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/131,866
Patent 12554931
Method and System of Improving Communication Skills for High Client Conversation Rate
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+38.8%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 107 resolved cases by this examiner. Grant probability derived from career allow rate.