Last updated: May 29, 2026
Application No. 18/526,712
METHODS AND APPARATUSES FOR SPEECH ENHANCEMENT

Final Rejection §101§102§103
Filed
Dec 01, 2023
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Comcast Cable Communications LLC
OA Round
2 (Final)
This examiner grants 63% of cases after interview

— +51.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 63% of resolved cases
Career Allowance Rate
17 granted / 27 resolved
+1.0% vs TC avg
Strong +52% interview lift
Without
With
+51.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
11.4%
-28.6% vs TC avg
§103
82.9%
+42.9% vs TC avg
§102
3.8%
-36.2% vs TC avg
§112
1.9%
-38.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
	1. Regarding the rejection of claim 8 under 35 U.S.C. § 112, Applicant has amended claim 8 to depend on claim 5, providing sufficient antecedent basis. Accordingly, the rejection under 35 U.S.C. § 112 is withdrawn.

2. Regarding the rejection of claims 1-11 under 35 U.S.C. § 101, Applicant's arguments filed 12/10/2025 have been fully considered but they are not persuasive. 

Applicant first argues that the claims do not recite abstract ideas under Step 2A Prong 1. Specifically, Applicant argues that the claims do not recite mental processes (pgs. 11-12 of Remarks) and that the claims do not recite mathematical concepts (pgs. 12-13 of Remarks). The Examiner respectfully disagrees with both of these arguments. The claims as currently written recite both mental processes and mathematical concepts. The claim recites limitations which can be performed mentally with the aid of pen and paper. A person can listen to audio signals and make a determination as to how likely what they hear contains speech. Furthermore, the claims do not merely involve mathematical concepts (as argued pgs. 12-13), but instead recite mathematical calculations under the broadest reasonable interpretation of the claims. The limitations of “determining” a set of TF samples amounts to a mathematical calculation, “determining” TF losses to be applied to TF samples based on a speech probability estimate, and “generating” an output signal by applying TF losses to TF samples, all amount to recitations of mathematical calculations (e.g. time-frequency domain conversion calculation, TF loss computations calculation, frequency bin loss adjustment calculations). Therefore, the claims recite abstract ideas under Step 2A Prong 1.
Applicant further argues that the claims integrate the judicial exception into a practical application under Step 2A Prong 2 (see pgs. 13-16). Specifically, Applicant argues that the claims have been incorrectly analyzed under Step 2A Prong 2 and are instead subject matter eligible, arguing that the claims are directed to improvements to a technology and are directed to a specific technological process for digital processing an input audio signal involving a specific signal-processing architecture (pg. 14, 2nd and 3rd para. and pg. 15 3rd para.). The Examiner respectfully disagrees with these arguments. Under Step 2A Prong 2 analysis, the claims as a whole do not integrate the judicial exception into a practical application. The only additional element present in claim 1 that does not fall under a mathematical concept or mental process is “a computing device”, which amounts to merely applying the recited abstract ideas using a generic computer. The claims do not recite any additional structure besides this generic “computing device”. When this additional limitation is analyzed along with the claims as a whole, the claimed invention is not integrated into a practical application via an improvement to technology. The claims do not recite any limitations which impose any meaningful limits on practicing the abstract ideas or that tie the abstract ideas to a technical improvement. Further, the claims do not recite any technical components/modules which carry out the claimed method, and thus do not recite a specific signal-processing architecture as argued by Applicant. Thus, even when viewed in combination with the claims as a whole, the additional limitations do not integrate the judicial exception into a practical application under Step 2A Prong 2.
Hence, Applicant’s arguments are not persuasive.

3. Regarding the rejection of claims 1-4, 9, and 10 under 35 U.S.C. § 102, Applicant's arguments filed 12/10/2025 have been fully considered but they are not persuasive. 

Applicant argues that the cited prior art does not teach or suggest each and every element of the claimed invention. Specifically, Applicant argues that the Thyssen reference does not specifically disclose the limitation of “determining, based on the set of TF samples, a speech probability estimate that speech is present for each TF sample of the set of TF samples” (see pgs. 16-18). The Examiner respectfully disagrees with this argument. The Thyssen reference reads on the BRI of the claimed limitation. Thyssen teaches GMM modeling for computing probabilities that a particular frame and/or frequency bin of each frame/frequency bin contains audio from a desired source of speech (para. 0115 “By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”). This statistical modeling is performed on feature vectors for a per-frequency bin basis (para. 0115 “SSNR feature statistical modeling component 310 may be configured to model feature vector 305 on a per-frame basis an/or per-frequency bin basis.”). Performing a per-frequency bin statistical modeling operation to determine a per-frequency bin probability value reflecting how likely it belongs to a desired speech source reads on the BRI of “determining, based on the set of TF samples, a speech probability estimate that speech is present for each TF sample of the set of TF samples”. 
Hence, Applicant’s arguments are not persuasive.

4. Regarding the rejections under 35 U.S.C. § 103, Applicant's arguments filed 12/10/2025 have been fully considered but they are not persuasive for analogous reasons as discussed above. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

5. Claims 1-11 and 30-51 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

Regarding claims 1, 30, and 41, “A method”, “An apparatus”, and “One or more non-transitory computer-readable media” are recited, which is directed to one of the four statutory categories of invention (process, machine, and article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations recited in claim 1, and analogous claims recited in independent claims 30 and 41, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:
receiving, …, an input signal comprising speech and non-speech: a person listens to audio comprising speech and non-speech sounds
determining, based on the input signal, a set of time-frequency (TF) samples of the input signal: determining time-frequency (TF) samples of the input signal is a mathematical concept
determining, based on the set of TF samples, a speech probability estimate that speech is present for each TF sample of the set of TF samples: a person analyzes the TF samples and determines a probability that there is speech in the sample
determining, based on the speech probability estimate for each TF sample of the set of TF samples, one or more losses to be applied to one or more TF samples of the set of TF samples: determining losses to apply amounts to a mathematical concept.
and generating, based on the one or more TF losses applied to the one or more TF samples, an output signal, wherein the output signal comprises less non-speech than the input signal: applying the losses to the samples to generate an output signal amounts to a mathematical concept.

Claims 1, 30, and 41 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only additional limitations are “…by a computing device” (claim 1), “An apparatus comprising: one or more processors; and a memory storing processor-executable instructions that, when executed by the one or more processors, cause the apparatus to” (claim 30), and “One or more non-transitory computer-readable media storing processor-executable instructions that, when executed by at least one processor, cause the at least one processor to” (claim 41), which amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, claims 1, 30, and 41 are directed to an abstract idea (Step 2A: YES).
Claims 1, 30 and 41 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitation amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception as they do not provide an inventive concept. Therefore, claims 1, 30 and 41 are not patent eligible.

Regarding dependent claims 2-11, 31-40, and 42-51, “The method”, “The apparatus”, and “The one or more non-transitory computer-readable media” are recited, which is directed to one of the four statutory categories of invention (process, machine, and article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite further mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations recited in claims 2-11, and the analogous limitations recited in claims 31-40 and 42-51, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:

	Claim 2, 31, and 42:
wherein the speech probability estimate for the each TF sample of the set of TF samples is indicative of the speech being present in the each TF sample: a person determines the estimate by determining how likely speech is present in each sample.
Claims 2, 31, and 42 contain no additional elements.

Claim 3, 32, and 43:
wherein the non-speech comprises stationary noise and non-stationary noise: a person listens to audio which has stationary noise (e.g. white noise), and non-stationary noise (e.g. wind)
Claims 3, 32, and 43 contain no additional elements.

Claim 4, 33, and 44:
wherein the speech probability estimate further distinguishes the speech from the stationary noise and the non-stationary noise: a person determines the estimate to decide which samples contain speech and which samples contain noise.
Claims 4, 33, and 44 contain no additional elements.

Claim 5, 34, and 45:
generating a labelled data set that comprises one or more input features and one or more indications indicative of the speech being present: a person writes down a data set of features and indications of speech being present.
providing the labelled data set…to determine the speech probability estimate: a person uses the data to learn how to determine the estimate.
Claim 5, 34, and 45 contain the additional limitation “to a machine learning mode, wherein the machine learning model is configured to…”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claims 6, 35, and 46:
receiving…a set of speech samples; applying, based on each of the set of speech samples, a speech weight to the each of the set of speech samples; receiving,…a set of non-speech samples; applying, based on each of the set of non-speech sample, a non-speech weight to the each of the set of non-speech samples; generating, by combining the speech weighted set of speech samples and the non-speech weighted set of non-speech samples, a speech augmented set; and extracting one or more input features from the speech augmented set: generating a combined speech weighted set by combining the speech weighted set and non-speech weighted set, and extracting input features from the speech augmented set amounts to mathematical concepts.
Claims 6, 35, and 46 contain the additional limitation “by the computing device”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 7, 36, and 47:
wherein the one or more extracted input features comprise Mel Frequency Cepstrum Coefficients (MFCCs), Phonemes, Senones, and Mel Spectrogram: a person can write down phonemes they hear. Determining MFCCs, Senones, and Mel Spectrogram amount to mathematical concepts.
Claims 7, 36, and 47 contain no additional elements.

Claims 8, 37, and 48:
receiving…a set of speech samples; applying, based on each of the set of speech samples, a speech weight to the each of the set of speech samples; receiving,…a set of non-speech samples; applying, based on each of the set of non-speech sample, a non-speech weight to the each of the set of non-speech samples; determining, based on the speech weighted set of speech samples and the non-speech weighted set of non-speech sample, the one or more indications indicative of the speech present: weighting a set of speech samples and a set of non-speech samples and using this to determine the one or more indications amounts to a mathematical concept.
Claims 8, 37, and 48 contain the additional limitation “by the computing device”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 9, 38, and 49:
determining, based on at least one of a priori signal to noise (SNR) ratio, the speech probability estimate, or a posteriori SNR the one or more TF losses to be applied to the each of the set of TF samples: determining a prior/a posteriori SNR’s to determine TF losses amounts to a mathematical concept.
Claims 9, 38, and 49 contain no additional elements.

Claims 10, 39, and 50:
wherein each of the set of time-frequency samples comprises a frequency bin narrowly filtered based on a frequency domain: determining samples comprising a frequency bin narrowly filtered amounts to a mathematical concept.
Claims 10, 39, and 50 contain no additional elements.

Claim 11, 40, and 51:
wherein the input signal comprises one or more pulse code modulation (PCM) signals: using PCM signals as the input signal amounts to a mathematical concept.
Claims 11, 40, and 51 contains no additional elements.

	Claims 2-11, 31-40, and 42-51 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, claims 2-11, 31-40, and 42-51 are directed to an abstract idea (Step 2A: YES).
	Claims 2-11, 31-40, and 42-51 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception as they do not provide an inventive concept. Therefore, claims 2-11, 31-40, and 42-51 are not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


6. Claims 1-4, 9-10, 30-33, 38-39, 41-44, and 49-50 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Thyssen & Borgstrom (US 2015/0071461 A1, hereinafter Thyssen).

Regarding claim 1, Thyssen discloses receiving, by a computing device (Fig. 9), an input signal comprising speech and non-speech (para. 0194 “As shown in FIG. 4, the method of flowchart 400 begins at step 402, where an audio signal is received that comprises at least a desired source component and at least one interfering source type. For example, with reference to FIG. 3C, back-end SCS component receives first signal 340.”); determining, based on the input signal, a set of time-frequency (TF) samples of the input signal (para. 0069 “In the case of frequency-dependent feature vectors, the notation x.sub.n,m(k) represents the k.sup.th element of a feature vector corresponding to time index n and frequency channel m.”; para. 0121 “For example, as shown in FIG. 3E, plot 347 represents a time domain input waveform representing first signal 340 (which includes both speech and car noise), plot 349 represents a time-frequency plot of first signal 340”); determining, based on the set of TF samples, a speech probability estimate that speech is present for each TF sample of the set of TF samples (para. 0115 “By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”); determining, based on the speech probability estimate for each TF sample of the set of TF samples, one or more TF losses to be applied to one or more TF samples of the set of TF samples (para. 0122 “Probability 307 is provided to multi-noise source gain component 332. As will be described below, probability 307 may be used to determine optimal gain 325, which is used to suppress stationary noise (and/or other types of interfering sources) present in first signal 340 on a per-frame basis and/or per-frequency bin basis.”; para. 0188 “In accordance with an embodiment where multi-noise source gain component 332 is configured to determine optimal gain 325 on a per-frequency bin basis, multi-noise source gain component 332 provides a respective optimal gain value for each frequency bin.”); generating, based on the one or more TF losses applied to the one or more TF samples, an output signal, wherein the output signal comprises less non-speech than the input signal (para. 0189 “Gain application component 346 may be configured to suppress noise (e.g., stationary noise, non-stationary noise and/or residual echo) present in first signal 340 by applying optimal gain 325 to provide noise-suppressed signal 344.”).

Regarding claim 2, Thyssen discloses wherein the speech probability estimate for the each TF sample of the set of TF samples is indicative of the speech being present in the each TF sample (para. 0115 “By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”). 

Regarding claim 3, Thyssen discloses wherein the non-speech comprises stationary noise and non-stationary noise (para. 0108 “Back-end SCS component 300 is configured to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in a first signal 340.”).

Regarding claim 4, Thyssen discloses wherein the speech probability estimate further distinguishes the speech from the stationary noise and the non-stationary noise (para. 0121 “For example, as shown in FIG. 3E, plot 347 represents a time domain input waveform representing first signal 340 (which includes both speech and car noise), plot 349 represents a time-frequency plot of first signal 340”; para. 0204 “As shown in FIG. 5, the method of flowchart 500 begins at step 502, where one or more first characteristics associated with a first type of interfering source in an audio signal are determined. In accordance with an embodiment, the first type of interfering source is stationary noise. In accordance with such an embodiment, the first characteristic(s) include an SNR regarding the stationary noise with respect to the audio signal and a first measure of probability indicative of a probability that the audio signal is from a desired source with respect to the stationary noise.”; para. 0206 “At step 504, one or more second characteristics associated with a second type of interfering source in an audio signal are determined. In accordance with an embodiment, the second type of interfering source is non-stationary noise. In accordance with such an embodiment, the second characteristic(s) include an SNR regarding the non-stationary noise with respect to the audio signal and a second measure of probability indicative of a probability that the audio signal is from a desired source with respect to the non-stationary noise.”).

Regarding claim 9, Thyssen discloses determining, based on at least one of a priori signal to noise (SNR) ratio, the speech probability estimate, or a posteriori SNR, the one or more TF losses to be applied to the each of the set of TF samples (para. 0113 “SSNR feature extraction component 308 may be configured to extract one or more SNR feature(s) from first signal 340 based on stationary noise estimate 301 on a per-frame basis and/or per-frequency bin basis to obtain an SNR feature vector 305. …In accordance with another embodiment, the estimate of the SNR feature(s) is equivalent to the a priori SNR that is estimated simply as the posteriori SNR minus one (assuming statistical independence between interfering and desired sources).”; para. 0115 “SSNR feature statistical modeling component 310 may be configured to model feature vector 305 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, SSNR feature statistical modeling component 310 models SNR feature vector 305 using GMM modeling. By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”; para. 0122 “Probability 307 is provided to multi-noise source gain component 332. As will be described below, probability 307 may be used to determine optimal gain 325, which is used to suppress stationary noise (and/or other types of interfering sources) present in first signal 340 on a per-frame basis and/or per-frequency bin basis.”).

Regarding claim 10, Thyssen discloses wherein each of the set of TF samples comprises a frequency bin narrowly filtered based on a frequency domain (para. 0120 “As an example, only a single feature is used (per frequency bin in the frequency domain), with a mild smoothing.”; para. 0115 “By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”).

Regarding claim 30, claim 30 is an apparatus claim with limitations similar to those in method claim 1, and is thus rejected under similar rationale.
Additionally, Thyssen discloses An apparatus (Fig. 9, 900) comprising: one or more processors (Fig. 9, ‘CPU’ 902); and a memory (Fig. 9, 906) storing processor-executable instructions that, when executed by the one or more processors, cause the apparatus to (para. 0237 “CPU 902 further includes a program sequencer 916, a program memory (PM) data address generator 918 and a data memory (DM) data address generator 920. Program sequencer 916 may be configured to manage program structure and program flow by generating an address of an instruction to be fetched from program memory 906. Program sequencer 916 may also be configured to fetch instruction(s) from instruction cache 922, which may store an N number of recently-executed instructions, where N is a positive integer.”; para. 0235 “FIG. 9 depicts a block diagram of a processor circuit 900 in which … any methods, algorithms, and functions described herein, may be implemented.”).

Regarding claims 31, 32, 33, 38, and 39, these claims are rejected under analogous reasons to claims 2, 3, 4, 9, and 10, respectively.

Regarding claim 41, claim 41 is a non-transitory computer-readable media claim with limitations similar to those in method claim 1, and is thus rejected under similar rationale.
Additionally, Thyssen discloses One or more non-transitory computer-readable media storing processor-executable instructions that, when executed by at least one processor, cause the at least one processor to (para. 0238 “Such computer-readable storage media may, for example, store computer program logic, e.g., program modules, comprising computer executable instructions that, when executed by one or more processor circuits, provide and/or maintain one or more aspects of functionality described herein with reference to the figures, as well as any and all components, steps and functions therein and/or further embodiments described herein.”).

Regarding claims 42, 43, 44, 49, and 50, these claims are rejected for analogous reasons to claims 2, 3, 4, 9, and 10, respectively.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


5. Claims 5-6, 8, 34-35, 37, 45-46, and 48 are rejected under 35 U.S.C. 103 as being unpatentable over Thyssen in view of Sivaraman et al. (US 2022/0084509 A1, hereinafter Sivaraman).

Regarding claim 5, Thyssen discloses a machine learning model, wherein the machine learning model is configured to determine the speech probability estimate (para. 0115 “By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.”).
Thyssen does not specifically disclose generating a labelled data set that comprises one or more input features and one or more indications indicative of the speech being present; and providing the labelled data set to [a machine learning model…]
Sivaraman teaches generating a labelled data set that comprises one or more input features and one or more indications indicative of the speech being present (para. 0075 “In some embodiments, the analytics server 102 employs supervised training to train the machine-learning models of the machine-learning architecture, where the analytics database 104 and/or the call center database 112 contains labels associated with the training call data or enrollment call data. The labels indicate, for example, the expected data for the training call data or enrollment call data.”; para. 0086 “The server, or certain layers of the machine-learning architecture, may perform one or more data augmentation operations on the input audio signal (e.g., training audio signal, enrollment audio signal). The data augmentation operations generate certain types of degradation for the input audio signal, thereby generating corresponding simulated audio signals from the input audio signal. ...”; para. 0093 “In step 204, the server trains the machine-learning architecture by applying the sub-architectures (e.g., speaker separation engine, noise suppression engine, speaker-embedding engine) on the training signals. The server trains the speech separation engine and noise suppression engine to extract spectro-temporal masks (e.g., speaker mask, noise mask) and generate features of an output signal (e.g., noisy target speaker signal, enhanced speaker signal).”); and providing the labelled data set to a machine learning model (para. 0093 “In step 204, the server trains the machine-learning architecture by applying the sub-architectures (e.g., speaker separation engine, noise suppression engine, speaker-embedding engine) on the training signals. The server trains the speech separation engine and noise suppression engine to extract spectro-temporal masks (e.g., speaker mask, noise mask) and generate features of an output signal (e.g., noisy target speaker signal, enhanced speaker signal).”).
Thyssen and Sivaraman are considered to be analogous to the claimed invention as they both are in the same field of speech enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen to incorporate the teachings of Sivaraman in order to generate a labelled data set that comprises one or more input features and one or more indications indicative of the speech being present, and to provide the labelled data set to a machine learning model to determine the speech probability estimate. Doing so would be beneficial, as this would force the machine-learning architecture to evaluate and adjust for various types of degradation present in the original training data signals (Sivaraman, para. 0086).

Regarding claim 6, Thyssen in view of Sivaraman discloses receiving, by the computing device, a set of speech samples (Sivaraman: para. 0085 “Certain steps of the method 200 include obtaining the input audio signals and/or pre-processing the input audio signals (e.g., training audio signal, enrollment audio signal, inbound audio signal) based upon the particular operational phase (e.g., training phase, enrollment phase, deployment phase).”); applying, based on each of the set of speech samples, a speech weight to the each of the set of speech samples (para. 0031 “As an example of the speech separation engine operations, the input audio signal containing a speech mixture signal x(t) may be represented as: x(t)=s.sub.tar(t)+αs.sub.interf(t)+n(t), where s.sub.tar(t) is the target speaker's signal; s.sub.interf(t) is an interfering speaker's signal; n(t) is the noise; and a is a scaling factor according to the SDR of the given training signal.”; speech weight for speech sample s.sub.tar(t) is 1); receiving, by the computing device, a set of non-speech samples (para. 0091 “The data augmentation operations are not limited to generating speech mixtures. Before or after generating the simulated signals containing the speech mixtures, the server additionally or alternatively performs the data augmentation operations for non-speech background noises.”); applying, based on each of the set of non-speech sample, a non-speech weight to the each of the set of non-speech samples (para. 0091 “For instance, the server may add background noises randomly selected from a large noise corpus to the simulated audio signal comprising the speech mixture. The server may apply these background noises to the simulated audio signal at SNRs ranging from, for example, 5 dB to 30 dB, though such range is not limiting on possible embodiments;”); generating, by combining the speech weighted set of speech samples and the non-speech weighted set of non-speech samples, a speech augmented set (para. 0086 “The data augmentation operations generate certain types of degradation for the input audio signal, thereby generating corresponding simulated audio signals from the input audio signal.”); and extracting one or more input features from the speech augmented set (para. 0093 “In step 204, the server trains the machine-learning architecture by applying the sub-architectures (e.g., speaker separation engine, noise suppression engine, speaker-embedding engine) on the training signals. The server trains the speech separation engine and noise suppression engine to extract spectro-temporal masks (e.g., speaker mask, noise mask) and generate features of an output signal (e.g., noisy target speaker signal, enhanced speaker signal).”).
Thyssen and Sivaraman are considered to be analogous to the claimed invention as they both are in the same field of speech enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen to incorporate the teachings of Sivaraman in order to receive a set of speech and non-speech samples, to apply a speech and non-speech weight respectively to the samples, to combine the speech weighted set of speech samples and the non-speech weighted set of non-speech samples to generate a speech augmented set, and to extract one or more input features from the speech augmented set. Doing so would be beneficial, as this would force the machine-learning architecture to evaluate and adjust for various types of degradation present in the original training data signals (Sivaraman, para. 0086).

Regarding claim 8, Thyssen in view of Sivaraman discloses receiving, by the computing device, a set of speech samples (Sivaraman: para. 0085 “Certain steps of the method 200 include obtaining the input audio signals and/or pre-processing the input audio signals (e.g., training audio signal, enrollment audio signal, inbound audio signal) based upon the particular operational phase (e.g., training phase, enrollment phase, deployment phase).”); applying, based on each of the set of speech samples, a speech weight to the each of the set of speech samples (para. 0031 “As an example of the speech separation engine operations, the input audio signal containing a speech mixture signal x(t) may be represented as: x(t)=s.sub.tar(t)+αs.sub.interf(t)+n(t), where s.sub.tar(t) is the target speaker's signal; s.sub.interf(t) is an interfering speaker's signal; n(t) is the noise; and a is a scaling factor according to the SDR of the given training signal.”; speech weight for speech sample s.sub.tar(t) is 1); receiving, by the computing device, a set of non-speech samples (para. 0091 “The data augmentation operations are not limited to generating speech mixtures. Before or after generating the simulated signals containing the speech mixtures, the server additionally or alternatively performs the data augmentation operations for non-speech background noises.”); applying, based on each of the set of non-speech sample, a non-speech weight to the each of the set of non-speech samples (para. 0091 “For instance, the server may add background noises randomly selected from a large noise corpus to the simulated audio signal comprising the speech mixture. The server may apply these background noises to the simulated audio signal at SNRs ranging from, for example, 5 dB to 30 dB, though such range is not limiting on possible embodiments;”); determining, based on the speech weighted set and the non-speech weighted set of non-speech sample, the one or more indications indicative of the speech present (para. 0093 “In step 204, the server trains the machine-learning architecture by applying the sub-architectures (e.g., speaker separation engine, noise suppression engine, speaker-embedding engine) on the training signals. The server trains the speech separation engine and noise suppression engine to extract spectro-temporal masks (e.g., speaker mask, noise mask) and generate features of an output signal (e.g., noisy target speaker signal, enhanced speaker signal).”).
Thyssen and Sivaraman are considered to be analogous to the claimed invention as they both are in the same field of speech enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen to incorporate the teachings of Sivaraman in order to receive a set of speech and non-speech samples, to apply a speech and non-speech weight respectively to the samples, and to determine based on the speech weighted set of speech samples and the non-speech weighted set of non-speech samples the one or more indications indicative of the speech present. Doing so would be beneficial, as this would force the machine-learning architecture to evaluate and adjust for various types of degradation present in the original training data signals (Sivaraman, para. 0086).

Regarding claims 34, 35, and 37, these claims are rejected for analogous reasons to claims 5, 6, and 8, respectively.
Regarding claims 45, 46, and 48, these claims are rejected for analogous reasons to claims 5, 6, and 8, respectively.

6. Claims 7, 36, and 47 are rejected under 35 U.S.C. 103 as being unpatentable over Thyssen in view of Sivaraman and in further view of Ahoei et al. (US 11,977,816 B1, hereinafter Ahoei).

Regarding claim 7, Thyssen in view of Sivaraman discloses wherein the one or more extracted input features comprise Mel Frequency Cepstrum Coefficients (MFCCs)…(para. 0093 “In step 204, the server trains the machine-learning architecture by applying the sub-architectures (e.g., speaker separation engine, noise suppression engine, speaker-embedding engine) on the training signals.”; para. 0029 “The speech separation engine receives an input audio signal containing a mixture of speaker signals and one or more types of noise (e.g., additive noise, reverberation). The speech separation engine extracts low-level spectral features, such as such as mel-frequency cepstrum coefficients (MFCCs), and receives a voiceprint for a target speaker (sometimes called an “inbound voiceprint” or “target voiceprint”) generated by the speaker-embedding engine.”).
Thyssen and Sivaraman are considered to be analogous to the claimed invention as they both are in the same field of speech enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen to incorporate the teachings of Sivaraman in order to have the one or more input features include MFCCs. Doing so would be beneficial, as MFCCs are a set of features which are frequently used for voice recognition (NPL Deruty, Intuitive understanding of MFCCs, pg. 1, 1st para.) which would be indicative of speech being present in an audio signal.
Thyssen in view of Sivaraman does not specifically disclose [wherein the one or more extracted input features comprise…] Phonemes, Senones, and Mel Spectrogram.
Ahoei teaches wherein the one or more extracted input features comprise Phonemes (Col. 28 Lines 1-7 “The preprocessing component 720 may transform the text data 715 into, for example, a symbolic linguistic representation, which may include linguistic context features such as phoneme data, punctuation data, syllable-level features, word-level features, and/or emotion, speaker, accent, or other features for processing by the TTS system 680.”), Senones (Col. 32 Lines 55-60 “The speech recognition engine 858 may use the acoustic model(s) 853 to attempt to match received audio feature vectors to words or subword acoustic units. An acoustic unit may be a senone, phoneme, phoneme in context, syllable, part of a syllable, syllable in context, or any other such portion of a word.”), and Mel Spectrogram (Col. 28 Lines 62-65 “This symbolic linguistic representation may be sent to the TTS model 780 for conversion into audio data (e.g., in the form of Mel-spectrograms or other frequency content data format).”).
Thyssen, Sivaraman, and Ahoei are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen in view of Sivaraman to incorporate the teachings of Ahoei in order to have the one or more input features include phonemes, senones, and mel spectrogram. Utilizing phonemes would be beneficial as phonetic information can be used to guide speech the speech enhancement process to achieve better denoising performance (NPL Lu et al., Incorporating Broad Phonetic Information for Speech Enhancement, pg. 4, Conclusion). Furthermore, utilizing senones would be beneficial as senones carry higher-level information relating to human perception which aid in speech enhancement tasks (NPL Wang et al., A Cross-Task Transfer Learning Approach to Adapting Deep Speech Enhancement Models to Unseen Background Noise Using Paired Senone Classifiers, pg. 3 section 4.2 1st para., and pg. 4 Conclusion). Furthermore, utilizing mel spectrograms would be beneficial as they provide a concise snapshot of an audio signal while better reflecting how humans perceive amplitude and frequency compared to a normal spectrogram (NPL Doshi, Audio Deep Learning Made Simple (Part 2): Why Mel Spectrograms perform better, pg. 4 section “Spectrograms”; pg. 6 section “Mel Spectrograms”).

Regarding claims 36 and 47, both claims are rejected for analogous reasons to claim 7.

7. Claim 11, 40, and 51 are rejected under 35 U.S.C. 103 as being unpatentable over Thyssen in view of Vilkamo et al. (US 2023/0402050 A1, hereinafter Vilkamo).

Regarding claim 11, Thyssen does not specifically disclose wherein the input signal comprises one or more pulse code modulation (PCM) signals.
Vilkamo teaches wherein the input signal comprises one or more pulse code modulation (PCM) signals (para. 0081 “The audio signals 205 can be provided to the processor 103 in any suitable format. In some examples the audio signals 205 can be provided to the processor 103 in a digital format. The digital format could comprise pulse code modulation (PCM) or any other suitable type of format.”).
Thyssen and Vilkamo are considered to be analogous to the claimed invention as
they both are in the same field of speech enhancement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Thyssen to incorporate the teachings of Vilkamo in order to have the input signal comprise one or more pulse code modulation (PCM) signals. Doing so would be beneficial, as pulse-code modulation is a noise-resistant method for transmitting audio signals (NPL Plonus, Electronics and Communications for Scientists and Engineers, Chapter 9, pg. 370, section “Pulse Code Modulation (PCM)”).

	Regarding claims 40 and 51, both claims are rejected for analogous reasons to claim 11.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kaskari (US 2024/0304204 A1): pre-frequency bin calculations of probabilities of speech for subsequent speech enhancement (para. 0039-0041, para. 0068)
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659       

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Dec 01, 2023
Application Filed
Apr 22, 2024
Response after Non-Final Action
Dec 10, 2025
Non-Final Rejection mailed — §101, §102, §103
Mar 10, 2026
Response Filed
Apr 23, 2026
Final Rejection mailed — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/094,556
Patent 12626715
ROLE SEPARATION METHOD, ELECTRONIC DEVICE, AND COMPUTER STORAGE MEDIUM
3y 4m to grant Granted May 12, 2026
18/421,318
Patent 12614036
INTELLIGENT DETECTION OF BIAS WITHIN AN ARTIFICIAL INTELLIGENCE MODEL
2y 3m to grant Granted Apr 28, 2026
18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 10m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 3m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
3y 1m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
63%
Grant Probability
99%
With Interview (+51.7%)
2y 8m (~2m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allowance rate.