Last updated: May 29, 2026
Application No. 18/088,125
MULTI-FEATURE AI NOISE REDUCTION

Final Rejection §103
Filed
Dec 23, 2022
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Ati Technologies Ulc
OA Round
4 (Final)
This examiner grants 63% of cases after interview

— +51.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 63% of resolved cases
Career Allowance Rate
17 granted / 27 resolved
+1.0% vs TC avg
Strong +52% interview lift
Without
With
+51.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
11.4%
-28.6% vs TC avg
§103
82.9%
+42.9% vs TC avg
§102
3.8%
-36.2% vs TC avg
§112
1.9%
-38.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
1. Regarding the rejections under U.S.C. § 103, Applicant's arguments filed 12/15/2025 have been fully considered but they are not persuasive. 

Applicant argues that the claimed invention would not have been obvious over any combination of the cited prior art. Specifically, Applicant argues that the cited prior art fails to disclose the claimed limitation of “the machine-learning model trained for the phase component is smaller than the second machine-learning model trained for the magnitude component”. The Examiner respectfully disagrees. Regarding the above limitation, the combination of Le Roux et al. (US 2020/0058314) and Aslan (US 2017/0132528) reads on the BRI of this feature. First, Le Roux discloses two machine learning models, a first machine learning model for generating phase codes (Fig. 5, 542) and a second machine learning model for generating magnitude codes (Fig. 5, 540), with both models being trained (Fig. 4, training of “Enhancement Network” 454 containing the aforementioned models). This reads on the BRI of “the machine-learning model trained for the phase component … [and] the second machine-learning model trained for the magnitude component”. The only feature which Le Roux does not disclose is that the machine learning model trained for the phase component is smaller than the machine learning model trained for the magnitude component. Under the BRI of the claim, the feature requires that a first ML model be smaller in some way (e.g. less layers, nodes, parameters, etc.) than a second ML model. Aslan teaches this feature. Specifically, Aslan teaches that a first model (Fig. 1, 102) is smaller than a second model (Fig. 1, 100) in size by having fewer parameters (para. 0044 “For example, the first (teacher) model 100 of FIG. 1 can comprise a large, complex ensemble of machine learning models that is often too large and/or slow to be used at run-time in particular scenarios. Meanwhile, the second (student) model 102 can comprise a much smaller machine learning model (e.g., a neural net with 1000 times fewer parameters than the first model 100) that has the size and/or speed that is advantageous at run-time in particular scenarios.”). Regarding Applicant’s arguments that Aslan would not have been obvious to combine as it is specifically “directed to having models mimicking larger models for the same task” (see pg. 9, 1st  para. of Remarks), the Examiner disagrees as Aslan teaches that the tasks for the first and second models need not be the same (para. 0025 “In some implementations, the first model 100 and the second model 102 can be trained in parallel so that each model learns a task. The task learned by the first model 100 can be the same task as the task learned by the second model 102, or each model 100 and 102 can learn related (or complimentary) tasks, meaning that the tasks can differ slightly between the models 100 and 102.”). Therefore, the combination of Le Roux and Aslan teaches “the machine-learning model trained for the phase component is smaller than the second machine-learning model trained for the magnitude component”.
Hence, Applicant’s arguments are not persuasive.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


2. Claims 1-3, 5, 8-12, 14, 16-19, 21-22, and 24 and are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux et al. (PGPUB No. 2020/0058314, hereinafter Le Roux) in view of Liang et al. (NPL Pruning and Quantization for Deep Neural Network Acceleration: A Survey, hereinafter Liang) and further in view of Aslan et al. (PGPUB No. 2017/0132528, hereinafter Aslan).

Regarding claim 1, Le Roux discloses:
A method comprising: transforming, by a transform module (Fig. 5, 565 “Speech Estimation” module) from a time domain (Fig. 5, 605 “Noisy speech”; ”Paragraph 0003 “…the noisy speech can be obtained with a far-field microphone….”) into a frequency domain (Fig. 5, 582 “STFT”; Paragraph 0076 “…together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582…”) a sound signal (Fig. 5, “Noisy speech”) into a transformed sound signal (Fig. 5, output of 582 “STFT”) comprising a magnitude component and a phase component (Output of a Short-time Fourier transform (STFT) comprises a magnitude and phase component), wherein the phase component uses a first data format size (method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a first data format size);
generating, using the phase component with a machine learning model trained for the phase component (Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; para. 0075 “FIG. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement, according to embodiments of the present disclosure. …A noisy input speech signal 405 including a mixture of speech and noise and the corresponding clean signals 461 for the speech and noise are sampled from the training set of clean and noisy audio 401. The noisy input signal 405 is processed by an enhancement network 454 to compute a filter 460 for the target signal, using stored network parameters 452. A speech estimation module 465 then multiplies at each time-frequency bin the time-frequency representation of the noisy speech 405 with the filter 460 to obtain a time-frequency representation of the enhanced speech, and inverts that time-frequency representation of the enhanced speech to obtain the enhanced speech signal 490. An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.”), a quantized mask (Quantized mask generated comprising a phase value for each time-frequency bin: para. 0073 “For each time-frequency bin, the one or more phase codes 272 are used to select or combine phase-related values corresponding to the one or more phase codes within a phase codebook 280 to obtain a filter phase 278 for that time-frequency bin.”) that uses a second data format size…(method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a second data format size); 
generating, using the magnitude component with a second machine learning model trained for the magnitude component, a second mask…(para. 0076 “For each time frame and each frequency in a time-frequency domain, …the magnitude softmax layer 540 … represent probabilities that the corresponding value in the magnitude codebook should be selected as the filter magnitude 574. A filter magnitude computation module 550 can use these probabilities …to obtain a filter magnitude 574.”; para. 0075 “FIG. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement, according to embodiments of the present disclosure. …A noisy input speech signal 405 including a mixture of speech and noise and the corresponding clean signals 461 for the speech and noise are sampled from the training set of clean and noisy audio 401. The noisy input signal 405 is processed by an enhancement network 454 to compute a filter 460 for the target signal, using stored network parameters 452. A speech estimation module 465 then multiplies at each time-frequency bin the time-frequency representation of the noisy speech 405 with the filter 460 to obtain a time-frequency representation of the enhanced speech, and inverts that time-frequency representation of the enhanced speech to obtain the enhanced speech signal 490. An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.”);
filtering, by an artificial intelligence (AI) module (Filter is generated by AI module 554 “Enhancement Network”), the phase component of the transformed sound signal (Fig. 5, 576 “Filter”, contains a phase filtering component 578) by applying, to the phase component (para. 0076 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576. A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other, to obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590.”), the quantized mask (Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter phase for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component) and the magnitude component of the transformed sound signal (Fig. 5, 576 “Filter”, contains a magnitude filtering component 574) by applying, to the magnitude component (para. 0076 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576. A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other, to obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590.”), the second mask (para. 0076 “For each time frame and each frequency in a time-frequency domain, …the magnitude softmax layer 540 … represent probabilities that the corresponding value in the magnitude codebook should be selected as the filter magnitude 574. A filter magnitude computation module 550 can use these probabilities …to obtain a filter magnitude 574.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter magnitude for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component); 
and generating, by the transform module (Fig. 5, 565 “Speech Estimation” module), a filtered sound signal (Fig 5., 590 “Enhanced Speech”) by transforming, from the frequency domain into the time domain (Fig. 5, 588 “Speech Reconstruction”; Paragraph 0076 “…obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590...”), the transformed sound signal comprising the filtered magnitude component and the filtered phase component (Fig 5, the signal being transformed from frequency domain to the time domain is the enhanced spectrogram obtained after filtering the time-frequency representation with “Filter” 576).
Le Roux does not specifically disclose [a quantized mask that uses a second data format size] that is smaller than the first data format size.
Liang teaches [a quantized mask that uses a second data format size] that is smaller than the first data format size (pg. 15 section 4.1 1st para. “There are many methods to quantize a given network. Generally, they are formulated as Equation 12 where s is a scalar that can be calculated using various methods. g(.) is the clamp function applied to floating-point values Xr performing the quantization…”; 2nd para. “The min-max method is given by Equation 14 where [m,M] are the bounds for the minimum and maximum values of the parameters, respectively. N is the maximum representable number derived from the bit-width (e.g., 256=28 in case of 8-bit).” See Eq 14, floating point values Xr quantized to smaller data format size (e.g. n=256 for 8-bit)).
Le Roux and Liang are considered to be analogous to the claimed invention as
they both are in the same field of applying quantization for machine learning applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux to incorporate the teachings of Liang in order to have a quantized mask that uses a second data format size which is smaller than the first data format size. Doing so would be beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (pg. 22, para. 1).
	Le Roux in view of Liang additionally discloses the machine-learning model trained for the phase component and the second machine-learning model trained for the magnitude component (see above claim mapping); however, Le Roux in view of Liang does not specifically disclose [wherein the machine-learning model trained for the phase component] is smaller than [the second machine-learning model trained fro the magnitude component].
Aslan teaches a method for jointly training machine learning models (Paragraph 0020) which involves a first machine learning model (Fig. 1, 102 “Student ML Model”) and a second machine learning model (Fig. 1, 100 “Teacher ML Model”) which can be trained using speech data (Fig. 1, 104 “Training Data”; Paragraph 0024 “The training data 104 can be stored in a database or repository of any suitable data, such as…speech data”). Aslan further teaches wherein the first machine-learning model is smaller than the second machine-learning model (Paragraph 0044 “…the first (teacher model 100 of FIG. 1 can comprise a large, complex ensemble of machine learning models…the second (student model 102 can comprise a much smaller machine learning model”).
Le Roux, Liang, and Aslan are considered to be analogous to the claimed invention as they are all in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux in view of Liang to incorporate the teachings of Aslan in order to make the machine-learning model trained for the phase component smaller than the second machine learning model trained for the magnitude component. Doing so would lead to the second machine learning model occupying less memory, leading to a faster runtime and less resource utilization (Paragraph 0044).

Regarding claim 2, Le Roux in view of Liang and Aslan discloses dequantizing the quantized mask by converting from the second data format size to the first data format size (Liang: Page 16, Equation (16) describes quantizing a mask WR to obtain a quantized version WQ; Page 24, Equation (39) describes the quantized mask WQ being dequantized (by dividing by (n-1) and multiplying by Wmax ) to obtain “quantize-dequantized weights” (Page 24, Paddle Paragraph 1); the dequantized mask is then applied to features F; pg. 16 2nd para. “De-quantizing converts the quantized value Oq back to floating-point Or using the feature scales sf and weights scales sw”); and applying the dequantized mask to the phase component to filter the phase component (Liang teaches dequantized mask (see above mapping); Le Roux discloses application of mask to phase component: Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter phase for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component).
Le Roux, Liang, Aslan are considered to be analogous to the claimed invention as they both are in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Liang in order to dequantize a quantized mask by converting from the second data format size to the first data format size before applying it to filter the phase component. Doing so would beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (Page 22, Paragraph 1).
	
Regarding claim 3, Le Roux in view of Liang and Aslan discloses wherein the AI module generates the quantized mask using the machine-learning model (Le Roux: Fig. 5, quantized mask (Filter Phases 578) are generated using outputs (572 Phase codes) from the machine-learning model (542 “Phase softmax”) contained within the AI Module (Enhancement Network 554)).

Regarding claim 5, Le Roux in view of Liang and Aslan discloses further comprising filtering, by the Al module, the phase component in parallel with filtering the magnitude component (Le Roux: Fig. 5, AI module 554 generates outputs to models 540 “Magnitude softmax” and 542 “Phase softmax” in parallel; further “Filter magnitude computation” 550 and “Filter phase computation” 552 are shown to be carried out in parallel; Paragraph 0084 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576”; Filter applied to noisy speech contains both magnitude and phase filters, and thus filters the noisy magnitude and phase components in parallel).

Regarding claim 8, Le Roux in view of Liang and Aslan discloses wherein the machine-learning model is trained to filter noise (Le Roux: Fig. 4, 454 “Enhancement Network”; Paragraph 0075 “Fig. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement…An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.” Enhancement Network 454 is trained to filter noise by learning network parameters 452; the machine learning model of claim 1 (542 “Phase softmax”) is part of the Enhancement Network).

Regarding claim 9, Le Roux in view of Liang and Aslan discloses wherein transforming the sound signal from the time domain into the frequency domain uses a Fourier transform (Le Roux: Fig. 5, 582 transforms signal using a short-time-Fourier transform “STFT” ) and transforming the filtered sound signal from the frequency domain into the time domain uses an inverse Fourier transform (Le Roux: Paragraph 0063 “For example, this time-frequency representation can be a short-time Fourier transform, in which case the obtained time-frequency representation of an enhanced audio signal can be processed by inverse short-time Fourier transform to obtain a time-domain enhanced audio signal.”).

Regarding claim 10, Le Roux discloses:
A system comprising: a physical memory (Fig. 7A, 710 “Memory”; Paragraph 0090 “The memory 710 can include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory systems. The memory 710 can be a volatile memory unit or units, and/or a non-volatile memory unit or units. The memory 710 may also be another form of computer-readable medium, such as a magnetic or optical disk.”); 
at least one physical processor (Fig. 7A, 709 “Processor”; Paragraph 0091 “The instructions, when executed by one or more processing devices (for example, processor 709), perform one or more methods, such as those described above.”);
a transform circuit (Paragraph 0108 “Further, embodiments of the present disclosure and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.”) configured to transform, from a time domain (Fig. 5, 605 “Noisy speech”; ”Paragraph 0003 “…the noisy speech can be obtained with a far-field microphone….”) into a frequency domain (Fig. 5, 582 “STFT”; Paragraph 0076 “…together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582…”), a sound signal (Fig. 5 “Noisy speech”) into a transformed sound signal (Fig. 5, output of 582 “STFT”) comprising a first feature component and a second feature component (Output of a Short-time Fourier transform (STFT) comprises a first feature component (phase) and a second feature component (magnitude)), wherein the first feature component uses a first data format size (method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a first data format size); 
and an artificial intelligence (Al) circuit (Paragraph 0108 “Further, embodiments of the present disclosure and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.”) configured to: generate, using the first feature component with a first machine-learning model trained for the first feature component (Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; para. 0075 “FIG. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement, according to embodiments of the present disclosure. …A noisy input speech signal 405 including a mixture of speech and noise and the corresponding clean signals 461 for the speech and noise are sampled from the training set of clean and noisy audio 401. The noisy input signal 405 is processed by an enhancement network 454 to compute a filter 460 for the target signal, using stored network parameters 452. A speech estimation module 465 then multiplies at each time-frequency bin the time-frequency representation of the noisy speech 405 with the filter 460 to obtain a time-frequency representation of the enhanced speech, and inverts that time-frequency representation of the enhanced speech to obtain the enhanced speech signal 490. An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.”), a quantized mask (Quantized mask generated comprising a phase value for each time-frequency bin: para. 0073 “For each time-frequency bin, the one or more phase codes 272 are used to select or combine phase-related values corresponding to the one or more phase codes within a phase codebook 280 to obtain a filter phase 278 for that time-frequency bin.”) that uses a second data format size…(method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a second data format size);
generate, using the second feature component with a second machine-learning model trained for the second feature component, a second mask…( para. 0076 “For each time frame and each frequency in a time-frequency domain, …the magnitude softmax layer 540 … represent probabilities that the corresponding value in the magnitude codebook should be selected as the filter magnitude 574. A filter magnitude computation module 550 can use these probabilities …to obtain a filter magnitude 574.”; para. 0075 “FIG. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement, according to embodiments of the present disclosure. …A noisy input speech signal 405 including a mixture of speech and noise and the corresponding clean signals 461 for the speech and noise are sampled from the training set of clean and noisy audio 401. The noisy input signal 405 is processed by an enhancement network 454 to compute a filter 460 for the target signal, using stored network parameters 452. A speech estimation module 465 then multiplies at each time-frequency bin the time-frequency representation of the noisy speech 405 with the filter 460 to obtain a time-frequency representation of the enhanced speech, and inverts that time-frequency representation of the enhanced speech to obtain the enhanced speech signal 490. An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.”); 
filter the first feature component of the transformed sound signal (Fig. 5, 576 “Filter”, contains a first feature (phase) filtering component 578) by applying, to the first feature component (para. 0076 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576. A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other, to obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590.”), the quantized mask (Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter phase for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component); and filter the second feature component of the transformed sound signal (Fig. 5, 576 “Filter”, contains a second feature (magnitude) filtering component 574; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”) by applying, to the second feature component (para. 0076 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576. A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other, to obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590.”), the second mask (para. 0076 “For each time frame and each frequency in a time-frequency domain, …the magnitude softmax layer 540 … represent probabilities that the corresponding value in the magnitude codebook should be selected as the filter magnitude 574. A filter magnitude computation module 550 can use these probabilities …to obtain a filter magnitude 574.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter magnitude for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component); 
wherein the transform circuit is further configured to generate a filtered sound signal (Fig 5., 590 “Enhanced Speech”) by transforming, from the frequency domain into the time domain (Fig. 5, 588 “Speech Reconstruction”; Paragraph 0076 “…obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590...”), the transformed sound signal comprising the filtered first feature component and the filtered second feature component (Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other, to obtain an enhanced spectrogram”; Fig 5, the signal being transformed from frequency domain to time domain is the enhanced spectrogram, which contains first feature (phase) and second feature (magnitude) components).
Le Roux does not specifically disclose [a quantized mask that uses a second data format size] that is smaller than the first data format size.
Liang teaches [a quantized mask that uses a second data format size] that is smaller than the first data format size (pg. 15 section 4.1 1st para. “There are many methods to quantize a given network. Generally, they are formulated as Equation 12 where s is a scalar that can be calculated using various methods. g(.) is the clamp function applied to floating-point values Xr performing the quantization…”; 2nd para. “The min-max method is given by Equation 14 where [m,M] are the bounds for the minimum and maximum values of the parameters, respectively. N is the maximum representable number derived from the bit-width (e.g., 256=28 in case of 8-bit).” See Eq 14, floating point values Xr quantized to smaller data format size (e.g. n=256 for 8-bit)).
Le Roux and Liang are considered to be analogous to the claimed invention as
they both are in the same field of applying quantization for machine learning applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux to incorporate the teachings of Liang in order to have a quantized mask that uses a second data format size which is smaller than the first data format size. Doing so would be beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (pg. 22, para. 1).
Le Roux in view of Liang additionally discloses the machine-learning model trained for the first feature component and the second machine-learning model trained for the second feature component component (see above claim mapping); however, Le Roux in view of Liang does not specifically disclose [wherein the machine-learning model trained for the first feature component] is smaller than [the second machine-learning model trained for the second feature component].
Aslan teaches a method for jointly training machine learning models (Paragraph 0020) which involves a first machine learning model (Fig. 1, 102 “Student ML Model”) and a second machine learning model (Fig. 1, 100 “Teacher ML Model”) which can be trained using speech data (Fig. 1, 104 “Training Data”; Paragraph 0024 “The training data 104 can be stored in a database or repository of any suitable data, such as…speech data”). Aslan further teaches wherein the first machine-learning model is smaller than the second machine-learning model (Paragraph 0044 “…the first (teacher model 100 of FIG. 1 can comprise a large, complex ensemble of machine learning models…the second (student model 102 can comprise a much smaller machine learning model”).
Le Roux, Liang, and Aslan are considered to be analogous to the claimed invention as they are all in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux in view of Liang to incorporate the teachings of Aslan in order to make the machine-learning model trained for the phase component smaller than the second machine learning model trained for the magnitude component. Doing so would lead to the second machine learning model occupying less memory, leading to a faster runtime and less resource utilization (Paragraph 0044).

Regarding claim 11, Le Roux in view of Liang and Aslan discloses dequantizing the quantized mask by converting from the second data format size to the first data format size (Liang: Page 16, Equation (16) describes quantizing a mask WR to obtain a quantized version WQ; Page 24, Equation (39) describes the quantized mask WQ being dequantized (by dividing by (n-1) and multiplying by Wmax ) to obtain “quantize-dequantized weights” (Page 24, Paddle Paragraph 1); the dequantized mask is then applied to features F; pg. 16 2nd para. “De-quantizing converts the quantized value Oq back to floating-point Or using the feature scales sf and weights scales sw”); and applying the dequantized mask to the first feature component (Liang teaches dequantized mask (see above mapping); Le Roux discloses application of mask to first feature component: Paragraph 0076 “For each time frame and each frequency in a time-frequency domain… the phase softmax layer 542 …represent probabilities that the corresponding value in the phase codebook should be selected as the filter phase 578” “A filter phase computation module 552 can use these probabilities …to obtain a filter phase 578.”; Page 8, Paragraph 0073 discusses that for every discrete time-frequency bin, the filter phase for that bin is set to one of a finite set of values in the phase codebook; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component).
Le Roux, Liang, Aslan are considered to be analogous to the claimed invention as they both are in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Liang in order to dequantize a quantized mask by converting from the second data format size to the first data format size before applying it to filter the phase component. Doing so would beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (Page 22, Paragraph 1).

Regarding claim 12, Le Roux in view of Liang and Aslan discloses wherein the Al circuit is further configured to filter the first feature component in parallel with filtering the second feature component (Le Roux: Fig. 5, AI module 554 generates outputs to models 540 “Magnitude softmax” and 542 “Phase softmax” in parallel; further “Filter magnitude computation” 550 and “Filter phase computation” 552 are shown to be carried out in parallel; Paragraph 0084 “A filter combination module 560 combines the filter magnitudes 574 and the filter phases 578, for example by multiplying them, to obtain a filter 576”; Filter applied to noisy speech contains both magnitude and phase filters, and thus filters the noisy magnitude and phase components in parallel).

Regarding claim 14, Le Roux in view of Liang and Aslan discloses wherein the first feature component corresponds to a phase component and the second feature component corresponds to a magnitude component (Le Roux: Output of a Short-time Fourier transform (STFT) comprises a first feature component (magnitude) and a second feature component (phase)).

Regarding claim 16, Le Roux in view of Liang and Aslan discloses wherein the first machine-learning model is trained to filter noise from the first feature component and the second machine-learning model is trained to filter noise from the second feature component (Le Roux: Fig. 4, 454 “Enhancement Network”; Paragraph 0075 “Fig. 4 is a flow diagram illustrating training of an audio signal processing system 400 for speech enhancement…An objective function computation module 463 computes an objective function by computing a distance between the clean speech and the enhanced speech. The objective function can be used by a network training module 457 to update the network parameters 452.” Enhancement Network 454 is trained to filter noise by learning network parameters 452; the first machine learning model (542 “Phase softmax) filters noise from the first feature (phase) component, and the second machine learning model (540 “Magnitude softmax) filters noise from the second feature (magnitude) component)

Regarding claim 17, claim 17 is a non-transitory computer readable medium claim with limitations similar to those in claim 1, and thus is rejected under similar rationale.
Additionally, Le Roux discloses A non-transitory computer-readable medium comprising one or more computer- executable instructions that, when executed by at least one processor of a computing device, cause the computing device to (Paragraph 0042 “According to another embodiment of the present disclosure, a non-transitory computer readable storage medium embodied thereon a program executable by a hardware processor for performing a method.”).
 
Regarding claim 18, claim 18 is rejected for analogous reasons to claim 2.

Regarding claim 19, claim 19 is rejected for analogous reasons to claim 5.

Regarding claim 21, Le Roux in view of Liang and Aslan discloses wherein the magnitude component uses a third data format size (Le Roux, method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a third data format size) and the second mask is a second quantized mask (Quantized mask generated comprising a magnitude value for each time-frequency bin: para. 0073 “For each time-frequency bin, the one or more magnitude codes 270 are used to select or combine magnitude values corresponding to the one or more magnitude codes within a magnitude codebook 158 to obtain a filter magnitude 274 for that time-frequency bin.”) that uses a fourth data format size that is smaller than the third data format size (Liang, pg. 15 section 4.1 1st para. “There are many methods to quantize a given network. Generally, they are formulated as Equation 12 where s is a scalar that can be calculated using various methods. g(.) is the clamp function applied to floating-point values Xr performing the quantization…”; 2nd para. “The min-max method is given by Equation 14 where [m,M] are the bounds for the minimum and maximum values of the parameters, respectively. N is the maximum representable number derived from the bit-width (e.g., 256=28 in case of 8-bit).” See Eq 14, floating point values Xr quantized to smaller data format size (e.g. n=256 for 8-bit)).
Le Roux, Liang, and Aslan are considered to be analogous to the claimed invention as they are all in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Liang in order to have a second quantized mask that uses a fourth data format size which is smaller than the third data format size. Doing so would be beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (pg. 22, para. 1).

Regarding claim 22, Le Roux in view of Liang and Aslan discloses wherein the second feature component uses a third data format size (Le Roux, method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a third data format size) and the second mask is a second quantized mask (Quantized mask generated comprising a magnitude value for each time-frequency bin: para. 0073 “For each time-frequency bin, the one or more magnitude codes 270 are used to select or combine magnitude values corresponding to the one or more magnitude codes within a magnitude codebook 158 to obtain a filter magnitude 274 for that time-frequency bin.”) that uses a fourth data format size (Le Roux, method implemented via processor coupled to memory: para. 0041 “According to another embodiment of the present disclosure, a method for audio signal processing having a hardware processor coupled with a memory, wherein the memory has stored instructions and other data, and when executed by the hardware processor carry out some steps of the method.”; the memory stores data in a third data format size) that is smaller than the third data format size (Liang, pg. 15 section 4.1 1st para. “There are many methods to quantize a given network. Generally, they are formulated as Equation 12 where s is a scalar that can be calculated using various methods. g(.) is the clamp function applied to floating-point values Xr performing the quantization…”; 2nd para. “The min-max method is given by Equation 14 where [m,M] are the bounds for the minimum and maximum values of the parameters, respectively. N is the maximum representable number derived from the bit-width (e.g., 256=28 in case of 8-bit).” See Eq 14, floating point values Xr quantized to smaller data format size (e.g. n=256 for 8-bit)).
Le Roux, Liang, and Aslan are considered to be analogous to the claimed invention as they are all in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Liang in order to have a second quantized mask that uses a fourth data format size which is smaller than the third data format size. Doing so would be beneficial, as using a low precision acceleration framework as taught in Liang would lead to accelerated computation (pg. 22, para. 1).

Regarding claim 24, claim 24 is rejected for analogous reasons to claim 21.

3. Claims 7, 15, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Le Roux in view of Liang and Aslan, and further in view of Wang et al. (PGPUB No. 2023/0306980, hereinafter Wang).

Regarding claim 7, Le Roux in view of Liang and Aslan discloses:
transforming the sound signal (Le Roux: Fig. 5, “Noisy speech”) into the transformed signal (Le Roux: Fig. 5, output of 582 “STFT”) further comprises:
transforming, by the transform module (Le Roux: Fig. 5, 565 “Speech Estimation” module) … from the time domain (Le Roux: Fig. 5, 605 “Noisy speech”; ”Paragraph 0003 “…the noisy speech can be obtained with a far-field microphone….”) into the frequency domain (Le Roux: Fig. 5, 582 “STFT”; Paragraph 0076 “…together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582…”) 
filtering the phase component further comprises filtering, by the AI module (Le Roux: Filter is generated by AI module 554 “Enhancement Network”), the phase component (Le Roux: Fig. 5, 576 “Filter”, contains a phase filtering component 578; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component) … by applying the machine-learning model (Le Roux: Fig. 5, filter phases 578 derived from machine learning model (”542 “Phase softmax”) within AI module 554); 
generating the filtered sound signal further comprises:
transforming, by the transform module (Le Roux: Fig. 5, 565 “Speech Estimation” module) … from the frequency domain into the time domain (Le Roux: Fig. 5, 588 “Speech Reconstruction”; Paragraph 0076 “…obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590...”) 
Le Roux in view of Liang and Aslan does not specifically disclose: 
splitting, by the transform module, the sound signal into overlapping segments; 
[transforming, by the transform module,] the overlapping segments [from the time domain into the frequency domain;]
[filtering, by the Al module, the phase component] of each of the overlapping segments [by applying the machine-learning model;]
[transforming, by the transform module], the overlapping segments from the [frequency domain into the time domain]; 
reconstructing, by the transform module, the segments into the filtered sound signal.
Wang teaches a system for audio signal enhancement (Wang, Abstract), which takes as input an audio signal (Fig. 1A 102 “Input Y”; Paragraph 0061 “…system 104 is configured to receive an input mixture of audio signals 102…”) into a transform module (Fig. 1A 104 “Speech Enhancement System”), to obtain an enhanced audio signal output (Fig. 1A 106 “Output Enhanced Signal”; Paragraph 0062 “The signal enhancement system 104 uses deep learning to perform processing of the input mixture of audio signals …to ensure providing of a low latency, but high quality enhanced signal 106, S,”). Wang further teaches splitting, by the transform module (Fig. 1B 114 “Partition using first sliding window method” within a transform module 104), the sound signal into overlapping segments (Fig. 2A 114; Paragraph 0077 “The first window function 200 is applied, such as by multiplying, with the input mixture of audio signals 102, resulting in the partitioning of the input mixture of audio signals 102 into a sequence of overlapping frames:”); and transforming, by the transform module (Fig. 2C; Paragraph 0087 “FIG. 2C illustrates an exemplar method 230 for performing audio signal enhancement using the signal enhancement system 104”), the overlapping segments from the time domain into the frequency domain (Fig. 2C, 116a and 116b; Paragraph 0088 “…the input mixture of audio signals 102 at the input is first partitioned 114 into a sequence of input overlapping frames 202-208…” Paragraph 0089 “The processing at the enhancement 116 step further includes, at 116a, applying an analysis window for performing the STFT, and then at 116b performing the DFT for each frame of the input overlapping frames 202-208 to convert the time-frequency domain signals in each frame into frequency domain for neural network based enhancement.”) .
Wang further teaches an AI module (Fig. 2C 116c) and filtering, by the Al module, each of the overlapping segments (Fig. 3A 234 “Frequency Domain Online Filtering”; Paragraph 0099 “… frequency domain online linear filtering component 234 to generate a frequency domain filtering output 234a for the intermediate representation 232a of each frame of the enhanced overlapping frames.”) by applying the machine-learning model (AI module in Fig. 2C 116c contains machine learning model, whose architecture is given in Fig. 3B);
Wang further teaches transforming, by the transform module (Fig. 1B, 118 “Combine using second sliding window method” within transform module 104) the overlapping segments from the frequency domain into the time domain (Fig. 2C 116d “Inverse DFT”; Paragraph 0090 “As a result of the frame-online DNN based processing at 116c, final enhanced frames are obtained which are subjected to iDFT operation to convert frequency domain signals in each frame back into time-frequency domain for reconstruction of samples in each frame.”); and reconstructing, by the transform module, the segments into the filtered sound signal (Fig. 2C 118; Paragraph 0090 “The reconstruction is performed such as using the combining 118 step…The overlapped frames 118c are then added at 118d to reconstruct the samples in the output signal 106”).
Le Roux, Liang, Aslan, and Wang are considered to be analogous to the claimed invention as they are in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux in view of Liang and Aslan to incorporate the teachings of Wang in order to split the sound signal into overlapping segments, filter each overlapping segment, and reconstruct the overlapping segments into a filtered sound signal. Doing so would reduce the algorithmic latency of the audio processing (Paragraph 0011, Paragraph 0013).

Regarding claim 15, Le Roux in view of Liang and Aslan discloses:
the transform circuit is further configured to transform the sound signal into the transformed sound signal by:
transforming (Le Roux: Fig. 5, 565 “Speech Estimation” module) … from the time domain (Le Roux: Fig. 5, 605 “Noisy speech”; ”Paragraph 0003 “…the noisy speech can be obtained with a far-field microphone….”) into the frequency domain (Le Roux: Fig. 5, 582 “STFT”; Paragraph 0076 “…together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582…”) 
the AI circuit is further configured to filter the first feature component by filtering the first feature component … by applying the quantized mask (Le Roux: Fig. 5, 576 “Filter”, contains a phase filtering component 578; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”; Multiplying time-frequency bins of noisy speech by filter phases 578 amounts to applying a quantized mask to the phase component) from the machine-learning model (Le Roux: Fig. 5, filter phases 578 derived from machine learning model (542 “Phase softmax”) within AI module 554);
the AI circuit is further configured to filter the second feature component by filtering the second feature component (Le Roux: Fig. 5, 576 “Filter”, contains a second feature (magnitude) filtering component 574; Paragraph 0076 “A speech estimation module 565 uses a spectrogram estimation module 584 to process the filter 576 together with a time-frequency representation of the noisy speech 505 such as the short-time Fourier transform 582, for example by multiplying them with each other…”) …by applying the second machine-learning model (Le Roux: Fig. 5, 540 “Magnitude softmax”);
the transform circuit is further configured to generate the filtered sound signal by:
transforming … (Le Roux: Fig. 5, 565 “Speech Estimation” module) … from the frequency domain into the time domain (Fig. 5, 588 “Speech Reconstruction”; Paragraph 0076 “…obtain an enhanced spectrogram, which is inverted in a speech reconstruction module 588 to obtain an enhanced speech 590...”) 
Le Roux in view of Liang does not specifically disclose: 
splitting the sound signal into overlapping segments; 
[transforming] the overlapping segments [from the time domain into the frequency domain;]
[filtering the first feature component] of each of the overlapping segments [by applying the quantized mask from the first machine-learning model;]
[filtering the second feature component] of each of the overlapping segments [by applying the second machine-learning model;]
[transforming] the overlapping segments from the [frequency domain into the time domain]; 
reconstructing the segments into the filtered sound signal.
Wang teaches a system for audio signal enhancement (Wang, Abstract), which takes as input an audio signal (Fig. 1A 102 “Input Y”; Paragraph 0061 “…system 104 is configured to receive an input mixture of audio signals 102…”) into a transform module (Fig. 1A 104 “Speech Enhancement System”), to obtain an enhanced audio signal output (Fig. 1A 106 “Output Enhanced Signal”; Paragraph 0062 “The signal enhancement system 104 uses deep learning to perform processing of the input mixture of audio signals …to ensure providing of a low latency, but high quality enhanced signal 106, S,”). Wang further teaches splitting (Fig. 1B 114 “Partition using first sliding window method” within a transform module 104) the sound signal into overlapping segments (Fig. 2A 114; Paragraph 0077 “The first window function 200 is applied, such as by multiplying, with the input mixture of audio signals 102, resulting in the partitioning of the input mixture of audio signals 102 into a sequence of overlapping frames:”); and transforming (Fig. 2C; Paragraph 0087 “FIG. 2C illustrates an exemplar method 230 for performing audio signal enhancement using the signal enhancement system 104”), the overlapping segments from the time domain into the frequency domain (Fig. 2C, 116a and 116b; Paragraph 0088 “…the input mixture of audio signals 102 at the input is first partitioned 114 into a sequence of input overlapping frames 202-208…” Paragraph 0089 “The processing at the enhancement 116 step further includes, at 116a, applying an analysis window for performing the STFT, and then at 116b performing the DFT for each frame of the input overlapping frames 202-208 to convert the time-frequency domain signals in each frame into frequency domain for neural network based enhancement.”) .
Wang further teaches filtering each of the overlapping segments (Fig. 3A 234 “Frequency Domain Online Filtering”; Paragraph 0099 “… frequency domain online linear filtering component 234 to generate a frequency domain filtering output 234a for the intermediate representation 232a of each frame of the enhanced overlapping frames.”) by applying the machine-learning model (AI module in Fig. 2C 116c contains machine learning model, whose architecture is given in Fig. 3B);
Wang further teaches transforming (Fig. 1B, 118 “Combine using second sliding window method” within transform module 104) the overlapping segments from the frequency domain into the time domain (Fig. 2C 116d “Inverse DFT”; Paragraph 0090 “As a result of the frame-online DNN based processing at 116c, final enhanced frames are obtained which are subjected to iDFT operation to convert frequency domain signals in each frame back into time-frequency domain for reconstruction of samples in each frame.”); and reconstructing the segments into the filtered sound signal (Fig. 2C 118; Paragraph 0090 “The reconstruction is performed such as using the combining 118 step…The overlapped frames 118c are then added at 118d to reconstruct the samples in the output signal 106”).
Le Roux, Liang, Aslan, and Wang are considered to be analogous to the claimed invention as they are all in the same field of utilizing machine learning models for speech applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Le Roux in view of Liang and Aslan to incorporate the teachings of Wang in order to split the sound signal into overlapping segments, filter each overlapping segment, and reconstruct the overlapping segments into a filtered sound signal. Doing so would reduce the algorithmic latency of the audio processing (Paragraph 0011, Paragraph 0013).

	Regarding claim 23, claim 23 is rejected for analogous reasons to claim 7.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Afouras et al. (NPL The Conversation: Deep Audio-Visual Speech Enhancement): speech enhancement of noisy speech vis a magnitude subnetwork and a phase subnetwork (Fig. 1)
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659  

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Show 1 earlier event
Dec 05, 2024
Non-Final Rejection mailed — §103
Mar 12, 2025
Response Filed
Apr 24, 2025
Final Rejection mailed — §103
Aug 21, 2025
Request for Continued Examination
Aug 22, 2025
Response after Non-Final Action
Sep 29, 2025
Non-Final Rejection mailed — §103
Dec 15, 2025
Response Filed
Mar 11, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/094,556
Patent 12626715
ROLE SEPARATION METHOD, ELECTRONIC DEVICE, AND COMPUTER STORAGE MEDIUM
3y 4m to grant Granted May 12, 2026
18/421,318
Patent 12614036
INTELLIGENT DETECTION OF BIAS WITHIN AN ARTIFICIAL INTELLIGENCE MODEL
2y 3m to grant Granted Apr 28, 2026
18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 10m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 3m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
3y 1m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
63%
Grant Probability
99%
With Interview (+51.7%)
2y 8m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allowance rate.