Last updated: July 17, 2026
Application No. 18/643,539
HIGH-PERFORMANCE SMALL-FOOTPRINT AI-BASED NOISE SUPPRESSION MODEL

Final Rejection §103
Filed
Apr 23, 2024
Priority
Apr 25, 2023 — provisional 63/461,665
Examiner
LEE, JANGWOEN
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Skyworks Solutions Inc.
OA Round
2 (Final)
Interview Optional

— +19.6% interview lift. Examiner has a relatively high allowance rate (84%); +19.6% interview lift. A written response may suffice.
Based on 51 resolved cases, 2023–2026
Examiner Intelligence

LEE, JANGWOEN View full profile →
Grants 84% — above average
Career Allowance Rate
43 granted / 51 resolved
+22.3% vs TC avg
Strong +20% interview lift
Without
With
+19.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
15 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.4%
-38.6% vs TC avg
§103
97.8%
+57.8% vs TC avg
§102
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 51 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/16/2026 was filed. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
The Response filed on 04/16/2026 has been correspondingly accepted and considered in the office action. Claims 1-15, 17 and 19-22 are pending. Claims 1, 12, and 20 are independent. Claims 1, 4, 11-12, 15, 17 and 19-20 are being amended. Claims 16 and 18 are cancelled. Claims 21-22 are new.
The rejections to Claims 17 and 19 under 35 U.S.C. § 101 as being indefinite have been withdrawn in view of Applicant’s amendments to the claims. Similar rejects to Claims 16 and 18 have been rendered moot since those claims are canceled.
Claims 1-15, 17 and 19-20 stand rejected under 35 U.S.C. § 103. Applicant’s arguments with respect to Claims 1-15, 17 and 19-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In order to expedite prosecution, and as to the material from the Specifications that are not in the Claim and are argued by the Applicant, please note Sargsyan et al. (US Pub 2020/0066296, hereinafter, Sargsyan), Park et al. ("A fully convolutional neural network for speech enhancement." arXiv preprint arXiv:1609.07132 (2016), hereinafter, Park), Tan et al. ("A convolutional recurrent neural network for real-time speech enhancement." Interspeech. Vol. 2018. 2018, hereinafter, Tan), Tan and Wang ("Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, hereinafter, Tan2), and Lee et al. ("A 2.17-mW acoustic DSP processor with CNN-FFT accelerators for intelligent hearing assistive devices." IEEE Journal of Solid-State Circuits 55.8 (2020): 2247-2258, hereinafter, Lee). 
For at least the supra provided reasons, Applicant's arguments have been fully considered but they are not persuasive.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: an encoding layer in Claim 1, 12 and 20; a decoding layer in Claims 4 and 13; a frequency long short term memory (LSTM) layer in Claim 22.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9, 11, 20 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan in view of Park further in view of Tan.
Regarding Claim 1,
Sargsyan discloses a computer-implemented method (Abstract, "Example speech enhancement and noise suppression systems and methods...") comprising: 
receiving audio data including a known noisy acoustic signal, the known noisy acoustic signal including a known clean acoustic signal and at least one known additive noise (Figs.1-4, par [045], "…The systems and methods have data of various noise recordings and data of various clean speech recordings...take a randomly picked noise recording and a randomly picked speech recording...sum of the speech data and the noise data in order to create mix data..."); 
transforming the audio data into frequency-domain data (par [046], "…the systems and methods take a randomly picked piece of the mix data, and takes the corresponding piece of clean speech data...these pieces contain 7 overlapped 32 ms frames each... multiply each frame with a window function, and apply a Fast Fourier transform (FFT) to obtain Fourier coefficients that correspond to each frame..."); and 
training a [convolutional] neural network based on the frequency-domain data and at least one of the known clean acoustic signal or the known additive noise (par [052-055], the systems and methods calculate the logarithm of the amplitudes of the Fourier coefficients for each frame of the randomly picked piece of the mix data (i.e., log power spectrum (LPS) features of the mix data) and the noise model approximation (paras [053-054]); par [055], "…The systems and methods then add the matrix comprising the LPS features of mix data with NMA and take the obtained matrix as input to NN."), 
wherein the convolutional neural network is configured to: output a frequency multiplicative mask to be applied to the frequency-domain data to estimate the known clean acoustic signal (par [030], "…As an output, the systems and methods generate a ratio mask (absolute values of clean voice coefficients divided to noisy voice coefficients). Then, clean voice coefficients are computed using ratio masks and update the noise model…"; Fig.4, paras [059-065], Inference example of recursive noise model approximation and speech enhancement); and
Sargsyan does not explicitly disclose the limitation, "including an encoding layer configured to upsample the frequency-domain data into a feature space, the encoding layer including a 2-dimensional convolution followed by a batch normalization followed by a rectified linear unit activation" for speech enhancement/noise suppression using a convolutional neural network.
Park discloses include an encoder including an encoding layer configured to upsample the frequency-domain data into a feature space (Figs.1, 3, Introduction, "…Redundant Convolutional Encoder Decoder extracts redundant representations of a noisy spectrum at the encoder and map it back to clean a spectrum at the decoder. This can be viewed as mapping the spectrum to higher dimensions (e.g. kernel method), and projecting the features back to lower dimensions..."), 
the encoding layer including a 2-dimensional convolution followed by a batch normalization followed by a rectified linear unit activation (Fig.3, 3.2. Redundant CED Network (R-CED), "…R-CED consists of repetitions of a convolution, batch normalization, and a ReLU activation layer (see Fig.3, each block represents a feature)…R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder.", "…The number of filters are kept symmetric: at the encoder, the number of filters are gradually increased, and at the decoder, the number of filters are gradually decreased…").
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified a neural network-based speech enhancement/noise suppression system of Sargsyan with the convolution, batch-normalization, and an ReLU activation layers of Redundant Convolutional Encoder-Decoder (R-CED) network as taught by Park with a reasonable expectation of success to implement a compact and memory efficient denoising algorithm with much smaller number of parameters especially for an embedded system such as the hearing aid and improve hearing intelligibility of human speech greatly (Park, Abstract, Introduction).
However, Sargsyan in view of Park does not explicitly teach 2-dimensional convolution. 
Tan, in the analogous field of endeavor, discloses  the encoding layer including a 2-dimensional convolution followed by a batch normalization followed by a rectified linear unit activation (Tan, Title, Abstract, "…a novel convolutional recurrent network (CRN) to address real-time monaural speech enhancement. We incorporate a convolutional encoder-decoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing..."; Fig.2, 2.1. Encoder-decoder with causal convolutions, "…the encoder comprises five convolutional layers while the decoder has five deconvolutional layers. We apply exponential linear units (ELUs) [20] to all convolutional and deconvolutional layers except the output layer (for  faster convergence and better generalization than ReLUs). ...we adopt batch normalization right after each convolution (or deconvolution) and before activation..."; 2.3. Network architecture, "…To perform causal convolutions, we use a kernel size of 2×3  (time x frequency). (i.e., 2D Convolution)").
It would have been obvious to a person of ordinary skill in the art to use a 2D convolution for convolutional encoder-decoder network and a long short-term memory (LSTM) component of the Convolutional Recurrent Network (CRN) as taught by Tan in the Redundant Convolutional Encoder-Decoder speech enhancement system with ratio mask as taught by Sargsyan in view of Park to improve the device with a reasonable expectation that this would result in an real-time noise and speaker-independent speech enhancement device with better objective intelligibility and much fewer trainable parameters as well as with an ability to track a target speaker temporally (Tan, Abstract, 2.2. Temporal modeling via LSTM). 
Regarding Claim 2,
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 further comprising multiplying the frequency multiplicative mask to the frequency-domain data to estimate the known clean acoustic signal (Sargsyan, Fig.4, par [065], "…As an output, the systems and methods generate the ratio mask for the current frame.  Modifying this ratio mask with special smoothing functions, it is multiplied with amplitudes of current frame's Fourier coefficients. This gives the approximation of amplitudes of Fourier coefficients of voice. Taking the inverse Fourier transform of the approximation of amplitudes, creates the approximation of the voice data for the current frame...").
Regarding Claim 3,  
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 wherein spatial dimensions specified by width and height of the frequency-domain data remain the same before and after a performance of at least one 2-dimensional convolutional layer of the encoder (Park, 3. CONVOLUTIONAL NETWORK ARCHITECTURES, "…In Convolutional Encoder-Decoder (CED) network, encoder consists of max-pooling layers, and decoder consists of up-sampling layers. Unlike CED, in redundant CED (R-CED), "No pooling layer is present, and thus no upsampling layer is required." - No changes in spatial dimensions of frequency-domain data; Tan, 2.3. Network architecture, "…To perform causal convolutions, we use a kernel size of 2×3  (time x frequency). (i.e., 2D Convolution)").
Regarding Claim 4, 
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 wherein the convolutional neural network includes a decoder including a decoding layer configured to downsample the feature space into the frequency multiplicative mask (Park, Fig.3, 3.2. Redundant CED Network (R-CED), "…R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder. The number of filters are kept symmetric: at the encoder, the number of filters are gradually increased, and at the decoder, the number of filters are gradually decreased (i.e., it is construed that the decoder downsamples by progressive decreasing the number of filters, thus contracting the channels back (see Table 1 for the changes in filter numbers)...").
Regarding Claim 5,
the limitations of Claim 5 is similar to those of Claim 3, and Park further discloses that R-CED consists of symmetric encoding layers and decoding layers, and "No pooling layer is present, and thus no upsampling layer is required." Claim 5 is rejected under similar rationale as Claim 3. 
Regarding Claim 6,
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 further comprising constructing the convolutional neural network, including a plurality of neurons arranged in a plurality of layers including encoding layers and decoding layers (Park, 3.2. Redundant CED Network (R-CED)) wherein the encoding layers and decoding layers include 2-dimensional convolutional layers (Tan, 2.3. Network architecture, "…To perform causal convolutions, we use a kernel size of 2×3  (time x frequency). (i.e., 2D Convolution)").
Regarding Claim 7, 
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 6 wherein each of the encoding layers and the decoding layers includes a 2-dimensional convolution followed by a batch normalization followed by a rectified linear unit activation (Tan, Title, Abstract, "…a novel convolutional recurrent network (CRN) to address real-time monaural speech enhancement. We incorporate a convolutional encoder-decoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing..."; Fig.2, 2.1. Encoder-decoder with causal convolutions, "…the encoder comprises five convolutional layers while the decoder has five deconvolutional layers. We apply exponential linear units (ELUs) [20] to all convolutional and deconvolutional layers except the output layer (for  faster convergence and better generalization than ReLUs). ...we adopt batch normalization right after each convolution (or deconvolution) and before activation..."; 2.3. Network architecture, "…To perform causal convolutions, we use a kernel size of 2×3  (time x frequency). (i.e., 2D Convolution)"; Sargsyan, par [031], "…add to the nonlinear part in addition to using linear transformation and applying ReLU (nonlinear part)..."). Sargsyan explicitly teaches the ReLU activation, and it is construed that a person of ordinary skill in the art would use the ReLU activation layer in place of the ELU activation selectively.
Regarding Claim 8, 
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 6 wherein a first layer of the plurality of layers is configured to encode frequencies in the frequency-domain data into a higher-dimension feature space in comparison with an original dimension of the frequency-domain data, and a second layer of the plurality of layers is configured to decode feature space to lower-dimension in comparison with the higher-dimension feature space (Park, Fig.3, 3.2. Redundant CED Network (R-CED), "…R-CED encodes the features into higher dimension along the encoder and achieves
compression along the decoder...").
Regarding Claim 9,
 Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 further comprising providing the trained convolutional neural network to a wearable or portable audio device (Sargsyan, par [026], "…as a noise cancellation mobile application which allows users to noise cancel their captured audio and video before saving on their handset..."; par [029], "…noise cancellation can be implemented on end device (phone, laptop, tablet, internet of things (IoT) device), and inside hardware chip integrated into laptop, phone, headset, microphone or IoT device…") 
wherein the audio device is capable of : receiving real-time audio data, transforming the real-time audio data into real-time frequency-domain data (Sargsyan, par [060], "…During inference the systems and methods process noisy speech audio to obtain speech enhancement..."; par [061], "…First, create the input matrix that will be input to the NN. This is done by taking the ratio mask predicted by NN for the previous frame and multiplying it by the amplitudes of the Fourier coefficients of the previous frame of the noisy speech audio..."),
outputting a real-time frequency multiplicative mask using the trained convolutional neural network and the real-time audio data (Sargsyan, par [065], "…As an output, the systems and methods generate the
ratio mask for the current frame. ..."), and 
applying the real-time frequency multiplicative mask to the real-time frequency-domain data (Sargsyan, par [065], "…Modifying this ratio mask with special smoothing functions, it is multiplied with amplitudes of current frame's Fourier coefficients. This gives the approximation of amplitudes of Fourier coefficients of voice. Taking the inverse Fourier transform of the approximation of amplitudes, creates the approximation of the voice data for the current frame...").
Regarding Claim 11,
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 wherein  the known noisy acoustic signal is a known noisy speech signal and the known clean acoustic signal is a known clean speech signal (Sargsyan, par [045], "…The systems and methods have data of various noise recordings and data of various clean speech recordings...take a randomly picked noise recording and a randomly picked speech recording...sum of the speech data and the noise data in order to create mix data...").
Claim 20 is a computer-readable storage device claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Sargsyan discloses a computer-readable storage device storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method (Sargsyan, paras[017-018], computer storage media (devices)).
…
Rationale for combination is similar to that provided for Claim 1.
Regarding Claim 22,
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1 wherein the convolutional neural network further includes a frequency long short term memory (LSTM) layer configured to extract frequency information and a time LSTM layer configured to extract temporal information (Tan, Fig.2, 1. Introduction, "…we develop a novel CRN architecture for noise- and speaker-independent speech enhancement in real time. The CRN incorporates a convolutional encoder-decoder and long short-term memory..."; 2.2. Temporal modeling via LSTM, "…In order to track a target speaker, it may be important to leverage long-term contexts, which cannot be utilized by the aforementioned convolutional encoder-decoder. The LSTM [21], a specific type of RNN which incorporates a memory cell, has been successful in temporal modeling...", "…Our proposed CRN is shown in Fig. 2, in which the network input is encoded into a higher-dimensional latent space, and the sequence of latent feature vectors are then modeled by two LSTM layers…The proposed CRN benefits from the feature extraction capability of CNNs and the temporal modeling capability of RNNs, by combining the two topologies together...").

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan in view of Park further in view of Tan further in view of Tan2.
Regarding Claim 10, 
Sargsyan in view of Park further in view of Tan discloses the computer-implemented method of claim 1, but does not explicitly teaches a phase-aware complex ratio mask. 
Tan2 discloses  the frequency multiplicative mask is a phase-aware complex ratio mask (Tan2, Abstract, "…Phase is important for perceptual quality in speech enhancement...Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of noisy speech..."; 2.1.2. Target complex spectrum, "…we use the real and imaginary spectrograms of clean speech as the training target. This training target is referred to as the target complex spectrum (TCS)..."; 2.1.3. Complex ideal ratio mask; Fig.2, 2.2. Convolutional recurrent network, "…An overview of complex spectral mapping using a CRN is illustrated in Fig. 2. Note that the real and imaginary spectrograms of noisy speech are treated as two different input channels...").
It would have been obvious to a person of ordinary skill in the art to use the complex spectral mapping of Tan2 in the convolution neural network-based speech enhancement as taught by Sargsyan in view of Park further in view of Tan to further improve the system with a reasonable expectation to simultaneously enhance magnitude and phase responses of noisy speech by estimating the real and imaginary spectrograms of clean speech from those of noisy speech, and substantially reduce the number of trainable parameters and the computational cost without sacrificing performance (Tan2, Abstract).

Claims 12-15, 17, 19 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan in view of Park further in view of Tan further in view of Lee.
Regarding Claim 12, 
Lee discloses a system (Lee, Fig.4, Title) comprising: a combination of a high fidelity digital signal processor (HiFi DSP) paired with a neural processing unit (NPU) for real-time audio processing and executed on the combination of the HiFi DSP paired with the NPU (Lee, Fig.4: the proposed acoustic DSP architecture with CNN-FFT accelerators, Abstract, "…an acoustic DSP processor with a neural network core for speech enhancement. Accelerators for convolutional neural network (CNN) and fast Fourier transform (FFT) are embedded. The CNN-based speech enhancement algorithm takes the speech signals spectrogram as the model’s input, and predicts the desired mask of speech to enhance speech intelligibility after passing through the CNN model..."; D. Efficient CNN-FFT Processing, "…the proposed DSP architecture with a larger granularity is designed to support both CNN and FFT more efficiently and to reduce the overall latency by joint scheduling..." ); Claim 12 is a system claim with the rest of limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Sargsyan further discloses applying the frequency multiplicative mask to the frequency-domain data; and estimating a noise suppressed version of the input audio data (par [030], "…As an output, the systems and methods generate a ratio mask (absolute values of clean voice coefficients divided to noisy voice coefficients). Then, clean voice coefficients are computed using ratio masks and update the noise model…"; Fig.4, paras [059-065], Inference example of recursive noise model approximation and speech enhancement).
...
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified the convolution neural network-based speech enhancement as taught by Sargsyan in view of Park further in view of Tan with an acoustic DSP architecture with CNN-FFT accelerators with a reasonable expectation of success to improve the speech intelligibility by significantly and can be embedded in intelligent hearing aided devices because the DSP chip is fabricated with a CMOS technology, which will achieve high energy efficiency and low power dissipation (Lee, Abstract, Introduction).
Claim 13 is a system claim with limitations similar to the limitations of Claim 4 and is rejected under similar rationale.
Claim 14 is a system claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale.
Regarding Claim15, 
Sargsyan in view of Park further in view of Tan further in view of Lee discloses the system of claim 13 wherein the encoding layer is configured to increase a number of channels  (Park, Fig.3, 3.2. Redundant CED Network (R-CED), "…R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder. The number of filters are kept symmetric: at the encoder, the number of filters are gradually increased (i.e., increasing the channels (see Table 1 for the changes in filter numbers)...").
Regarding Claim 17,
Sargsyan in view of Park further in view of Tan further in view of Lee discloses the system of claim 12 wherein the HiFi DSP is a HiFi 5 DSP (Lee, see Fig.4: acoustic DSP processor) .
Regarding Claim 19,
Sargsyan in view of Park further in view of Tan further in view of Lee discloses the system of claim 12 wherein the NPU is a neural network engine (NNE) 110 (Lee, see Fig.4, III. DSP PROCESSOR ARCHITECTURE, CNN-FFT accelerators).
Regarding Claim 21,
Sargsyan in view of Park further in view of Tan further in view of Lee the system of claim 13 wherein the decoding layer is configured to decrease a number of channels (Park, Fig.3, 3.2. Redundant CED Network (R-CED), "…R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder. The number of filters are kept symmetric: at the decoder, the number of filters are gradually decreased (i.e., it is construed that the decoder downsamples by progressive decreasing the number of filters, thus contracting the channels back (see Table 1 for the changes in filter numbers)...").
	
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Oostermeijer et al. ("Frequency gating: Improved convolutional neural networks for speech enhancement in the time-frequency domain." 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2020) discloses a CNN autoencoder architecture with a frequency gating mechanism with main contributions: (1) a new frequency gating mechanism is introduced to improve CNN-based speech enhancement in the time-frequency domain, and (2) We design a CNN-based autoencoder network for speech enhancement (Oostermeijer, B. Contributions)..
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JANGWOEN LEE/Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Apr 23, 2024
Application Filed
Dec 16, 2025
Non-Final Rejection mailed — §103
Apr 16, 2026
Response Filed
Jul 01, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/304,382
Patent 12645876
AUTO-CORRECTING FRAMEWORK FOR OPEN INFORMATION EXTRACTION SYSTEMS
3y 1m to grant Granted Jun 02, 2026
18/229,099
Patent 12626689
DETERMINING EMOTION SEQUENCES FOR SPEECH FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS
2y 9m to grant Granted May 12, 2026
17/808,242
Patent 12619872
System And Method for Filtering Datasets Using Conditional-Likelihood Filtration
3y 10m to grant Granted May 05, 2026
18/260,889
Patent 12608547
METHOD AND DEVICE FOR SEGMENTING WORD BASED ON CROSS-LANGUAGE DATA AUGMENTATION, AND STORAGE MEDIUM
2y 9m to grant Granted Apr 21, 2026
18/007,025
Patent 12597432
HUM NOISE DETECTION AND REMOVAL FOR SPEECH AND MUSIC RECORDINGS
3y 2m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
84%
Grant Probability
99%
With Interview (+19.6%)
2y 8m (~5m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 51 resolved cases by this examiner. Grant probability derived from career allowance rate.