Last updated: April 19, 2026

Application No. 18/553,221

METHOD FOR END-TO-END SPEECH ENHANCEMENT BASED ON NEURAL NETWORK, COMPUTER-READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE

Non-Final OA §103

Filed

Sep 29, 2023

Examiner

OPSASNICK, MICHAEL N

Art Unit

2658

Tech Center

2600 — Communications

Assignee

Jingdong Technology Holding Co. Ltd.

OA Round

3 (Non-Final)

Interview Optional

— +10.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 900 resolved cases, 2023–2026

Examiner Intelligence

OPSASNICK, MICHAEL N View full profile →

Grants 82% — above average

Career Allow Rate

737 granted / 900 resolved

+19.9% vs TC avg

Moderate +10% lift

Without

With

+10.5%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

46 currently pending

Career history

946

Total Applications

across all art units

Statute-Specific Performance

§101

17.7%

-22.3% vs TC avg

§103

33.0%

-7.0% vs TC avg

§102

29.9%

-10.1% vs TC avg

§112

6.3%

-33.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 900 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/4/2026 has been entered.
 
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claim(s) 1-4, 6, 13-19,21-24, 26 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al (20210142148) in view of Zhang et al (20210168554).

As per claim 1, Wang et al (20210142148) teaches a method for end-to-end speech enhancement based on a neural network (as, using neural networks to perform end-to-end, para 0003, and para 0024) comprising: 
obtaining, by a server or a terminal device (para 0020 – tablet computer/server),
a time-domain smoothing (as, global averaging pooling operation – para 0043-0044) feature  (as compressing features – para 0043) of an original speech signal by performing feature extraction on the original speech signal using a time-domain convolution kernel (as using a time-frequency structure – para 0026; and a convolution kernel – para 0031); 
and obtaining, by the server or the terminal device, an enhanced speech signal by performing combined feature extraction on the original speech signal and the time-domain smoothing feature of the original speech signal (as, re-generating an output signal – figure 7, by taking a multilayer perceptron, and combining – fig. 7, subblock s263, s265-s267; with a potential application being sound source separation – para 0045, last 2 sentences).
	As per claim 1, Wang et al (20210142148) teaches incorporating a time-domain smoothing algorithm into a deep neural network as a one-dimensional convolution module (as, one dimensional neural network signal separator – para 0048), with a time domain smoothing feature (as operating on input parameters – para 0007, para 0025, which operates on well known parameters of STFT); however, Wang et al (20210142148) does not explicitly teach noise smoothing as part of an explicit defined feature (Wang et al (20210142148) teaches operating in the time domain on ‘well known features’, but does not explicitly define noise smoothing); Zhang et al (20210168554) teaches the use of time-recursive average noise estimation, to reduce the noise in sound signals – para 0204.  Therefore, it would have been obvious to one of ordinary skill in the art of sound separation/extraction to modify the sound processing of Wang et al (20210142148) with additional noise averaging/estimation, as taught by Zhang et al (20210168554), because it would advantageously reduce the noise artifacts in the recorded sound (see para 0204, last half).
  
As per claim 2, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 1, wherein obtaining a time-domain smoothing feature of an original speech signal by performing feature extraction on the original speech signal using a time-domain convolution kernel comprises: 
determining a time-domain smoothing parameter matrix according to a convolution sliding window and a time-domain smoothing factor (as, Wang et al (20210142148),  interpolating using nearest neighbor, in the time domain – pp3, second column, lines 5-10):
 obtaining a weight matrix of the time-domain convolution kernel by performing a product operation on the time-domain smoothing parameter matrix (as, Wang et al (20210142148),  weight matrix with product calculations – lines 17-40 – see interpolation referrals, and multiplication/add of the Weights);
and obtaining the time-domain smoothing feature of the original speech signal by performing a convolution operation on the weight matrix of the time-domain convolution kernel and the original speech signal (as, Wang et al (20210142148),  using the weights of the convolution kernel – par 0031; and pg 3, second column, after equation (1), the explanation of the Weights, and after equation 3).

As per claim 3, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 2, wherein determining a time-domain smoothing parameter matrix according to a convolution sliding window and a time-domain smoothing factor comprises:
initializing a plurality of time-domain smoothing factors; and obtaining the time-domain smoothing parameter matrix based on a preset convolution sliding window and the plurality of time-domain smoothing factors (as, Wang et al (20210142148),  fixed convolution functions – para 0042, operating/ calculating pooling functions – para 0043-0044; and interpolations as well – para 0034).

As per claim 4, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 1, wherein obtaining an enhanced speech signal by performing combined feature extraction on the original speech signal and the time-domain smoothing feature of the original speech signal comprises: 
obtaining a speech signal to be enhanced by combining the original speech signal and the time-domain smoothing feature of the original speech signal (as, Wang et al (20210142148),  re-generating an output signal – figure 7, by taking a multilayer perceptron, and combining – fig. 7, subblock s263, s265-s267; with a potential application being sound source separation – para 0045, last 2 sentences); 
training a weight matrix of the time-domain convolution kernel by using a back propagation algorithm with the speech signal to be enhanced as an input of a deep neural network (as, Wang et al (20210142148),  projecting features back to a time-domain, using weights from the kernel – para 0042), and; 
obtaining the enhanced speech signal by performing combined feature extraction on the speech signal to be enhanced according to the weight matrix obtained by training (as, Wang et al (20210142148),  Generating the enhanced speech signal after features extraction and weighting – as, Wang et al (20210142148), see Figure 7, obtaining a mask, filtering, using a new convolution kernel with input from the neural network/multilayer perceptron, with an output signal).

As per claim 6, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 4, wherein obtaining the enhanced speech signal by performing combined feature extraction on the speech signal to be enhanced according to the weight matrix obtained by training comprises: 
obtaining a first time-domain feature map by performing a convolution operation on the weight matrix obtained by training and an original speech signal in the speech signal to be enhanced (as, Wang et al (20210142148),  using the Wave-U-Net – analyzing time -frequency domain, and convolutional neural network – para 0026, and para 0042); 
obtaining a second time-domain feature map by performing a convolution operation on the weight matrix obtained by training and a smoothing feature in the speech signal to be enhanced (as, Wang et al (20210142148),  weight matrix with product calculations – lines 17-40 – see interpolation referrals, and multiplication/add of the Weights); and obtaining the enhanced speech signal by combining the first time-domain feature map and the second time-domain feature map (as combining the high-level and low-level calculations into a single decoded signal as, Wang et al (20210142148), – para 0034, pp 3, second column, after equation #3).

As per claim 15, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 1, wherein obtaining a time-domain smoothing feature of an original speech signal by performing feature extraction on the original speech signal using a time-domain convolution kernel comprises: 
performing speech enhancement on phase information and amplitude information in the original speech signal by inputting the original speech signal into a deep neural network for time-varying feature extraction (as, Wang et al (20210142148),  operating on input parameters – para 0007, para 0025, which operates on well known parameters of STFT – short time fourier transform, which by definition, has phase and amplitude information – see also para 0003).

As per claim 16, the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 1, wherein the original speech signal is represented by a one-dimensional vector (as, Wang et al (20210142148), one dimensional neural network signal separator – para 0048).

Claims 13,17,18,19,21 are computer readable medium storage claims, whose steps are performed  by method claims 1-4,6,15,16, and as such, claims 13,17-19,21 are similar in scope and content to claims 1-4,6,15,16; therefore, claims 13,17-19,21 are rejected under similar rationale as presented against claims 1-4,6,15,16 above.  Furthermore, Wang et al (20210142148) teaches storage mediums – para 0024, storing the instructional steps.

Claims 14,22-24,26 are electronic device claims performing the steps found in method claims 1-4,6,15,16, and as such, claims 14,22-24,26 are similar in scope and content to claims 1-4,6,15,16; therefore, claims 14,22-24,26 are rejected under similar rationale as presented against claims 1-4,6,15,16 above.  Furthermore, Wang et al (20210142148) teaches a processor accessing memory and executing the stored instructions – para 0007.
 
Claim(s) 5, 20, 25 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al (20210142148) in view of Zhang et al (20210168554), in further view of Arik et al (20190355347).

As per claims 5,20, 25,  the combination of Wang et al (20210142148) in view of Zhang et al (20210168554) teaches the method for end-to-end speech enhancement according to claim 4, wherein training a weight matrix of the time-domain convolution kernel by using a back propagation algorithm with the speech signal to be enhanced as an input of a deep neural network comprises (as mapped above, in claims 1,4): 
inputting the speech signal to be enhanced into the deep neural network ( Wang et al (20210142148) para 0003); but does not explicitly teach:
“and constructing a time-domain loss function; and training the weight matrix of the time-domain convolution kernel by using an error back propagation algorithm according to the time-domain loss function”; (Wang casually mentions loss in para 0030);  Arik et al (20190355347) explicitly teaches the use/calculation of loss functions during the execution of the short time fourier transform (see para 0070-0075), as well as envelope loss during the convolution operators – para 0076 – 0080).  Therefore, it would have been obvious to one of ordinary skill in the art of convolutional processing of audio signals to further detail the STFT processing in Wang et al (20210142148) in view of Zhang et al (20210168554) with loss functions for the STFT parameters, as well as the loss function tied to the convolutional calculations, as taught by Arik et al (20190355347), because it would advantageously quantify “how close”, or the “quality” of the measurement, compared to ground-truth information -- Arik et al (20190355347), para 0078). 

Response to Arguments

Applicants amendments, to the abstract, has overcome that rejection; and that rejection has been removed.  Applicant's arguments filed 02/04/2026  have been fully considered but are moot in view of the new grounds of rejection.  Examiner notes the introduction of the Zhang et al (20210168554) reference to address the new claim limitations toward the noise averaging to reduce noise in the sound signal.



Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.   Please see related art listed on the PTO-892 form.
Furthermore, the following references were found, that teach claim/specification features:

Cantanzaro et al (20170148431) teaches end-to-end speech recognition uses convolutional kernels in a ctc-rnn models – para 0043-0045 

Mesgarani et al (20190066713) teaches speech separation using neural networks (para 0079, 0080) processing time frequency bins – para 0081-0082, and smoothing/cleaning the speech signal via spectrogram processing – para 0087).

Tashev et al (20190318755) teaches end-to-end models (para 0035) operating on spectrograms in both time-frequency domains (para 0035), using neural networks processing convolutional kernels (para 0036).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        02/18/2026

Read full office action

Prosecution Timeline

Sep 29, 2023

Application Filed

Jun 22, 2025

Non-Final Rejection — §103

Sep 16, 2025

Response Filed

Dec 11, 2025

Final Rejection — §103

Feb 04, 2026

Request for Continued Examination

Feb 17, 2026

Response after Non-Final Action

Feb 18, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/512,723

Patent 12602554

SYSTEMS AND METHODS FOR PRODUCING RELIABLE TRANSLATION IN NEAR REAL-TIME

2y 5m to grant Granted Apr 14, 2026

17/698,029

Patent 12592246

SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS

2y 5m to grant Granted Mar 31, 2026

18/367,779

Patent 12586580

System For Recognizing and Responding to Environmental Noises

2y 5m to grant Granted Mar 24, 2026

18/344,007

Patent 12579995

Automatic Speech Recognition Accuracy With Multimodal Embeddings Search

2y 5m to grant Granted Mar 17, 2026

18/273,354

Patent 12567432

VOICE SIGNAL ESTIMATION METHOD AND APPARATUS USING ATTENTION MECHANISM

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

82%

Grant Probability

92%

With Interview (+10.5%)

3y 3m

Median Time to Grant

High

PTA Risk

Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.