Last updated: April 19, 2026

Application No. 18/577,586

SPEECH ENHANCEMENT

Non-Final OA §103

Filed

Jan 08, 2024

Examiner

PATEL, YOGESHKUMAR G

Art Unit

2691

Tech Center

2600 — Communications

Assignee

Dolby Laboratories Licensing Corporation

OA Round

1 (Non-Final)

Interview Optional

— +3.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 650 resolved cases, 2023–2026

Examiner Intelligence

PATEL, YOGESHKUMAR G View full profile →

Grants 83% — above average

Career Allow Rate

538 granted / 650 resolved

+20.8% vs TC avg

Minimal +3% lift

Without

With

+3.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 4m

Avg Prosecution

17 currently pending

Career history

667

Total Applications

across all art units

Statute-Specific Performance

§101

4.7%

-35.3% vs TC avg

§103

61.9%

+21.9% vs TC avg

§102

14.4%

-25.6% vs TC avg

§112

14.2%

-25.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 650 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
Claims 1-36 were originally cancelled.
Species 1: Claims 37-47 are withdrawn from consideration [Not Elected].
Species 2: Claims 48-56 Elected.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 48-56 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan et al. (US #2020/0066296) in view of Tan et al. (WO #2020/042707 A1).

Regarding Claim 48, Sargsyan discloses a method for enhancing audio signals (Fig. 1; ¶0004: a method of speech enhancement, the neural network contains a structure, and parameters based on a previous training using predefined noise data and clean speech data to result in a known ratio mask), comprising:
obtaining, by a control system, a distorted audio signal (Sargsyan ¶0030 discloses several overlapped frames of noisy speech);
generating, by the control system, a frequency-domain representation of the distorted audio signal (Sargsyan ¶0030 discloses compute Fourier coefficients of each frame);
providing, by the control system, the frequency-domain representation to a trained machine learning model (Sargsyan ¶0030 discloses concatenate them with a noise model and take as an input for a neural network [NN]).
Sargsyan may not explicitly disclose wherein the trained machine learning model comprises a convolutional neural network (CNN) comprising a plurality of convolutional layers and to a recurrent element, wherein an output of the recurrent element is provided to a subset of the plurality of convolutional layers; determining, by the control system, an enhancement mask based on an output of the trained machine learning model; generating, by the control system, a spectrum of an enhanced audio signal based at least in part on the enhancement mask and the distorted audio signal; and generating, by the control system, the enhanced audio signal based on the spectrum of the enhanced audio signal.
However, Tan (title, abstract, Figs. 1-13) teaches wherein the trained machine learning model comprises a convolutional neural network (CNN) (Tan ¶0021 discloses the convolutional recurrent neural network model is constructed by training a pre-collected speech training set using the convolutional recurrent neural network. ¶0027 discloses the speech training set is composed of background noise collected in daily environments, various types of male and female voices, and speech signals mixed with specific signal-to-noise ratios) comprising a plurality of convolutional layers and to a recurrent element (Tan ¶0022 discloses the convolutional neural network is a convolutional encoder-decoder structure, wherein the encoder includes a set of convolutional layers and pooling layers, the structure of the decoder is the same as that of the encoder in reverse order, and the output of the encoder is connected to the input of the decoder. ¶0120 discloses the CRN model has five convolutional layers in the encoder, five deconvolutional layers in the decoder, and two LSTM layers between the encoder and decoder), wherein an output of the recurrent element is provided to a subset of the plurality of convolutional layers (Tan ¶0095 discloses the convolution operation in CNN is based on a two-dimensional structure. It’s definition of the local receptive field is that each low-level feature is only related to a subset of the input, such as the topological neighborhood);
determining, by the control system, an enhancement mask based on an output of the trained machine learning model (Tan ¶0082 discloses the acoustic features obtained in step S110 are used as input to a convolutional recurrent neural network model. Iterative calculations are performed in the convolutional recurrent neural network model to calculate the ratio membrane for the acoustic features. ¶0083 discloses in this step, the IRM [Ideal Ratio Mask] is used as the target of the iterative calculation. The IRM of each time-frequency cell in the spectrum can be expressed by the equation in ¶0085. ¶0086 discloses by predicting an ideal ratio membrane during supervised training, and then using the ratio membrane to mask acoustic features, a denoised speech signal can be obtained. ¶0087 discloses step S130: Use a ratio membrane to mask the acoustic features. ¶0088 discloses in step S140, the masked acoustic features are synthesized with the phase of the single-channel sound signal to obtain a denoised speech signal);
generating, by the control system, a spectrum of an enhanced audio signal based at least in part on the enhancement mask and the distorted audio signal (Tan ¶0089 discloses once trained, the trained CRN [Convolutional Recurrent Network] can be used for speech denoising applications. Using a trained neural network for a specific application is called inference or operation. During the inference phase, each layer of the CRN model processes noisy signals. The T-F mask is derived from the inference results and then used to weight [or mask] the amplitude of noisy speech to produce a clearer enhanced speech signal than the original noisy input); and
generating, by the control system, the enhanced audio signal based on the spectrum of the enhanced audio signal (Tan ¶0090 discloses the masked spectral amplitude vector, along with the phase of the single-channel audio signal, is sent to the inverse Fourier transform to derive the speech signal in the corresponding time domain).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 49, Sargsyan in view of Tan discloses the method of claim 48, wherein obtaining the frequency-domain representation of the distorted audio signal comprises:
generating an initial frequency-domain representation of the distorted audio signal (Sargsyan ¶0030 discloses several overlapped frames of noisy speech. claim 1 performing a discrete Fourier transform on each frame of a first subset of said plurality of frames to provide a plurality of frequency-domain outputs. claim 2 performing a discrete Fourier transform on each frame of a second subset of said plurality of frames to provide a plurality of frequency-domain outputs. claim 3 performing a discrete Fourier transform on each frame of a third subset of said plurality of frames to provide a plurality of frequency-domain outputs); and
applying a filter that represents filtering of a human cochlea to the initial frequency-domain representation of the distorted audio signal to generate the frequency-domain representation of the distorted audio signal (Sargsyan ¶0035 discloses the input of the NN is computed using bark scale band features; Figs. 1, 2).

Regarding Claim 50, Sargsyan in view of Tan discloses the method of claim 49. But Sargsyan may not explicitly disclose wherein the plurality of convolutional layers comprise a first subset of convolutional layers with increasing dilation values and a second subset of convolutional layers with decreasing dilation values.
However, Tan (title, abstract, Figs. 1-13) teaches wherein the plurality of convolutional layers comprise a first subset of convolutional layers with increasing dilation values and a second subset of convolutional layers with decreasing dilation values (Tan ¶0021 discloses the convolutional recurrent neural network model is constructed by training a pre-collected speech training set using the convolutional recurrent neural network. ¶0027 discloses the speech training set is composed of background noise collected in daily environments, various types of male and female voices, and speech signals mixed with specific signal-to-noise ratios. ¶0095 discloses the topological local constraints within a convolutional layer make the weight matrix very sparse, so the two layers connected by the convolutional operation only have local connections. This claim recites implementation details with regards the CNN architecture. In the absence of any remarkable associated technical effect, this claim is not regarded to be inventive).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 51, Sargsyan in view of Tan discloses the method of claim 50. But Sargsyan may not explicitly disclose wherein an output of a convolutional layer of the first subset of convolutional layers is passed to a convolutional layer of the second subset of convolutional layers having a same dilation value.
However, Tan (title, abstract, Figs. 1-13) teaches wherein an output of a convolutional layer of the first subset of convolutional layers is passed to a convolutional layer of the second subset of convolutional layers having a same dilation value (Tan ¶0095 discloses the convolution operation in CNN is based on a two-dimensional structure. Its definition of the local receptive field is that each low-level feature is only related to a subset of the input, such as the topological neighborhood. This claim recites implementation details with regards the CNN architecture. In the absence of any remarkable associated technical effect, this claim is not regarded to be inventive).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 52, Sargsyan in view of Tan discloses the method of claim 50. But Sargsyan may not explicitly disclose wherein the output of the recurrent element is provided to the second subset of convolutional layers.
However, Tan (title, abstract, Figs. 1-13) teaches wherein the output of the recurrent element is provided to the second subset of convolutional layers (Tan ¶0095 discloses the convolution operation in CNN is based on a two-dimensional structure. Its definition of the local receptive field is that each low-level feature is only related to a subset of the input, such as the topological neighborhood. This claim recites implementation details with regards the CNN architecture. In the absence of any remarkable associated technical effect, this claim is not regarded to be inventive).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 53, Sargsyan in view of Tan discloses the method of claim 48. But Sargsyan may not explicitly disclose wherein the output of the recurrent element is provided to the subset of the plurality of convolutional layers by reshaping the output of the recurrent element.
However, Tan (title, abstract, Figs. 1-13) teaches wherein the output of the recurrent element is provided to the subset of the plurality of convolutional layers by reshaping the output of the recurrent element (Tan ¶0093 discloses step S121: Combine the convolutional neural network with a recurrent neural network with long short-term memory to obtain a convolutional recurrent neural network. ¶0095 discloses the convolution operation in CNN is based on a two-dimensional structure. Its definition of the local receptive field is that each low-level feature is only related to a subset of the input, such as the topological neighborhood. This claim recites implementation details with regards the CNN architecture. In the absence of any remarkable associated technical effect, this claim is not regarded to be inventive).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 54, Sargsyan in view of Tan discloses the method of claim 48. But Sargsyan may not explicitly disclose wherein the recurrent element is at least one of a gated recurrent unit (GRU), a long short-term memory (LSTM) network or an Elman recurrent neural network (RNN).
However, Tan (title, abstract, Figs. 1-13) teaches wherein the recurrent element is at least one of a gated recurrent unit (GRU), a long short-term memory (LSTM) network (Tan ¶0097 discloses a recurrent neural network with long short-term memory [LSTM] is a type of temporal recurrent neural network) or an Elman recurrent neural network (RNN) (Tan ¶0108 discloses Convolutional recurrent neural networks [CRNs] are obtained by combining convolutional neural networks with recurrent neural networks that have long short-term memory. CRN combines the characteristics of CNN and LSTM, thus enabling efficient speech denoising while significantly reducing the number of neural network parameters and effectively decreasing the size of the CRN).
Sargsyan and Tan are analogous art as they pertain to speech enhancement. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify speech enhancement sytstem (as taught by Sargsyan) to greatly reduce the number of neural network parameters, the amount of data storage, and the requirement for system bandwidth since this scheme uses a pre-trained convolutional recurrent neural network model (as taught by Tan, ¶0091) to achieve good noise reduction performance while greatly improving the real-time performance of single-channel speech noise reduction (Tan, ¶0091).

Regarding Claim 55, Sargsyan in view of Tan discloses the method of claim 48, wherein generating the enhanced audio signal comprises multiplying the enhancement mask by the frequency-domain representation of the distorted audio signal (Sargsyan ¶0040 discloses the systems and methods then multiply successive frames with the scale coefficient, in order to scale the energy of the signal to the target energy. Post processing tools also include ratio mask modification. ¶0061 discloses first, create the input matrix that will be input to the NN. This is done by taking the ratio mask predicted by NN for the previous frame and multiplying it by the amplitudes of the Fourier coefficients of the previous frame of the noisy speech audio).

Regarding Claim 56, Sargsyan in view of Tan discloses the method of claim 48, wherein the distorted audio signal is a live-captured audio signal and/or includes one or more of reverberation or noise (Sargsyan ¶0060 discloses process noisy speech audio to obtain speech enhancement. The noisy speech audio can be audio that has not previously been used in training the NN. ¶0026 discloses the methods and systems can further be used for noise robust automatic speech recognition [a.k.a. speech to text], and for improved machine understanding of intents in human speech [Alexa, Google Home, etc.]).



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691

Read full office action

Prosecution Timeline

Jan 08, 2024

Application Filed

Jan 25, 2026

Non-Final Rejection — §103

Mar 20, 2026

Interview Requested

Apr 01, 2026

Applicant Interview (Telephonic)

Apr 01, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/420,157

Patent 12598426

CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO

2y 5m to grant Granted Apr 07, 2026

18/534,033

Patent 12596525

METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION

2y 5m to grant Granted Apr 07, 2026

18/389,832

Patent 12592675

AUDIO DEVICE WITH MICROPHONE AND MEDIA MIXING

2y 5m to grant Granted Mar 31, 2026

18/553,799

Patent 12593010

COMMUNICATION ASSEMBLY

2y 5m to grant Granted Mar 31, 2026

18/386,826

Patent 12587448

AI-BASED NETWORK TROUBLESHOOTING WITH EXPERT FEEDBACK

2y 5m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

86%

With Interview (+3.4%)

2y 4m

Median Time to Grant

Low

PTA Risk

Based on 650 resolved cases by this examiner. Grant probability derived from career allow rate.