Prosecution Insights
Last updated: April 19, 2026
Application No. 18/431,496

GENERATION OF A PERSONALIZED SPEECH REPRESENTATION WITHIN AN AUDIO ENHANCEMENT MODEL

Non-Final OA §103
Filed
Feb 02, 2024
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
44%
Grant Probability
Moderate
1-2
OA Rounds
2y 11m
To Grant
90%
With Interview

Examiner Intelligence

Grants 44% of resolved cases
44%
Career Allow Rate
10 granted / 23 resolved
-18.5% vs TC avg
Strong +47% interview lift
Without
With
+46.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
39 currently pending
Career history
62
Total Applications
across all art units

Statute-Specific Performance

§101
22.0%
-18.0% vs TC avg
§103
48.6%
+8.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 23 resolved cases

Office Action

§103
DETAILED ACTION The examiner would like to note that the claims have been deemed to contain eligible subject matter under 35 U.S.C. 101 because the concepts of producing encodings of audio to be used to generate a representation of speech characteristics, further used to suppress noise from a second signal based on the generated speech representation is not something which can reasonably be performed mentally, nor is there recitation of explicit mathematical operations for performing these functions. Further, claims 19-20, directed to a computer-readable storage medium, are eligible as [0087] of the instant application excludes a signal interpretation of this medium. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement(s) submitted on 03/20/2024 is/are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement(s) is/are being considered by the examiner. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-4, 13, 18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopalakrishnan et al. (US-8965005-B1), hereinafter Gopalakrishnan in view of Ramos et al. (US-20230298593-A1), hereinafter Ramos. Regarding claim 1, Gopalakrishnan discloses: a method comprising: obtaining a first microphone signal that includes speech by a user ([Fig. 5, Receive first audio 505], [Col. 6, Lines 16-17] the audio signal is a speech audio signal (e.g., from a mobile phone)); and, obtaining a second microphone signal that includes speech by the user ([Fig. 5, Receive second audio 515], [In view of the previous mapping of Gopalakrishnan disclosing the audio signals to be speech, indicating the second audio signal can also reasonably be considered to be representative of speech]). Gopalakrishnan does not disclose: inputting the first microphone signal into a trained audio enhancement model, the trained audio enhancement model producing one or more first encodings from the first microphone signal; generating a representation of speech characteristics of the user based at least on the one or more first encodings; inputting the second microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user; wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources in the second microphone signal other than the speech of the user. Ramos discloses: inputting the first microphone signal into a trained audio enhancement model ([0120] inputting the corrupted audio sample into an encoder module of the ML model, [0073] using the trained ML model to perform sound enhancement of audio signals [In view of the previously disclosed first microphone signal of Gopalakrishnan which could be used as the “corrupted”, i.e. by noise, audio of Ramos without a change in functionality to Ramos. Further, consider the enrolment for generating speaker embeddings of [0153], indicating the enrolment signal to be a first microphone signal as compared to the corrupted signal]), the trained audio enhancement model producing one or more first encodings from the first microphone signal ([0120] concatenating a vector with each frame of the corrupted audio sample after processing by the encoder module, to generate a modified corrupted audio sample, [Concatenating output from an encoder, indicating encodings as output, for multiple frames indicates one or more first encodings]); generating a representation of speech characteristics of the user based at least on the one or more first encodings ([0109] the speaker embedding vector is pre-computed using a speech recognition model, such as an X-Vector model, from utterances of a target user during an enrolment phase, [Generating a speaker embedding vector from a speech recognition x-vector model, indicating the components of the output vector to be encodings, further indicates the speaker embedding to be a representation of speech characteristics based at least on one or more first encodings, i.e. the speech components input to the x-vector model, wherein the speaker embedding(s)/encodings could be gathered from the corrupted, i.e. first, audio without a change in functionality to Ramos]); inputting the second microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user ([Fig. 2, Filter Block], [0109] Before passing the input to the filter block, each time frame is concatenated with a vector representing the speech profile of the target user, [Concatenating the speech profile, i.e. representation of speech characteristics of the user, and speech signal before filtering indicates both the second microphone signal, i.e. that containing speech (also reasonably representing the corrupted speech signal, e.g. as the first and second microphone signals are defined in the same way, it is reasonable to assume they are the same signal), and representation of speech characteristics are passed into the filter, i.e. part of the audio enhancement model]); wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources in the second microphone signal other than the speech of the user ([0121] When the speaker embedding vector exists during a training round, the vector is the speaker embedding vector and the ML model switches to perform personalised noise removal. In this case, the model learns to remove ambient noise and/or babble noise from the output enhanced audio sample, while maintaining the speech of the target user). Gopalakrishnan and Ramos are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan to incorporate the teachings of Ramos, because of the novel way to concatenate a speaker embedding vector with a noisy audio signal only after processing by encoder and decoder modules, reducing required computing resources for processing which improves performance on resource-constrained device such as smart phones (Ramos, [0011]). Gopalakrishnan further discloses: obtaining an enhanced second microphone signal from the trained audio enhancement model ([Fig. 5, 520], [Col. 11, Lines 24-25] processing logic adjusts the second audio signal to compensate for the noisy environment based on the noise characteristics, [Adjusting a signal to compensate for noise indicates the resultant signal is enhanced and performing this operation indicates a required obtaining of the result for output (see below element)]); and outputting the enhanced second microphone signal ([Fig. 6, 640], [Col. 12, Lines 7-8] At block 640, processing logic outputs the adjusted second audio signal to speakers). Regarding claim 2, Gopalakrishnan in view of Ramos discloses: the method of claim 1. Ramos further discloses: wherein the one or more first encodings are obtained from a hidden state in the trained audio enhancement model ([Fig. 2, TDS Cell], [0107] Encoder and Decoder. Both encoder and decoder are composed of four time-depth-separable convolution (TDS) cells, [Output from the encoder passing through multiple TDS cells indicates any of them and/or individual layers within each cell can represent a hidden state]). Regarding claim 3, Gopalakrishnan in view of Ramos discloses: the method of claim 2. Ramos further discloses: the hidden state being produced in a recurrent layer of the trained audio enhancement model ([Fig. 2, Output from encoder into Filter Block], [0109] A filter block with N recurrent layers, [Viewing the flow of the system of Fig. 2, output from the encoder (the examiner believes the lower “decoder” is incorrectly labeled and should be “encoder” as [0106] defines Fig. 2 to be representing “an encoder and decoder style network with intermediate filter blocks”) passed into the filter block containing a plurality of RNNs indicates any of them can represent a hidden state in a recurrent layer. Passing output from the filter block into the decoder indicates the input to still be in the form of an encoding]). Regarding claim 4, Gopalakrishnan in view of Ramos discloses: the method of claim 3. Ramos further discloses: the representation being an average of the hidden state of the recurrent layer over multiple audio frames ([0129] concatenating a speaker embedding vector with each frame of the noisy audio signal after processing by the encoder module, [Concatenation being performed on a plurality of frames to be passed into the recurrent neural network layers, see Fig. 2, indicates the output from the filter block is an average of the hidden state of the recurrent later over multiple audio frames as recurrent neural networks are dependent upon previous information for processing current information, i.e. feedback loops, suggesting an averaging of input data with feedback as output]). Regarding claim 13, Gopalakrishnan discloses: a system comprising: a processor ([Col. 4, Lines 32-33] a hardware module such as a chipset (commonly referred to as a voice processor)); and a storage medium storing instruction which ([Col. 9, Lines 28-29] storage medium 316 on which is stored one or more sets of instructions), when executed by the processor, cause the system to: obtain a microphone signal that includes speech by the user ([Fig. 5, Receive first audio 505], [Col. 6, Lines 16-17] the audio signal is a speech audio signal (e.g., from a mobile phone)). Gopalakrishnan does not disclose: receive a representation of speech characteristics of a user, the representation being generated from encodings produced by a trained audio enhancement model from one or more audio signals that include speech by the user; input the microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user; and obtain an enhanced microphone signal from the trained audio enhancement model, wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources other than the speech of the user. Ramos discloses: receive a representation of speech characteristics of a user, the representation being generated from encodings produced by a trained audio enhancement model from one or more audio signals that include speech by the user ([0109] the speaker embedding vector is pre-computed using a speech recognition model, such as an X-Vector model, from utterances of a target user during an enrolment phase, [Generating a speaker embedding vector from a speech recognition x-vector model, indicating the components of the output vector to be encodings, further indicates the speaker embedding to be a representation of speech characteristics based at least on one or more first encodings, i.e. the speech components input to the x-vector model]); input the microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user ([Fig. 2, Filter Block], [0109] Before passing the input to the filter block, each time frame is concatenated with a vector representing the speech profile of the target user, [Concatenating the speech profile, i.e. representation of speech characteristics of the user, and speech signal before filtering indicates both the second microphone signal, i.e. that containing speech (also reasonably representing the corrupted speech signal, e.g. as the first and second microphone signals are defined in the same way, it is reasonable to assume they are the same signal), and representation of speech characteristics are passed into the filter, i.e. part of the audio enhancement model]); and obtain an enhanced microphone signal from the trained audio enhancement model ([Fig. 5, 520], [Col. 11, Lines 24-25] processing logic adjusts the second audio signal to compensate for the noisy environment based on the noise characteristics, [Adjusting a signal to compensate for noise indicates the resultant signal is enhanced and performing this operation indicates a required obtaining of the result for output (see below element)]), wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources other than the speech of the user ([0121] When the speaker embedding vector exists during a training round, the vector is the speaker embedding vector and the ML model switches to perform personalised noise removal. In this case, the model learns to remove ambient noise and/or babble noise from the output enhanced audio sample, while maintaining the speech of the target user). Gopalakrishnan and Ramos are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan to incorporate the teachings of Ramos, because of the novel way to concatenate a speaker embedding vector with a noisy audio signal only after processing by encoder and decoder modules, reducing required computing resources for processing which improves performance on resource-constrained device such as smart phones (Ramos, [0011]). Regarding claim 18, Gopalakrishnan in view of Ramos discloses: the system of claim 13. Ramos further discloses: wherein the instructions, when executed by the processor, cause the system to: send the enhanced microphone signal to a device of another user that is participating in a call with the user ([0131] transmitting the audio signal after processing by the trained ML model to another participant in the audio call. Thus, the ‘cleaned-up’ audio signal is transmitted to the listener(s) in the audio call (instead of the noisy signal)) or to a server that sends the enhanced microphone signal to the device of the another user ([The examiner would like to note that, due to the disjunctive nature of the claim, this element does not require a mapping. Further, consider the “remote server” of Ramos which performs signal processing indicating this server could be used to send the clean audio to the other user in a call]). Regarding claim 19, Gopalakrishnan discloses: a computer-readable storage medium storing instructions which, when executed by a computing device ([Col. 9, Lines 30-35] computer readable storage medium 316, system memory 306 and/or within the processing device(s) 330 during execution thereof by the computer system 300), cause the computing device to perform acts comprising: obtaining a microphone signal that includes speech by the user ([Fig. 5, Receive first audio 505], [Col. 6, Lines 16-17] the audio signal is a speech audio signal (e.g., from a mobile phone)). Gopalakrishnan does not disclose: receiving a representation of speech characteristics of a user, the representation being generated from encodings produced by a trained audio enhancement model from one or more audio signals that include speech by the user; inputting the microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user; and obtaining an enhanced microphone signal from the trained audio enhancement model, wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources other than the speech of the user. Ramos discloses: receiving a representation of speech characteristics of a user, the representation being generated from encodings produced by a trained audio enhancement model from one or more audio signals that include speech by the user ([0109] the speaker embedding vector is pre-computed using a speech recognition model, such as an X-Vector model, from utterances of a target user during an enrolment phase, [Generating a speaker embedding vector from a speech recognition x-vector model, indicating the components of the output vector to be encodings, further indicates the speaker embedding to be a representation of speech characteristics based at least on one or more first encodings, i.e. the speech components input to the x-vector model]); inputting the microphone signal into the trained audio enhancement model with the representation of the speech characteristics of the user ([Fig. 2, Filter Block], [0109] Before passing the input to the filter block, each time frame is concatenated with a vector representing the speech profile of the target user, [Concatenating the speech profile, i.e. representation of speech characteristics of the user, and speech signal before filtering indicates both the second microphone signal, i.e. that containing speech (also reasonably representing the corrupted speech signal, e.g. as the first and second microphone signals are defined in the same way, it is reasonable to assume they are the same signal), and representation of speech characteristics are passed into the filter, i.e. part of the audio enhancement model]); and obtaining an enhanced microphone signal from the trained audio enhancement model ([Fig. 5, 520], [Col. 11, Lines 24-25] processing logic adjusts the second audio signal to compensate for the noisy environment based on the noise characteristics, [Adjusting a signal to compensate for noise indicates the resultant signal is enhanced and performing this operation indicates a required obtaining of the result for output (see below element)]), wherein inputting the representation of the speech characteristics of the user adapts the trained audio enhancement model to suppress sound sources other than the speech of the user ([0121] When the speaker embedding vector exists during a training round, the vector is the speaker embedding vector and the ML model switches to perform personalised noise removal. In this case, the model learns to remove ambient noise and/or babble noise from the output enhanced audio sample, while maintaining the speech of the target user). Gopalakrishnan and Ramos are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan to incorporate the teachings of Ramos, because of the novel way to concatenate a speaker embedding vector with a noisy audio signal only after processing by encoder and decoder modules, reducing required computing resources for processing which improves performance on resource-constrained device such as smart phones (Ramos, [0011]). Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopalakrishnan in view of Ramos, further in view of Yu et al. (US-20230403505-A1), hereinafter Yu. Regarding claim 5, Gopalakrishnan in view of Ramos discloses: the method of claim 3. Gopalakrishnan in view of Ramos does not disclose: the recurrent layer being a gated recurrent unit. Yu discloses: the recurrent layer being a gated recurrent unit ([0076] a gated recurrent unit (GRU) encoder layer (e.g., a GRU encoder layer with 257 hidden units)). Gopalakrishnan, Ramos, and Yu are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos to incorporate the teachings of Yu, because of the novel way to unify training of echo cancellation and automatic gain control, improving echo and noise suppression in audio signals (Yu, [0001]). Claim(s) 6-10, 14-15, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopalakrishnan in view of Ramos, further in view of Zhang et al. (US-10622009-B1), hereinafter Zhang. Regarding claim 6, Gopalakrishnan in view of Ramos discloses: the method of claim 3. Gopalakrishnan in view of Ramos does not disclose: receiving a first far end signal associated with the first microphone signal; and, inputting the first far end signal into the trained audio enhancement model with the first microphone signal, the one or more first encodings being produced by the trained audio enhancement model from the first microphone signal and the first far end signal. Zhang discloses: receiving a first far end signal associated with the first microphone signal ([Col. 12, Lines 35-40] double-talk conditions 250 are present (e.g., near-end speech and far-end speech, represented by near-end speech data 212b and far-end speech data 214b), [Determining a double talk condition, wherein a near-end speech reasonably tracks to a first microphone signal, indicates a required reception of a first far end signal and both signals being present at one time indicates an association between the two signals]); and, inputting the first far end signal into the trained audio enhancement model with the first microphone signal ([Fig. 6, Double-talk Detection 130], [Wherein the system displayed in Fig. 6 of Zhang contains acoustic echo cancellation 120, residual echo suppression 122, and noise reduction 624, after receiving both first far-end and microphone signals 602/606 respectively, indicating the system of Fig. 6 to reasonably track to a trained audio enhancement model as previously disclosed in Ramos]), the one or more first encodings being produced by the trained audio enhancement model from the first microphone signal and the first far end signal ([Fig. 7A, Feature Extraction 714/722], [Col. 19, Lines 25-30] The resulting accumulated/processed speech audio data for the utterance (from beginpoint to endpoint) may then be represented in a single feature vector, [Col. 27, Lines 1-5] the feature vector data 808 may be a single vector representing audio qualities of the input utterance. For example, the single vector may be created using an encoder, [Extracting features from near-end, i.e. first microphone, and far-end signals to be represented in a vector indicates the components of the vector to be encodings]). Gopalakrishnan, Ramos, and Zhang are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos to incorporate the teachings of Zhang, because of the novel way to operate noise cancellation and apply noise cancellation parameters differently based on the type of audio present in received audio data, improving the quality of multiple types of speech without attenuating or degrading the target speech (Zhang, [Col. 3, Lines 40-50]). Regarding claim 7, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the method of claim 6. Zhang further discloses: receiving a second far end signal associated with the second microphone signal ([Col. 12, Lines 35-40] double-talk conditions 250 are present (e.g., near-end speech and far-end speech, represented by near-end speech data 212b and far-end speech data 214b), [Determining a double talk condition, wherein a near-end speech reasonably tracks to a first microphone signal, indicates a required reception of a first far end signal and both signals being present at one time indicates an association between the two signals. Further, as the first and second microphone signals are currently defined to be containing the same information, they can be considered to be the same signal; therefore, a first far-end signal associated with a first microphone signal can also be a second far-end signal associated with a second microphone signal]); and, inputting the second far end signal into the trained audio enhancement model with the second microphone signal and the representation of the speech characteristics of the user ([Fig. 10A, Inputting Near-end reference signal 602, offline speech data 1008, and far-end reference signal 606 into double-talk detection 1040], [A near-end signals tracks to a microphone signal, offline speech data having features extracted, wherein feature vectors of Zhang are disclosed to be representing audio qualities (see Col. 27), indicating the feature vector representing offline speech data to reasonably map to speech characteristics of the user. Further, wherein the system displayed in Fig. 6 of Zhang contains acoustic echo cancellation 120, residual echo suppression 122, and noise reduction 624, after receiving second far-end and microphone signals 602/606 respectively and offline speech data representing characteristics, indicating the system of Fig. 6 to reasonably track to a trained audio enhancement model as previously disclosed in Ramos]), the trained audio enhancement model producing the enhanced second microphone signal from the second microphone signal, the second far end signal, and the representation of the speech characteristics of the user ([Fig. 10A, Output Signal 632], [Generating an output signal after performing noise cancellation based upon a collection of microphone signal, far-end signal, and representation of speech characteristics indicates the output signal to be an enhances second microphone signal]). Regarding claim 8, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the method of claim 7. Zhang further discloses: the trained audio enhancement model performing temporal alignment of the first microphone signal to the first far end signal and the second microphone signal to the second far end signal ([Col. 32, Lines 55-65] Double-talk detection may include performing frame level feature extraction and/or utterance level feature extraction using the feature extraction component 714 and/or the feature extraction component 722. The frame level feature extraction may determine which frame of a universal background model (UBM) the frame corresponds to. The UBM may be a Gaussian mixture model, a deep neural network, etc. The utterance level feature extraction may analyze aligned speech frames to derive feature vectors of fixed length (i.e., feature vector data 808), [In view of double-talk containing near and far end signals, see Fig. 2, indicating an alignment of microphone signal, i.e. near-end, and far-end signals. Further, aligning signals for “frame level feature extraction” indicates the temporal alignment to be on a frame-basis. Further still, as the first and second signals are defined to be the same, the same operations could be applied using the method of Zhang to second near and far end signals, i.e. the same as first near and far end signals, without a change in functionality to Zhang. Consider the plurality of input signals of Fig. 1]). Regarding claim 9, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the method of claim 7. Ramos further discloses: wherein the enhanced second microphone signal is obtained by applying masks produced by the trained audio enhancement model to the second microphone signal ([0106] The ML model (also referred to herein as “PSE-Net”) accepts (i) a noisy spectrogram and (ii) speaker-embedding of a target user as inputs and outputs a time-frequency mask, which is then applied on the input audio spectrogram to obtain the enhanced spectrogram, [Applying to a mask to spectrogram indicates that spectrogram to be representative of a second microphone signal; therefore, the masks are applied to the second microphone signal]). Regarding claim 10, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the method of claim 9. Ramos further discloses: wherein the trained audio enhancement model attenuates at least one of noise, distortions, or echoes present in the second microphone signal ([0148] PSE-Net variants can effectively suppress both babble and ambient noise, [Wherein the signal having suppressed babble and ambient noise can be the second microphone signal without extending beyond the scope of Ramos]) or extends bandwidth of the second microphone signal ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping]). Regarding claim 14, Gopalakrishnan in view of Ramos discloses: the system of claim 13. Gopalakrishnan in view of Ramos does not disclose: wherein the instructions, when executed by the processor, cause the system to: obtain a far end signal associated with the microphone signal; and input the far end signal into the trained audio enhancement model with the microphone signal, wherein the trained audio enhancement model aligns the microphone signal with the far end signal prior to processing resulting features with a recurrent layer. Zhang discloses: obtain a far end signal associated with the microphone signal ([Col. 12, Lines 35-40] double-talk conditions 250 are present (e.g., near-end speech and far-end speech, represented by near-end speech data 212b and far-end speech data 214b), [Determining a double talk condition, wherein a near-end speech reasonably tracks to a microphone signal]); and input the far end signal into the trained audio enhancement model with the microphone signal ([Fig. 6, Double-talk Detection 130], [Wherein the system displayed in Fig. 6 of Zhang contains acoustic echo cancellation 120, residual echo suppression 122, and noise reduction 624, after receiving both first far-end and microphone signals 602/606 respectively, indicating the system of Fig. 6 to reasonably track to a trained audio enhancement model as previously disclosed in Ramos]), wherein the trained audio enhancement model aligns the microphone signal with the far end signal prior to processing resulting features with a recurrent layer ([Col. 32, Lines 55-65] Double-talk detection may include performing frame level feature extraction and/or utterance level feature extraction using the feature extraction component 714 and/or the feature extraction component 722. The frame level feature extraction may determine which frame of a universal background model (UBM) the frame corresponds to. The UBM may be a Gaussian mixture model, a deep neural network, etc. The utterance level feature extraction may analyze aligned speech frames to derive feature vectors of fixed length (i.e., feature vector data 808), [Col. 33, Lines 39-45] for a certain frame's worth of audio data that comes in, the feature extraction components 714/722 may combine that frame's worth of data to the previous data received for the particular utterance. The particular method of accumulation may vary, including using an arithmetic component, a recurrent neural network (RNN), [In view of double-talk containing near and far end signals, see Fig. 2, indicating an alignment of microphone signal, i.e. near-end, and far-end signals. Further, analyzing aligned frames indicates a required alignment before analysis, i.e. feature extraction using a recurrent neural network (clearly containing at least one recurrent layer)]). Gopalakrishnan, Ramos, and Zhang are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos to incorporate the teachings of Zhang, because of the novel way to operate noise cancellation and apply noise cancellation parameters differently based on the type of audio present in received audio data, improving the quality of multiple types of speech without attenuating or degrading the target speech (Zhang, [Col. 3, Lines 40-50]). Regarding claim 15, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the system of claim 14. Ramos further discloses: wherein the encodings are produced in the recurrent layer of the trained audio enhancement model ([Fig. 2, Output from encoder into Filter Block], [0109] A filter block with N recurrent layers, [Viewing the flow of the system of Fig. 2, output from the encoder (the examiner believes the lower “decoder” is incorrectly labeled and should be “encoder” as [0106] defines Fig. 2 to be “an encoder and decoder style network with intermediate filter blocks”) passed into the filter block containing a plurality of RNNs indicates any of them can represent a hidden state in a recurrent layer. Passing output from the filter block into the decoder indicates the input to still be in the form of an encoding, further indicating encodings produced in the recurrent layers]). Regarding claim 20, Gopalakrishnan in view of Ramos discloses: the computer-readable storage medium of claim 19. Ramos further discloses: wherein the representation is an embedding ([0092] The personalisation may be performed by conditioning the output of a source extractor network on a voice profile of the target user, represented by a fixed-size embedding vector). Gopalakrishnan in view of Ramos does not disclose: the acts further comprise: refreshing the embedding based at least on the microphone signal, the embedding being generated and subsequently refreshed based at least on a state of a recurrent, convolutional, or transformer layer of the trained audio enhancement model. Zhang discloses: the acts further comprise: refreshing the embedding based at least on the microphone signal ([Col. 37, Lines 20-23] far-end reference signal 606 may be input to feature extraction component 1070 to generate third feature data and the third feature data may be used to train and update the GMM 1072, [Wherein the far-end speech influences the microphone signal, see Fig. 1, indicating an updated far-end signal will have updated associated feature data, i.e. embeddings, as required to update the model receiving the embeddings, based on at least the microphone signal (containing the far-end signal as noise)]), the embedding being generated and subsequently refreshed based at least on a state of a recurrent, convolutional, or transformer layer of the trained audio enhancement model ([Col. 37, Lines 24-26] For example, the device 110 may include a VAD detector and may update the GMM 1072 when far-end speech is detected by the VAD detector, as discussed above with regard to FIGS. 7A-7B and 9B, [Updating, i.e. performing feature extraction, based on voice activity being detected indicates the update is based on the recurrent, convolutional, or transformer layers being in an inactive/standby state, i.e. waiting to receive the updated extracted features, as the operation only continues if far-end voice activity is detected]). Gopalakrishnan, Ramos, and Zhang are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos to incorporate the teachings of Zhang, because of the novel way to operate noise cancellation and apply noise cancellation parameters differently based on the type of audio present in received audio data, improving the quality of multiple types of speech without attenuating or degrading the target speech (Zhang, [Col. 3, Lines 40-50]). Claim(s) 11-12, 16-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopalakrishnan in view of Ramos, further in view of Zhang, further in view of Yu. Regarding claim 11, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the method of claim 10. Gopalakrishnan in view of Ramos, further in view of Zhang does not disclose: wherein the trained audio enhancement model produces a concatenation of the representation of the speech characteristics of the user with features representing current frames of the second microphone signal and the second far end signal. Yu discloses: wherein the trained audio enhancement model produces a concatenation of the representation of the speech characteristics of the user with features representing current frames of the second microphone signal and the second far end signal ([0075] the real and imaginary parts of both the microphone and far-end reference channels are concatenated, [0077] the LPS of YAEC and Xecho are computed, which are then concatenated with the acoustic feature from the linear FiLM projection layer in the first stage to serve as the second stage input feature, [LPS tracks to log power spectrum, indicating the LPS of YAEC, i.e. the power of the input microphone signal, and the LPS of Xecho, i.e. the far-end reference signal, (see [0076] for signal definitions) being concatenated with an acoustic feature, wherein the acoustic features are used to separate target speech from noise (see [0043]), tracks to an overall concatenation of representation of speech characteristics, i.e. acoustic feature, features of current frames of the microphone signal, i.e. LPS of YAEC, and far-end signal, i.e. LPS of Xecho]). Gopalakrishnan, Ramos, Zhang, and Yu are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos, further in view of Zhang to incorporate the teachings of Yu, because of the novel way to unify training of echo cancellation and automatic gain control, improving echo and noise suppression in audio signals (Yu, [0001]). Regarding claim 12, Gopalakrishnan in view of Ramos, further in view of Zhang, further in view of Yu discloses: the method of claim 11. Yu further discloses: wherein the trained audio enhancement model produces a projection of the concatenation into a corresponding dimension of a flattened feature map ([Fig. 4, Linear layer projection after concatenation 405], [0075] the lower triangular matrix is flattened and the real and imaginary parts of both the microphone and far-end reference channels are concatenated), the projection being fed into the recurrent layer ([Fig. 4, GRU encoder which receives the flattened concatenation after passing through the FILM layer adding the speaker characteristic information]). Regarding claim 16, Gopalakrishnan in view of Ramos, further in view of Zhang discloses: the system of claim 15. Gopalakrishnan in view of Ramos, further in view of Zhang does not disclose: wherein the trained audio enhancement model produces a concatenation of the representation of the speech characteristics of the user with features representing current frames of the microphone signal and the far end signal prior to processing the concatenation via the recurrent layer. Yu discloses: wherein the trained audio enhancement model produces a concatenation of the representation of the speech characteristics of the user with features representing current frames of the microphone signal and the far end signal prior to processing the concatenation via the recurrent layer ([0075] the real and imaginary parts of both the microphone and far-end reference channels are concatenated, [0077] the LPS of YAEC and Xecho are computed, which are then concatenated with the acoustic feature from the linear FiLM projection layer in the first stage to serve as the second stage input feature, [LPS tracks to log power spectrum, indicating the LPS of YAEC, i.e. the power of the input microphone signal, and the LPS of Xecho, i.e. the far-end reference signal, (see [0076] for signal definitions) being concatenated with an acoustic feature, wherein the acoustic features are used to separate target speech from noise (see [0043]), tracks to an overall concatenation of representation of speech characteristics, i.e. acoustic feature, features of current frames of the microphone signal, i.e. LPS of YAEC, and far-end signal, i.e. LPS of Xecho. Further, wherein all these operations are performed before being passed into GRU encoder, see Fig. 4, indicating the concatenation is produced prior to processing the concatenation via the recurrent layer]). Gopalakrishnan, Ramos, Zhang, and Yu are considered analogous art within adaptive noise cancellation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Gopalakrishnan in view of Ramos, further in view of Zhang to incorporate the teachings of Yu, because of the novel way to unify training of echo cancellation and automatic gain control, improving echo and noise suppression in audio signals (Yu, [0001]). Regarding claim 17, Gopalakrishnan in view of Ramos, further in view of Zhang, further in view of Yu discloses: the system of claim 16. Yu further discloses: wherein the recurrent layer is a gated recurrent unit ([0076] a gated recurrent unit (GRU) encoder layer (e.g., a GRU encoder layer with 257 hidden units)). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chang et al. (US-20240105199-A1) discloses “A multi-channel based noise and echo signal integrated cancellation device using deep neural network according to an embodiment comprises a plurality of microphone encoders that receive a plurality of microphone input signals including an echo signal, and a speaker's voice signal, convert the plurality of microphone input signals into a plurality of conversion information, and output the plurality of conversion information, a channel convert unit for compressing the plurality of pieces of conversion information and converting them into first input information having a size of a single channel and outputting the converted first input information, a far-end signal encoder that receives a far-end signal, converts the far-end signal into second input information, and outputs the converted second input information, an attention unit outputting weight information by applying an attention mechanism to the first input information and the second input information, a pre-learned first artificial neural network taking third input information, which is the sum information of the weight information and the second input information, as input information, and first output information including mask information for estimating the voice signal from the second input information as output information and a voice signal estimator configured to output an estimated voice signal obtained by estimating the voice information based on the first output information and the second input information” (abstract). See entire document. Chhetri et al. (US-12272369-B1) discloses “A system configured to improve audio processing by performing dereverberation and noise reduction during a communication session. In some examples, the system may include a deep neural network (DNN) configured to perform speech enhancement, which is located after an Acoustic Echo Cancellation (AEC) component. For example, the DNN may process isolated audio data output by the AEC component to jointly mitigate additive noise and reverberation. In other examples, the system may include a DNN configured to perform acoustic interference cancellation, which may jointly mitigate additive noise, reverberation, and residual echo, removing the need to perform residual echo suppression processing. The DNN is configured to process complex-valued spectrograms corresponding to the isolated audio data and/or estimated echo data generated by the AEC component” (abstract). See entire document for processing of near and far end signals. Lund et al. (US-20240005930-A1) discloses “A method for personalized bandwidth extension in an audio device. The method comprises obtaining an input microphone signal with a first bandwidth, obtaining a first user parameter indicative of one or more characteristics of a user of the audio device, determining, based on the first user parameter, a bandwidth extension model, and generating an output signal with a second bandwidth by applying the determined bandwidth extension model to the input microphone signal” (abstract). See entire document for user-specific audio processing. Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /THEODORE WITHEY/Examiner, Art Unit 2655 /ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655
Read full office action

Prosecution Timeline

Feb 02, 2024
Application Filed
Oct 16, 2025
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 5m to grant Granted Jan 27, 2026
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Nov 18, 2025
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
2y 5m to grant Granted Sep 16, 2025
Patent 12412580
Sound Extraction System and Sound Extraction Method
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
44%
Grant Probability
90%
With Interview (+46.9%)
2y 11m
Median Time to Grant
Low
PTA Risk
Based on 23 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month