Last updated: April 19, 2026
Application No. 18/730,041
SINGLE-CHANNEL SPEECH ENHANCEMENT USING ULTRASOUND

Non-Final OA §103
Filed
Jul 18, 2024
Examiner
PASHA, ATHAR N
Art Unit
2657
Tech Center
2600 — Communications
Assignee
The Regents of the University of California
OA Round
1 (Non-Final)
Interview Optional

— +17.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 154 resolved cases, 2023–2026
Examiner Intelligence

PASHA, ATHAR N View full profile →
Grants 90% — above average
Career Allow Rate
138 granted / 154 resolved
+27.6% vs TC avg
Strong +17% interview lift
Without
With
+17.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
18 currently pending
Career history
172
Total Applications
across all art units
Statute-Specific Performance

§101
21.9%
-18.1% vs TC avg
§103
49.4%
+9.4% vs TC avg
§102
16.9%
-23.1% vs TC avg
§112
5.2%
-34.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 154 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 7/18/2024 is being considered by the examiner.
Claim Rejections - 35 USC § 103
 In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 4, 5, 9, 12, 13, 15, 16, 20, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar (US 20220366927 A1) in further view of Babayoff (US 20150304789 A1).

With respect to claims 1, 12 and 23 Pishehvar teaches 
(Claim 1) (Original) A method comprising:
(Claim 12) (Original) A system comprising: at least one processor ([0069] Aspects of the deep learning system described herein may be implemented in a data processing system, for example, by a network computer, network server, tablet computer, smartphone, laptop computer, desktop computer, other consumer electronic devices or other data processing systems. In particular, the operations described for the deep learning system are digital signal processing operations performed by a processor that is executing instructions stored in one or more memories); and 
(Claim 23) (Original) A non-transitory computer-readable storage medium including instruction which when executed by at least one processor causes operations comprising ([0070] The processes and blocks described herein are not limited to the specific examples described and are not limited to the specific orders used as examples herein. Rather, any of the processing blocks may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above. The processing blocks associated with implementing the audio processing system may be performed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium):
at least one memory including instruction which when executed by the at least one processor causes operations comprising ([0069] Aspects of the deep learning system described herein may be implemented in a data processing system, for example, by a network computer, network server, tablet computer, smartphone, laptop computer, desktop computer, other consumer electronic devices or other data processing systems. In particular, the operations described for the deep learning system are digital signal processing operations performed by a processor that is executing instructions stored in one or more memories.): 
receiving, by a machine learning model, first data corresponding to noisy audio including audio of a target speaker of interest proximate to a microphone (Pishehvar ¶[0045] The machine learning model 305 may further leverage multi-modal signals 331 to further improve the robustness of enhanced target speech signal 325. For example, visual information of the target speaker provided by a camera may augment the audio information [first set of features] provided by multi-channel signal 303 to facilitate the encoder and the separation network to better discriminate between the target speech and speech from interfering speakers or background noise. The multi-modal signals 331 may also improve the estimates of the VAD. In one aspect, multi-modal signals 331 may include DOA information of the target speaker from a camera tracking the target speaker's face, visual signals of the target speaker's lip movements or facial expressions [second set of features] captured by the camera, ¶[0046] An encoder module 307 may map segments of multi-channel signal 303 and segments of multi-modal signals 331 to high-dimensional internal representations using encoder basis functions. The mapped features of multi-channel signal 303 and multi-modal signals 331 may be fused to generate fused feature space 309); 
receiving, by the machine learning model, second data corresponding to articulatory gestures sensed by the microphone which also detected the noisy audio, [[wherein the second data corresponding to the articulatory gestures comprises one or more Doppler data indicative of Doppler associated with the articulatory gestures of the target speaker while speaking the audio (Pishehvar ¶[0045] The machine learning model 305 may further leverage multi-modal signals 331 to further improve the robustness of enhanced target speech signal 325. For example, visual information of the target speaker provided by a camera may augment the audio information [first set of features] provided by multi-channel signal 303 to facilitate the encoder and the separation network to better discriminate between the target speech and speech from interfering speakers or background noise. The multi-modal signals 331 may also improve the estimates of the VAD. In one aspect, multi-modal signals 331 may include DOA information of the target speaker from a camera tracking the target speaker's face, visual signals of the target speaker's lip movements or facial expressions [second set of features] captured by the camera, ¶[0046] An encoder module 307 may map segments of multi-channel signal 303 and segments of multi-modal signals 331 to high-dimensional internal representations using encoder basis functions. The mapped features of multi-channel signal 303 and multi-modal signals 331 may be fused to generate fused feature space 309); 
generating, by the machine learning model, a first set of features for the first data and a second set of features for the second data (Pishehvar ¶[0045] The machine learning model 305 may further leverage multi-modal signals 331 to further improve the robustness of enhanced target speech signal 325. For example, visual information of the target speaker provided by a camera may augment the audio information [first set of features] provided by multi-channel signal 303 to facilitate the encoder and the separation network to better discriminate between the target speech and speech from interfering speakers or background noise. The multi-modal signals 331 may also improve the estimates of the VAD. In one aspect, multi-modal signals 331 may include DOA information of the target speaker from a camera tracking the target speaker's face, visual signals of the target speaker's lip movements or facial expressions [second set of features] captured by the camera, ¶[0046] An encoder module 307 may map segments of multi-channel signal 303 and segments of multi-modal signals 331 to high-dimensional internal representations using encoder basis functions. The mapped features of multi-channel signal 303 and multi-modal signals 331 may be fused to generate fused feature space 309); 
combining, by the machine learning model, the first set of features for the first data and the second set of features for the second data to form an output representative of the audio of the target speaker that reduces, based on the combined first and second features, noise and/or interference related to at least one other speaker and/or related at least one other source of audio ([0045] The machine learning model 305 may further leverage multi-modal signals 331 to further improve the robustness of enhanced target speech signal 325. For example, visual information of the target speaker provided by a camera may augment the audio information provided by multi-channel signal 303 to facilitate the encoder and the separation network to better discriminate between the target speech and speech from interfering speakers or background noise. The multi-modal signals 331 may also improve the estimates of the VAD. In one aspect, multi-modal signals 331 may include DOA information of the target speaker from a camera tracking the target speaker's face, visual signals of the target speaker's lip movements or facial expressions captured by the camera, ¶[0046] An encoder module 307 may map segments of multi-channel signal 303 and segments of multi-modal signals 331 to high-dimensional internal representations using encoder basis functions. The mapped features of multi-channel signal 303 and multi-modal signals 331 may be fused to generate fused feature space 309); and 
providing, by the machine learning model, the output representative of the audio of the target speaker ([0045] The machine learning model 305 may further leverage multi-modal signals 331 to further improve the robustness of enhanced target speech signal 325). 
Pishehvar does not explicitly disclose however Babayoff teaches wherein the second data corresponding to the articulatory gestures comprises one or more Doppler data indicative of Doppler associated with the articulatory gestures of the target speaker while speaking the audio (Babayoff ¶[0158] In such embodiments, the sound discriminator module 620 is connectable to the processing utility 650 or directly to the acoustic transducer system 610 and is operable for utilizing the acoustic transducer system 610 for sending an ultrasound beam/waveform (e.g. at discrete frequency) towards the location of the user. When such waveform hits the user's face/head, it is reflected back but it is however Doppler modulated by movements of the face/head. Specifically, when the user is talking and/or moving his mouth, the reflected ultrasound will be Doppler modulated by movement of mouth and throat. To this end, the sound discriminator module 620 may be connectable to an ultrasonic sensitive microphone (e.g. 642 or other) which is capable of detecting the Doppler modulated reflection of the transmitted ultrasound beams. The sound discriminator module 620 may also be connectable to a microphone in the audible range microphone (e.g. 642 or other) operable for detecting audible sounds (e.g. including that of the user). Sound discriminator module 620 may be adapted to process the audible sound detected together with the Doppler modulated reflection for filtering the audible sounds based on a correlation of the audible sound with the Doppler reflected sounds); 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar to include articulatory gestures of Babayoff in order to aid in noise reduction; 

With respect to claims 2 and 13 Babayoff further teaches further comprising: emanating, via a loudspeaker, ultrasound towards at least the target speaker, wherein the ultrasound is reflected by the articulatory gestures and detected by the microphone (Babayoff ¶[0158] In such embodiments, the sound discriminator module 620 is connectable to the processing utility 650 or directly to the acoustic transducer system 610 and is operable for utilizing the acoustic transducer system 610 for sending an ultrasound beam/waveform (e.g. at discrete frequency) towards the location of the user. When such waveform hits the user's face/head, it is reflected back but it is however Doppler modulated by movements of the face/head. Specifically, when the user is talking and/or moving his mouth, the reflected ultrasound will be Doppler modulated by movement of mouth and throat. To this end, the sound discriminator module 620 may be connectable to an ultrasonic sensitive microphone (e.g. 642 or other) which is capable of detecting the Doppler modulated reflection of the transmitted ultrasound beams. The sound discriminator module 620 may also be connectable to a microphone in the audible range microphone (e.g. 642 or other) operable for detecting audible sounds (e.g. including that of the user). Sound discriminator module 620 may be adapted to process the audible sound detected together with the Doppler modulated reflection for filtering the audible sounds based on a correlation of the audible sound with the Doppler reflected sounds. This technique enables to discriminate the user's voice which is relatively correlated with the Doppler ultrasound reflections since the ultrasound beam is directed/focused at the user); 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar to include articulatory gestures of Babayoff in order to aid in noise reduction; 

With respect to claims 4 and 15 Babayoff further teaches wherein the ultrasound comprises a plurality of continuous wave (CW) single frequency tones (Babayoff¶[0055] For example FIG. 1B illustrates schematically the results of a concurrent transmission of two ultrasound signals/waves with respective frequencies f.sub.1=40 KHz and f.sub.2=42 KHz through a non-linear medium (air in this case). The air propagates the 40 and 42 KHz frequencies, but also produces the following frequencies: 80 and 84 KHz (the harmonics), 82 KHz (the sum) and 2 KHz (the difference). However, only the later frequency |f.sub.1−f.sub.2|=2 KHz is audible (i.e. heard by humans) as the rest of the frequencies are in the ultrasound regime. Modulating a carrier frequency in the ultrasonic regime (e.g. with frequency f.sub.1=40 KHz) may be amplitude modulated at the input (e.g. utilizing Double Side Band Amplitude Modulation—AM-DSB) with an audible tone (for example single tone at 2 KHz), which will create the spectrum lines of 40 KHz and 42 KHz (also 38 KHz as this is double side band modulation) in the frequency domain. Based on the self demodulation characteristic of the air/non-linear medium, the AM modulated signal will be demodulate to reproduce the 2 KHz tone which the human ear can hear (typically also producing the native frequencies, harmonics, and sum of the native frequencies). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar to include articulatory gestures of Babayoff in order to aid in noise reduction; 

With respect to claims 5 and 16 Babayoff further teaches wherein the articulatory gestures comprise gestures associated with the target speaker’s speech including mouth gestures, lip gestures, tongue gestures, jaw gestures, vocal cord gestures, and/or other speech related organs (Babayoff ¶[0158] In such embodiments, the sound discriminator module 620 is connectable to the processing utility 650 or directly to the acoustic transducer system 610 and is operable for utilizing the acoustic transducer system 610 for sending an ultrasound beam/waveform (e.g. at discrete frequency) towards the location of the user. When such waveform hits the user's face/head, it is reflected back but it is however Doppler modulated by movements of the face/head. Specifically, when the user is talking and/or moving his mouth, the reflected ultrasound will be Doppler modulated by movement of mouth and throat. To this end, the sound discriminator module 620 may be connectable to an ultrasonic sensitive microphone (e.g. 642 or other) which is capable of detecting the Doppler modulated reflection of the transmitted ultrasound beams. The sound discriminator module 620 may also be connectable to a microphone in the audible range microphone (e.g. 642 or other) operable for detecting audible sounds (e.g. including that of the user). Sound discriminator module 620 may be adapted to process the audible sound detected together with the Doppler modulated reflection for filtering the audible sounds based on a correlation of the audible sound with the Doppler reflected sounds. This technique enables to discriminate the user's voice which is relatively correlated with the Doppler ultrasound reflections since the ultrasound beam is directed/focused at the user); 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar to include articulatory gestures of Babayoff in order to aid in noise reduction; 

With respect to claims 9 and 20 Babayoff further teaches further comprising: receiving a single stream of data obtained from the microphone (Babayoff ¶[0158] In such embodiments, the sound discriminator module 620 is connectable to the processing utility 650 or directly to the acoustic transducer system 610 and is operable for utilizing the acoustic transducer system 610 for sending an ultrasound beam/waveform (e.g. at discrete frequency) towards the location of the user. When such waveform hits the user's face/head, it is reflected back but it is however Doppler modulated by movements of the face/head. Specifically, when the user is talking and/or moving his mouth, the reflected ultrasound will be Doppler modulated by movement of mouth and throat. To this end, the sound discriminator module 620 may be connectable to an ultrasonic sensitive microphone (e.g. 642 or other) which is capable of detecting the Doppler modulated reflection of the transmitted ultrasound beams. The sound discriminator module 620 may also be connectable to a microphone in the audible range microphone (e.g. 642 or other) operable for detecting audible sounds (e.g. including that of the user). Sound discriminator module 620 may be adapted to process the audible sound detected together with the Doppler modulated reflection for filtering the audible sounds based on a correlation of the audible sound with the Doppler reflected sounds. This technique enables to discriminate the user's voice which is relatively correlated with the Doppler ultrasound reflections since the ultrasound beam is directed/focused at the user);
preprocessing the single stream to extract the first data comprising noisy audio and to extract the second data comprising the articulatory gestures ((Babayoff ¶[0158] In such embodiments, the sound discriminator module 620 is connectable to the processing utility 650 or directly to the acoustic transducer system 610 and is operable for utilizing the acoustic transducer system 610 for sending an ultrasound beam/waveform (e.g. at discrete frequency) towards the location of the user. When such waveform hits the user's face/head, it is reflected back but it is however Doppler modulated by movements of the face/head. Specifically, when the user is talking and/or moving his mouth, the reflected ultrasound will be Doppler modulated by movement of mouth and throat. To this end, the sound discriminator module 620 may be connectable to an ultrasonic sensitive microphone (e.g. 642 or other) which is capable of detecting the Doppler modulated reflection of the transmitted ultrasound beams. The sound discriminator module 620 may also be connectable to a microphone in the audible range microphone (e.g. 642 or other) operable for detecting audible sounds (e.g. including that of the user). Sound discriminator module 620 may be adapted to process the audible sound detected together with the Doppler modulated reflection for filtering the audible sounds based on a correlation of the audible sound with the Doppler reflected sounds. This technique enables to discriminate the user's voice which is relatively correlated with the Doppler ultrasound reflections since the ultrasound beam is directed/focused at the user);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar to include articulatory gestures of Babayoff in order to aid in noise reduction; 


Claims 3, 6, 14, 17 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar, Babayoff in further view of Zhou (US 20200309930 A1).
With respect to claims 3 and 14 none of Pishehvar and Babayoff explicitly disclose however Zhou teaches receiving an indication of an orientation of a user equipment including the microphone and the loudspeaker ([0271] During such data augmentation process shown in FIG. 11A, the system extracts lightweight visual features of the face to compliment acoustic ones. The vision techniques serve two purposes: 1) the detection of facial landmarks 155 which are later used as basic visual features; and 2) tracking the user's face 9 on the smartphone screen 105 so that the user can hold the device 7 within some “valid” zone (thus, distance and orientation are accounted for) during data collection and processing of visual and/or acoustics data for later authentication of the user 9.); and 
selecting, using the received indication, the machine learning model (Zhou ¶[0314] Finally, in step 207 of FIG. 11B, the system synthesizes the acoustics data and outputs for system input of a unique echo-signature print, representation or pattern associated with identifying and authenticating a particular user 3 (analogous to a fingerprint of the particular user 3). Such unique echo-signature print is input into a pre-trained CNN model and/or SVM model used in SVM prediction 131 and/or two-factor authentication 121 (as shown in example FIG. 7)). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in to include orientation of equipment of Zhou in order to aid in background noise removal (Zhou ¶[0194]); 

With respect to claims 6 and 17 one of Pishehvar and Babayoff explicitly disclose however Zhou teaches wherein the generating, by the machine learning model, the first set of features for the first data and the second set of features for the second data further comprises: 
using, a first set of convolutional layers to provide feature embedding for the first data, wherein the first data is in a time-frequency domain (Zhou ¶ [0044] FIG. 7A illustrates an example overview of the workflow associated with generating an image-based CNN and acoustics-based CNN, by extracting and fusing both acoustics features and image features in generating a classifier, in accordance with an embodiment of the disclosed echo-signature authentication system and method); and 
using, a second set of convolutional layers to provide feature embedding for the second data, wherein the second data is in the time-frequency domain (Zhou ¶ [0044] FIG. 7A illustrates an example overview of the workflow associated with generating an image-based CNN and acoustics-based CNN, by extracting and fusing both acoustics features and image features in generating a classifier, in accordance with an embodiment of the disclosed echo-signature authentication system and method). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in to include convolution layers of Zhou in order to aid in background noise removal (Zhou ¶[0194]); 


Claims 7, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar, Babayoff and Zhou in further view of Zhou_2 (US 20220335925 A1).
With respect to claims 7 and 18 none of Pishehvar, Babayoff and Zhou explicitly disclose however Zhou_2 teaches wherein the first set of features and the second set of features are combined in the time-frequency domain while maintaining time alignment between the first and second set of features (Zhou_2 ¶[0036]A timestamp (starting time and ending time) can be associated to each phoneme (for example, using the Montreal Forced Aligner to convert audio to MFCC features, and create alignment between MFCC features and phonemes). For this, the output contains: 1) a sequence of phonemes 2) the timestamp/duration of each phoneme.). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in in view of convolution layers of Zhou to include alignment of Zhou_2 in order to remove background noise (Zhou_2 ¶[0029]); 

Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar, Babayoff, Zhou and Zhou_2 in further view of Chakrabarty (US 20240214729 A1).
With respect to claims 8 and 19 none of Pishehvar, Babayoff and Zhou and Zhou_2 explicitly disclose however Chakrabarty teaches wherein the machine learning model includes one or more fusion layers to combine, in a frequency domain, the first set of features for the first data and the second set of features for the second data (Chakrabarty ¶[0080] Given the frequency domain feature vector for each microphone channel, the first module 160 in this block, CB1, typically comprises multiple convolution layers that compute a non-linear combination of the frequency domain features corresponding to at least two microphones for each frequency bin of the frequency domain feature representation separately. The output of the module CB1 160 may, e.g., be referred to as frequency-domain feature embedding vector. In an embodiment, only elements of the frequency domain feature vector that are associated with the same frequency bin are combined. In the case where N microphones (with N equal to or larger than two) are considered, the number of layers in this processing module is typically designed to be (N−1), similar to the design choice in [5]. This is based on the finding that to account for all the microphone pairs (two microphone combinations) for a given microphone array, (N−1) layers are used. This implies that the output of the first module CB1 of the DOA estimation block, represents aggregated information of the DOA related information from all microphone input channels for that particular frequency bin.). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in in view of convolution layers of Zhou in view of alignment of Zhou_2 to include fusion layers of Chakrabarty in order to perform noise reduction and speaker identification(Chakrabarty ¶[0003]); 

Claims 10 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar, Babayoff, in further view of Kimura (US 20140321668 A1).
With respect to claims 10 and 21 none of Pishehvar, Babayoff do not explicitly disclose however Kimura teaches further comprising: correcting [[the]]a phase of the output representative of the audio of the target speaker (Kimura ¶[0055] The phase correcting unit 701, while maintaining the frequency amplitude characteristics of the input signal 101, corrects only its phase characteristics in such a manner that the phase characteristics become nearly equal to the phase characteristics of the HPF 702, and supplies the resultant signal 703 to the first multiplier 705 and excessive input estimating unit 102). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in to include phase correction of Kimura in order to correct phase following noise cancellation (Kimura ¶[0055]); 

Claims 11 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Pishehvar, Babayoff, in further view of Biswas (US 20210327445 A1) and Lee (US-20220343939-A1)
With respect to claims 11 and 22 Pishehvar and Babayoff don’t explicitly disclose however Wang teaches wherein during training of the machine learning model, a generator comprising the machine learning model is used to output a noise-reduced representation of audible speech of the target speaker, and a discriminator is used to receive as a first input the noise-reduced representation of audible speech of the target speaker, receive as a second input a noisy representation of audible speech of the target speaker (Biswas ¶[0057] Such a Generator may be a Generator trained in a Generative Adversarial Network setting (GAN setting). A GAN setting generally includes the Generator G and a Discriminator D which are trained by an iterative process. During training in the Generative Adversarial Network setting, the Generator G generates enhanced audio data, x*, based on a random noise vector, z, and raw audio data derived from original audio data, x, that has been coded at a low bitrate and decoded, respectively. The random noise vector may, however, be set to z=0, which was found to be best for coding artifact reduction. Training may be performed without the input of a random noise vector, z. In addition, metadata may be input into the Generator for modifying enhanced audio data in a coded audio feature space. In this, during training, the generation of enhanced audio data may be conditioned based on the metadata. The Generator G tries to output enhanced audio data, x*, that is indistinguishable from the original audio data, x. The Discriminator D is one at a time fed with the generated enhanced audio data, x*, and the original audio data, x, and judges in a fake/real manner whether the input data are enhanced audio data, x*, or original audio data, x. In this, the Discriminator D tries to discriminate the original audio data, x, from the enhanced audio data, x*. During the iterative process, the Generator G then tunes its parameters to generate better and better enhanced audio data, x*, as compared to the original audio data, x, and the Discriminator D learns to better judge between the enhanced audio data, x*, and the original audio data, x. This adversarial learning process may be described by the following equation (1):), 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in to include generator and discriminator of Biswas in order to enhance the quality of the raw audio data (Biswas ¶[0084]); 
output, using a cross modal similarity metric, a cross-modal indication of similarity to train the machine learning model (Lee ¶ [0053] In one example, similarities 440 output from attention module 430 may be in the form of a heat map for each frame of the input video clip 402 to indicate pixels within the frame having the highest similarities with a particular phrase ,¶[0068] As such, the disclosure, by training the neural network model with the text and audio signal as input data (this may be called cross-modality learning) may distinguish between audio signals or text having similarities. For example, in order to learn “Tom” which is a name of a person, by training the neural network model by masking at least one of text elements “T”, “o”, “m” constituting the text “Tom”, audio element corresponding to “T”, audio element corresponding to “o”, and an audio element corresponding to “m” the neural network model of the disclosure may output the audio signal “Tom” as the text “Tom” and may prevent error of outputting “tomb” which has a phonetic similarity, ¶[0067] When it is identified that the plurality of audio elements constituting the outputted second audio signal correspond to a plurality of text elements constituting the second text after learning, the neural network model may terminate learning in operation S1600-Y. Even prior to learning, if the plurality of audio elements constituting the outputted second audio signal correspond to a plurality of text elements constituting the second text, the neural network model may terminate the learning procedure without performing learning). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify noisy audio processing of Pishehvar in view of articulatory gestures of Babayoff in in view of generator and discriminator of Biswas to include cross-modal indication of Lee in order to remove noise from an audio signal (Lee ¶[0073]); 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ATHAR N PASHA/ Primary Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Jul 18, 2024
Application Filed
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/335,256
Patent 12596882
COMPLIANCE DETECTION USING NATURAL LANGUAGE PROCESSING
2y 5m to grant Granted Apr 07, 2026
17/418,193
Patent 12586563
Method, System and Apparatus for Understanding and Generating Human Conversational Cues
2y 5m to grant Granted Mar 24, 2026
18/094,032
Patent 12579173
SYSTEMS AND METHODS FOR DYNAMICALLY PROVIDING INTELLIGENT RESPONSES
2y 5m to grant Granted Mar 17, 2026
18/087,629
Patent 12566921
GAZETTEER INTEGRATION FOR NEURAL NAMED ENTITY RECOGNITION
2y 5m to grant Granted Mar 03, 2026
18/200,559
Patent 12547844
INTELLIGENT MODEL SELECTION SYSTEM FOR STYLE-SPECIFIC DIGITAL CONTENT GENERATION
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+17.0%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 154 resolved cases by this examiner. Grant probability derived from career allow rate.