Last updated: May 29, 2026
Application No. 18/541,788
SPLIT-AND-MERGE FRAMEWORK FOR AUDIO CONTENT PROCESSING

Final Rejection §101§103
Filed
Dec 15, 2023
Examiner
SHAIKH, ZEESHAN MAHMOOD
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Paypal Inc.
OA Round
2 (Final)
Interview Optional

— +52.8% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 53% grant rate with +52.8% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 34 resolved cases, 2023–2026
Examiner Intelligence

SHAIKH, ZEESHAN MAHMOOD View full profile →
Grants 53% of resolved cases
Career Allowance Rate
18 granted / 34 resolved
-9.1% vs TC avg
Strong +53% interview lift
Without
With
+52.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
6.9%
-33.1% vs TC avg
§103
88.4%
+48.4% vs TC avg
§102
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 34 resolved cases
Office Action

§101 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This communication is responsive to the applicant’s amendment dated 2/9/2026.  The applicant amended claims 1-4, 8-10, 15, and 18.  

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 101 (see Remarks, pg. 8, line 17 – pg. 10, line 23) filed 2/9/2026 have been fully considered but they are not persuasive. The applicant has amended the limitations to include predicting events by analyzing vocal features, modifying vocal features based on correlations, and classifying the modified vocal features.  As shown below, the examiner interprets these limitations as mental processes using generic computer components.  The examiner fails to see how these steps are an improvement to a computer or technical field.  Therefore, the 35 U.S.C. 101 rejection is maintained.  
Applicant’s arguments with respect 35 U.S.C. 102 and 35. U.S.C. 103 (See Remarks, pg. 10, line 24 – pg. 11, line 29) for claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.  Given the amendments, a new ground of rejection is provided below. 

Claim Rejections - 35 USC § 101
    35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

    Claim 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Independent claim 1 recites, “splitting an audio content into a vocal portion and a background portion”, “extracting vocal features from the vocal portion and extracting background features from the background portion”, “predicting an event associated with the audio content based on analyzing the vocal features independent of the background features”, “determining one or more correlations between the vocal portion of the audio content and the background portion of the audio content based on the vocal features and the background features”, “modifying the vocal features based on the one or more correlations, wherein the modifying comprises accentuating at least a portion of the vocal features when the one or more correlations indicate that the background features supports an occurrence of the event or suppressing the at least the portion of the vocal features when the one or more correlations indicate that the background features refutes the occurrence of the event” and “classifying the audio content based on analyzing the modified vocal features and the background features collectively”. 
	The limitation of splitting audio is as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting, “a non-transitory memory” and “one or more hardware processors”, nothing in the claim precludes the step from practically being performed in the mind.  For example, “splitting” in the context of this claim encompasses identifying portions of audio, which a human can do by listening and identifying portions of audio.  Next, the limitation of extracting vocal features, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “extracting” in the context of this claim encompasses feature extraction, which a human can do by identifying portions of audio.  Next, the limitation of predicting an event associated with audio content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “predicting” in the context of this claim encompasses analyzing audio data, which a human can do in the mind.   Next, the limitation of determining correlations, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “determining” in the context of this claim encompasses feature extraction analysis, which a human can do in the mind or with a pen and paper.  Next, the limitation of modifying vocal features as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “modifying” in the context of this claim encompasses censoring audio content, which a human can do in the mind.  Lastly, the limitation of classifying audio content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “classifying” in the context of this claim encompasses labeling audio which a human can do in the mind.  
	The judicial exception is not integrated into a practical application.  In particular, the claim only recites the additional elements, using “a non-transitory memory” and “one or more hardware processors” to perform the recited limitations.  These elements in these steps are recited at a high-level of generality such that is amounts no more than mere instructions to apply the exception using generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.  
	The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using “a non-transitory memory” and “one or more hardware processors” to perform the recited limitations amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claim is not patent eligible.   
	Dependent claims 2-7 are also rejected for the same reasons provided in independent claim 1 above.  The dependent claim, including the further recited limitation, does not integrate the abstract idea into a practical application and the additional elements, taken individually and in combination do not contribute to an inventive concept.  In other words, the dependent claim is directed to an abstract idea without significantly more.  
	Independent claim 8 recites, “dividing audio data associated with a digital content into a first audio track and a second audio track”, “extracting a first plurality of audio features from the first audio track and extracting a second plurality of audio features from the second audio track”, “predicting an event associated with the digital content based on analyzing the first plurality of audio features”, “determining one or more correlations between the first audio track and the second audio track based on the first plurality of audio features and the second plurality of audio features”, “modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises emphasizing at least a subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is consistent with an occurrence of the event or de-emphasizing the at least the subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is inconsistent with the occurrence of the event”, and “classifying the digital content based on analyzing the modified first plurality of audio features and the second plurality of audio features”. 
	The limitation of dividing audio data is as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind.  Nothing in the claim precludes the step from practically being performed in the mind.  For example, “dividing” in the context of this claim encompasses identifying portions of audio, which a human can do by listening and identifying portions of audio.  Next, the limitation of extracting audio features, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind.  Nothing in the claim precludes the step from practically being performed in the mind.  For example, “extracting” in the context of this claim encompasses feature extraction, which a human can do by identifying portions of audio.  Next, the limitation of predicting an event associated with digital content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “predicting” in the context of this claim encompasses analyzing audio data, which a human can do in the mind.  Next, the limitation of determining correlations, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind.  Nothing in the claim precludes the step from practically being performed in the mind.  For example, “determining” in the context of this claim encompasses feature extraction analysis, which a human can do in the mind or with a pen and paper.  Next, the limitation of modifying audio features as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “modifying” in the context of this claim encompasses censoring audio content, which a human can do in the mind.  Lastly, the limitation of classifying digital content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind.  Nothing in the claim precludes the step from practically being performed in the mind.  For example, “classifying” in the context of this claim encompasses labeling audio which a human can do in the mind.  
	The judicial exception is not integrated into a practical application.  The claims fail to recite specific technical improvements or inventive implementation detail that would provide significantly more that the abstract idea.  Accordingly, the claims do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.  
	The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using “a non-transitory memory” and “one or more hardware processors” to perform the recited limitations amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claim is not patent eligible.   
	Dependent claims 9-14 are also rejected for the same reasons provided in independent claim 8 above.  The dependent claim, including the further recited limitation, does not integrate the abstract idea into a practical application and the additional elements, taken individually and in combination do not contribute to an inventive concept.  In other words, the dependent claim is directed to an abstract idea without significantly more.  
	Independent claim 15 recites “splitting audio data into a first portion and a second portion”, “extracting a first plurality of audio features from the first portion of the audio data and extracting a second plurality of audio features from the second portion of the audio data”, “predicting an occurrence of an event based on analyzing the first plurality of audio features”, “comparing the first plurality of audio features with the second plurality of audio features”, “determining, based on the comparing, one or more correlations between the first portion of the audio data and the second portion of the audio”, “modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises highlighting the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features supports the occurrence of the event or suppressing the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features does not support the occurrence of the event”, and “classifying the audio data based on analyzing the modified first plurality of audio features and the second plurality of audio features”.  
	 The limitation of splitting audio is as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting, “a non-transitory machine-readable medium”, nothing in the claim precludes the step from practically being performed in the mind.  For example, “splitting” in the context of this claim encompasses identifying portions of audio, which a human can do by listening and identifying portions of audio.  Next, the limitation of extracting audio features, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “extracting” in the context of this claim encompasses feature extraction, which a human can do by identifying portions of audio.  Next, the limitation of predicting an event associated with audio content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “predicting” in the context of this claim encompasses analyzing audio data, which a human can do in the mind.  Next, the limitation of comparing audio features, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “comparing” in the context of this claim encompasses feature extraction analysis, which a human can do in the mind or with a pen and paper.  Next, the limitation of determining correlations, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “determining” in the context of this claim encompasses feature extraction analysis, which a human can do in the mind or with a pen and paper.  Next, the limitation of modifying audio features as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “modifying” in the context of this claim encompasses censoring audio content, which a human can do in the mind.  Lastly, the limitation of classifying audio content, as drafted, is a process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  That is, other than reciting the elements listed above, nothing in the claim precludes the step from practically being performed in the mind.  For example, “classifying” in the context of this claim encompasses labeling audio which a human can do in the mind.  
The judicial exception is not integrated into a practical application.  In particular, the claim only recites the additional elements, using “a non-transitory machine-readable medium” to perform the recited limitations.  These elements in these steps are recited at a high-level of generality such that is amounts no more than mere instructions to apply the exception using generic computer component.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claim is directed to an abstract idea.  
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using “a non-transitory machine-readable medium” to perform the recited limitations amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claim is not patent eligible.   
Dependent claims 16-20 are also rejected for the same reasons provided in independent claim 15 above.  The dependent claim, including the further recited limitation, does not integrate the abstract idea into a practical application and the additional elements, taken individually and in combination do not contribute to an inventive concept.  In other words, the dependent claim is directed to an abstract idea without significantly more.  

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-10 and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan et al. US 20200066296 A1 (hereinafter Sargsyan) in view of Liu et al. US 20230104070 A1 (hereinafter Liu).	

Regarding independent claim 1, Sargsyan teaches a system, comprising:
a non-transitory memory (FIG. 15, 1506); and
one or more hardware processors coupled with the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising (FIG. 15, 1506):
splitting an audio content into a vocal portion and a background portion ([0005] “receiving an audio file comprising a combination of voice data and noise data having a first bandwidth; dividing said audio file into a plurality of frames”, [0025] “The methods and systems disclosed herein can be provided as an online service which receives audio or video file, cleans it from background noise and returns the resulting file back”);
extracting vocal features from the vocal portion and extracting background features from the background portion (FIG. 3, [0070] “During data collection prior to train (see train process) the systems and methods extract VAD of clean speech based on k-mean algorithm and use this feature to calculate NMA based on voiceless frames of the mix”; [0045] “At each step of training data creation (see train process in chart), the systems and methods take a randomly picked noise recording and a randomly picked speech recording, and extract raw data of these audios”);
determining one or more correlations between the vocal portion of the audio content and the background portion of the audio content based on the vocal features and the background features ([0076] “The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames”, examiner interprets overlapping frames to include vocal portions and background portions); and
classifying the audio content based on analyzing the modified vocal features and the background features collectively. ([0030] using overlapped frames to calculate voice coefficients to update noise model.  The examiner interprets this as a means of classifying the audio content based off correlations; [0074] “The systems and methods evaluate the overall performance of the models using, for example, the following metrics: STOI (Short-Time Objective Intelligibility), PESQ (perceptual evaluation of speech quality, version ITU-T P.862), SNR (speech to noise ratio), SIR (speech to interference ratio). All of these metrics work based on reference audio (clean speech) and enhanced audio”).
	Sargsyan fails to teach predicting an event associated with the audio content based on analyzing the vocal features independent of the background features;  modifying the vocal features based on the one or more correlations, wherein the modifying comprises accentuating at least a portion of the vocal features when the one or more correlations indicate that the background features supports an occurrence of the event or suppressing the at least the portion of the vocal features when the one or more correlations indicate that the background features refutes the occurrence of the event
However, Liu teaches predicting an event associated with the audio content based on analyzing the vocal features independent of the background features ([0052] “the system 10 can be deployed in a hearing aid, for example, to aid in picking up the sound of others (e.g., a voice of a conversation partner or a desired signal source) in the far field in order to enhance playback”, examiner here interprets the conversation as the event);
modifying the vocal features based on the one or more correlations, wherein the modifying comprises accentuating at least a portion of the vocal features when the one or more correlations indicate that the background features supports an occurrence of the event or suppressing the at least the portion of the vocal features when the one or more correlations indicate that the background features refutes the occurrence of the event (FIG. 1, [0044-0045] “This process P2A can also be beneficial in scenarios where multiple users 15 (FIG. 1) will be talking and it is desirable to enhance speech from two or more of those users 15… a) filtering the reference signal to generate a noise estimate signal and b) subtracting the noise estimate signal from the primary signal. In certain of these cases, the process further includes enhancing the spectral amplitude of the primary signal 210 based on the noise estimate signal to provide an output signal”; [0017] “removing from the primary signal components that correlate to the reference signal includes filtering the reference signal to generate a noise estimate signal and subtracting the noise estimate signal from the primary signal”);
	Sargsyan in view of Liu are considered to be analogous to the claimed invention because both are the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan with the technique of modifying vocal features based on correlations taught by Liu in order to improve beamforming in audio devices (see Liu [0001])

Regarding claim 2, Sargsyan in view of Liu teaches all of the limitations of claim 1, upon which claim 2 depends. 
Additionally, Sargsyan teaches wherein the classifying comprises classifying the audio content as a first audio type, and wherein the operations further comprise: incorporating, into the audio content, a signal indicating the first audio type ([0005] “taking an inverse fast Fourier transform of the frequency spectrum to provide an audio signal; repeating said bandwidth expansion process for a subsequent frame of said plurality of frames; and outputting an audio file having the second bandwidth based on the audio signals for the plurality of frames”).

Regarding claim 3, Sargsyan in view of Liu teaches all of the limitations of claim 1, upon which claim 3 depends. 
	Additionally, Sargsyan teaches wherein the classifying comprises determining that a first segment of the audio content comprises audio data corresponding to a first audio type, and wherein the operations further comprise: modifying the first segment of the audio content based on the first audio type ([0065] “the systems and methods generate the ratio mask for the current frame. Modifying this ratio mask with special smoothing functions, it is multiplied with amplitudes of current frame's Fourier coefficients”).

Regarding claim 4, Sargsyan in view of Liu teaches all of the limitations of claim 3, upon which claim 4 depends. 
	Additionally, Sargsyan teaches wherein the modifying the first segment comprises removing the first segment from the audio content ([0024] “The systems and methods disclosed herein will remove background noise”).

	Regarding claim 5, Sargsyan in view of Liu teaches all of the limitations of claim 1, upon which claim 5 depends. 
	Additionally, Sargsyan teaches wherein the operations further comprise: augmenting the vocal portion and the background portion ([0081] “Speech enhancement and noise suppression system 1500 also includes an audio processing manager 1508 that manages the processing of various audio data and audio signals…” ).

Regarding claim 6, Sargsyan in view of Liu teaches all of the limitations of claim 1, upon which claim 6 depends. 
Additionally, Sargsyan teaches wherein the operations further comprise: segmenting the vocal portion into a plurality of vocal segments; and segmenting the background portion into a plurality of background segments ([0005] “dividing said audio file into a plurality of frames”, [0025] The methods and systems disclosed herein can be provided as an online service which receives audio or video file, cleans it from background noise and returns the resulting file back), 
wherein the determining the one or more correlations comprises determining a corresponding correlation score between a first vocal segment in the plurality of vocal segments and each corresponding background segment in the plurality of background segments ([0074] “The systems and methods evaluate the overall performance of the models using, for example, the following metrics: STOI (Short-Time Objective Intelligibility), PESQ (perceptual evaluation of speech quality, version ITU-T P.862), SNR (speech to noise ratio), SIR (speech to interference ratio)”).

	Regarding claim 7, Sargsyan in view of Liu teaches all of the limitations of claim 6, upon which claim 7 depends. 
Additionally, Sargsyan teaches wherein the operations further comprise: determining a correlation between the first voice segment and a first corresponding background segment from the plurality of background segments based on the corresponding correlation score ([0076] “Train process: The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames”).

	Regarding independent claim 8, Sargsyan teaches a method, comprising: 
dividing audio data associated with a digital content into a first audio track and a second audio track ([0005] “dividing said audio file into a plurality of frames; for a first frame of said plurality of frames”, [Claim 6] “identifying a second subset of the plurality of frames of the mixed data, the second subset including a second plurality of frames”); 
extracting a first plurality of audio features from the first audio track and extracting a second plurality of audio features from the second audio track ([0070] “During data collection prior to train (see train process) the systems and methods extract VAD of clean speech based on k-mean algorithm and use this feature to calculate NMA based on voiceless frames of the mix”); 
determining one or more correlations between the first audio track and the second audio track based on the first plurality of audio features and the second plurality of audio features ([0076] “The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames.”); and 
classifying the digital content based on analyzing the modified first plurality of audio features and the second plurality of audio features ([0030] using overlapped frames to calculate voice coefficients to update noise model.  The examiner interprets this as a means of classifying the audio content based off correlations; [0074] “The systems and methods evaluate the overall performance of the models using, for example, the following metrics: STOI (Short-Time Objective Intelligibility), PESQ (perceptual evaluation of speech quality, version ITU-T P.862), SNR (speech to noise ratio), SIR (speech to interference ratio). All of these metrics work based on reference audio (clean speech) and enhanced audio”).  
	Sargsyan fails to teach predicting an event associated with the digital content based on analyzing the first plurality of audio features; modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises emphasizing at least a subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is consistent with an occurrence of the event or de-emphasizing the at least the subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is inconsistent with the occurrence of the event;
	However, Liu teaches predicting an event associated with the digital content based on analyzing the first plurality of audio features ([0052] “the system 10 can be deployed in a hearing aid, for example, to aid in picking up the sound of others (e.g., a voice of a conversation partner or a desired signal source) in the far field in order to enhance playback”, examiner here interprets the conversation as the event);
	modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises emphasizing at least a subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is consistent with an occurrence of the event or de-emphasizing the at least the subset of the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features is inconsistent with the occurrence of the event (FIG. 1, [0044-0045] “This process P2A can also be beneficial in scenarios where multiple users 15 (FIG. 1) will be talking and it is desirable to enhance speech from two or more of those users 15… a) filtering the reference signal to generate a noise estimate signal and b) subtracting the noise estimate signal from the primary signal. In certain of these cases, the process further includes enhancing the spectral amplitude of the primary signal 210 based on the noise estimate signal to provide an output signal”; [0017] “removing from the primary signal components that correlate to the reference signal includes filtering the reference signal to generate a noise estimate signal and subtracting the noise estimate signal from the primary signal”);
	Sargsyan in view of Liu are considered to be analogous to the claimed invention because both are the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan with the technique of modifying vocal features based on correlations taught by Liu in order to improve beamforming in audio devices (see Liu [0001])

Regarding claim 9, Sargsyan in view of Liu teaches all of the limitations of claim 8, upon which claim 9 depends. 
Additionally, Sargsyan teaches in response to determining the occurrence of the event based on the one or more correlations emphasizing the at least the subset of the first plurality of audio features ([0032] “The noise model is updated using the successive frames having 0 as a VAD output”, examiner interprets VAD output as the occurrence of an event; [0062] “Then, the systems and methods take the logarithm and obtain the clean speech features. Further, the systems and methods use the ratio mask to get a noise model (approximation of noise LPS features) for the previous frame”; [0076] “The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames”; [0079] “A system for speech enhancement and noise suppression may include a processor configured to implement a method for speech enhancement and noise suppression”).

	Regarding claim 10, Sargsyan in view of Liu teaches all of the limitations of claim 9, upon which claim 10 depends. 
Additionally, Sargsyan teaches wherein the classifying the digital content is further based on the occurrence of the event ([0005] “performing a fast Fourier transform to obtain audio features corresponding to the combination of voice data and noise data…”; [0030] “The systems and methods then concatenate them with a noise model and take as an input for a neural network… the systems and methods generate a ratio mask… clean voice coefficients are computed using ratio masks and update the noise model”).
	
	Regarding claim 13, Sargsyan in view of Liu teaches all of the limitations of claim 8, upon which claim 13 depends. 
Additionally, Sargsyan teaches wherein the extracting the first plurality of audio features from the first portion of the audio data comprises: 
extracting a first portion of the first plurality of audio features from the first audio track using a first machine learning model ([0070] “During data collection prior to train (see train process) the systems and methods extract VAD of clean speech based on k-mean algorithm and use this feature to calculate NMA based on voiceless frames of the mix”; [0098] “Combination of statistical and machine learning methods”, examiner interprets these as different learning models); and 
extracting a second portion of the first plurality of audio features from the first audio track using a second machine learning model different from the first machine learning model ([0045] “the systems and methods take a randomly picked noise recording and a randomly picked speech recording, and extract raw data of these audios. The level of noise is randomly changed, and sum of the speech data and the noise data in order to create mix data”).

Regarding claim 14, Sargsyan in view of Liu teaches all of the limitations of claim 8, upon which claim 14 depends. 
Additionally, Sargsyan teaches segmenting the first audio track into a first plurality of audio segments ([0005] “taking an inverse fast Fourier transform of the frequency spectrum to provide an audio signal; repeating said bandwidth expansion process for a subsequent frame of said plurality of frames; and outputting an audio file having the second bandwidth based on the audio signals for the plurality of frames”, examiner interprets expansion process for a subsequent frame as segmenting); and 
segmenting the second audio track into a second plurality of audio segments, wherein the determining the one or more correlations comprises determining a corresponding correlation score between a first audio segment in the first plurality of audio segments and each corresponding audio segment in the second plurality of audio segments ([0005] “dividing said audio file into a plurality of frames”, [0074] the listed metrics determines correlations between various audio segments).

	Regarding independent claim 15, Sargsyan teaches a non-transitory machine-readable medium having stored thereon machine- readable instructions executable to cause a machine to perform operations comprising: 
splitting audio data into a first portion and a second portion ([0005] “receiving an audio file comprising a combination of voice data and noise data having a first bandwidth; dividing said audio file into a plurality of frames”, [0025] “The methods and systems disclosed herein can be provided as an online service which receives audio or video file, cleans it from background noise and returns the resulting file back”); 
extracting a first plurality of audio features from the first portion of the audio data and extracting a second plurality of audio features from the second portion of the audio data (FIG. 3, [0070] “During data collection prior to train (see train process) the systems and methods extract VAD of clean speech based on k-mean algorithm and use this feature to calculate NMA based on voiceless frames of the mix”); 
comparing the first plurality of audio features with the second plurality of audio features ([0104] “Implementation of speech enhancement evaluation scores (benchmarks)—the systems and methods are using PESQ, MOS, STOI, POLQA, SNR, SIR scores to evaluate the model performance and to compare the results with other models”); 
determining, based on the comparing, one or more correlations between the first portion of the audio data and the second portion of the audio [0076]; and 
classifying the audio data based on analyzing the modified first plurality of audio features and the second plurality of audio features ([0005]; [0030]; [0074]).
	Sargsyan fails to teach predicting an occurrence of an event based on analyzing the first plurality of audio features; modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises highlighting the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features supports the occurrence of the event or suppressing the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features does not support the occurrence of the event
However, Liu teaches predicting an occurrence of an event based on analyzing the first plurality of audio features ([0052] “the system 10 can be deployed in a hearing aid, for example, to aid in picking up the sound of others (e.g., a voice of a conversation partner or a desired signal source) in the far field in order to enhance playback”, examiner here interprets the conversation as the event);   
modifying the first plurality of audio features based on the one or more correlations, wherein the modifying comprises highlighting the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features supports the occurrence of the event or suppressing the first plurality of audio features when the one or more correlations indicate that the second plurality of audio features does not support the occurrence of the event (FIG. 1, [0044-0045] “This process P2A can also be beneficial in scenarios where multiple users 15 (FIG. 1) will be talking and it is desirable to enhance speech from two or more of those users 15… a) filtering the reference signal to generate a noise estimate signal and b) subtracting the noise estimate signal from the primary signal. In certain of these cases, the process further includes enhancing the spectral amplitude of the primary signal 210 based on the noise estimate signal to provide an output signal”; [0017] “removing from the primary signal components that correlate to the reference signal includes filtering the reference signal to generate a noise estimate signal and subtracting the noise estimate signal from the primary signal”)
Sargsyan in view of Liu are considered to be analogous to the claimed invention because both are the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan with the technique of modifying vocal features based on correlations taught by Liu in order to improve beamforming in audio devices (see Liu [0001])

Regarding claim 16, Sargsyan in view of Liu teaches all of the limitations of claim 15, upon which claim 16 depends. 
Additionally, Sargsyan teaches wherein the operations further comprise: 
segmenting the first portion of the audio data into a first plurality of audio segments ([0005] “taking an inverse fast Fourier transform of the frequency spectrum to provide an audio signal; repeating said bandwidth expansion process for a subsequent frame of said plurality of frames; and outputting an audio file having the second bandwidth based on the audio signals for the plurality of frames”, examiner interprets expansion process for a subsequent frame as segmenting; [0025] cleaning audio portion from background noise, first and second segment; ); and 
segmenting the second portion of the audio data into a second plurality of audio segments, wherein the determining the one or more correlations comprises determining a corresponding correlation score between a first audio segment in the first plurality of audio segments and each corresponding audio segment in the second plurality of audio segments ([0005] “dividing said audio file into a plurality of frames”, [0074] the listed metrics determines correlations between various audio segments; [0076] “The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames.”).

Regarding claim 17, Sargsyan in view of Liu teaches all of the limitations of claim 16, upon which claim 17 depends. 
Additionally, Sargsyan teaches wherein the operations further comprise: determining a correlation between the first audio segment and a particular corresponding audio segment from the second plurality of audio segments based on the corresponding correlation score ([0076] “Train process: The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames”).

Regarding claim 18, Sargsyan in view of Liu teaches all of the limitations of claim 17, upon which claim 18 depends. 
Additionally, Sargsyan teaches wherein the classifying the audio data is further based on classification of the event ([0062] Then, the systems and methods take the logarithm and obtain the clean speech features. Further, the systems and methods use the ratio mask to get a noise model (approximation of noise LPS features) for the previous frame; [0076] “Train process: The systems and methods take a 8 kHz .wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames”).

Regarding claim 19, Sargsyan in view of Liu teaches all of the limitations of claim 15, upon which claim 19 depends. 
Additionally, Sargsyan teaches wherein the operations further comprise: incorporating corresponding temporal information into each audio feature in the first plurality of audio features based on one or more other audio features in the first plurality of audio features ([0040] “For noise suppression models, post processing tools include a moving average rescaling.  The systems and methods compute the average energy of a signal, which changes in time…”; [0078] “Inference (test process): The described systems and methods construct a wideband audio signal…”).

Claims 11-12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Sargsyan in view of Liu, as shown above in claim 1, in further view of Valin et al. US 11521637 B1 (hereinafter Valin).

Regarding claim 11, Sargsyan in view of Liu teaches all of the limitations of claim 8, upon which claim 11 depends.  
Sargsyan in view of Liu fails to teach incorporating, using a gated recurrent unit (GRU), temporal information into the first plurality of audio features.
However, Valin teaches incorporating, using a gated recurrent unit (GRU), temporal information into the first plurality of audio features (FIG. 5, [Column 8, line 49-54] “The model 427 may receive features f as discussed above with regard to FIG. 4 at a 128 fully connected layer 511, which may then pass to two convolutional layers, a 512, 1×5 convolutional layer 513 followed by a 512, 1×3 convolutional layer 515, and 512 gated recurrent unit (GRU) layers 517, 519, 521”).
	Sargsyan in view of Liu in view of Valin are considered to be analogous to the claimed invention because all are in the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan in view of Liu with the technique of using a GRU to incorporate features taught by Valin in order to improve post-filtering for ratio masks as a part of audio enhancement (see Valin [Abstract])

	Regarding claim 12, Sargsyan in view of Liu teaches all of the limitations of claim 8, upon which claim 12 depends.  
Sargsyan in view of Liu fails to teach wherein the first plurality of audio features comprises at least one of a word feature, a sentiment feature, or a tone feature.
However, Valin teaches wherein the first plurality of audio features comprises at least one of a word feature, a sentiment feature, or a tone feature (FIG. 4, [Column 3, line 52-56] “various other features of audio data 102 determined from signal deconstruction and analysis 110 may also be provided for signal reconstruction 130 (e.g., as illustrated in FIG. 4 below)”; [Column 8, line 21-23] “Feature extraction 425 may provide a feature set f for determining the ideal ratio mask of spectrum bands at deep neural network model 427” )
Sargsyan in view of Liu in view of Valin are considered to be analogous to the claimed invention because all are in the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan in view of Liu with the technique of incorporating various audio features taught by Valin in order to improve post-filtering for ratio masks as a part of audio enhancement (see Valin [Abstract])

	Regarding claim 20, Sargsyan in view of Liu teaches all of the limitations of claim 15, upon which claim 20 depends.  
Sargsyan in view of Liu fails to teach wherein the first plurality of audio features comprises at least one of a text feature, a sentiment feature, or a tone feature.
However, Valin teaches wherein the first plurality of audio features comprises at least one of a text feature, a sentiment feature, or a tone feature (FIG. 4, [Column 3, line 52-56] “various other features of audio data 102 determined from signal deconstruction and analysis 110 may also be provided for signal reconstruction 130 (e.g., as illustrated in FIG. 4 below)”; [Column 8, line 21-23] “Feature extraction 425 may provide a feature set f for determining the ideal ratio mask of spectrum bands at deep neural network model 427”).
	Sargsyan in view of Liu in view of Valin are considered to be analogous to the claimed invention because all are in the same field of audio enhancement.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques speech enhancement and noise suppression of Sargsyan in view of Liu with the technique of incorporating various audio features taught by Valin in order to improve post-filtering for ratio masks as a part of audio enhancement (see Valin [Abstract])

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Gay et al. (US 20110288858 A1) teaches a signal processing apparatus, system and software product for audio modification/substitution of a background noise generated during an event including, but not be limited to, substituting or partially substituting a noise signal from one or more microphones by a pre-recorded noise, and/or selecting one or more noise signals from a plurality of microphones for further processing in real-time or near real-time broadcasting.  
Scheuregger et al. (US 20240249715 A1) teaches systems and techniques for dynamically augmenting voice content with audio content are described. An example technique includes obtaining, via at least one microphone communicatively coupled to a loudspeaker device, voice content within an environment. Text content corresponding to the voice content is determined. At least one audio content is determined, based at least in part on the text content. The voice content within the environment is dynamically augmented with the at least one audio content. The augmenting of the voice content includes outputting, via a transducer of the loudspeaker device, the at least one audio content in the environment as the voice content is output in the environment. 
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZEESHAN SHAIKH whose telephone number is (703)756-1730. The examiner can normally be reached Monday-Friday 7:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ZEESHAN MAHMOOD SHAIKH/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Dec 15, 2023
Application Filed
Nov 10, 2025
Non-Final Rejection mailed — §101, §103
Dec 24, 2025
Interview Requested
Jan 13, 2026
Examiner Interview Summary
Jan 13, 2026
Applicant Interview (Telephonic)
Feb 09, 2026
Response Filed
Apr 16, 2026
Final Rejection mailed — §101, §103
May 15, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

17/974,851
Patent 12633299
LINEAR PREDICTION CODING PARAMETER CODING METHOD AND CODING APPARATUS
3y 6m to grant Granted May 19, 2026
17/992,340
Patent 12579373
SYSTEM AND METHOD FOR SYNTHETIC TEXT GENERATION TO SOLVE CLASS IMBALANCE IN COMPLAINT IDENTIFICATION
3y 3m to grant Granted Mar 17, 2026
17/915,465
Patent 12555575
Wakeup Indicator Monitoring Method, Apparatus and Electronic Device
3y 4m to grant Granted Feb 17, 2026
17/682,177
Patent 12518090
LOGICAL ROLE DETERMINATION OF CLAUSES IN CONDITIONAL CONSTRUCTIONS OF NATURAL LANGUAGE
3y 10m to grant Granted Jan 06, 2026
17/820,285
Patent 12511318
MULTI-SYSTEM-BASED INTELLIGENT QUESTION ANSWERING METHOD AND APPARATUS, AND DEVICE
3y 4m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
53%
Grant Probability
99%
With Interview (+52.8%)
3y 1m (~8m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 34 resolved cases by this examiner. Grant probability derived from career allowance rate.