Last updated: April 19, 2026
Application No. 18/305,971
DETECTING AUDIO DEEPFAKES THROUGH ACOUSTIC PROSODIC MODELING

Non-Final OA §103
Filed
Apr 24, 2023
Examiner
BECKER, TYLER JUSTIN
Art Unit
2657
Tech Center
2600 — Communications
Assignee
UNIVERSITY OF FLORIDA RESEARCH FOUNDATION, INC.
OA Round
3 (Non-Final)
Interview Optional

— +19.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

BECKER, TYLER JUSTIN View full profile →
Grants 74% — above average
Career Allow Rate
14 granted / 19 resolved
+11.7% vs TC avg
Strong +19% interview lift
Without
With
+19.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
23.1%
-16.9% vs TC avg
§103
45.4%
+5.4% vs TC avg
§102
14.9%
-25.1% vs TC avg
§112
16.7%
-23.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on January 14th, 2026 has been entered.
 
Response to Amendment
	The amendment filed January 14th, 2026 has been entered. Claims 1, 3, 8, 10, 15, and 17 have been amended. Claims 1-20 are pending and have been examined.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 8-12, and 15-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopala et al. (US Pat. Pub. No. 2021/0074305 A1 hereinafter Gopala), in view of Applebaum et al. (US Pat. Pub. No. 2005/0171774 A1 hereinafter Applebaum).
Regarding claim 1,  Gopala discloses a method for detecting audio deepfakes through acoustic prosodic modeling, comprising: extracting, using a feature extraction technique to enable machine learning via a machine learning model that comprises multiple machine learning layers (Gopala, [0114]: “the audio analysis techniques may use one or more convolutional neural networks. A large convolutional neural network may include, e.g., 60 M parameters and 650,000 neurons. The convolutional neural network may include, e.g., eight learned layers with weights, including, e.g., five convolutional layers and three fully connected layers with a final 1000-way softmax or normalized exponential function that produces a distribution over the 1000 class labels. Some of the convolution layers may be followed by max-pooling layers.”), one or more prosodic features from an audio sample based on a prosodic feature group comprising at least a first feature and a second feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech (Gopala, [0071]: " identification engine 124 may perform analysis and classification of the audio content associated with the video. Notably, identification engine 124 may determine a representation of the audio content by performing a transformation on the audio content."; [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content."); and classifying the audio sample as a deepfake audio sample or an organic audio sample by applying the machine learning model to the one or more prosodic features, wherein the machine learning model is configured as a classification-based detector for audio deepfakes (Gopala, [0073]: "identification engine 124 may classify, based at least in part on an output of the predetermined neural network, the audio content as being fake or real, where the fake audio content is, at least in part, computer-generated."). However, Gopala fails to expressly recite a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample.
Applebaum teaches a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample (Applebaum, [0023]: “Yet another set of glottal source parameters that may be extracted in accordance with the present invention, jitter and shimmer, is characteristic of the glottal folds of an individual. The vibration of the glottis is fairly consistent and periodic, however there is a chaotic element as the glottal folds come into physical contact. This causes slight perturbations in the pitch period and the pressure wave amplitude on a period to period basis. These are called, respectively, jitter and shimmer. Given a single extracted glottal pulse waveform, one can measure the period and amplitude. Then for a sequence of pulses, one can compute a variance about a moving average. Alternatively, another measure of jitter and shimmer can be computed as a ratio of autocorrelation coefficients A[n]/A[0], where n corresponds to the fundamental period.”; Here, the period variance is seen as a jitter feature indicative of vocal shimmer associated with a frequency variation due to the inherent relationship between the period and frequency of a wave.).
Gopala and Applebaum are analogous arts because they both belong to the same field of speaker authentication. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala to incorporate the teachings of Applebaum to include jitter and shimmer as features in the prosodic feature group. Jitter and shimmer are speech variations caused by physical contact of a person’s glottal folds (Applebaum, [0023]), and as such are difficult for an imposter to fake (Applebaum, [0003]). Detecting jitter and shimmer allows the system to better differentiate between real and fake voices.

	Regarding claim 2, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the classifying the audio sample comprises: identifying the audio sample as the deepfake audio sample in response to the one or more prosodic features of the audio sample failing to correspond to a predefined organic audio classification measure as determined by the machine learning model (Gopala, [0073]: "in some embodiments ‘classification’ may involve the use of a threshold (such as a value between 0 and 1, e.g., 0.5) that the output of the predetermined neural network is compared to in order to decide whether given audio content associated with a video is real or fake. Note that the audio content may be allegedly associated with the given individual and the threshold may correspond to the given individual"; [0078]: "the classification may be performed using a classifier or a regression model that was trained using a supervised learning technique (such as a support vector machine, a classification and regression tree, logistic regression, LASSO, linear regression and/or another linear or nonlinear supervised-learning technique) and a training dataset with additional (real and/or synthetic) audio content.").

Regarding claim 3, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the prosodic feature group comprises a pitch feature, an intonation feature, a fundamental frequency feature, a rhythm feature, a stress feature, a harmonic-to-noise ratio feature, or one or more metrics features related to the one or more audio samples (Gopala, [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content.").

Regarding claim 4, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the machine learning model is a neural network model (Gopala, [0072]: "identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons, a combination of the neural networks, or, more generally, a neural network that is trained to discriminate between fake and real audio content).").

Regarding claim 5, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the machine learning model is a multilayer perceptron (MLP) model (Gopala, [0072]: "identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons, a combination of the neural networks, or, more generally, a neural network that is trained to discriminate between fake and real audio content).").

Regarding claim 8, Gopala discloses an apparatus for detecting audio deepfakes through acoustic prosodic modeling, the apparatus comprising at least one processor and at least one memory including program code, the at least one memory and the program code configured to, with the at least one processor, cause the apparatus to at least (Gopala, [0005]: “This computer system may include: a computation device (such as a processor); and memory that stores program instructions that are executed by the computation device.”): extract, using a feature extraction technique to enable machine learning via a machine learning model that comprises multiple machine learning layers (Gopala, [0114]: “the audio analysis techniques may use one or more convolutional neural networks. A large convolutional neural network may include, e.g., 60 M parameters and 650,000 neurons. The convolutional neural network may include, e.g., eight learned layers with weights, including, e.g., five convolutional layers and three fully connected layers with a final 1000-way softmax or normalized exponential function that produces a distribution over the 1000 class labels. Some of the convolution layers may be followed by max-pooling layers.”), one or more prosodic features from an audio sample based on a prosodic feature group comprising at least a first feature and a second feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech (Gopala, [0071]: " identification engine 124 may perform analysis and classification of the audio content associated with the video. Notably, identification engine 124 may determine a representation of the audio content by performing a transformation on the audio content."; [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content."); and classify the audio sample as a deepfake audio sample or an organic audio sample by applying the machine learning model to the one or more prosodic features, wherein the machine learning model is configured as a classification-based detector for audio deepfakes (Gopala, [0073]: "identification engine 124 may classify, based at least in part on an output of the predetermined neural network, the audio content as being fake or real, where the fake audio content is, at least in part, computer-generated."). However, Gopala fails to expressly recite a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample.
Applebaum teaches a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample (Applebaum, [0023]: “Yet another set of glottal source parameters that may be extracted in accordance with the present invention, jitter and shimmer, is characteristic of the glottal folds of an individual. The vibration of the glottis is fairly consistent and periodic, however there is a chaotic element as the glottal folds come into physical contact. This causes slight perturbations in the pitch period and the pressure wave amplitude on a period to period basis. These are called, respectively, jitter and shimmer. Given a single extracted glottal pulse waveform, one can measure the period and amplitude. Then for a sequence of pulses, one can compute a variance about a moving average. Alternatively, another measure of jitter and shimmer can be computed as a ratio of autocorrelation coefficients A[n]/A[0], where n corresponds to the fundamental period.”; Here, the period variance is seen as a jitter feature indicative of vocal shimmer associated with a frequency variation due to the inherent relationship between the period and frequency of a wave.).
Gopala and Applebaum are analogous arts because they both belong to the same field of speaker authentication. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala to incorporate the teachings of Applebaum to include jitter and shimmer as features in the prosodic feature group. Jitter and shimmer are speech variations caused by physical contact of a person’s glottal folds (Applebaum, [0023]), and as such are difficult for an imposter to fake (Applebaum, [0003]). Detecting jitter and shimmer allows the system to better differentiate between real and fake voices.

Regarding claim 9, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the at least one memory and the program code are configured to, with the at least one processor, further cause the apparatus to at least: identify the audio sample as the deepfake audio sample in response to the one or more prosodic features of the audio sample failing to correspond to a predefined organic audio classification measure as determined by the machine learning model (Gopala, [0073]: "in some embodiments ‘classification’ may involve the use of a threshold (such as a value between 0 and 1, e.g., 0.5) that the output of the predetermined neural network is compared to in order to decide whether given audio content associated with a video is real or fake. Note that the audio content may be allegedly associated with the given individual and the threshold may correspond to the given individual"; [0078]: "the classification may be performed using a classifier or a regression model that was trained using a supervised learning technique (such as a support vector machine, a classification and regression tree, logistic regression, LASSO, linear regression and/or another linear or nonlinear supervised-learning technique) and a training dataset with additional (real and/or synthetic) audio content.").

Regarding claim 10, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the prosodic feature group comprises a pitch feature, an intonation feature, a fundamental frequency feature, a rhythm feature, a stress feature, a harmonic-to-noise ratio feature, or one or more metrics features related to the one or more audio samples (Gopala, [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content.").

Regarding claim 11, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the machine learning model is a neural network model (Gopala, [0072]: "identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons, a combination of the neural networks, or, more generally, a neural network that is trained to discriminate between fake and real audio content).").

Regarding claim 12, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the machine learning model is a multilayer perceptron (MLP) model (Gopala, [0072]: "identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons, a combination of the neural networks, or, more generally, a neural network that is trained to discriminate between fake and real audio content).").

Regarding claim 15, Gopala discloses a non-transitory computer storage medium comprising instructions for detecting audio deepfakes through acoustic prosodic modeling, the instructions being configured to cause one or more processors to at least perform operations configured to (Gopala, [0005]: “This computer system may include: a computation device (such as a processor); and memory that stores program instructions that are executed by the computation device.”): extract, using a feature extraction technique to enable machine learning via a machine learning model that comprises multiple machine learning layers (Gopala, [0114]: “the audio analysis techniques may use one or more convolutional neural networks. A large convolutional neural network may include, e.g., 60 M parameters and 650,000 neurons. The convolutional neural network may include, e.g., eight learned layers with weights, including, e.g., five convolutional layers and three fully connected layers with a final 1000-way softmax or normalized exponential function that produces a distribution over the 1000 class labels. Some of the convolution layers may be followed by max-pooling layers.”), one or more prosodic features from an audio sample based on a prosodic feature group comprising at least a first feature and a second feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech (Gopala, [0071]: " identification engine 124 may perform analysis and classification of the audio content associated with the video. Notably, identification engine 124 may determine a representation of the audio content by performing a transformation on the audio content."; [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content."); and classify the audio sample as a deepfake audio sample or an organic audio sample by applying the machine learning model to the one or more prosodic features, wherein the machine learning model is configured as a classification-based detector for audio deepfakes (Gopala, [0073]: "identification engine 124 may classify, based at least in part on an output of the predetermined neural network, the audio content as being fake or real, where the fake audio content is, at least in part, computer-generated."). However, Gopala fails to expressly recite a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample.
Applebaum teaches a prosodic feature group comprising at least a jitter feature and a second shimmer feature, the one or more prosodic features indicative of one or more prosodic characteristics associated with human speech, the jitter feature indicative of vocal jitter associated with a frequency variation in the audio sample, and the shimmer feature indicative of vocal shimmer associated with amplitude variation in the audio sample (Applebaum, [0023]: “Yet another set of glottal source parameters that may be extracted in accordance with the present invention, jitter and shimmer, is characteristic of the glottal folds of an individual. The vibration of the glottis is fairly consistent and periodic, however there is a chaotic element as the glottal folds come into physical contact. This causes slight perturbations in the pitch period and the pressure wave amplitude on a period to period basis. These are called, respectively, jitter and shimmer. Given a single extracted glottal pulse waveform, one can measure the period and amplitude. Then for a sequence of pulses, one can compute a variance about a moving average. Alternatively, another measure of jitter and shimmer can be computed as a ratio of autocorrelation coefficients A[n]/A[0], where n corresponds to the fundamental period.”; Here, the period variance is seen as a jitter feature indicative of vocal shimmer associated with a frequency variation due to the inherent relationship between the period and frequency of a wave.).
Gopala and Applebaum are analogous arts because they both belong to the same field of speaker authentication. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala to incorporate the teachings of Applebaum to include jitter and shimmer as features in the prosodic feature group. Jitter and shimmer are speech variations caused by physical contact of a person’s glottal folds (Applebaum, [0023]), and as such are difficult for an imposter to fake (Applebaum, [0003]). Detecting jitter and shimmer allows the system to better differentiate between real and fake voices.

Regarding claim 16, the rejection of claim 15 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the operations are further configured to: identify the audio sample as the deepfake audio sample in response to the one or more prosodic features of the audio sample failing to correspond to a predefined organic audio classification measure as determined by the machine learning model (Gopala, [0073]: "in some embodiments ‘classification’ may involve the use of a threshold (such as a value between 0 and 1, e.g., 0.5) that the output of the predetermined neural network is compared to in order to decide whether given audio content associated with a video is real or fake. Note that the audio content may be allegedly associated with the given individual and the threshold may correspond to the given individual"; [0078]: "the classification may be performed using a classifier or a regression model that was trained using a supervised learning technique (such as a support vector machine, a classification and regression tree, logistic regression, LASSO, linear regression and/or another linear or nonlinear supervised-learning technique) and a training dataset with additional (real and/or synthetic) audio content.").

Regarding 17, the rejection of claim 15 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the prosodic feature group comprises a pitch feature, an intonation feature, a fundamental frequency feature, a shimmer feature, a rhythm feature, a stress feature, a harmonic-to-noise ratio feature, or one or more metrics features related to the one or more audio samples (Gopala, [0071]: "other transformations and/or representations may be used, such as audio feature-extraction techniques, including: pitch detection, tonality, harmonicity, spectral centroid, pitch contour, prosody analysis (e.g., pauses, disfluences), syntax analysis, lexicography analysis, principal component analysis, or another feature extraction technique that determines a group of basis features, at least a subset of which allow discrimination of fake or real audio content.").

Regarding claim 18, the rejection of claim 15 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. Gopala further discloses wherein the machine learning model is a multilayer perceptron (MLP) model (Gopala, [0072]: "identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons, a combination of the neural networks, or, more generally, a neural network that is trained to discriminate between fake and real audio content).").

Claim(s) 6-7, 13-14, and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gopala, in view of Applebaum, as applied to claims 1-5, 8-12, and 15-18 above, and further in view of Wang et al. (US Pat. Pub. No. 2024/0005947 A1 hereinafter Wang).
Regarding claim 6, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite scaling the one or more prosodic features for processing by the machine learning model.
Wang teaches scaling the one or more prosodic features for processing by the machine learning model (Wang, [0045]: "a plurality of extracted acoustic features from the speech data 304 is passed through one or more DNNs 306."; [0051]: "the pooling/gradient reversal layers 308 are configured to perform attention pooling that gives each of a plurality of feature vectors a weight and generates an average vector, wherein the weighting determines the corresponding accuracy.").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to scale the extracted prosodic features. This allows the features to be effectively averaged (Wang, [0051]). This allows the features to be combined without disproportionately effecting the result.

Regarding claim 7, the rejection of claim 1 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite applying one or more hidden layers of the machine learning model to the one or more prosodic features.
Wang teaches applying one or more hidden layers of the machine learning model to the one or more prosodic features (Wang, [0034]: "As such, the DNN 204 can include a bottom input layer 222(1) and a top layer 222(L) (integer L>1), as well as multiple hidden layers, such as the multiple layers 222(2)-222(3).").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to apply one or more hidden layers to the prosodic features. The hidden layers can help improve the results of a deep neural network (Wang, [0019]). This improves the performance of the overall system.

Regarding claim 13, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite wherein the at least one memory and the program code are configured to, with the at least one processor, further cause the apparatus to at least: scale the one or more prosodic features for processing by the machine learning model.
Wang teaches wherein the at least one memory and the program code are configured to, with the at least one processor, further cause the apparatus to at least: scale the one or more prosodic features for processing by the machine learning model (Wang, [0045]: "a plurality of extracted acoustic features from the speech data 304 is passed through one or more DNNs 306."; [0051]: "the pooling/gradient reversal layers 308 are configured to perform attention pooling that gives each of a plurality of feature vectors a weight and generates an average vector, wherein the weighting determines the corresponding accuracy.").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to scale the extracted prosodic features. This allows the features to be effectively averaged (Wang, [0051]). This allows the features to be combined without disproportionately effecting the result.

Regarding claim 14, the rejection of claim 8 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite wherein the at least one memory and the program code are configured to, with the at least one processor, further cause the apparatus to at least: apply one or more hidden layers of the machine learning model to the one or more prosodic features.
Wang teaches wherein the at least one memory and the program code are configured to, with the at least one processor, further cause the apparatus to at least: apply one or more hidden layers of the machine learning model to the one or more prosodic features (Wang, [0034]: "As such, the DNN 204 can include a bottom input layer 222(1) and a top layer 222(L) (integer L>1), as well as multiple hidden layers, such as the multiple layers 222(2)-222(3).").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to apply one or more hidden layers to the prosodic features. The hidden layers can help improve the results of a deep neural network (Wang, [0019]). This improves the performance of the overall system.

Regarding claim 19, the rejection of claim 15 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite wherein the operations are further configured to: scale the one or more prosodic features for processing by the machine learning model.
Wang teaches wherein the operations are further configured to: scale the one or more prosodic features for processing by the machine learning model (Wang, [0045]: "a plurality of extracted acoustic features from the speech data 304 is passed through one or more DNNs 306."; [0051]: "the pooling/gradient reversal layers 308 are configured to perform attention pooling that gives each of a plurality of feature vectors a weight and generates an average vector, wherein the weighting determines the corresponding accuracy.").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to scale the extracted prosodic features. This allows the features to be effectively averaged (Wang, [0051]). This allows the features to be combined without disproportionately effecting the result.

Regarding claim 20, the rejection of claim 15 is incorporated. Gopala, in view of Applebaum, discloses all of the elements of the current invention as stated above. However, Gopala, in view of Applebaum, fails to expressly recite wherein the operations are further configured to: apply one or more hidden layers of the machine learning model to the one or more prosodic features (Wang, [0034]: "As such, the DNN 204 can include a bottom input layer 222(1) and a top layer 222(L) (integer L>1), as well as multiple hidden layers, such as the multiple layers 222(2)-222(3).").
Gopala, Applebaum, and Wang are analogous arts because they all belong to the same field of synthetic audio detection. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the fake audio content identification method of Gopala, as modified by the speaker authentication techniques of Applebaum, to incorporate the teachings of Wang to apply one or more hidden layers to the prosodic features. The hidden layers can help improve the results of a deep neural network (Wang, [0019]). This improves the performance of the overall system.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TYLER J BECKER whose telephone number is (703)756-1271. The examiner can normally be reached M-Th, 7:15am-5:45pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TYLER BECKER/              Examiner, Art Unit 2657                                                                                                                                                                                          

/DANIEL C WASHBURN/               Supervisory Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Apr 24, 2023
Application Filed
Apr 22, 2025
Non-Final Rejection — §103
Jul 29, 2025
Response Filed
Oct 08, 2025
Final Rejection — §103
Jan 14, 2026
Request for Continued Examination
Jan 26, 2026
Response after Non-Final Action
Mar 19, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/484,927
Patent 12597433
SPEECH SIGNAL ENHANCEMENT METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Apr 07, 2026
18/334,771
Patent 12585893
Full Media Translator
2y 5m to grant Granted Mar 24, 2026
17/692,070
Patent 12518777
SYSTEMS AND METHODS FOR AUTHENTICATION USING SOUND-BASED VOCALIZATION ANALYSIS
2y 5m to grant Granted Jan 06, 2026
18/110,990
Patent 12499869
SOUND SYNTHESIS METHOD, SOUND SYNTHESIS APPARATUS, AND RECORDING MEDIUM STORING INSTRUCTIONS TO PERFORM SOUND SYNTHESIS METHOD
2y 5m to grant Granted Dec 16, 2025
18/117,304
Patent 12499311
Language Model Preprocessing with Weighted N-grams
2y 5m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
74%
Grant Probability
93%
With Interview (+19.0%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allow rate.