DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101.
Claims 1, 10 and 19: claims are rejected under 35 U.S.C. § 101 because it is directed to an abstract idea. The claim recites data collection and analysis for classification—i.e., receiving an audio sample, converting it to text, aligning text with phonemes, extracting/deriving vectors (frequency response vectors, classification space vectors), and then using those derived values to classify phonemes (synthetic vs. organic) and the overall audio sample. These steps amount to mathematical concepts and mental processes (e.g., vector extraction, transformation, and normalization; and deciding/classifying based on computed values), which are considered an abstract-idea of groupings.
The claims are not integrated into a practical application because the recited apparatus elements (processor, memory, computer program code) function as generic computing components used as a tool to execute the abstract classification workflow, and the remaining limitations largely amount to pre and post solution activity (e.g., receiving speech/audio, converting to text, and outputting a classification). The claim does not recite a specific improvement to computer functionality or to a particular technical field in a way that meaningfully limits the exception; instead, it broadly covers analyzing speech-related data and outputting a label. This is analogous to the line of cases holding that collecting information, analyzing it, and reporting/classifying results is an abstract idea.
The claims do not recite “significantly more” (an inventive concept) because the additional elements beyond the abstract idea amount to generic computer implementation of the classification logic and routine data-processing operations described at a high level of generality. Merely requiring performance of the abstract idea on a generic processor/memory arrangement does not transform the claim into patent-eligible subject matter, and the claim does not add any unconventional technical implementation (e.g., a specific nonconventional architecture, specialized hardware, or a particular asserted improvement to the operation of the computer itself) that would supply an inventive concept.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the claims are (i) mere instructions to implement the idea on a computer, and/or (ii) recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. There is further no improvement to the computing device.
Dependent claims 2-9, 11-18 and 20 further recite an abstract idea performable by a human and do not amount to significantly more than the abstract idea as they do not provide steps other than what is conventionally known in audio processing (mathematical concepts and mental processes).
Claims 2, 11 and 20: generic computer components and do not integrate the idea into a practical application or add significantly more.
Claims 3 and 12: adds only conventional mathematical signal-processing on a generic processor/memory environment and therefore does not include additional elements that amount to significantly more than the abstract idea.
Claims 4 and 13: this additional limitation does not meaningfully limit the claim to a specific technical implementation beyond generic computation
Claims 5 and 14: merely apply mathematical processing to data using generic computing components without a recited technological improvement.
Claims 6 and 15: these are abstract mathematical operations performed on information.
Claims 7 and 16: a well-known mathematical transform to convert a time-domain signal into a frequency-domain representation and obtain a frequency response vector.
Claims 8 and 17: recites the abstract classification concept of comparing computed values to a threshold and labeling outcomes as synthetic or organic, which is a mental process / evaluation rule that can be expressed as mathematics.
Claims 9 and 18: a mathematical decision rule applied to classification results.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 7, 9-11, 16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wei et al. (CN 111816203) in view of Khodabakhsh et al. (“Investing of Synthetic Speech Detection Using Frame- and Segment-Specific Importance Weighting”, Oct. 10, 2016).
Claims 1, 10 and 19,
Wei teaches an apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and computer program code configured to, with the processor ([Content of the Invention] processor; memory), cause the apparatus to at least:
receive an audio sample comprising speech ([step one, data preparation] 25380 audio);
convert the speech into text ([step one, data preparation] use a set of voice recognition system, identifying the content in the 25380 audio);
align the text with phonemes identified within the audio sample ([step one, data preparation] obtaining each phoneme in the audio and their starting time and so on information; extracting phoneme information from the audio after recognition (extracting the phoneme information in the audio through the tool of voice marking));
obtain, from the audio sample, a frequency response vector for each of the predetermined phonemes ([step one, data preparation] obtain the data of each frequency band on each frame of different phonemes);
transform the frequency response vector for each of the predetermined phonemes to a classification space vector for each of the predetermined phonemes having a magnitude ([step three, data analysis] [the technical solution…] through discrete cosine transform DCT to obtain new characteristic of inhibiting phoneme influence; Wei further ties the transformed characteristic to classification modeling (training the GMM of real voice and fraud voice by using the characteristic);
normalize the classification space vector for each of the predetermined phonemes ([step two, data analysis specifically comprises the following steps] normalizing the result; normalization at the phoneme/frequency band analysis stage (performing normalization processing to the obtained PF value));
The difference between the prior art and the claimed invention is that Wei does not explicitly teach filter the audio sample to only contain predetermined phonemes; identify each of the predetermined phonemes as one of synthetic or organic based on the classification space vector for each of the predetermined phonemes; and identify the audio sample as synthetic or organic based on identification of each of the predetermined phonemes as one of synthetic or organic.
Khodabakhsh teaches filter the audio sample to only contain predetermined phonemes ([III. Feature Grouping Methods] feature vectors that occur within a particular phoneme type in the utterance are grouped together);
identify each of the predetermined phonemes as one of synthetic or organic based on the classification space vector for each of the predetermined phonemes ([II. Synthetic Speech Detectors] [III. Feature Grouping Methods] in the phoneme-based approach, each phoneme constitutes a group (classification); log likelihood ration (LLR) detection is done for each group of feature vectors; feature vectors that belong to the same phoneme or sound class constitute a group); and
identify the audio sample as synthetic or organic based on identification of each of the predetermined phonemes as one of synthetic or organic ([II. Synthetic Speech Detectors] utterance decision on the set of per-group (phoneme) scores (see equation 2); a final decision threshold (a hard threshold is used to compute the final decision; synthetic vs. natural speech).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wei with teachings of Khodabakhsh by modifying the synthesized speech detection method based on phoneme level analysis for inhibiting influence of phoneme as taught by Wei to include filter the audio sample to only contain predetermined phonemes; identify each of the predetermined phonemes as one of synthetic or organic based on the classification space vector for each of the predetermined phonemes; and identify the audio sample as synthetic or organic based on identification of each of the predetermined phonemes as one of synthetic or organic as taught by Khodabakhsh for the benefit of capturing distortions that were caused by the unknown systems (Khodabakhsh [Abstract]).
Claims 2, 11 and 20,
Wei further teaches the apparatus of claim 1, wherein the predetermined phonemes include fricative phonemes, plosive phonemes, and nasal phonemes ([Fig. 3] [III. Feature Grouping Methods] five sound classes are used: vowels, nasals, glides, stops, and rest. The rest class contains all phonemes that do not belong to the other four classes; feature vectors that occur within a particular phoneme type in the utterance are grouped together; stop and fricative sounds).
Claims 7 and 16,
Wei further teaches the apparatus of claim 1, wherein causing the apparatus to obtain, from the audio sample, the frequency response vector for each of the predetermined phonemes comprises causing the apparatus to: apply a Discrete Fourier Transform to the audio sample to convert the audio sample from a time domain signal to a complex frequency domain ([step three, extracting characteristic] applying “short-time Fourier transform” to the framed/windowed speech signal (after framing, windowing and short-time Fourier transform)); and
obtain the frequency response vector for each of the predetermined phonemes in the complex frequency domain ([Step one, data preparation] marks phoneme and obtains per-phoneme frequency band data (a frequency-domain vector) on frames corresponding to different phonemes (obtaining each phoneme and obtain the data of each frequency band on each frame of different phonemes).
Claims 9 and 18,
Wei further teaches the apparatus of claim 1, wherein causing the apparatus to identify the audio sample as synthetic or organic based on identification of each of the predetermined phonemes as one of synthetic or organic comprises causing the apparatus to: identify the audio sample as synthetic in response to more than five percent of the predetermined phonemes being identified as synthetic ([Contents of the Invention] utterance-level classification via “maximum likelihood ration classification method to obtain the final result; 5% is a user defined parameter which can be changed/altered).
Claims 3-5 and 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Wei et al. (CN 111816203) in view of Khodabakhsh et al. (“Investing of Synthetic Speech Detection Using Frame- and Segment-Specific Importance Weighting”, Oct. 10, 2016) and further in view of Vaseghi (“Advanced Digital Signal Processing and Noise Reduction” pgs. 178-204; Dec. 2008).
Claims 3 and 12,
Wei in view of Khodabakhsh teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Wei nor Khodabakhsh explicitly teaches wherein causing the apparatus to transform the frequency response vector for each of the predetermined phonemes to the classification space vector for each of the predetermined phonemes comprises fitting a Weiner filter to the frequency response vector for each of the predetermined phonemes.
Vaseghi teaches wherein causing the apparatus to transform the frequency response vector for each of the predetermined phonemes to the classification space vector for each of the predetermined phonemes comprises fitting a Weiner filter to the frequency response vector for each of the predetermined phonemes ([pgs. 191 & 194] [eq. 6.38 & 6.50] Wiener filtering as frequency-domain transformation using the Wiener filter frequency response (Xˆ ( f )=W( f )Y( f ); obtaining the frequency-domain Wiener filter (frequency-domain Wiener filter (eq. 6.50)).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wei and Khodabakhsh with teachings of Vaseghi by modifying the synthesized speech detection method based on phoneme level analysis for inhibiting influence of phoneme as taught by Wei to include wherein causing the apparatus to transform the frequency response vector for each of the predetermined phonemes to the classification space vector for each of the predetermined phonemes comprises fitting a Weiner filter to the frequency response vector for each of the predetermined phonemes as taught by Vaseghi for the benefit of forming the foundation of data-dependent linear least square error filters (Vaseghi [pg. 178]).
Claims 4 and 13,
Vaseghi further teaches the apparatus of claim 3, wherein the Weiner filter computes a statistical estimation of the frequency response vector for each of the predetermined phonemes as an unknown signal using a related known signal ([6.1 Wiener Filters: Least Square Error Estimation] the filter takes as the input a signal y(m), and produces an output signal xˆ (m) , where xˆ (m) is the least mean square error estimate of a desired or target signal x(m)).
Claims 5 and 14,
Vaseghi further teaches the apparatus of claim 4, wherein the Weiner filter attempts to find an ideal linear transformation mapping the unknown signal to the related known signal ([pgs. 178 & 193] [eq. 6.48] Wiener filer coefficients are selected to optimize a liner mapping (minimize the average squared distance); resulting Wiener solution as an “optimal linear filter”).
Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Wei et al. (CN 111816203) in view of Khodabakhsh et al. (“Investing of Synthetic Speech Detection Using Frame- and Segment-Specific Importance Weighting”, Oct. 10, 2016) and further in view of De Leon et al. (US 9,865,253).
Claims 8 and 17,
Wei in view of Khodabakhsh teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Wei nor Khodabakhsh explicitly teaches compare the classification space vector for each of the predetermined phonemes to a threshold; and one of: determine that one of the predetermined phonemes is synthetic in response to the classification space vector for the one of the predetermined phonemes failing to satisfy a threshold; or determine that the one of the predetermined phonemes is organic in response to the classification space vector for the one of the predetermined phonemes satisfying the threshold.
De Leon teaches compare the classification space vector for each of the predetermined phonemes to a threshold ([col. 8 lines 1-31] threshold classifier using feature vectors extracted at the phoneme-level statistics are compared against stored minima and segmented along phoneme boundaries.. feature vector is computed for each phoneme and comparted to the minimums from the training); and one of:
determine that one of the predetermined phonemes is synthetic in response to the classification space vector for the one of the predetermined phonemes failing to satisfy a threshold; or determine that the one of the predetermined phonemes is organic in response to the classification space vector for the one of the predetermined phonemes satisfying the threshold (([col. 8 lines 15-31] if the test speaker's mean IQRs are greater than the training minimums, the test speaker is declared human otherwise, synthetic)).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wei and Khodabakhsh with teachings of De Leon by modifying the synthesized speech detection method based on phoneme level analysis for inhibiting influence of phoneme as taught by Wei to include compare the classification space vector for each of the predetermined phonemes to a threshold; and one of: determine that one of the predetermined phonemes is synthetic in response to the classification space vector for the one of the predetermined phonemes failing to satisfy a threshold; or determine that the one of the predetermined phonemes is organic in response to the classification space vector for the one of the predetermined phonemes satisfying the threshold as taught by De Leon for the benefit of classifying the speech signal as human or synthetic based on the extracted features (De Leon [Abstract]).
Allowable Subject Matter
Claim 6 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims AND overcome the 101 Abstract Idea set forth.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Dhamyal et al. (“Fake Audio Detection in Resource-constrained Settings using Microfeatures”; Sept. 2021) – Fake audio generation has undergone remarkable improvement with the advancement in deep neural network models. This has made it increasingly important to develop lightweight yet robust mechanisms for detecting fake audios, especially for resource-constrained settings such as on edge devices and embedded controllers as well as with low-resource languages. In this paper, we analyze two microfeatures: Voicing Onset Time (VOT) and coarticulation, to classify bonafide and synthesized audios. Using the ASVSpoof2019 LA dataset, we find that on average, VOT is higher in synthesized speech compared to bonafide speech and exhibits higher variance for multiple occurrences of the same stop consonants. Further, we observe that vowels in CVC form in bonafide speech have greater F1/F2 movement compared to similarly constrained vowels in synthesized speech. We also analyze the predictive power of VOT and coarticulation for detecting bonafide and synthesized speech and achieve equal error rates of 25.2% using VOT, 39.3% using coarticulation, and 23.5% using a fusion of both models. This is the first study analyzing VOT and coarticulation as features for fake audio detection. We suggest these microfeatures as standalone features for speaker-dependent forensics, voice biometrics, and for rapid pre-screening of suspicious audios, and as additional features in bigger feature sets for computationally intensive classifiers.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
SHREYANS A. PATEL
Primary Examiner
Art Unit 2653
/SHREYANS A PATEL/ Examiner, Art Unit 2659