DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Response to Amendments and Arguments
Regarding a rejection under 35 U.S.C. §101, applicant amended claims 18-20 by excluding transitory type medium. The rejection under §101 has been withdrawn.
Regarding twice rejections to independent claims under 35 U.S.C. §102, applicant amended independent claims by adding new limitations. Applicant argued (Remarks, page 9) that the previously cited references fail to teach the newly added limitations added to independent claims 1, 10 and 18. Applicant further argued that references cited for rejecting dependent claims also fail to disclose the new limitations (Remarks, page 10).
By reviewing references cited for rejecting dependent claims, the examiner noticed the features in the added limitations are disclosed by some of the references cited for rejecting dependent claims.
For example, Zhang (“The PartialSpoof Database and Countermeasures for the Detection of Short Generated Audio Segments Embedded in a Speech Utterance”, applicant submitted IDS, published on April 11, 2022) discloses detecting partial spoof speech. For example, Zhang discloses detecting a partially spoofed speech with an inserted synthesized word “not” using text-to-speech (TTS) (Zhang, section 1, see an illustration of Fig. 1).
Another previously cited reference (Rahman, "Detecting Synthetic Speech Manipulation in Real Audio Recordings", published in September, 2022, a reference submitted by the applicant in an IDS”) also discloses the newly added limitations to independent claims. For example, Rahman discloses detecting spoof attacks from audio clips that contains partial manipulated audio by using synthesizers (Rahman, Abstract, section 1, Introduction; section III, experiments).
After performing update search, the examiner discovered several references that disclosed features defined by the added limitations. For example, Zhang (“Localizing Fake Segments in Speech”, published 2022) discloses detecting injected a fake segment by using a speech synthesizer (Zhang, Fig. 3, replicated below)
PNG
media_image1.png
492
668
media_image1.png
Greyscale
The examiner includes the newly discovered references in an attached PTO-892 form. The examiner rejects the amended independent claims by combining a previously cited reference that were used for rejecting certain rejecting dependent claims. The arguments regarding anticipation rejections under 35 U.S.C. §102 have been considered, but are moot because the arguments do not apply to a new ground rejection necessitated by the amendments.
Claim Rejections - 35 USC § 103
Claims 1-4, 9-12, and 17-21 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (US PG Pub. 2021/0233541, referred to as Chen) in view of Zhang (“The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance”, published in April, 2022, a reference submitted by the applicant in an IDS, referred to Zhang).
Chen discloses a neural network based spoofing detection technique to detect deepfake speech generated by using speech synthesis technique (Chen, [0009-0011], [0036-0039], Fig. 6). Chen discloses generating feature embedding (Chen, [0009], [0026], Fig. 6, #606, Fig. 7). Chen further discloses detecting whether an inbound call is from a genuine speaker or spoofed by using speech synthesis technique (Chen, [0045], [0059-0061], [0070], Fig. 7, #718).
Zhang discloses detecting partially spoofed speech generated by inserting synthesized speech into a genuine speech (Zhang, section I and section III, inserting a synthesized word “NOT” to change a meaning of the original utterance).
Regarding claims 1, 10 and 18, Chen discloses a method, a system and a computer readable storage medium for detecting synthetic speech in frames of an audio clip (Chen [0025], [0107], Fig. 6, a computer implemented method / system for detecting deepfake synthesized audios using neural network), comprising:
processing, by a machine learning system, an audio clip to generate a plurality of speech artifact embeddings based on a plurality of synthetic speech artifact features (Chen, [0026-0028], [0037], [0053], Fig. 6, #606; generating spoofing embeddings from artifacts of audio/speech frames);
computing, by the machine learning system, one or more scores based on the plurality of speech artifact embeddings (Chen, [0010], [0045], Fig. 7, #718; calculating spoofing scores using neural network models);
determining, by the machine learning system, based on the one or more scores, whether one or more frames of the audio clip include synthetic speech (Chen, [0045], [0070], determining whether audio signals are genuine or spoofed by from synthesis based on spoofing or similarity scores); and
outputting an indication of whether the one or more frames of the audio clip include synthetic speech (Chen, [0009], [0087], indicating whether audio is from genuine speaker or spoofed from synthesized speech).
Chen discloses detecting spoofed speech generated by speech synthesizers based on detected spoof characteristics (Chen, Summary of the invention, Fig. 4). Chen does not explicitly discloses detecting a partially spoofed speech. Therefore, Chen does not explicitly disclose the newly added limitations: “wherein the plurality of synthetic speech artifact features have been created by one or more synthetic speech generators injecting synthetic speech into the audio clip”.
Zhang discloses detecting partially spoofed speech that were generated by inserting certain words generated by using a speech synthesizer to change meaning of original utterance (Zhang, section 1, Introduction, section III and Fig. 1).
Both Chen and Zhang are dealing with detecting spoofed audio using a neural network, it would have been obvious to a person having ordinary skill in the art at the time the invention was filed to combine Chen’s teaching with Zhang’s teaching to detect partially spoofed speech and obtain a training database by labelling which section / frames are real speech and which sections / frames are spoofed speech from speech synthesis. One having ordinary skill in the art would have been motivated to make such a modification to improve accuracy of spoof detection when using speaker verification (Zhang, Abstract, Section V). In addition, all the claimed elements were known in the prior art and one skilled in the art could have combined the elements as claimed by known methods, and in the combination each element merely would have performed the same function as it did separately. “A combination of familiar elements according to known methods is likely to be obvious when it does no more than yield predictable results.” KSR, 550 U.S. ___, 82 USPQ2d at 1395 (2007). One of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claims 2, 11 and 19, Chen in view of Zhang further discloses: extracting the plurality of synthetic speech artifact features from frames of the audio clip (Chen, [0009-0011], [0026], [0038], processing audio frames to extract artifacts and training neural network based spoofing audio detection), wherein the synthetic speech artifact features include at least ONE of artifacts, distortions, or degradations that are associated with one or more synthetic speech generators and that are included in the audio clip (Chen, [0026], [0037-0038], spoofing artifact features such as degradation, distortion in synthesized spoofing speech).
Regarding claims 3-4 and 12, limitations recited in these dependent claims are related to preparing training data for training a neural network-based spoofing speech detection. The claimed “determining one or more boundaries in the mapping … based on label” is related to labelling training data.
Chen discloses training a neural network for detecting synthesized / spoofed speech (Chen, [0038-0039], Fig. 2 and Fig. 3). Chen further discloses preparing a training audio with labels (Chen, [0047], [0070], labelling speech as genuine or spoofed audio data). Although Chen implicitly discloses all features in dependent claims 3-4 and 12, the examiner further cites Zhang, which discloses more details about labelling training audio data to indicate segments related real speech or spoofed synthesized speech (Zhang, section III, creating partial spoof database, Fig. 1).
Regarding claims 9, 17 and 20, Chen in view of Zhang further discloses: wherein outputting the indication comprises: responsive to determining a score of the one or more scores satisfies a threshold, outputting an indication that a frame of the one or more frames that corresponds to the score includes synthetic speech (Chen, [0045], [0059-0061], [0087], comparing scores with spoofing threshold to determine whether audio is from genuine speaker or from spoofed synthesized speech).
Regarding claim 21, Chen in view of Zhang further discloses wherein the audio clip originally includes authentic speech, and wherein the one or more frames of the audio clip include the injected synthetic speech (Zhang, section I, Introduction, inserting a synthesized “NOT” by a speech synthesizer to change meaning of original utterance; See Fig. 1).
Claims 5-6 and 13-14 are rejected under 35 U.S.C. §103 as being unpatentable over Chen in view of Zhang and further in view of Castan (“Speaker-targeted Synthetic Speech Detection”, published in June, 2022, a reference submitted by the applicant in an IDS, referred to as Castan).
Regarding claims 5 and 13, Chen in view of Zhang discloses training a neural network model to detect spoofed audio (Chen, [0023], [0038-0039], Fig. 2 and Fig. 3). Chen further discloses labelling data to indicate which portion containing speech (Chen, [0047], [0069], [0092]). Chen does not explicitly disclose removing non-speech information from training data.
Castan discloses a neural network structure for detecting spoofing speech in multimedia data (Castan, Abstract, xResNet-PLDA system). Castan further discloses discarding silence frames in audio data (Castan, Section 4.1).
Chen in view of Zhang and Castan are dealing with detecting spoofed / synthesized audio using a neural network, it would have been obvious to a person having ordinary skill in the art at the time the invention was filed to modify Chen’s teaching with Castan’s teaching to discard silence frames from speech data. One having ordinary skill in the art would have been motivated to make such a modification to reduce error and improve performance (Castan, section 5, using a network structure with xResNet and PLDA outperforms the baseline system for detecting fake speech). In addition, all the claimed elements were known in the prior art and one skilled in the art could have combined the elements as claimed by known methods, and in the combination each element merely would have performed the same function as it did separately. “A combination of familiar elements according to known methods is likely to be obvious when it does no more than yield predictable.
Regarding claims 6 and 14, Chen in view of Zhang and Castan further discloses wherein computing the one or more scores comprises computing one or more log-likelihood ratios by at least comparing the plurality of speech artifact embeddings to a plurality of enrollment embeddings (Castan, section 2.2.2, we applied a common and simple solution using a discriminatively trained affine transformation from scores to log-likelihood ratios (LLRs.), see Fig. 2., comparing log likelihood ration), wherein each of the plurality of enrollment embeddings are associated with authentic speech (Chen, [0036], [0038-0039], Fig. 4, Castan, Section 3.3).
Claims 7-8 and 15-16 are rejected under 35 U.S.C. §103 as being unpatentable over Chen in view of Zhang and further in view of Rahman (“Detecting Synthetic Speech Manipulation in Real Audio Recordings”, published in September, 2022, a reference submitted by the applicant in an IDS, referred to as Rahman).
Regarding claims 7-8 and 15-16, Chen in view of Zhang discloses processing frames / segments to obtain scores (Chen, [0037], [0053], [0065]). Chen further discloses using likelihood as scores (Chen, [0009-0010]). Chen does not explicitly mention “segment scores”.
Rahman discloses a neural network structure for detecting synthetic speech and obtaining segment level scores and utterance scores (Rahman, Section II (D), utterance level scores are calculated from segment level scores).
Both Chen and Rahman are dealing with detecting spoofed / synthesized audio using a neural network, it would have been obvious to a person having ordinary skill in the art at the time the invention was filed to modify Chen’s teaching with Rahman’s teaching to calculate segment level scores and utterance level scores. One having ordinary skill in the art would have been motivated to make such a modification to reduce error and improve performance (Rahman, Section III, C, Results). In addition, all the claimed elements were known in the prior art and one skilled in the art could have combined the elements as claimed by known methods, and in the combination each element merely would have performed the same function as it did separately. “A combination of familiar elements according to known methods is likely to be obvious when it does no more than yield predictable.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jialong He, whose telephone number is (571) 270-5359. The examiner can normally be reached on Monday – Friday, 8:00AM – 4:30PM, EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Pierre Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JIALONG HE/Primary Examiner, Art Unit 2659