DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on January 9, 2026 has been entered.
Response to Arguments
Applicants argue that the prior art cited fails to teach generating, based on the one or more style embeddings and the one or more linguistic embeddings, one or more dependency embeddings representing dependencies between the one or more style embeddings and the one or more linguistic embeddings. Applicants’ arguments are persuasive, but are moot in view of new grounds of rejection.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-5 and 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Altaf et al. (PGPUB 2024/0363099), hereinafter referenced as Altaf in view of Shekhar et al. (PGPUB 2022/0228367), hereinafter referenced as Shekhar.
Regarding claims 1 and 19-20, Altaf discloses a method, system and medium, hereinafter referenced as a method for classifying audio data, the method comprising:
inputting the audio data into a trained machine-learning model, wherein the trained machine-learning model (trained machine learning; p. 0041, 0138-0139), is configured to:
generate, using a style encoder of the machine-learning model, one or more style embeddings (style and tone) representing nonverbal characteristics of the audio data (text; p. 0178, 0196);
generate, using a linguistic encoder of the machine-learning model (encoder), one or more linguistic embeddings (semantic) representing textual content (textual content) of the audio data (p. 0165, 0196-0197);
generate one or more dependency embeddings representing dependencies between the one or more style embeddings and the one or more linguistic embeddings (embeddings and classification; p. 0179-0180, 0237-0242);
inputting the one or more dependency embeddings into a classification head of the machine-learning model (trained to classify the audio data as real or fake; p. 0103-0107, 0172-0181, 0209-0220); and
obtaining, from the trained machine-learning model, a classification result of whether the audio data is real or fake (trained to classify the audio data as real or fake; p. 0103-0107, 0172-0181, 0209-0220), but does not specifically teach generating, based on the one or more style embeddings and the one or more linguistic embeddings, one or more dependency embeddings representing dependencies between nonverbal speech characteristics and verbal speech characteristics.
Shekhar discloses a method comprising generating, based on the one or more style embeddings (style features such as pitch, emotion, etc.; p. 0019) and the one or more linguistic embeddings, one or more dependency embeddings representing dependencies between nonverbal speech characteristics and verbal speech characteristics (conveyed by the audio generated by text; p. 0019, 0103-0111, to more accurately capture the expressiveness of an input text.
Therefore, it would have been obvious to one of ordinary skill of the art, before the effective filing date of the claimed invention, to modify the method as described above, to increase efficiency.
Regarding claim 2, Altaf discloses a method wherein the audio data comprises real human speech, synthetic human speech, or both real human speech and synthetic human speech (machine or human; p. 0020, 0130, 0179-0185).
Regarding claim 3, Altaf discloses a method wherein the one or more machine learning models have been trained using bona fide audio data to learn dependencies between nonverbal characteristics and textual content in real human speech (bona fide enrollment audio signals; p. 0121, 0172-0181, 0209-0220).
Regarding claim 4, Altaf discloses a method wherein determining a first subset of the one or more dependency embeddings comprises:
inputting the one or more style embeddings into a style compressor (style embedding; p. 0178-0179);
compressing the one or more style embeddings to create one or more style dependency embeddings (compress; p. 0104-0105).
Regarding claim 5, Altaf discloses a method wherein determining a second subset of the one or more dependency embeddings comprises:
inputting the one or more linguistic embeddings into a linguistic compressor (grammar; p. 0178-0179); and
compressing the one or more linguistic embeddings to create one or more linguistic dependency embeddings (compress; p. 0104-0105).
Regarding claim 13, Altaf discloses a method wherein the one or more style embeddings represent one or more attributes selected from the group comprising: speaker identity (individual identity; p. 0104), gender, emotion (p. 0174-0176), accent (p. 0174-0176), tone (p. 0178), speech rate (speech rate; p. 0284), health state, age, vocal pitch (p. 0174-0176), vocal intensity (p. 0174, 0179, 0199), and cognitive state (p. 0174-0176).
Regarding claim 14, Altaf discloses a method wherein the classification head has been trained to classify audio as real or fake via supervised learning using labeled audio data (supervised learning; p. 0140).
Regarding claim 15, Altaf discloses a method wherein the style compressor and the linguistics compressor are trained in a first training phase using only bona fide audio data, and wherein the classification head is trained during a second training phase (second authentication) using labeled bona fide audio data and labeled fake audio data (label indicating human or machine speech; p. 0066-0071).
Regarding claim 16, Altaf discloses a method comprising:
permitting access to a computing resource or protected endpoint based on the classification result (p. 0116), wherein the classification result indicates that the audio is real (p. 0103-0107, 0172-0181, 0209-0220).
Regarding claim 17, Altaf discloses a method comprising:
restricting access to a computing resource or protected endpoint based on the classification result, wherein the classification result indicates that the audio is fake (authenticate/restrict; p. 0319-0320).
Regarding claim 18, Altaf discloses a method comprising:
displaying an alert via a user interface based on the classification result, wherein the classification result indicates that the audio is fake (display genuine or fake; p. 0173).
Claim(s) 9-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Altaf in view of Shekhar and in further view of Chen et al. (PGPUB 2024/0005905), hereinafter referenced as Chen.
Regarding claim 9, Altaf in view of J Shekhar disclose a method as described above, but does not specifically teach a method comprising:
generating one or more supplementary style embeddings based on the one or more style embeddings, wherein the one or more supplementary style embeddings include information-rich portions of the input audio data.
Chen discloses a method comprising:
generating one or more supplementary style embeddings based on the one or more style embeddings, wherein the one or more supplementary style embeddings include information-rich portions of the input audio data (p. 0227), to make training simple and efficient.
Therefore, it would have been obvious to one of ordinary skill of the art, before the effective filing date of the claimed invention, to modify the method as described above, to improve naturalness and emotion richness.
Regarding claim 10, it is interpreted and rejected for similar reasons as set forth above. In addition, Chen discloses a method comprising:
generating one or more supplementary linguistic embeddings based on the one or more linguistic embeddings, wherein the one or more supplementary linguistic embeddings include information-rich portions of the input audio data (p. 0180, 0227).
Claim(s) 11-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Altaf in view of Shekhar and Chen and in further view of Yan et al. (PGPUB 2025/0094718), hereinafter referenced as Yan.
Regarding claim 11, Altaf in view of Shekhar and Chen disclose a method as described above, but does not specifically disclose a method comprising concatenating the one or more supplementary style embeddings, one or more supplementary linguistic embeddings, one or more style dependency embeddings, and one or more linguistic dependency embeddings to one another and inputting the concatenated embeddings into the classifier module, to condense embeddings.
Yan discloses a method of concatenating the one or more supplementary style embeddings, one or more supplementary linguistic embeddings, one or more style dependency embeddings, and one or more linguistic dependency embeddings to one another (p. 0080-0081); and
inputting the concatenated embeddings into the classifier module (p. 0050-0055), to condense embeddings.
Therefore, it would have been obvious to one of ordinary skill of the art, before the effective filing date of the claimed invention, to modify the method as described above, to assist with improving tasks.
Regarding claim 12, it is interpreted and rejected for similar reasons as set forth above. In addition, Yan discloses a method wherein the one or more supplementary style embeddings and one or more supplementary linguistic embeddings are generated using an attentive statistics pooling module (p. 0080) and a multi-layer perceptron module (p. 0066-0067).
Allowable Subject Matter
Claims 6-8 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAKIEDA R JACKSON whose telephone number is (571)272-7619. The examiner can normally be reached Mon - Fri 6:30a-2:30p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571.272.5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JAKIEDA R JACKSON/Primary Examiner, Art Unit 2657