Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103 is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 USC 103(a) as being unpatentable over Baeuml et al. (US 2023/0074406 A1) in view of Zheng et al. (CN 111611208 A, see attached IP.com translation).
Regarding Claims 1 and 14, Baeuml discloses a non-transitory computer-readable storage medium storing instructions that, when executed by a computer processor (¶112, Fig. 7, processor 714 executing software program instructions in RAM 730), cause the computer processor to perform operations comprising:
initializing a virtual experience for a user, the virtual experience including a virtual agent (¶35 and Fig. 6, automated assistant 115 can identify a computing device associated with a user of client device 110 to satisfy a spoken utterance by rendering responsive content at the client device);
receiving speech text from a constrained machine-learned language model configured to provide adaptive speech relating to the virtual experience (¶42, generating a set of assistant outputs that are predicted to be responsive to an assistant query included in a spoken utterance captured in a stream of audio data processed by ASR engine; ¶43, LLM engine uses one or more LLMs stored in model database 115A to process the set of assistant outputs to generate a set of modified assistant outputs; ¶81, one or more first layers of the one or more LLMs can be utilized in generating personality replies);
parsing the speech text into a plurality of speech units, wherein a speech unit is an atomic unit representative of natural breaks in human speech to generate first audio byte (¶44, TTS engine uses TTS model stored in machine learning model database 115A and textual data formulated by automated assistant 115 (i.e., modified assistant outputs) to generate synthesized speech audio data that includes computer generated synthesized speech; i.e., parsing textual data comprising modified assistant outputs to generate synthesized speech units constituting first audio byte); and
generating a voiceover track for the virtual agent with the first audio byte (¶44, TTS engine generates any synthesized speech audio data in causing audio data to be provided for audible presentation to the user).
Baeuml does not disclose apply a hashing function to each speech unit to determine a corresponding hash.
Zheng discloses a device for acquiring a text comprising speech units to be subjected to voice synthesis (Abstract, Fig. 1, p. 9, step 101, acquiring a text to be subjected to voice synthesis):
apply a hashing function to each speech unit to determine a corresponding hash (p. 9, step 101, “performing HASH value calculation on the text to obtain a HASH value of the text”);
for each hash, querying a cache database to identify a cached hash that matches the hash queried against the cache database (p. 9, step 102, “taking the calculated HASH value as a file name, and inquiring whether a voice file with the same file name exists in a set voice file directory”);
responsive to identifying a matching hash to a first hash queried against the cache database, retrieving a first audio byte stored with the matching hash in the cache database (p. 9, step 103, “taking the voice files with the same file name as the voice files of the text”); and
generating a voiceover track for a virtual agent with the first audio byte (p. 9, “when a speech file with the same name as the HASH value of the text is found, the speech file is directly called and used as a speech synthesized file of the text, that is, the intelligent broadcast system directly outputs the speech file as a speech file to be subjected to speech synthesis”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to apply a hushing function to each speech unit to determine a corresponding hash to identify a cached hash in a cache database matching the hash to retrieve a first audio byte stored with the matching hash in the cache database in order to directly outputs the first audio byte / speech file where speech synthesis operation does not need to be performed on the text (Zheng, p. 9, “…and the speech synthesis operation does not need to be performed on the text”).
Regarding Claims 2 and 15, Baeuml discloses the operations further comprising:
generating a prompt for the constrained machine-learned language model specifying speech of the user and a request to infer a reply to the speech of the user (¶42, generate a set of assistant outputs that are predicted to be responsive to the assistant query); and
providing the prompt to the constrained machine-learned language model (¶43, the LLM engine processes the set of assistant outputs that are predicted to be responsive to the assistant query to generate a set of modified assistant outputs).
Regarding Claim 3, Baeuml discloses wherein generating prompt comprises: generating the prompt further specifying prior conversations between the user and the virtual agent in a prior session of the virtual experience (¶43, generate the LLM outputs based on using one or more LLMs stored in the model database 115A, the set of assistant outputs, a context of dialog session in which the spoken utterance is received based on one or more contextual signals stored in context database 110A; per ¶50, contextual signals stored in context database 110A comprises historical dialog history of one or more prior dialog sessions between the user and the automated assistant 115).
Regarding Claims 4 and 16, Baeuml discloses the operations further comprising: receiving an audio signal captured by a microphone, the audio signal representing the speech of the user and applying a speech recognition algorithm to the audio signal to determine the speech of the user (¶42, ASR engine 130 uses ASR models stored in machine learning models database 115A to process a stream of audio data that captures spoken utterances and is generated by microphones of client device 110 to generate a stream of ASR output).
Regarding Claims 5 and 17, Baeuml discloses wherein the constrained machine-learned language model is trained to output adaptive speech in a constrained language space relevant to the virtual experience (¶81, one or more first layers of the one or more LLMs can be utilized in generating personality replies; per 63, a single LLM can be utilized to generate first LLM output using a first set of parameters that reflect a first personality (e.g., chef personality) and a second LLM generates second LLM output using a second set of parameters reflecting a second personality (e.g., butler personality)).
Regarding Claims 6-7 and 18, Baeuml discloses determining a tonality for the speech text from the constrained machine-learned language model (¶14, when given modified assistant outputs (i.e., from the LLMs), the modified assistant outputs can reflect contextual personalities via prosodic properties of different personalities such as intonation),
As modified by Zheng, wherein applying the hashing function comprises applying the hashing function to each speech unit and the tonality to determine the corresponding hash (Zheng, p. 9, “step 101, acquiring a text to be subjected to voice synthesis, and performing HASH value calculation on the text to obtain a HASH value of the text”; see further p. 9, ¶2, as an implementation manner, at least one of the sound quality of the text may also be obtained and spliced with the text to form a spliced text and calculate the HASH value of the spliced text).
Regarding Claim 8, Baeuml as modified by Zheng disclose tagging the hashes corresponding to the plurality of speech units with a tag indicating the tonality (Zheng, p. 9, ¶14, “And according to the HASH HASH value of the text, performing multilevel hierarchical storage on attribute information of at least one of the information of the voice file corresponding to the text, the sound quality…”).
Regarding Claims 9-10 and 19, Baeuml discloses wherein parsing the speech text into the plurality of speech units comprises grouping one or more words from the speech text into a speech unit (¶43, using LLM outputs to generate the set of modified assistant outputs, each of the one or more LLM outputs can include a probability distribution over a sequence of one or more words and/or phrases across one or more vocabularies; ¶44, TTS engine uses TTS models to process textual data comprising modified assistant outputs to generate synthesized speech audio data; i.e., parsing textual data comprising modified assistant outputs to generate synthesized speech units), wherein the plurality of speech units are phrases, sentence clauses, or sentences (Fig. 6, ¶96, see automated assistant modified assistant outputs / synthesized speech units 654, 658, 662; compare Zheng, p. 9, ¶10, “speech synthesis processing is to convert characters in a text into a played speech file, convert Chinese characters, letters, numerals, English characters or sentences in the text into corresponding pronunciation files”).
Regarding Claims 11-12 and 20, Baeuml as modified by Zheng discloses responsive to identifying no matching hash in the cache database to a second hash queried against the cache database, generating a second audio byte for the speech unit corresponding to the second hash with a voice synthesizer and storing the second audio byte with the second hash in the cache database (Zheng, p. 9, ¶9, “And 104, when the voice file with the same file name is not inquired, performing voice synthesis based on the text, taking the calculated HASH value as the file name of the synthesized voice file, and storing the synthesized voice file in the set voice file directory”),
wherein generating the voiceover track for the virtual agent comprises combining the first audio byte and the second audio byte (Zheng, p. 9, ¶7, “And 103, inquiring voice files with the same file name, calling the voice files with the same file name, and taking the voice files with the same file name as the voice files of the text”, Zheng, p. 9, ¶9, “And 104, when the voice file with the same file name is not inquired, performing voice synthesis based on the text”, and Zheng, p. 9, ¶10 “speech synthesis processing is to convert characters in a text into a played speech file, convert Chinese characters, letters, numerals, English characters or sentences in the text into corresponding pronunciation files”; i.e., outputting comprises (1) voice files of the text with inquired file name / HASH value extracted from the set voice file directory and (2) the speech files after speech synthesis).
Regarding Claim 13, Baeuml discloses transmitting the voiceover track for the virtual agent for presentation in the virtual experience to the user (¶51 in view of Fig. 1, TTS engine 160A2 at Natural Conversation System 120 generates synthesized speech audio data for rendering engine 112 at client device 110 to cause the synthesized speech audio data to be audibly rendered by speakers of client device 110 for audible presentation to the user of the client device 110; i.e., transmit synthesized speech audio data from Natural Conversation System 120 through network 199 to client device 110).
Conclusion
Prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
US 2020/0380995 A1 discloses a user interface system comprising interface graphical items and a plurality of audio files containing spoken representation of announcement text of the items (Abstract), where accessing the user interface with request to hear spoken versions of the interface items based on announcement identifier that is a hashed version of the announcement text itself (¶4); i.e., by issuing requests to a cache for audio files for each of the spoken version of the interface using a hashed version of the text (¶9).
US 9691376 B2 discloses a speech synthesis process for synthesizing speech from a text by identifying an acoustic unit sequential pair in the speech, searching for a concatenation cost (which is a pre-computed measure of mismatch between pairs of acoustic units stored in a database) for the acoustic unit sequential pair in a database using a hash table for the database, and when the concatenation cost is not found in the database, assigning a default value as the concatenation cost for the acoustic unit sequential pair (Abstract).
US 7921296 B2 discloses generating robust hashes / signatures of data files for audio clips where an audio clip is represented by a concatenation of binary hash words
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor Hai Phan whose telephone number is 571-272-6338. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2654
03/03/2026