DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Khemka et al., US 2023/0409615 A1 (“Khemka”) in view of Evermann et al., (US 2019/0179890 A1) (Evermann).
Regarding claims 1, 9 and 17, Khemka discloses a device, a non-transitory medium and method, “identifying, using an automated speech recognition (ASR) system, at least one named entity hypothesis from at least one audio input;” by receiving audio and performing ASR to generate one-best and n-best hypotheses, followed by NLU extracting slots/entities (Khemka Fig. 2; ¶¶ [0050]–[0056]; ¶¶ [0110]–[0116]); “providing, using the ASR system, the identified at least one named entity hypothesis to a large language model (LLM);” by passing ASR/NLU outputs to transformer-based models (e.g., T5) for dialogue state tracking and slot/value prediction using example-guided inputs (Khemka ¶¶ [0158]–[0176]; Fig. 15); “generating a prompt using an automated prompt generator;” by programmatically constructing question templates Q(s→q) and concatenating retrieved in-context examples (TransferQA formatting) that serve as prompts for the model (Khemka ¶¶ [0174]–[0176]; Fig. 15); “processing, using the LLM, the identified at least one named entity hypothesis and the prompt to generate updated named entity recognition data;” by feeding the template prompt plus examples and dialogue history into T5 to output slot values/updated entity recognition (Khemka ¶¶ [0169]–[0176]; Fig. 15).
Khemka does not specifically disclose “providing the updated named entity recognition data back to the ASR system.”
However, Evermann discloses the natural language processor… re-processes the text string… to search for word matches… [and] can determine that the text actually refers to… [correct entity]… and feed corrected tokens back into the recognition pipeline” (¶0014, ¶0135; Fig. 5).
Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Evermann’s feedback of corrected named-entity tokens into Khemka’s ASR/NLU pipeline to improve recognition accuracy. Both references address the same problem — errors in speech-to-text transcription — and Evermann’s technique predictably improves ASR performance by reintegrating corrected entity data.
Regarding claims 2, 10 and 18, Khemka discloses the device, non-transitory medium and method of claims 1, 9 and 17, further comprising updating the ASR system with the updated named entity recognition data to enhance named entity recognition accuracy (i.e., model training, on-device/federated updates, and adapting models with data (Khemka ¶0061; ¶0114–¶0115; ¶0179–¶0181).
It should also be noted that Evermann discloses a system which updates its speech recognition models with corrected named entities to improve future recognition accuracy (see paragraphs 15 and 136). Although not combined here, it would have been obvious to integrate Evermann’s update mechanism into Khemka’s pipeline for predictable accuracy gains.
Regarding claims 3, 11 and 19, Khemka discloses the device, non-transitory medium and method of claims 2, 10 and 18, further comprising: accessing a set of audio samples of named entities collected from users of a voice assistance system, wherein each audio sample is annotated with a text transcript of the named entity and a corresponding category that the named entity belongs to of a plurality of categories (i.e., Khemka discloses user-collected named-entity audio samples, annotated with transcripts and categories (¶0110–¶0116)); for each audio sample in the set of audio samples: generating the prompt including the named entity based on the corresponding category (i.e., Khemka’s DST/TransferQA prompt generation with slot/category context (¶0174–¶0176)); and providing the prompt as input to the LLM (i.e., (¶0174–¶0176)), wherein the updated named entity recognition data includes a plurality of possible commands including the named entity based on the corresponding category (i.e. Khemka’s LLM output slot values/dialogue state (¶0176).); and training, based on the plurality of possible commands generated by the LLM, at least one of a language model or a talk-to-speech (TTS) model of the ASR system (i.e., Khemka’s training/fine-tuning of language/TTS models with generated examples (¶0179–¶0181; ¶0102–¶0106)).
Regarding claims 4 and 12, Khemka discloses the device and method of claims 3 and 11, wherein the plurality of categories includes at least one of: an application name; a name of a person; a name of a television program; a name of a movie; a name of an electronic device; a name of a place; a name of a radio station; a name of a podcast; a name of a genre; a name of a business; a name of a sports team; or a name of a song (i.e., many domains/slots and lists of types (music, movies, contacts, places, etc.), and teaches domain/slot vocabularies that encompass these categories (Khemka ¶0046; ¶0116; Fig. 3C).
Regarding claims 5 and 13, Khemka discloses the device and method of claims 3 and 11, further comprising: creating a base model trained using the set of audio samples of named entities collected from the users of the voice assistance system; and periodically updating the ASR system and/or the LLM based on the base model (i.e., building base models from user data, on-device and federated training, and periodically updating models including ASR/NLU components (Khemka ¶0061; ¶0114–¶0115; ¶0179–¶0181)).
Regarding claims 6, 14 and 20, Khemka discloses the device, medium and method of claims 1, 9 and 17, further comprising: providing the prompt generated using the automated prompt generator to a talk-to-speech (TTS) model and synthesizing, using the TTS model, an audio sample based on the prompt (i.e., Khemka explicitly describes a TTS component that generates synthesized speech from text and discusses generated examples for model fine-tuning (Khemka ¶0102–¶0106; ¶0179–¶0181)).
Although Kemka discloses general model training but does not explicitly teach training the ASR model using TTS-synthesized audio derived from LLM prompts.
Evermann teaches generating augmented training data (including TTS-synthesized audio) for improving recognition of named entities and using such synthesized samples for ASR training/augmentation (Evermann ¶0015; ¶0136).
Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to to apply Evermann’s data-augmentation technique using TTS samples to Khemka’s prompt/TTS flows in order to improve ASR training coverage for named entities recognized/produced by the LLM and to also improve the performance of the ASR.
Regarding claims 7 and 15, Khemka discloses the device and method of claims 1 and 9, wherein the ASR system and the LLM are executed on a same electronic device (i.e., on-device ASR/portions of NLU and acknowledges some processing (ASR/parts of NLU/LLM) may execute on device (Khemka ¶0047–¶0053; Fig. 2).
Regarding claims 8 and 16, Khemka discloses the device and method of claims 7 and 15, further comprising: providing user information stored on the same electronic device to the LLM; and processing, using the LLM, the identified at least one named entity hypothesis to generate the updated named entity recognition data using the user information (i.e., using local user data, personalized lexicons/gazetteers and using such context in NLU/LLM processing (Khemka ¶0046; ¶0061; ¶0110–¶0116)).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PIERRE LOUIS DESIR whose telephone number is (571)272-7799. The examiner can normally be reached Monday-Friday 9AM-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PIERRE LOUIS DESIR/ Supervisory Patent Examiner, Art Unit 2659