Last updated: April 17, 2026
Application No. 19/314,555
SYSTEM AND METHOD FOR MULTI-MODAL AI CONVERSATIONAL INTERFACE IMPROVING WEBSITE NAVIGATION AND USER INTERACTION

Final Rejection §103
Filed
Aug 29, 2025
Examiner
NAZAR, AHAMED I
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
unknown
OA Round
2 (Final)
This examiner grants 53% of cases after interview

— +35.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 378 resolved cases, 2023–2026
Examiner Intelligence

NAZAR, AHAMED I View full profile →
Grants 53% of resolved cases
Career Allow Rate
202 granted / 378 resolved
-1.6% vs TC avg
Strong +35% interview lift
Without
With
+35.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
29 currently pending
Career history
407
Total Applications
across all art units
Statute-Specific Performance

§101
9.2%
-30.8% vs TC avg
§103
59.7%
+19.7% vs TC avg
§102
15.3%
-24.7% vs TC avg
§112
9.6%
-30.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 378 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This communication is responsive to the amendment filed 12/12/2025.
Claims 1-2, 11, 15, and 20 have been amended, claim 3 has been canceled, and no claims have been added. 
Claims 1-2 and 4-20 are pending with claims 1, 11, and 20 as independent claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2 and 4-20 are rejected under 35 U.S.C. 103 as being unpatentable Baeuml et al. (US 2023/0343,324, published 10/26/2023, hereinafter as Baeuml) in view of Foster et al. (US 2019/0354594, published 11/21/2019, hereinafter as Foster) in view of Gaskill et al. (US 2018/0052885, published 2/22/2018, hereinafter as Gaskill).

Claim 1. A system for artificial intelligence (AI)-enabled interactive website transformation into multi-modal conversational platforms, comprising:
a computing device having a processor and a memory configured to store one or more instructions executable by the processor, Baeuml discloses in [0117-0121] “Computing device 810 typically includes at least one processor 814 which communicates with a number of peripheral devices via bus subsystem 812. These peripheral devices may include a storage subsystem 824, including, for example, a memory subsystem 825… Memory 825 used in the storage subsystem 824 can include a number of memories including a main random access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored.” (emphasis added),
wherein the processor is configured to receive user input data and generate adaptive multimodal responses, wherein the computing device is in communication with a server and a database via a network, Baeuml discloses in [0125] “a method implemented by one or more processors is provided, and includes receiving a stream of audio data that captures a spoken utterance of a user of a client device, the stream of audio data being generated by one or more microphones of the client device, and the spoken utterance being directed to an instance of an automated assistant executing at least in part at the client device; and generating, based on processing the stream of audio data, a given assistant output that is responsive to the spoken utterance.” (emphasis added) examiner note: the multimodal response may indicate response includes speech/audio output synchronized with animated visual data corresponding to the spoken audio,
wherein the processor is configured to:
receive one or more user queries by an input module via a user interface, wherein the input module is configured to accept at least one of text data and speech data, wherein the input module is configured to communicate with a speech-to-text module to pre-process the speech data by performing normalization, segmentation, and transcription it into text to improve recognition accuracy; Baeuml discloses in [0050-0055 and 0125] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Further, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents.” (emphasis added) examiner note: the captured spoken utterance may be user queries as input to automated speech recognition (ASR), which process the spoken utterance into text,
analyze the received text-based query by a natural language processing (NLP) module to interpret intent, classify user context, and [identify relevant website information], wherein the NLP module is configured to semantically retrieves and [combines information from multiple webpages within the website] to construct a contextually grounded response; Baeuml discloses in [0041] “the NLU engine 140A1 and/or 140A2 may include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “them” to “buy theatre tickets” in the natural language input “buy them”, based on “theatre tickets” being mentioned in a client device notification rendered immediately prior to receiving input “buy them”.” And in [0075] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data… the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Moreover, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents… the system can process the stream of ASR output, the stream of NLU output, and/or a context of a dialog session in which the spoken utterance was provided by the user (if any), using the LLM, to generate the given assistant output.” (emphasis added) examiner note: the system analyzes the stream of audio data (text-based spoken utterance) as output by the ASR by the natural language understanding (NLU) model, wherein the NLU output may be processed by LLM to generate given assistant output as contextually grounded response based on grouping or clustering of recognized terms using semantics. For example, the phrase “buy them” may be resolved semantically to “buy theatre tickets”,
adapt by a persona adaptation module conversational interaction based on the classified contextually grounded response by dynamically modifying vocabulary, tone, speech style, and avatar representation; Baeuml discloses in [0063] the system causes synthesized speech audio data capturing synthesized speech corresponding to the stream of textual content or the stream of modified textual content to be audibly rendered for presentation to the user. For example, the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified). Notably, the system can utilize the given set of prosodic properties assigned to the given persona in generating the synthesized speech audio data to reflect a given tone, rhythm, pitch, intonation, and/or other prosodic properties associated with the given persona.” And in [0074] “the given persona can include… a given vocabulary that is specific to the given persona and that is utilized in modifying the stream of textual content to generate the stream of modified textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in subsequently synthesizing the stream of modified textual content for audible presentation to the user, and/or a given set of visual cues that is specific to the given persona and that is utilized in modifying the stream of visual cues to generate the stream of modified visual cues. Notably, each of the other disparate personas can also be associated with a corresponding vocabulary, a corresponding set of prosodic properties, and/or a corresponding set of visual cues.” And in [0076] “even though the LLM is general to one or more of the plurality of personas, the given assistant output generated at block 354 may be specific to the given persona assigned to the instance of the automated assistant, and without requiring any additional post-processing of the given assistant output to tailor it to the given persona assigned to the instance of the automated assistant, since the processing using the LLM generating the given assistant output is adapted to the given persona through utilization of the given persona data.” and in [0079] “method 400 of training large language models for utilization in dynamically adapting a given assistant output based on a given persona, from among a plurality of disparate personas, assigned to an automated assistant, assigned to the automated assistant is depicted.” And in [0086-0092] “assume the stream of textual content is represented by a data structure of <text_stream = “Hey there, how are you doing?”. In this example, the developer input can associate associated one or more visual cues with one or more portions of the stream of textual content, such as providing developer input that includes visual cues represented by data structured of < = hand lift_“Hey there”/body gesture> to cause a visualized representation of an instance of an automated assistant to wave while the “Hey there” portion of the stream of textual content is being audibly presented and < = smile “how are you doing?”/face gesture> to cause the visualized representation of the instance of an automated assistant to smile while the “how are you doing” portion of the stream of textual content is being audibly presented… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… the output generated using the one or more movement tracking machine learning models can be further processed (e.g., using one or more classification machine learning models) to determine the higher level animated physical motion gestures and/or screen animations.” (emphasis added) examiner note: a given persona may be adapted by modifying vocabulary, tone, rhythm, pitch, intonation, body gesture, lip movement, etc.,
generate a structured natural language output by a response generator module, wherein the response generator module is configured to produce structured outputs as text data, and transmits the text data to a text-to-speech synthesis module to generate spoken audio output while exporting phoneme alignment data; Baeuml discloses in [0058, 0062-0065] “based on a given assistant persona assigned to the instance of the automated assistant by the user and from among a plurality of disparate personas, the given assistant output to generate a modified given assistant output, the modified given assistant output including: (1) a stream of modified textual content that differs from the stream of textual content; and/or (2) a stream of modified visual cues that differs from the stream of visual cues. In generating the modified given assistant output, the system may also consider a context of a dialog session (if any) in which the spoken utterance is provided by the user… the modified given assistant output may better resonate with the user engaged in the dialog session with the instance of the automated assistant… the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified)… the system can cause the synthesized speech audio data to be audibly rendered for presentation to the user via one or more speakers of the client device.” and in [0092] “the system processes the stream of audio data to generate a stream of textual content and the stream of vision data to generate a stream of visual cues… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… this synchronization may be performed by a dedicated synchronization engine and/or other component of the system that is capable of synchronizing the stream of visual cues or the stream of modified textual context and the utilization of the stream of visual cues or the modified stream of visual cues.” (emphasis added) examiner note: the modified given assistance output may be structured outputs that comprises modified textual content and modified visual cues, wherein the modified textual content may be transmitted to text-to-speech (TTS) model to generate spoken audio output. The phrase “exporting phoneme alignment data” may be synchronizing data that synchronize the modified textual content and/or the modified visual cues, and
generate synchronized video output using an avatar generation module based on the phoneme alignment data, wherein the avatar generation module produces a lifelike video of an animated persona delivering the generated spoken audio output in synchrony with lip movements and gestures, and render generated outputs through an output rendering module integrated into the user interface, Baeuml discloses in [0005] “The given persona can be embodied by, for example, a given vocabulary that is specific to the given persona and that is utilized in generating the corresponding streams of textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in synthesizing the corresponding streams of textual content for audible presentation to the user, and/or a given set of visual cues that includes some visual cues that are specific to the given persona (e.g., animated physical motion gestures that are commonly associated with the visualized representation of the instance of the automated assistant) and that includes some visual cues that are common amongst multiple personas of the plurality of disparate personas (e.g., waving, certain facial expressions, etc.). The visualized representation of the automated assistant can be, for example, an animated avatar or entity that represents the instance of the automated assistant, and can be based on, for example, a real human, a fictional character, animated object(s) and/or animal(s), and/or other visualized representations.” And in [0064] “the stream of visual cues or the modified stream of visual cues are utilized in controlling the visualized representation of the instance of the automated assistant. The visualized representation of the instance of the automated assistant can be, for example, an avatar corresponding to an animated person (e.g., real or fictional), character (e.g., a butler, a pirate, a chef), object (e.g., animated assistant dots), animal, and/or any other visualized representation.” (emphasis added) examiner note: the rendering of the animated avatar may be a synchronized video for presenting speech and some motion, movement to mimic lifelike video as response to the user spoken utterance input,
wherein the output rendering module is configured to display responses in multi-modal formats, which include a text transcript, an audio playback, and an embedded avatar video within the user's browser environment, Baeuml discloses in [0064-0065] “the system causes the stream of visual cues or the stream of modified visual cues to be utilized in controlling a display of the client device and/or in controlling a visualized representation of the instance of the automated assistant. In some implementations, the stream of visual cues or the stream of modified visual cues are utilized in controlling the display of the client device. In these implementations, the stream of visual cues or the stream of modified visual cues can cause the display of the client device to visually render one or more screen animations for presentation to the user (e.g., as described with respect to FIGS. 6A and 6B)… the stream of visual cues or the stream of modified visual cues can cause the visualized representation of the automated assistant to perform one or more animated physical gesture motions. These physical gesture motions can include, for example, general physical gesture motions that are general across all visualized representations (e.g., waving, entering the display of the client device, exiting the display of the client device, etc.), persona specific physical gesture motions that are specific to the given persona assigned to the automated assistant (e.g., a signature move of a fictional character), emotions portrayed by the visualized representation (e.g., happy, sad, angry, etc.) that may optionally be coupled with the above general or persona specific physical gesture motions, facial expressions (e.g., smiling, frowning, etc.) that may optionally be coupled with the above general or persona specific physical gesture motions, and/or other physical gesture motions… the system can synchronize, for presentation to the user, the audible rendering of the synthesized speech corresponding to the stream of visual cues or the stream of modified textual context and the utilization of the stream of visual cues or the modified stream of visual cues in controlling the display of the client device and/or for controlling the visualized representation of the instance of the automated assistant.” (emphasis added) examiner note: the rendered animated video may comprise physical gesture motions synchronized, for presentation, with audible rendering of synthesized speech corresponding the visual cues, and
wherein the system transforms a static website into an adaptive conversational platform by enabling natural language queries to directly trigger retrieval, navigation, and presentation of relevant website information, thereby overcoming limitations of conventional static menus and scripted chatbot systems. Baeuml discloses in [0053] “the system receives a stream of audio data that captures a spoken utterance of a user of a client device, the spoken utterance being directed to an instance of an automated assistant executing at least in part at the client device. The stream of audio data can be generated, for example, via one or more microphones of the client device… the system may receive the stream of audio data in response to the instance of the automated assistant being explicitly invoked, such as based on detection of a particular word or phrase at the client device (e.g., “Assistant”, “Hey Assistant”, etc.), based on actuation of a button at the client device (e.g., a hardware button of the client device, a software button of a display of the client device), based on detection of a particular gesture at the client device, and/or using other invocation techniques.” And in [0056] “the system can generate, based on at least the stream of NLU output, one or more structured requests to be transmitted to one or more 1P systems and/or one or more 3P systems. In response to transmitting one or more of the structured requests to the one or more 1P systems and/or one or more 3P systems, the system can receive responsive content from the one or more 1P systems and/or one or more 3P systems. The response content can be utilized in generating the stream of textual content and the stream of visual cues included in the given assistant output. For example, if the spoken utterance corresponds to “Assistant, what’s the weather”, the stream of audio data capturing the spoken utterance can be processed to obtain a stream of textual content (e.g., “The weather today is rainy and 45”) to be synthesized for audible presentation to the user and a stream of visual cues (e.g., an information card associated with the weather).” And in [0058] “the system modifies, based on a given assistant persona assigned to the instance of the automated assistant by the user and from among a plurality of disparate personas, the given assistant output to generate a modified given assistant output, the modified given assistant output including: (1) a stream of modified textual content that differs from the stream of textual content; and/or (2) a stream of modified visual cues that differs from the stream of visual cues. In generating the modified given assistant output, the system may also consider a context of a dialog session (if any) in which the spoken utterance is provided by the user. The given persona can include, for example, a given vocabulary that is specific to the given persona and that is utilized in modifying the stream of textual content to generate the stream of modified textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in subsequently synthesizing the stream of modified textual content for audible presentation to the user, and/or a given set of visual cues that is specific to the given persona and that is utilized in modifying the stream of visual cues to generate the stream of modified visual cues.” (emphasis added) examiner note: the website may be transformed from static, e.g. textual content response, to dynamic website that includes visual cues including animation mimicking particular persona, 
wherein the processor is configured to replace a static homepage of the website with an interactive conversation window integrated into the user interface, and to enable conversational queries to directly trigger navigation by displaying or linking to relevant webpages within the website; Baeuml discloses in [0096] “the display 190 may include a first portion 190A that includes an indication a user account that is active at the client device 110 (e.g., as indicated by a user account symbol in the right-hand side of the first portion 190A of the display 190), and an indication of when various components are active at the client device 110, such as one or more microphones or speech processing components of the client device 110 (e.g., as indicated by the ellipses 190A1 in the first portion 190A of the display 190)… the display 190 may include a second portion 190B that includes a transcription of a dialog between a user of the client device 110 and an instance of the automated assistant that is implemented at least in part at the client device 110. Further, the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen).” And in [0097] “the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” And in [0100] “the visual content provided for visual presentation to the user according to the stream of visual cues may be static visual content. For instance, the waves, the fish, and the birds may not move once displayed unless otherwise indicated by the stream of visual cues. In other implementations, the visual content provided for visual presentation to the user according to the stream of visual cues may be dynamic visual content.” (emphasis added) examiner note: the portion 190A may be a static homepage that indicated by active components such as ellipses 190A as shown in fig. 7A. Portion 190B displays transcription of conversation between a user and automated assistant (conversational queries). Display 190B as a multi-media content page may be dynamic homepage that replace portion 190A by way of overlay, for example. The dynamic homepage (portion 190B) may be generated (triggered) based on the conversation between the user and the automated assistant, and
Baeuml does not explicitly disclose
automatically generate and transmit, by an escalation module, a follow-up email to the user containing additional information and clarifications, and initiate further interaction by either connecting the user with designated management personnel or providing corresponding contact details through the conversation window when a query cannot be resolved. However, Gaskill, in an analogous art, discloses in [0086] “the NLU component 214 may provide as its output the dominant object, user intent, and the knowledge graph 808 that is formulated along dimensions likely to be relevant to the user query. This information may help the dialog manager 216 if there is missing information needed to fully resolve a user query to an item recommendation, and thus whether (and how) to then to prompt the user to further refine the user's requirements via additional input.” And in [0128] “the dialog manager 216 may select a prompt that comprises a validating statement, such as “I understand you want to find red Nike shoes” or “OK I can help you find red Nike shoes now” to conversationally lead the user to provide further confirmatory and revelatory discussion of the dominant object of user interest. This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion.” (emphasis added) examiner note: the prompt may be a follow-up ema message generated automatically by the dialog manager 216 and transmitted to the user requesting additional information for resolving a query language.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Gaskill because “ambiguity may occur for example if there are many unusual spelling errors in user textual input, or if the user's speech was received in a noisy environment, so that normalizing has not worked well…This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion. Gaskill [0128].
Baeuml does not explicitly disclose
identify relevant website information, wherein the NLP module is configured to semantically retrieves and combines information from multiple webpages within the website. However, Foster, in an analogous art, discloses in [0030] “By way of example consider a message “What's the weather today? I hope there's no rain.” The response can be templatic and include slots for retrieval and insertion of pertinent weather information. Thus, the response can be “Sorry, there is a high chance of rain, but not too cold at 50 degrees. Don't forget your umbrella!” In this example, the slots that are filled correspond to “rain” and “50 degrees.” This response is dynamically generated based on inputs received from elsewhere (e.g., application programming interface calls) based on context. As another example, consider the message or query “How far is Seattle from Hyderabad anyway?” The small talk or chatty response can be “Pretty far. It's 7,770 miles away.” Here, the slot that is being filled by external function call is the distance “7,770.”” And in [0033] “In one instance the persona-based language generation models can be built and deployed by way of an internet-based service. For example, a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download. Consider, for example, a movie theater company that desires to sell movie tickets through a conversational agent. A web portal can be utilized to upload data and/or specific configuration settings. A persona-based language generation model can be generated based on the data and/or configuration settings and made available for use by the movie theater.” And in [0038] “FIG. 5 is a flow chart diagram that depicts a method 500 of message processing. At reference numeral 510, a message is received, wherein the message is a query. At 520, a decision is made as to whether or not the query is task oriented or not. A task-oriented query seeks to accomplish a task such as book a vacation, order a product, or receive instructions, among other things… A query that is not task oriented is one that does not seek to accomplish a task but rather can be small talk or chitchat about things that are trivial or uncontroversial, for example “How are you?” or “Where do you live?” At 540, a language generation model is invoked.” And in [0039] “a message is received, wherein the message is a query. For example, the query can be “Is it going to be sunny today?” At numeral 620, data that is required to satisfy the query is requested and received. For example, weather data can be requested and received by interaction with a weather application programming interface… functionality can be invoked involving slot filling to parameterize language generation. Continuing with the ongoing example, the generated response can be “Sorry, it will be cloudy today with a high chance of showers. Don't forget your umbrella!” In this case weather data, namely cloudy and chance of showers are integrated into a response that satisfies the query and is casual and friendly in nature. In addition to receiving data from an external source, it should be appreciated that a response can trigger, or include, an action directed toward completion of a message that is task-based. For example, a message that requests that lights be turned on can be responded to by turning on the lights. At reference numeral 640, the generated response to the query is returned.” And in [0040] “a persona-based language generation model is automatically generated based on the identified interlocutors. Additionally, the persona-based language generation model can be created to include action paths based on intent of a message, such as whether or not to respond to a specific intent. At numeral 740, the generated model is returned, for example for use by a conversational agent such as a chatbot, game character or conversation-enabled physical robot. The model enables responses to be generated that mimic the style and tone of the one or more interlocutors representative of the persona configuration.” (emphasis added) examiner note: the website may be a web portal that may request, based on a query, weather information from external source and may retrieve information responsive to a query about map application such as the distance between Seattle from Hyderabad. The different information sources may be based on the web portal pages or combination of internal and external applications.   
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Foster because “The integration of the conversation generation system 100 with conversational agent development can enable predefined or custom language generation models to be highly compatible and easy to use… the persona-based language generation models can be built and deployed by way of an internet-based service… a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download.” Foster [0032-0033].

Claim 2. The rejection of the system of claim 1 is incorporated, 
wherein the persona adaptation module is operable to switch among multiple roles which include at least one of a sales assistant, a recruiter, an educator, a healthcare professional, a financial advisor persona, a customer support persona, a legal advisor persona, a technical support persona, a government service agent persona, an ecommerce shopping guide persona, an entertainment/media host persona, and a compliance trainer persona, in response to the identified user context, Baeuml discloses in [0058-0060] “the system modifies, based on a given assistant persona assigned to the instance of the automated assistant by the user and from among a plurality of disparate personas, the given assistant output to generate a modified given assistant output, the modified given assistant output including: (1) a stream of modified textual content that differs from the stream of textual content; and/or (2) a stream of modified visual cues that differs from the stream of visual cues… the user may provide a spoken utterance of “talk to Walter” (e.g., where “Walter” is a reference to the given persona (e.g., a butler persona)) or “pretend you are “Assistant, pretend you are Blackbeard” (e.g., where “Blackbeard” is a reference to the given persona (e.g., a pirate persona))… the system may process audio data capturing the spoken utterance to identify the voice command to assign the given persona to the instance of the automated assistant using various components described herein (e.g., ASR, NLU, fulfillment, etc.)… the identity of the user (e.g., determined as described above with respect to the operations of block 252) can be utilized to determine which persona is assigned to the instance of the automated assistant for the multiple different users.” (emphasis added) examiner note: based on the identity of the user, a given persona may be assigned to be the automated assistant, and
wherein the relevant webpages within the website includes at least one of case studies, product descriptions, application forms, service catalogs, pricing pages, user manuals, knowledgebase articles, policy documents, FAQs, training modules, multimedia content pages, customer testimonials, blog posts, career pages, and contact and support pages. Baeuml discloses in [0096] “the display 190 may include a second portion 190B that includes a transcription of a dialog between a user of the client device 110 and an instance of the automated assistant that is implemented at least in part at the client device 110. Further, the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen).” And in [0097] “the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” And in [0100] “the visual content provided for visual presentation to the user according to the stream of visual cues may be static visual content. For instance, the waves, the fish, and the birds may not move once displayed unless otherwise indicated by the stream of visual cues. In other implementations, the visual content provided for visual presentation to the user according to the stream of visual cues may be dynamic visual content.” (emphasis added) examiner note: the portion 190B (static) may be homepage that displays transcription of conversation between a user and automated assistant (conversational queries). Portion 190C (may be multi-media content page) may be dynamic homepage that replace portion 190B by way or overlay, for example. The dynamic homepage (portion 190C) may be generated (triggered) based on the conversation between the user and the automated assistant.  

Claim 4. The rejection of the system of claim 1 is incorporated, wherein the processor is further configured to:
perform content cleaning operations, which include personal identifiable information (PII) scrubbing, track text changes, and metadata enrichment prior to indexing; Baeuml discloses in [0124] “certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed. For example, a user’s identity may be treated so that no personal identifiable information can be determined for the user, or a user’s geographic location may be generalized where geographic location information is obtained (such as to a city, ZIP code, or state level), so that a particular geographic location of a user cannot be determined. Thus, the user may have control over how information is collected about the user and/or used.” (emphasis added),
segment normalized content into metadata-tagged chunks, wherein the metadata-tagged chunks comprise at least one of a source, tags, or persona visibility; Baeuml discloses in [0040] “the NLU engine 140A1 and/or 140A2 may include an entity tagger (not depicted) configured to annotate entity references in one or more segments of the recognized text, such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth.” (emphasis added),
generate vector embeddings of the segmented content using a semantic embedding model, and index the embeddings for retrieval-augmented generation (RAG); Baeuml discloses in [0047] “the given LLM may additionally process given persona data (e.g., stored in persona data database 170C) for a given persona that is assigned to the automated assistant 115. In these implementations, the given persona data may correspond to, for example, a given persona token, a given persona embedding, a given persona vector, and/or other data that may be utilized to embody a given persona that is specific to the instance of the automated assistant 115.” And in [0058] “The given persona can include, for example, a given vocabulary that is specific to the given persona and that is utilized in modifying the stream of textual content to generate the stream of modified textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in subsequently synthesizing the stream of modified textual content for audible presentation to the user, and/or a given set of visual cues that is specific to the given persona and that is utilized in modifying the stream of visual cues to generate the stream of modified visual cues. The context of the dialog session can be determined based on one or more contextual signals that include, for example, a time of day, a day of week, a location of the client device, ambient noise detected in an environment of the client device, user profile data, software application data, environmental data about a known environment of the user of the client device, dialog history of the dialog session between the user and the automated assistant, and/or other contextual signals, and can be represented in various manners (e.g., a vector representation, a semantic token representation, and/or other representations). Notably, each of the other disparate personas can also be associated with a corresponding vocabulary, a corresponding set of prosodic properties, and/or a corresponding set of visual cues.” (emphasis added).
enforce compliance policies by recording user consent, masking sensitive data in logs, and automatically purging stored content after a configurable retention period; Baeuml discloses in [0124] “the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user’s social network, social actions or activities, profession, a user’s preferences, or a user’s current geographic location), or to control whether and/or how to receive content from the content server that may be more relevant to the user… certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed. For example, a user’s identity may be treated so that no personal identifiable information can be determined for the user, or a user’s geographic location may be generalized where geographic location information is obtained (such as to a city, ZIP code, or state level), so that a particular geographic location of a user cannot be determined. Thus, the user may have control over how information is collected about the user and/or used.” (emphasis added) examiner note: allowing the user to control and use of his/her own personal identifiable information may be recording user consent. Not determining user’s identifiable information or limiting particular geolocation information to a city, zip code, or state level may be masking sensitive data in logs. Removing user’s identifiable information before it is stored may indicate that the user’s identifiable information may be retain while it is used, and
capture runtime telemetry, which include chat transcripts, audio or voice metrics, and frontend user events for performance and usage analytics. Baeuml discloses in [0027] “the client device 110 may be equipped with one or more microphones that generate streams of audio data, such as streams of audio data that capture spoken utterances of the user and/or other sounds in an environment of the client device 110. Additionally, or alternatively, the client device 110 may be equipped with one or more vision components that are configured to generate streams of vision data that capture images, videos, and/or certain movements (e.g., gestures) detected in a field of view of one or more of the vision components.” And in [0032] “the client device 110 may perform speaker identification (SID) to recognize a user from their voice and/or facial identification (FID) to recognize a user from vision data that captures a face of the user.” And in [0039] “the corresponding streams of ASR output can include, for example, streams of ASR hypotheses (e.g., term hypotheses and/or transcription hypotheses) that are predicted to correspond to spoken utterance(s) of a user that are captured in the corresponding streams of audio data, one or more corresponding predicted measures (e.g., probabilities, log likelihoods, and/or other values) for each of the ASR hypotheses included in the streams of ASR hypotheses, a plurality of phonemes that are predicted to correspond to spoken utterance(s) of a user that are captured in the corresponding streams of audio data, and/or other ASR output.” And in [0081] “the given persona training instance may be generated based on analyzing video content from an online multimedia repository”. (emphasis added) examiner note: captured audio data and visual data may be analyzed for chat transcript of spoken utterance and/or face recognition. 
Baeuml does not explicitly disclose
wherein the processor is configured to normalize heterogeneous content formats comprises hypertext markup language (HTML), portable document format (PDF), and Markdown into structured text for uniform processing. However, Foster discloses in [0028 and 0033] “a movie theater company that desires to sell movie tickets through a conversational agent. A web portal can be utilized to upload data and/or specific configuration settings. A persona-based language generation model can be generated based on the data and/or configuration settings and made available for use by the movie theater. The result is a persona-based conversational agent is provided by the movie theater that enables tickets to be purchased by way of a friendly or casual conversation, for example.” (emphasis added) examiner note: the web portal may be based on markup language such as HTML such that a conversational agent may be utilized or configured to receive messages (utterance/query) and output responses to the received messages, wherein the query may be converted into structured text. 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Foster because “The integration of the conversation generation system 100 with conversational agent development can enable predefined or custom language generation models to be highly compatible and easy to use… the persona-based language generation models can be built and deployed by way of an internet-based service… a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download.” Foster [0032-0033].

Claim 5. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to transmit telemetry data into an analytics pipeline, which include log aggregation, application insights, and data transformation modules for funnel and cohort analysis. Baeuml discloses in [0056] “the system can generate, based on at least the stream of NLU output, one or more structured requests to be transmitted to one or more 1P systems and/or one or more 3P systems. In response to transmitting one or more of the structured requests to the one or more 1P systems and/or one or more 3P systems, the system can receive responsive content from the one or more 1P systems and/or one or more 3P systems. The response content can be utilized in generating the stream of textual content and the stream of visual cues included in the given assistant output. For example, if the spoken utterance corresponds to “Assistant, what’s the weather”, the stream of audio data capturing the spoken utterance can be processed to obtain a stream of textual content (e.g., “The weather today is rainy and 45”) to be synthesized for audible presentation to the user and a stream of visual cues (e.g., an information card associated with the weather).” (emphasis added) examiner note: the received responsive content (telemetry data), from the 1P and/or 3P system, may be transmitted to the system for processing to generate stream of textual content and stream of visual cues.  

Claim 6. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to perform quality review operations, which include transcript analysis, user feedback scoring, and automated benchmarking of response accuracy to update prompting strategies. Baeuml discloses in [0059] “in setting up the client device, the user may have been prompted to select the given persona, from among the plurality of disparate personas, to be assigned to the instance of the automated assistant… the given persona may be assigned to the instance of the automated assistant via settings of an automated assistant application that is associated with the instance of the automated assistant… the user may navigate to the settings of the automated assistant application that is associated with the automated instance of the automated assistant and be able to select the given persona from among the plurality of different personas… the given persona may be assigned to the instance of the automated assistant based on a voice command included in a spoken utterance that is directed to the instance of the automated assistant.” And in [0094] “the methods 500A and 500B may be utilized to generate given persona training instance for initially bootstrapping the instance of the given LLM, but the method 500C may be utilized to further train and refine the instance of the given LLM to reduce time and costs associated with generating the given persona training instances. Alternatively, the method 500C may be utilized for initially bootstrapping the instance of the given LLM to reduce time and costs associated with generating the given persona training instances, but the methods 500A and 500B may be utilized to further train and refine the instance of the given LLM to ensure higher accuracy and precision in subsequent use of the instance of the given LLM.” (emphasis added) examiner note: transcript analysis may indicate selecting a given persona based recognizing voice command included in the spoken utterance. The user feedback scoring may indicate the user may have been prompted to select a given persona from among plurality of personas to accurately update assigning a persona based on recognizing user voice command.  

Claim 7. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to integrate with enterprise platforms, which include customer relationship management (CRM), electronic health record (EHR/HL7), IT service management (ITSM), and geolocation services. Baeuml discloses in [0040] “the corresponding streams of NLU output can include, for example, streams of annotated recognized text that includes one or more annotations of the recognized text for one or more (e.g., all) of the terms of the recognized text, one or more corresponding predicted measures (e.g., probabilities, log likelihoods, and/or other values) for NLU output included in the streams of NLU output, and/or other NLU output… the NLU engine 140A1 and/or 140A2 may include a part of speech tagger (not depicted) configured to annotate terms with their grammatical roles… the NLU engine 140A1 and/or 140A2 may include an entity tagger (not depicted) configured to annotate entity references in one or more segments of the recognized text, such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth. (emphasis added) examiner note: from the recognized text terms, the system may identify entities such as customer relationship management, electronic health record, IT service management, and geolocation services such as organizations and/or locations.  

Claim 8. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to dynamically adapt compliance rules, persona selection, and conversational tone based on detected user domain, which include healthcare, recruitment, customer service, education, financial services, ecommerce, legal advisory, government services, technical support, industrial operations, and entertainment. Baeuml discloses in [0063, 0110] “in generating the corresponding synthesized speech 754A and 758A based on the corresponding streams of textual content, a set of prosodic properties associated with the butler persona can utilized to ensure that a tone, rhythm, pitch, cadence, and/or other prosodic properties reflect that of the butler persona. Accordingly, the instance of the automated assistant can verbally reflect not only terminology that a real butler may utilize in terms of the corresponding streams of textual content, but also a speaking style that the real butler may utilize in terms of synthesizing the corresponding streams of textual content, thereby reflecting the formal and conservative personality of the butler persona.” (emphasis added) examiner note: the automated assistant may be selected to play the role of “butler persona” based on identifying terms in the text content using synthesized speech of real butler.

Claim 9. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to orchestrate secure operations using identity management, secrets vault, and feature flagging for enabling or disabling selected conversational capabilities. Baeuml discloses in [0042] “multiple users co-located in a household may each utilize a given client device, and each of the multiple users may have a separate automated assistant account that is personal to each of the multiple users… the automated assistant 115 can utilize one or more user identification techniques described herein to identify a given user, from among the multiple users, and tailor any dialog session accordingly by utilizing a given persona that the given user assigned to the automated assistant 115 (e.g., which may differ from another persona assigned to the automated assistant 115 by another one of the multiple users), utilizing information that is specific to the given user (e.g., calendar information for the given user, mobile device information from a mobile device of the given user (e.g., incoming electronic communications directed to the given user), etc.), and/or other information to tailor the dialog session to the given user.” And in [0048] “The user identification engine 181 can determine the identity of the user that provided the spoken utterance captured in the stream of audio data based on processing the stream of audio data (e.g., using speaker identification (SID) model(s)), processing a stream of vision data that captures the user that provided the spoken utterance (e.g., using face identification (FID) model(s)), based on an automated assistant account associated with the automated assistant that is active at the client device 110, and/or other by using other techniques. Identifying the user that provided the spoken utterance captured in the stream of audio data is described in more detail herein (e.g., with respect to FIGS. 2 and 3). (emphasis added) examiner note: the secure operations using identity management may be identifying specific given user from among multiple users. Secrets vault may be utilizing the identified given user information such as calendar information. Feature flagging for enabling or disabling selected conversational capabilities may be to assign a given persona based on the identified user. 

Claim 10. The rejection of the system of claim 1 is incorporated, wherein the processor is configured to maintain an event bus for orchestrating asynchronous communication between the speech-to-text module, natural language understanding module, text-to-speech module, and avatar video generator. Baeuml discloses in [0056] “the system can generate, based on at least the stream of NLU output, one or more structured requests to be transmitted to one or more 1P systems and/or one or more 3P systems. In response to transmitting one or more of the structured requests to the one or more 1P systems and/or one or more 3P systems, the system can receive responsive content from the one or more 1P systems and/or one or more 3P systems. The response content can be utilized in generating the stream of textual content and the stream of visual cues included in the given assistant output. For example, if the spoken utterance corresponds to “Assistant, what’s the weather”, the stream of audio data capturing the spoken utterance can be processed to obtain a stream of textual content (e.g., “The weather today is rainy and 45”) to be synthesized for audible presentation to the user and a stream of visual cues (e.g., an information card associated with the weather).” (emphasis added) examiner note: orchestrating asynchronous communication between the speech-to-text, NLU and text-to-speech may indicate transmitting spoken utterance as processed by the automated speech recognition (ASR) engine to speech-to-text, wherein the text content may be processed by the NLU  received responsive content (telemetry data), from the 1P and/or 3P system, may be transmitted to the system for processing to generate stream of textual content and stream of visual cues.

Claim 11. A method for transforming a static website into an artificial intelligence (AI)-enabled multimodal conversational agent using a system, comprising:
receiving, by an input module of a conversation window executing on a user device, one or more user queries as at least one of text inputs or audio inputs, thereby pre-processing the audio input by a speech-to-text module by normalizing, segmenting, and transcribing into text to improve recognition accuracy; Baeuml discloses in [0050-0055 and 0125] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Further, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents.” (emphasis added) examiner note: the captured spoken utterance may be user queries as input to automated speech recognition (ASR), which process the spoken utterance into text,
analyzing, by a natural language processing (NLP) module, the typed or transcribed text to determine user intent, classify user context, and [identify relevant website information], thereby semantically retrieving and [combining grounded passages from multiple webpages of the website] to construct a contextually grounded response; Baeuml discloses in [0041] “the NLU engine 140A1 and/or 140A2 may include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “them” to “buy theatre tickets” in the natural language input “buy them”, based on “theatre tickets” being mentioned in a client device notification rendered immediately prior to receiving input “buy them”.” And in [0075] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data… the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Moreover, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents… the system can process the stream of ASR output, the stream of NLU output, and/or a context of a dialog session in which the spoken utterance was provided by the user (if any), using the LLM, to generate the given assistant output.” (emphasis added) examiner note: the system analyzes the stream of audio data (text-based spoken utterance) as output by the ASR by the natural language understanding (NLU) model, wherein the NLU output may be processed by LLM to generate given assistant output as contextually grounded response based on grouping or clustering of recognized terms using semantics. For example, the phrase “buy them” may be resolved semantically to “buy theatre tickets”,
adapting, by a persona adaptation module, conversational interaction based on the classified context by dynamically modifying at least one of vocabulary, tone, speech style, and avatar representation, thereby switching among multiple roles; Baeuml discloses in [0063] the system causes synthesized speech audio data capturing synthesized speech corresponding to the stream of textual content or the stream of modified textual content to be audibly rendered for presentation to the user. For example, the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified). Notably, the system can utilize the given set of prosodic properties assigned to the given persona in generating the synthesized speech audio data to reflect a given tone, rhythm, pitch, intonation, and/or other prosodic properties associated with the given persona.” And in [0074] “the given persona can include… a given vocabulary that is specific to the given persona and that is utilized in modifying the stream of textual content to generate the stream of modified textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in subsequently synthesizing the stream of modified textual content for audible presentation to the user, and/or a given set of visual cues that is specific to the given persona and that is utilized in modifying the stream of visual cues to generate the stream of modified visual cues. Notably, each of the other disparate personas can also be associated with a corresponding vocabulary, a corresponding set of prosodic properties, and/or a corresponding set of visual cues.” And in [0076] “even though the LLM is general to one or more of the plurality of personas, the given assistant output generated at block 354 may be specific to the given persona assigned to the instance of the automated assistant, and without requiring any additional post-processing of the given assistant output to tailor it to the given persona assigned to the instance of the automated assistant, since the processing using the LLM generating the given assistant output is adapted to the given persona through utilization of the given persona data.” and in [0079] “method 400 of training large language models for utilization in dynamically adapting a given assistant output based on a given persona, from among a plurality of disparate personas, assigned to an automated assistant, assigned to the automated assistant is depicted.” And in [0086-0092] “assume the stream of textual content is represented by a data structure of <text_stream = “Hey there, how are you doing?”. In this example, the developer input can associate associated one or more visual cues with one or more portions of the stream of textual content, such as providing developer input that includes visual cues represented by data structured of < = hand lift_“Hey there”/body gesture> to cause a visualized representation of an instance of an automated assistant to wave while the “Hey there” portion of the stream of textual content is being audibly presented and < = smile “how are you doing?”/face gesture> to cause the visualized representation of the instance of an automated assistant to smile while the “how are you doing” portion of the stream of textual content is being audibly presented… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… the output generated using the one or more movement tracking machine learning models can be further processed (e.g., using one or more classification machine learning models) to determine the higher level animated physical motion gestures and/or screen animations.” (emphasis added) examiner note: a given persona may be adapted by modifying vocabulary, tone, rhythm, pitch, intonation, body gesture, lip movement, etc.,
generating, by a response generator module, a structured natural-language output and transmitting the structured natural-language output to a text-to-speech (TTS) synthesis module to generate spoken audio while exporting phoneme alignment data; Baeuml discloses in [0058, 0062-0065] “based on a given assistant persona assigned to the instance of the automated assistant by the user and from among a plurality of disparate personas, the given assistant output to generate a modified given assistant output, the modified given assistant output including: (1) a stream of modified textual content that differs from the stream of textual content; and/or (2) a stream of modified visual cues that differs from the stream of visual cues. In generating the modified given assistant output, the system may also consider a context of a dialog session (if any) in which the spoken utterance is provided by the user… the modified given assistant output may better resonate with the user engaged in the dialog session with the instance of the automated assistant… the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified)… the system can cause the synthesized speech audio data to be audibly rendered for presentation to the user via one or more speakers of the client device.” and in [0092] “the system processes the stream of audio data to generate a stream of textual content and the stream of vision data to generate a stream of visual cues… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… this synchronization may be performed by a dedicated synchronization engine and/or other component of the system that is capable of synchronizing the stream of visual cues or the stream of modified textual context and the utilization of the stream of visual cues or the modified stream of visual cues.” (emphasis added) examiner note: the modified given assistance output may be structured outputs that comprises modified textual content and modified visual cues, wherein the modified textual content may be transmitted to text-to-speech (TTS) model to generate spoken audio output. The phrase “exporting phoneme alignment data” may be synchronizing data that synchronize the modified textual content and/or the modified visual cues,
rendering, by an avatar generation module, a synchronized lifelike video of an animated persona delivering the spoken audio output in synchrony with lip movements and gestures based on the phoneme alignment data; displaying, by an output rendering module, the generated response in multi-modal formats, which includes a text transcript, audio playback, and an embedded avatar video within the user's browser environment; Baeuml discloses in [0005] “The given persona can be embodied by, for example, a given vocabulary that is specific to the given persona and that is utilized in generating the corresponding streams of textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in synthesizing the corresponding streams of textual content for audible presentation to the user, and/or a given set of visual cues that includes some visual cues that are specific to the given persona (e.g., animated physical motion gestures that are commonly associated with the visualized representation of the instance of the automated assistant) and that includes some visual cues that are common amongst multiple personas of the plurality of disparate personas (e.g., waving, certain facial expressions, etc.). The visualized representation of the automated assistant can be, for example, an animated avatar or entity that represents the instance of the automated assistant, and can be based on, for example, a real human, a fictional character, animated object(s) and/or animal(s), and/or other visualized representations.” And in [0064] “the stream of visual cues or the modified stream of visual cues are utilized in controlling the visualized representation of the instance of the automated assistant. The visualized representation of the instance of the automated assistant can be, for example, an avatar corresponding to an animated person (e.g., real or fictional), character (e.g., a butler, a pirate, a chef), object (e.g., animated assistant dots), animal, and/or any other visualized representation.” (emphasis added) examiner note: the rendering of the animated avatar may be a synchronized video for presenting speech and some motion, movement to mimic lifelike video as response to the user spoken utterance input,
replacing, by the processor, a static homepage with the conversation window and enabling conversational queries to directly trigger navigation by displaying or linking to relevant webpages of the website; Baeuml discloses in [0096] “the display 190 may include a first portion 190A that includes an indication a user account that is active at the client device 110 (e.g., as indicated by a user account symbol in the right-hand side of the first portion 190A of the display 190), and an indication of when various components are active at the client device 110, such as one or more microphones or speech processing components of the client device 110 (e.g., as indicated by the ellipses 190A1 in the first portion 190A of the display 190)… the display 190 may include a second portion 190B that includes a transcription of a dialog between a user of the client device 110 and an instance of the automated assistant that is implemented at least in part at the client device 110. Further, the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen).” And in [0097] “the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” And in [0100] “the visual content provided for visual presentation to the user according to the stream of visual cues may be static visual content. For instance, the waves, the fish, and the birds may not move once displayed unless otherwise indicated by the stream of visual cues. In other implementations, the visual content provided for visual presentation to the user according to the stream of visual cues may be dynamic visual content.” (emphasis added) examiner note: the portion 190A may be a static homepage that indicated by active components such as ellipses 190A as shown in fig. 7A. Portion 190B displays transcription of conversation between a user and automated assistant (conversational queries). Display 190B as a multi-media content page may be dynamic homepage that replace portion 190A by way of overlay, for example. The dynamic homepage (portion 190B) may be generated (triggered) based on the conversation between the user and the automated assistant, and
Baeuml does not explicitly disclose
identify relevant website information, wherein the NLP module is configured to semantically retrieves and combines information from multiple webpages within the website. However, Foster, in an analogous art, discloses in [0030] “By way of example consider a message “What's the weather today? I hope there's no rain.” The response can be templatic and include slots for retrieval and insertion of pertinent weather information. Thus, the response can be “Sorry, there is a high chance of rain, but not too cold at 50 degrees. Don't forget your umbrella!” In this example, the slots that are filled correspond to “rain” and “50 degrees.” This response is dynamically generated based on inputs received from elsewhere (e.g., application programming interface calls) based on context. As another example, consider the message or query “How far is Seattle from Hyderabad anyway?” The small talk or chatty response can be “Pretty far. It's 7,770 miles away.” Here, the slot that is being filled by external function call is the distance “7,770.”” And in [0033] “In one instance the persona-based language generation models can be built and deployed by way of an internet-based service. For example, a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download. Consider, for example, a movie theater company that desires to sell movie tickets through a conversational agent. A web portal can be utilized to upload data and/or specific configuration settings. A persona-based language generation model can be generated based on the data and/or configuration settings and made available for use by the movie theater.” And in [0038] “FIG. 5 is a flow chart diagram that depicts a method 500 of message processing. At reference numeral 510, a message is received, wherein the message is a query. At 520, a decision is made as to whether or not the query is task oriented or not. A task-oriented query seeks to accomplish a task such as book a vacation, order a product, or receive instructions, among other things… A query that is not task oriented is one that does not seek to accomplish a task but rather can be small talk or chitchat about things that are trivial or uncontroversial, for example “How are you?” or “Where do you live?” At 540, a language generation model is invoked.” And in [0039] “a message is received, wherein the message is a query. For example, the query can be “Is it going to be sunny today?” At numeral 620, data that is required to satisfy the query is requested and received. For example, weather data can be requested and received by interaction with a weather application programming interface… functionality can be invoked involving slot filling to parameterize language generation. Continuing with the ongoing example, the generated response can be “Sorry, it will be cloudy today with a high chance of showers. Don't forget your umbrella!” In this case weather data, namely cloudy and chance of showers are integrated into a response that satisfies the query and is casual and friendly in nature. In addition to receiving data from an external source, it should be appreciated that a response can trigger, or include, an action directed toward completion of a message that is task-based. For example, a message that requests that lights be turned on can be responded to by turning on the lights. At reference numeral 640, the generated response to the query is returned.” And in [0040] “a persona-based language generation model is automatically generated based on the identified interlocutors. Additionally, the persona-based language generation model can be created to include action paths based on intent of a message, such as whether or not to respond to a specific intent. At numeral 740, the generated model is returned, for example for use by a conversational agent such as a chatbot, game character or conversation-enabled physical robot. The model enables responses to be generated that mimic the style and tone of the one or more interlocutors representative of the persona configuration.” (emphasis added) examiner note: the website may be a web portal that may request, based on a query, weather information from external source and may retrieve information responsive to a query about map application such as the distance between Seattle from Hyderabad. The different information sources may be based on the web portal pages or combination of internal and external applications.   
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Foster because “The integration of the conversation generation system 100 with conversational agent development can enable predefined or custom language generation models to be highly compatible and easy to use… the persona-based language generation models can be built and deployed by way of an internet-based service… a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download.” Foster [0032-0033].
Baeuml does not explicitly disclose
executing, by an escalation module, a follow-up procedure when the query cannot be resolved, thereby generating a follow-up email to the user and selectively connecting the user with designated management personnel or providing corresponding contact details. However, Gaskill, in an analogous art, discloses in [0086] “the NLU component 214 may provide as its output the dominant object, user intent, and the knowledge graph 808 that is formulated along dimensions likely to be relevant to the user query. This information may help the dialog manager 216 if there is missing information needed to fully resolve a user query to an item recommendation, and thus whether (and how) to then to prompt the user to further refine the user's requirements via additional input.” And in [0128] “the dialog manager 216 may select a prompt that comprises a validating statement, such as “I understand you want to find red Nike shoes” or “OK I can help you find red Nike shoes now” to conversationally lead the user to provide further confirmatory and revelatory discussion of the dominant object of user interest. This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion.” (emphasis added) examiner note: the prompt may be a message transmitted to the user requesting additional information for resolving a query language.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Gaskill because “ambiguity may occur for example if there are many unusual spelling errors in user textual input, or if the user's speech was received in a noisy environment, so that normalizing has not worked well…This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion. Gaskill [0128].

Claim 12. The rejection of the method of claim 11 is incorporated, wherein the speech-to-text module is configured to transcribe audio input into text and generate phoneme alignment data for synchronizing avatar lip, facial, and gesture movements during video rendering. Baeuml discloses in [0064-0065] “the system causes the stream of visual cues or the stream of modified visual cues to be utilized in controlling a display of the client device and/or in controlling a visualized representation of the instance of the automated assistant. In some implementations, the stream of visual cues or the stream of modified visual cues are utilized in controlling the display of the client device. In these implementations, the stream of visual cues or the stream of modified visual cues can cause the display of the client device to visually render one or more screen animations for presentation to the user (e.g., as described with respect to FIGS. 6A and 6B)… the stream of visual cues or the stream of modified visual cues can cause the visualized representation of the automated assistant to perform one or more animated physical gesture motions. These physical gesture motions can include, for example, general physical gesture motions that are general across all visualized representations (e.g., waving, entering the display of the client device, exiting the display of the client device, etc.), persona specific physical gesture motions that are specific to the given persona assigned to the automated assistant (e.g., a signature move of a fictional character), emotions portrayed by the visualized representation (e.g., happy, sad, angry, etc.) that may optionally be coupled with the above general or persona specific physical gesture motions, facial expressions (e.g., smiling, frowning, etc.) that may optionally be coupled with the above general or persona specific physical gesture motions, and/or other physical gesture motions… the system can synchronize, for presentation to the user, the audible rendering of the synthesized speech corresponding to the stream of visual cues or the stream of modified textual context and the utilization of the stream of visual cues or the modified stream of visual cues in controlling the display of the client device and/or for controlling the visualized representation of the instance of the automated assistant.” (emphasis added) examiner note: the rendered animated video may comprise physical gesture motions synchronized, for presentation, with audible rendering of synthesized speech corresponding the visual cues.

Claim 13. The rejection of the method of claim 11 is incorporated, wherein the persona adaptation module is configured to selects the conversational persona from plurality of roles comprises at least one of a sales assistant persona, a recruiter persona, a teacher persona, a healthcare assistant persona, a financial advisor persona, a customer support persona, a legal advisor persona, a technical support persona, a government service agent persona, an e-commerce shopping guide persona, an entertainment/media host persona, and a compliance trainer persona. Baeuml discloses in [0110] “For example, in FIG. 7A where the butler persona is assigned to the instance of the automated assistant, the instance of the automated assistant can generate corresponding streams of textual content and corresponding streams of visual cues as corresponding given assistant outputs that are specific to the butler persona… In determining the corresponding streams of textual content, corresponding outputs generating using an LLM or previously generated output generated using the LLM (e.g., previously generated based on spoken utterances that are the same or similar to those provided by the user) can include a corresponding probability distribution over a sequence of one or more words and/or phrases across one or more vocabularies, such as a butler persona vocabulary that is specific to the butler persona or a general vocabulary that may be biased towards words and/or phrases associated with the butler persona (e.g., “Salutations ...” and “How are you keeping on this fine morning?” in the synthesized speech 754A, “sire” and “I do caution” in the synthesized speech 758A). Further, in generating the corresponding synthesized speech 754A and 758A based on the corresponding streams of textual content, a set of prosodic properties associated with the butler persona can utilized to ensure that a tone, rhythm, pitch, cadence, and/or other prosodic properties reflect that of the butler persona. Accordingly, the instance of the automated assistant can verbally reflect not only terminology that a real butler may utilize in terms of the corresponding streams of textual content, but also a speaking style that the real butler may utilize in terms of synthesizing the corresponding streams of textual content, thereby reflecting the formal and conservative personality of the butler persona.” (emphasis added) examiner note: based on the identity of the user, a given persona may be assigned to be the automated assistant.

Claim 14. The rejection of the method of claim 11 is incorporated, Baeuml does not explicitly disclose wherein the grounded passages are retrieved by performing semantic similarity search across multiple webpages of the website and merged into a unified response dataset. However, Foster, in an analogous art, discloses in [0030] “By way of example consider a message “What's the weather today? I hope there's no rain.” The response can be templatic and include slots for retrieval and insertion of pertinent weather information. Thus, the response can be “Sorry, there is a high chance of rain, but not too cold at 50 degrees. Don't forget your umbrella!” In this example, the slots that are filled correspond to “rain” and “50 degrees.” This response is dynamically generated based on inputs received from elsewhere (e.g., application programming interface calls) based on context. As another example, consider the message or query “How far is Seattle from Hyderabad anyway?” The small talk or chatty response can be “Pretty far. It's 7,770 miles away.” Here, the slot that is being filled by external function call is the distance “7,770.”” And in [0033] “In one instance the persona-based language generation models can be built and deployed by way of an internet-based service. For example, a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download. Consider, for example, a movie theater company that desires to sell movie tickets through a conversational agent. A web portal can be utilized to upload data and/or specific configuration settings. A persona-based language generation model can be generated based on the data and/or configuration settings and made available for use by the movie theater.” And in [0038] “FIG. 5 is a flow chart diagram that depicts a method 500 of message processing. At reference numeral 510, a message is received, wherein the message is a query. At 520, a decision is made as to whether or not the query is task oriented or not. A task-oriented query seeks to accomplish a task such as book a vacation, order a product, or receive instructions, among other things… A query that is not task oriented is one that does not seek to accomplish a task but rather can be small talk or chitchat about things that are trivial or uncontroversial, for example “How are you?” or “Where do you live?” At 540, a language generation model is invoked.” And in [0039] “a message is received, wherein the message is a query. For example, the query can be “Is it going to be sunny today?” At numeral 620, data that is required to satisfy the query is requested and received. For example, weather data can be requested and received by interaction with a weather application programming interface… functionality can be invoked involving slot filling to parameterize language generation. Continuing with the ongoing example, the generated response can be “Sorry, it will be cloudy today with a high chance of showers. Don't forget your umbrella!” In this case weather data, namely cloudy and chance of showers are integrated into a response that satisfies the query and is casual and friendly in nature. In addition to receiving data from an external source, it should be appreciated that a response can trigger, or include, an action directed toward completion of a message that is task-based. For example, a message that requests that lights be turned on can be responded to by turning on the lights. At reference numeral 640, the generated response to the query is returned.” And in [0040] “a persona-based language generation model is automatically generated based on the identified interlocutors. Additionally, the persona-based language generation model can be created to include action paths based on intent of a message, such as whether or not to respond to a specific intent. At numeral 740, the generated model is returned, for example for use by a conversational agent such as a chatbot, game character or conversation-enabled physical robot. The model enables responses to be generated that mimic the style and tone of the one or more interlocutors representative of the persona configuration.” (emphasis added) examiner note: the website may be a web portal that may request, based on a query, weather information from external source and may retrieve information responsive to a query about map application such as the distance between Seattle from Hyderabad. The different information sources may be based on the web portal pages or combination of internal and external applications.   
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Foster because “The integration of the conversation generation system 100 with conversational agent development can enable predefined or custom language generation models to be highly compatible and easy to use… the persona-based language generation models can be built and deployed by way of an internet-based service… a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download.” Foster [0032-0033].

Claim 15. The rejection of the method of claim 11 is incorporated, Baeuml does not explicitly disclose wherein the response generator module is configured to produce navigation actions that link the user directly to specific website sections, which include case studies, product descriptions, application forms, service catalogs, pricing pages, user manuals, knowledge-base articles, policy documents, FA Os, training modules, multimedia content pages, customer testimonials, blog posts, career pages, and contact and support pages, or job postings. However, Gaskill, in an analogous art, discloses in [0032] “a third party application 114, executing on a third party server 112, is shown as having programmatic access to the networked system 116 via the programmatic interface provided by the Application Program Interface (API) server 118. For example, the third party application 114, using information retrieved from the networked system 116, may support one or more features or functions on a website hosted by the third party.” And in [0075] “FIG. 7 shows an overview of the intelligent personal assistant system 106 processing natural language user inputs to generate an item recommendation in an electronic marketplace.” And in [0078] “Extracting user intent is very helpful for the AI bot in determining what further action is needed. In one ecommerce-related example, at the very highest level, user intent could be shopping, chit-chat, jokes, weather, etc. the user intent is shopping, it could relate to the pursuit of a specific shopping mission, gifting an item for a target recipient other than the user, or just to browse an inventory of items available for purchase. Once the high level intent is identified, the artificial intelligence framework 128 is tasked with determining what the user is looking for; that is, is the need broad (e.g., shoes, dresses) or more specific (e.g., two pairs of new black Nike™ size 10 sneakers) or somewhere in between (e.g., black sneakers)?” (emphasis added) examiner note: the user may navigate third party website sections using natural language queries and an intelligent personal assistant processes the queries to search and retrieve product items as recommendation, wherein the recommended product items may be from different sections, e.g. Men’s Athletic Shoes, Women’s Athletic Shoes, etc., and the recommended product item descriptions may be shoe type such as Nike and size such as 10 as shown in figs. 10-11.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Gaskill because “ambiguity may occur for example if there are many unusual spelling errors in user textual input, or if the user's speech was received in a noisy environment, so that normalizing has not worked well…This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion. Gaskill [0128].

Claim 16. The rejection of the method of claim 11 is incorporated, wherein the avatar generation module is configured to apply phoneme-to-viseme mapping to animate lip, facial, and gesture movements of the avatar synchronously with the generated audio. Baeuml discloses in [0113] “in generating the corresponding synthesized speech 754 and 758B based on the corresponding streams of textual content, a set of prosodic properties associated with the pirate persona can utilized to ensure that a tone, rhythm, pitch, cadence, and/or other prosodic properties reflect that of the pirate persona. Accordingly, the instance of the automated assistant can verbally reflect not only terminology that a real pirate may utilize in terms of the corresponding streams of textual content, but also a speaking style that the real pirate may utilize in terms of synthesizing the corresponding streams of textual content, thereby reflecting the adventurous and edgy personality of the pirate persona.” (emphasis added).

Claim 17. The rejection of the method of claim 11 is incorporated, wherein the output rendering module is configured to simultaneously displays the natural-language answer as text and plays back the synchronized avatar video in a split-screen conversation window. Baeuml discloses in [0110-0111] “in generating the corresponding synthesized speech 754A and 758A based on the corresponding streams of textual content, a set of prosodic properties associated with the butler persona can utilized to ensure that a tone, rhythm, pitch, cadence, and/or other prosodic properties reflect that of the butler persona. Accordingly, the instance of the automated assistant can verbally reflect not only terminology that a real butler may utilize in terms of the corresponding streams of textual content, but also a speaking style that the real butler may utilize in terms of synthesizing the corresponding streams of textual content, thereby reflecting the formal and conservative personality of the butler persona… in determining the stream of visual cues, corresponding outputs generating using an LLM or previously generated output generated using the LLM (e.g., previously generated based on spoken utterances that are the same or similar to those provided by the user) can include a corresponding probability distribution over a sequence of tokens representing one or more animated physical motion gestures that may be performed by the visualized butler (e.g., as shown in the third portion 190C of the display 190 in FIG. 7A) and with respect to the corresponding streams of textual content and/or other animations to be performed with respect to the corresponding streams of textual content (e.g., instructions for controlling the display 190 as described above with respect to FIGS. 6A and 6B), such as the visualized butler nodding or lifting his monocle and/or waving when saying “Salutations”, or holding and opening an umbrella when saying “I do caution to take your umbrella”. These animated physical gesture motions and/or other animations can be synchronized with the corresponding streams of textual content in any manner described herein and/or other manners.” (emphasis added) examiner note: the display of the natural language answer as text may be text content 754A and 758A in window 190B and playback of the synthesized speech 754A and 758A may be performed by the animated butler persona (avatar) in window 190C as shown in fig. 7A. 

Claim 18. The rejection of the method of claim 11 is incorporated, wherein the output rendering module is configured to replace the homepage dynamically without reloading the entire website and preserves access to the classic website via a persistent hyperlink. Baeuml discloses in [0096-0097] “the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen)… Although the display 190 shown in FIGS. 6A and 6B includes various disparate portions, it should be understood that is for the sake of illustration and is not meant to be limiting. For instance, the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” (emphasis added) examiner note: since the disparate windows 190B and 190C may overlay one another, the dynamic window 190C may replace the static window 190B by way of overlay without reloading the entire website because both (static 190B and dynamic 190C) windows simultaneously rendered as shown in fig. 7A.  

Claim 19. The rejection of the method of claim 11 is incorporated, Baeuml does not explicitly disclose wherein the escalation module is configured to transmit follow-up emails through an automated mail server and logs unresolved queries in a management dashboard for review by designated personnel, and wherein the escalation module provides real-time connection to the designated personnel through at least one of chat forwarding, voice call initiation, or calendar-based appointment scheduling. However, Gaskill, in an analogous art, discloses in [0086] “the NLU component 214 may provide as its output the dominant object, user intent, and the knowledge graph 808 that is formulated along dimensions likely to be relevant to the user query. This information may help the dialog manager 216 if there is missing information needed to fully resolve a user query to an item recommendation, and thus whether (and how) to then to prompt the user to further refine the user's requirements via additional input.” And in [0128] “the dialog manager 216 may select a prompt that comprises a validating statement, such as “I understand you want to find red Nike shoes” or “OK I can help you find red Nike shoes now” to conversationally lead the user to provide further confirmatory and revelatory discussion of the dominant object of user interest. This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion.” (emphasis added) examiner note: the prompt may be a message transmitted to the user requesting additional information for resolving a query language.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Gaskill because “ambiguity may occur for example if there are many unusual spelling errors in user textual input, or if the user's speech was received in a noisy environment, so that normalizing has not worked well…This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion. Gaskill [0128].

Claim 20. A non-transitory computer-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method for transforming a static website into an artificial intelligence (AI)-enabled multi-modal conversational agent, the method comprising:
receiving, by the processor through a conversation window executing on a user device, one or more user queries as at least one of text inputs or audio inputs, thereby pre-processing the audio input by performing normalizing, segmenting, and transcribing into text to improve recognition accuracy; Baeuml discloses in [0050-0055 and 0125] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Further, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents.” (emphasis added) examiner note: the captured spoken utterance may be user queries as input to automated speech recognition (ASR), which process the spoken utterance into text,
analyzing, by the processor, the typed or transcribed text to determine user intent, classify user context, and [identify relevant website information, thereby semantically retrieving and combining grounded passages from multiple webpages of the website] to construct a contextually grounded response; Baeuml discloses in [0041] “the NLU engine 140A1 and/or 140A2 may include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “them” to “buy theatre tickets” in the natural language input “buy them”, based on “theatre tickets” being mentioned in a client device notification rendered immediately prior to receiving input “buy them”.” And in [0075] “the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data… the system can process the stream of audio data, using ASR model(s), to generate a stream of ASR output, such as one or more recognized terms corresponding to the spoken utterance captured in the stream of audio data. Moreover, the system can process the stream of ASR output, using NLU model(s), to generate a stream of NLU output, such as one or more intents and corresponding slot value for one or more parameters associated with one or more of the intents… the system can process the stream of ASR output, the stream of NLU output, and/or a context of a dialog session in which the spoken utterance was provided by the user (if any), using the LLM, to generate the given assistant output.” (emphasis added) examiner note: the system analyzes the stream of audio data (text-based spoken utterance) as output by the ASR by the natural language understanding (NLU) model, wherein the NLU output may be processed by LLM to generate given assistant output as contextually grounded response based on grouping or clustering of recognized terms using semantics. For example, the phrase “buy them” may be resolved semantically to “buy theatre tickets”,
adapting, by the processor, conversational interaction based on the classified context by dynamically modifying at least one of vocabulary, tone, speech style, and avatar representation, thereby switching among multiple roles; Baeuml discloses in [0063] the system causes synthesized speech audio data capturing synthesized speech corresponding to the stream of textual content or the stream of modified textual content to be audibly rendered for presentation to the user. For example, the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified). Notably, the system can utilize the given set of prosodic properties assigned to the given persona in generating the synthesized speech audio data to reflect a given tone, rhythm, pitch, intonation, and/or other prosodic properties associated with the given persona.” And in [0074] “the given persona can include… a given vocabulary that is specific to the given persona and that is utilized in modifying the stream of textual content to generate the stream of modified textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in subsequently synthesizing the stream of modified textual content for audible presentation to the user, and/or a given set of visual cues that is specific to the given persona and that is utilized in modifying the stream of visual cues to generate the stream of modified visual cues. Notably, each of the other disparate personas can also be associated with a corresponding vocabulary, a corresponding set of prosodic properties, and/or a corresponding set of visual cues.” And in [0076] “even though the LLM is general to one or more of the plurality of personas, the given assistant output generated at block 354 may be specific to the given persona assigned to the instance of the automated assistant, and without requiring any additional post-processing of the given assistant output to tailor it to the given persona assigned to the instance of the automated assistant, since the processing using the LLM generating the given assistant output is adapted to the given persona through utilization of the given persona data.” and in [0079] “method 400 of training large language models for utilization in dynamically adapting a given assistant output based on a given persona, from among a plurality of disparate personas, assigned to an automated assistant, assigned to the automated assistant is depicted.” And in [0086-0092] “assume the stream of textual content is represented by a data structure of <text_stream = “Hey there, how are you doing?”. In this example, the developer input can associate associated one or more visual cues with one or more portions of the stream of textual content, such as providing developer input that includes visual cues represented by data structured of < = hand lift_“Hey there”/body gesture> to cause a visualized representation of an instance of an automated assistant to wave while the “Hey there” portion of the stream of textual content is being audibly presented and < = smile “how are you doing?”/face gesture> to cause the visualized representation of the instance of an automated assistant to smile while the “how are you doing” portion of the stream of textual content is being audibly presented… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… the output generated using the one or more movement tracking machine learning models can be further processed (e.g., using one or more classification machine learning models) to determine the higher level animated physical motion gestures and/or screen animations.” (emphasis added) examiner note: a given persona may be adapted by modifying vocabulary, tone, rhythm, pitch, intonation, body gesture, lip movement, etc.,
generating, by the processor, a structured natural-language output and transmitting the structured natural-language output to generate spoken audio while exporting phoneme alignment data; Baeuml discloses in [0058, 0062-0065] “based on a given assistant persona assigned to the instance of the automated assistant by the user and from among a plurality of disparate personas, the given assistant output to generate a modified given assistant output, the modified given assistant output including: (1) a stream of modified textual content that differs from the stream of textual content; and/or (2) a stream of modified visual cues that differs from the stream of visual cues. In generating the modified given assistant output, the system may also consider a context of a dialog session (if any) in which the spoken utterance is provided by the user… the modified given assistant output may better resonate with the user engaged in the dialog session with the instance of the automated assistant… the system can process the stream of textual content, using TTS model(s), to generate the synthesized speech audio data capturing the synthesized speech corresponding to the stream of textual content (e.g., in instances where the stream of textual content is not modified) or the stream of modified textual content (e.g., in instances where the stream of textual content is modified)… the system can cause the synthesized speech audio data to be audibly rendered for presentation to the user via one or more speakers of the client device.” and in [0092] “the system processes the stream of audio data to generate a stream of textual content and the stream of vision data to generate a stream of visual cues… the system can also process the stream of vision data, using one or more movement tracking machine learning models (e.g., machine learning model(s) trained to track eye gaze, mouth movement, lip movement, body movement, body posture, etc.), to generate output indicative of how the person or character captured in the video content visually expresses themselves as they speak… this synchronization may be performed by a dedicated synchronization engine and/or other component of the system that is capable of synchronizing the stream of visual cues or the stream of modified textual context and the utilization of the stream of visual cues or the modified stream of visual cues.” (emphasis added) examiner note: the modified given assistance output may be structured outputs that comprises modified textual content and modified visual cues, wherein the modified textual content may be transmitted to text-to-speech (TTS) model to generate spoken audio output. The phrase “exporting phoneme alignment data” may be synchronizing data that synchronize the modified textual content and/or the modified visual cues,
rendering, by the processor, a synchronized lifelike video of an animated persona delivering the spoken audio output in synchrony with lip movements and gestures based on the phoneme alignment data; displaying, by the processor, the generated response in multi-modal formats, which includes a text transcript, audio playback, and an embedded avatar video within the user's browser environment; Baeuml discloses in [0005] “The given persona can be embodied by, for example, a given vocabulary that is specific to the given persona and that is utilized in generating the corresponding streams of textual content, a given set of prosodic properties that is specific to the given persona and that is utilized in synthesizing the corresponding streams of textual content for audible presentation to the user, and/or a given set of visual cues that includes some visual cues that are specific to the given persona (e.g., animated physical motion gestures that are commonly associated with the visualized representation of the instance of the automated assistant) and that includes some visual cues that are common amongst multiple personas of the plurality of disparate personas (e.g., waving, certain facial expressions, etc.). The visualized representation of the automated assistant can be, for example, an animated avatar or entity that represents the instance of the automated assistant, and can be based on, for example, a real human, a fictional character, animated object(s) and/or animal(s), and/or other visualized representations.” And in [0064] “the stream of visual cues or the modified stream of visual cues are utilized in controlling the visualized representation of the instance of the automated assistant. The visualized representation of the instance of the automated assistant can be, for example, an avatar corresponding to an animated person (e.g., real or fictional), character (e.g., a butler, a pirate, a chef), object (e.g., animated assistant dots), animal, and/or any other visualized representation.” (emphasis added) examiner note: the rendering of the animated avatar may be a synchronized video for presenting speech and some motion, movement to mimic lifelike video as response to the user spoken utterance input,
replacing, by the processor, a static homepage with the conversation window and enabling conversational queries to directly trigger navigation by displaying or linking to relevant webpages of the website; Baeuml discloses in [0096] “the display 190 may include a first portion 190A that includes an indication a user account that is active at the client device 110 (e.g., as indicated by a user account symbol in the right-hand side of the first portion 190A of the display 190), and an indication of when various components are active at the client device 110, such as one or more microphones or speech processing components of the client device 110 (e.g., as indicated by the ellipses 190A1 in the first portion 190A of the display 190)… the display 190 may include a second portion 190B that includes a transcription of a dialog between a user of the client device 110 and an instance of the automated assistant that is implemented at least in part at the client device 110. Further, the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen).” And in [0097] “the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” And in [0100] “the visual content provided for visual presentation to the user according to the stream of visual cues may be static visual content. For instance, the waves, the fish, and the birds may not move once displayed unless otherwise indicated by the stream of visual cues. In other implementations, the visual content provided for visual presentation to the user according to the stream of visual cues may be dynamic visual content.” (emphasis added) examiner note: the portion 190A may be a static homepage that indicated by active components such as ellipses 190A as shown in fig. 7A. Portion 190B displays transcription of conversation between a user and automated assistant (conversational queries). Display 190B as a multi-media content page may be dynamic homepage that replace portion 190A by way of overlay, for example. The dynamic homepage (portion 190B) may be generated (triggered) based on the conversation between the user and the automated assistant,, and
Baeuml does not explicitly disclose
identify relevant website information, thereby semantically retrieving and combining grounded passages from multiple webpages of the website. However, Foster, in an analogous art, discloses in [0030] “By way of example consider a message “What's the weather today? I hope there's no rain.” The response can be templatic and include slots for retrieval and insertion of pertinent weather information. Thus, the response can be “Sorry, there is a high chance of rain, but not too cold at 50 degrees. Don't forget your umbrella!” In this example, the slots that are filled correspond to “rain” and “50 degrees.” This response is dynamically generated based on inputs received from elsewhere (e.g., application programming interface calls) based on context. As another example, consider the message or query “How far is Seattle from Hyderabad anyway?” The small talk or chatty response can be “Pretty far. It's 7,770 miles away.” Here, the slot that is being filled by external function call is the distance “7,770.”” And in [0033] “In one instance the persona-based language generation models can be built and deployed by way of an internet-based service. For example, a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download. Consider, for example, a movie theater company that desires to sell movie tickets through a conversational agent. A web portal can be utilized to upload data and/or specific configuration settings. A persona-based language generation model can be generated based on the data and/or configuration settings and made available for use by the movie theater.” And in [0038] “FIG. 5 is a flow chart diagram that depicts a method 500 of message processing. At reference numeral 510, a message is received, wherein the message is a query. At 520, a decision is made as to whether or not the query is task oriented or not. A task-oriented query seeks to accomplish a task such as book a vacation, order a product, or receive instructions, among other things… A query that is not task oriented is one that does not seek to accomplish a task but rather can be small talk or chitchat about things that are trivial or uncontroversial, for example “How are you?” or “Where do you live?” At 540, a language generation model is invoked.” And in [0039] “a message is received, wherein the message is a query. For example, the query can be “Is it going to be sunny today?” At numeral 620, data that is required to satisfy the query is requested and received. For example, weather data can be requested and received by interaction with a weather application programming interface… functionality can be invoked involving slot filling to parameterize language generation. Continuing with the ongoing example, the generated response can be “Sorry, it will be cloudy today with a high chance of showers. Don't forget your umbrella!” In this case weather data, namely cloudy and chance of showers are integrated into a response that satisfies the query and is casual and friendly in nature. In addition to receiving data from an external source, it should be appreciated that a response can trigger, or include, an action directed toward completion of a message that is task-based. For example, a message that requests that lights be turned on can be responded to by turning on the lights. At reference numeral 640, the generated response to the query is returned.” And in [0040] “a persona-based language generation model is automatically generated based on the identified interlocutors. Additionally, the persona-based language generation model can be created to include action paths based on intent of a message, such as whether or not to respond to a specific intent. At numeral 740, the generated model is returned, for example for use by a conversational agent such as a chatbot, game character or conversation-enabled physical robot. The model enables responses to be generated that mimic the style and tone of the one or more interlocutors representative of the persona configuration.” (emphasis added) examiner note: the website may be a web portal that may request, based on a query, weather information from external source and may retrieve information responsive to a query about map application such as the distance between Seattle from Hyderabad. The different information sources may be based on the web portal pages or combination of internal and external applications.   
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Foster because “The integration of the conversation generation system 100 with conversational agent development can enable predefined or custom language generation models to be highly compatible and easy to use… the persona-based language generation models can be built and deployed by way of an internet-based service… a web portal can be provided that allows a user to select a predefined or build a custom persona-based language generation model that can be exposed as a service or made available for download.” Foster [0032-0033].
Baeuml does not explicitly disclose
executing, by the processor, a follow-up procedure when the query cannot be resolved, thereby generating a follow-up email to the user and selectively connecting the user with designated management personnel or providing corresponding contact details. However, Gaskill, in an analogous art, discloses in [0086] “the NLU component 214 may provide as its output the dominant object, user intent, and the knowledge graph 808 that is formulated along dimensions likely to be relevant to the user query. This information may help the dialog manager 216 if there is missing information needed to fully resolve a user query to an item recommendation, and thus whether (and how) to then to prompt the user to further refine the user's requirements via additional input.” And in [0128] “the dialog manager 216 may select a prompt that comprises a validating statement, such as “I understand you want to find red Nike shoes” or “OK I can help you find red Nike shoes now” to conversationally lead the user to provide further confirmatory and revelatory discussion of the dominant object of user interest. This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion.” (emphasis added) examiner note: the prompt may be a message transmitted to the user requesting additional information for resolving a query language.
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Baeuml with the teaching of Gaskill because “ambiguity may occur for example if there are many unusual spelling errors in user textual input, or if the user's speech was received in a noisy environment, so that normalizing has not worked well…This prompt type allows a user to resolve ambiguities that the intelligent personal assistant system 106 may not have been able to resolve automatically without asking question type prompts that may cause confusion. Gaskill [0128].


Response to Arguments
Applicant's arguments filed 12/12/2025 have been fully considered but they are not persuasive.
Argument: “Applicant respectfully submits that the Examiner's reliance on Baeuml's disclosures in [0053], [0056], and [0058] is misplaced. The cited passages describe a client-device automated assistant that receives spoken input through microphones on the device, is invoked via wake-word or button activation, communicates with first-party and third-party backend systems, and generates persona-styled textual and visual cue outputs for display within the assistant's own
interface on the device. None of this functionality relates to or suggests modifying, replacing, or restructuring the homepage of a website… Taken as a whole, Baeuml addresses a device-level virtual assistant architecture, which relies on device microphones, device invocation frameworks, and persona-styled multimodal output rendered within the assistant's native interface. Transforming a website homepage into a conversational navigation interface is a fundamentally different problem, belongs to a different field of endeavor, and is not suggested in any portion of Baeuml.” And in [0064] “The visualized representation of the instance of the automated assistant can be, for example, an avatar corresponding to an animated person (e.g., real or fictional), character (e.g., a butler, a pirate, a chef), object (e.g., animated assistant dots), animal, and/or any other visualized representation. In these implementations, the stream of visual cues or the stream of modified visual cues can cause the visualized representation of the automated assistant to perform one or more animated physical gesture motions.”
Response: Baeuml discloses in [0096-0097] “the display 190 may include a first portion 190A that includes an indication a user account that is active at the client device 110 (e.g., as indicated by a user account symbol in the right-hand side of the first portion 190A of the display 190), and an indication of when various components are active at the client device 110, such as one or more microphones or speech processing components of the client device 110 (e.g., as indicated by the ellipses 190A1 in the first portion 190A of the display 190). Further, the display 190 may include a second portion 190B that includes a transcription of a dialog between a user of the client device 110 and an instance of the automated assistant that is implemented at least in part at the client device 110. Further, the display may include a third portion 190C that includes a space for visual content to be provided for visual presentation to the user (e.g., a home screen)… the disparate portions of the display may overlay one another (e.g., the second portion 190B of the display 190 overlaying the third portion 190C of the display 190) and/or be omitted in certain circumstances (e.g., the second portion 190B of the display 190 may be omitted when the user is not engaged in a dialog session with the instance of the automated assistant).” (emphasis added) The display 190 may be a website home page such that the website may be invoked to respond to queries provided by the user of the website as the queries shown in fig. 7A-7B. the user’s queries may be static such that the user asks the assistance and the assistance responds to the user. However, the conversation between the user and the assistance may be overlaid by dynamic dialog window 190B by way of overlay as shown in fig. 7, wherein the assistance may be represented as an avatar animated as an assistant. 
Argument: Applicant argues “Foster provides no mechanism for treating a website as a searchable content repository, and nothing in the cited passages even hints at such functionality.”
Response: Foster teaches in [0030 and 0037] “consider the message or query “How far is Seattle from Hyderabad anyway?” The small talk or chatty response can be “Pretty far. It's 7,770 miles away.”… At reference 420, a selection of a persona is received. For example, selection of a friendly or professional persona can be received, retrieved, or otherwise obtained or acquired. At reference numeral 420, a pre-built language generation model is returned with the selected persona. The model can correspond to the persona-based language generation model 110, which can be made accessible by way of an internet-based service (e.g., web service) or downloaded.” (emphasis added) response to a query may be search and retrieval.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHAMED I NAZAR whose telephone number is (571)270-3174. The examiner can normally be reached 10 am to 7 pm Mon-Fri.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached at 571-272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AHAMED I NAZAR/Examiner, Art Unit 2178                                                                                                                                                                                                        2/2/2026

/STEPHEN S HONG/Supervisory Patent Examiner, Art Unit 2178
Read full office action
Prosecution Timeline

Aug 29, 2025
Application Filed
Oct 31, 2025
Non-Final Rejection — §103
Dec 12, 2025
Response Filed
Feb 02, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/761,532
Patent 12564342
METHODS, SYSTEMS, AND DEVICES FOR THE DIAGNOSIS OF BEHAVIORAL DISORDERS, DEVELOPMENTAL DELAYS, AND NEUROLOGIC IMPAIRMENTS
2y 5m to grant Granted Mar 03, 2026
17/566,782
Patent 12548333
DYNAMIC NETWORK QUANTIZATION FOR EFFICIENT VIDEO INFERENCE
2y 5m to grant Granted Feb 10, 2026
18/383,839
Patent 12549503
INFORMATION INTERACTION METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Feb 10, 2026
17/771,649
Patent 12539042
Multi-Modal Imaging System and Method Therefor
2y 5m to grant Granted Feb 03, 2026
17/865,748
Patent 12541546
LOSSLESS SUMMARIZATION
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
53%
Grant Probability
88%
With Interview (+35.1%)
3y 11m
Median Time to Grant
Moderate
PTA Risk
Based on 378 resolved cases by this examiner. Grant probability derived from career allow rate.