Last updated: April 19, 2026
Application No. 18/660,011
PERSONALIZED DIGITAL MEETING AGENT

Final Rejection §101§103
Filed
May 09, 2024
Examiner
JONES, CARISSA ANNE
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

JONES, CARISSA ANNE View full profile →
Grants 83% — above average
Career Allow Rate
20 granted / 24 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.1%
-36.9% vs TC avg
§103
76.0%
+36.0% vs TC avg
§102
11.6%
-28.4% vs TC avg
§112
4.9%
-35.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§101 §103
DETAILED ACTION
This action is in response to the remarks filed 02/23/2026. Claims 1 – 20 are pending and have
been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1- 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Response to Amendment
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 10 – 20 are rejected under 35 U.S.C. 101 because they are directed towards an abstract mental process under the BRI. The abstract idea is not integrated into a practical application and does not involve an improvement in technology and as such is directed towards patent ineligible subject matter. The identified claim limitation(s) that recite(s) an abstract idea do/does not fall within the enumerated groupings of abstract ideas in Section I of the 2018 Revised Patent Subject Matter Eligibility Guidance published in the Federal Register (84 FR 50) on January 7, 2019. Nonetheless, the claim limitation(s) is/are being treated as reciting an abstract idea because the abstract idea is not integrated into a practical application and does not involve an improvement in technology and as such is directed towards patent ineligible subject matter. For example, in regards to Claim 10, a person participating in a virtual conference can monitor the meeting, receive a meeting input such as a question directed at that individual, process an output, or answer, to the question, determine the action or response that is appropriate, and perform the action or response, wherein the person participates autonomously as a proxy or substitute for another individual. Claims 11 – 14 do not resolve the 101 issue of Claim 10, therefore, are rejected under 101 for being dependent upon Claim 10. Additionally, Claim 15 is rejected similarly to Claim 10, and Claims 16 – 20 do not resolve the 101 issue of Claim 15. Therefore, Claims 16 – 20 are rejected under 101 for being dependent on Claim 15.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 7, 10 – 12 and 14 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (U.S. Pun. No. 2021/0358188, hereinafter “Lebaredian”) in view of Lindmark (U.S. Pub. No. 2025/0330555) and Srivastava et al. (U.S. Pub. No. 2017/0006069, hereinafter “Srivastava”).
Regarding Claim 1, Lebaredian teaches
A system (see Lebaredian Paragraph [0005], system) comprising: 
at least one processor (see Lebaredian Paragraph [0070], CPU(s) 406 may include any type of processor); and 
memory storing instructions that, when executed by the at least one processor, cause the system to perform a set of operations (see Lebaredian Paragraph [0068], the memory 404 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system), comprising: 
instantiating a digital agent with a persona associated with a user (see Lebaredian Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like); 
receiving an indication to dispatch the digital agent to a virtual meeting on behalf of the user (see Lebaredian Paragraph [0045], referring to FIG. 1B, the video conferencing system 100B may be used to host a video conference session including the AI agent via the AI agent device(s) 102 and one or more users via the user device(s) 104. In such an example, the client application 116A and 116B may correspond to end-user application versions of the video conferencing applications and the host device(s) 106 may include the host application 126 hosting the video conference session. The connection between the client applications 116A and 116B and the host application 126 may be via an email, a meeting invite, dial in, a URL, and/or another invitation means. For example, an AI agent—or the AI agent device(s) 102—may have a corresponding email handle, calendaring application connection, etc., that may allow the AI agent device(s) 102 to connect the AI agent to the conference. As such, similar to how a user 130 may join the video conference session via the user device(s) 104—e.g., using a link from an email or meeting invite, going to a URL, entering a meeting code, etc.—the AI agent device(s) 102 may connect the AI agent to the video conference session using any means of access, and Paragraph [0039], tailor the AI engine 112 to the particular user, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments); 
monitoring the virtual meeting (see Lebaredian Paragraph [0029], In some embodiments, such as where privacy concerns are not an issue or a user has opted in to constant recording of audio and/or video, no trigger activation may be used—although the audio, text, and/or video may still be monitored to determine when a user is addressing the AI agent, Paragraph [0030], in the circumstance that privacy concerns may not allow for constant recording of audio or speech in public spaces, once an activation trigger is satisfied, the microphones, cameras, and/or other I/O component(s) 120 may be opened up (e.g., activated to listen, monitor, or observe for user input beyond triggering events), and the data may be processed by the AI engine 112 to determine a response and/or other communication, and Paragraph [0050], referring to FIG. 2A, visualization 200A may correspond to a screenshot of a video conference corresponding to a video conferencing application. Any number of users 202 (e.g., users 202A-202D) may participate in the video conference, and the video conference may include an AI agent 204A. The AI agent 204A may be represented to be within a virtual environment 206 corresponding to a virtual representation of Paris, France, including the Eiffel Tower to provide context for the location of the AI agent 204A to the other users 202. In such an embodiment, the AI agent 204 may be represented—in addition to the virtual environment 206—by graphical data, and the renderer 114 may generate video data from the graphical data, which may be transmitted as a video stream—via the host device(s) 106—to the user devices 104 associated with the users 202. In addition to the video stream, an audio stream and/or a textual stream may be transmitted such that the AI agent 204A may appear to, and interact with, the users 202 as any other user would. The virtual environment 206 may be used by the AI agent 204A to provide context to the conversation within the video conference. For example, when giving weather information, traffic information, time information, stock information, and/or any other information as a response or communication within the video conference, the AI agent 204A may leverage this virtual environment 206);
receiving first meeting input (see Lebaredian Paragraph [0047], The AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment and Paragraph [0057], the method 300, at block B304, includes receiving first data representative of one or more of an audio stream, a text stream, or a video stream associated with a user device(s) communicatively coupled with the instance of the application. For example, an audio, video, and/or textual stream generated using a user device(s) 104 may be received—e.g., by the AI agent device(s) 102);                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
processing, by a first model, the first meeting input to generate first output (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent, Paragraph [0031], The incoming data—e.g., visual, textual, audible, etc.—may be analyzed by the AI engine 112 to determine a textual, visual, and/or audible response or communication—represented using three-dimensional (3D) graphics—for the AI agent. For example, the AI engine 112 may generate output text for text-to-speech processing—e.g., using one or more machine learning or deep learning models—to generate audio data. This audio data may be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—for output by a speaker or another I/O component(s) 120 of the user device(s) 104 and Figure 3, analyze the first data using natural language processing step B306); 
based on the first output, determining a first action (see Lebaredian Figure 3, step B308, generate second data representative of a textual output responsive to the first data and corresponding to the virtual agent and Paragraph [0059], the method 300, at block B308, includes generating second data representative of a textual output responsive to the first data and corresponding to the virtual agent. For example, the AI engine 112 may generate text that corresponds to a verbal response of the AI agent); 
determining a first rendering for the first action (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment. In some embodiments, notes, question and answer dialogue box information, and/or other information associated with the video conference may be received and processed by the AI engine 112. As such, once the textual, visual, and/or audible response or communication of the AI agent is determined, the AI agent and the virtual environment may be updated according thereto, and display data and/or image data generated from the graphical data—e.g., from a virtual field of view or one or more virtual sensors, such as cameras, microphones, etc.—may be rendered using the renderer 114. A stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0060], the method 300, at block B310, includes applying the second data to a text-to-speech algorithm to generate audio data. For example, the textual data corresponding to the response or communication of the AI agent may be applied to a text-to-speech algorithm to generate audio data, and Paragraph [0061], the method 300, at block B312, includes generating graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment. For example, the renderer 114 may generate the graphical data representative of a virtual field of view of the virtual environment from a perspective of a virtual camera, and the virtual field of view may include a graphical representation of the AI agent. For example, the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.—and the virtual environment may be generated to provide context to the response); and 
causing the digital agent to perform the first action on behalf of the user according to the persona and the determined first rendering (see Lebaredian Paragraph [0047], a stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0062], the method 300, at block B314, includes causing presentation of a rendering of the graphical data and an audio output corresponding to the audio data as a communication exchanged using the instance of the application. For example, the renderer 114 may generate display data or image data corresponding to the graphical data, and audio data, and/or textual data may also be rendered or generated. This display or image data, audio data, and/or textual data may then be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—as an audio stream, a video stream, and/or a textual stream for output by the user device(s) 104, and Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like).

Lebaredian does not expressively teach
a foundation model
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

However, Lindmark teaches
a foundational model (see Lindmark Paragraph [0068], an AI model 330A-M is an AI model that has been trained on a corpus of data. For example, the AI model 330A-M can be an AI model that is first pre-trained on a corpus of data to create a foundational model, and afterwards fine-tuned on more data pertaining to a particular set of tasks to create a more task-specific, or targeted, model. The foundational model can first be pre-trained using a corpus of data that can include data in the public domain, licensed content, and/or proprietary content. Such a pre-training can be used by the AI model 330A-M to learn broad elements including, image or speech recognition, general sentence structure, common phrases, vocabulary, natural language structure, and other elements. In some implementations, this first foundational model is trained using self-supervision, or unsupervised training on such datasets)

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).

Lebaredian in view of Lindmark does not expressively teach
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

	However, Srivastava teaches
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input (see Srivastava Abstract, method for enabling a user to create a proxy persona to attend a meeting on behalf of the user, whereby the meeting is associated with a communication system for requesting a person to be available for a scheduled event is provided, and Paragraph [0027], a connected persona that may act as a proxy for them, attending meetings and phone calls (thus recording them), joining chats (thus saving the conversation), and even alerting the user in real-time to important information that might otherwise be missed).

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).
It would have been further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian in view of Lindmark), with a digital agent that autonomously participates in a video conference as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user (as taught in Srivastava), the motivation being to prevent a lack of communication that may lead to a loss of productivity and business for companies because of the absence of a user, by providing a proxy person to step in for the user and participate and keep record of meetings, conversations, etc. (see Srivastava Paragraphs [0014], [0016] and [0017]).

	Regarding Claim 2, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 1, wherein the persona is associated with a realistic representation of the user, a fanciful representation of the user, or any of a range of representations therebetween (see Lebaredian Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like, therefore may be realistic representation of user).

Regarding Claim 3, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 1, wherein the monitoring occurs continuously, at a regular time interval, or over a dynamic window (see Lebaredian Paragraph [0049], this process may continue throughout the video conference during times when the AI agent is to be displayed or presented—e.g., the entire time, only after activation criteria are satisfied and until a given interaction is complete, the remainder of the time after activation criteria are satisfied, until the AI agent is asked to leave or removed from the conference, etc.).

Regarding Claim 4, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 1, wherein the determined first rendering of the first action is an audio rendering, a visual rendering, or an audio-visual rendering (see Lebaredian Paragraph [0032], the AI agent's lips may be controlled with the virtual environment to correspond to the audio data—or at least the portions of the audio representing speech. In addition to the speech, there may be additional audio data corresponding to background noises or sounds, music, tones, ambient noises, other AI agents, virtual bots, and/or other sources. Ultimately, the audio data including the speech of the AI agent and other audio sources may be transmitted—e.g., as an audio stream—to the user device(s) 104 (e.g., via the host device(s) 106, in embodiments) and Paragraph [0033], In addition to audio, a response or communication by an AI agent may include simulated physical movements, gestures, postures, poses, and/or the like that may be represented in the virtual world. The appearance, gestures, movements, posture, and/or other information corresponding to the AI agent—in addition to the virtual environment in which the AI agent is located—may be represented by graphical data. This graphical data may be rendered by the renderer 114 to generate display data or image data that may be streamed to the user device(s) 104 for presentation on a display 122).

Regarding Claim 5, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 1, the set of operations further comprising: 
receiving second meeting input (see Lebaredian Paragraph [0057], the method 300, at block B304, includes receiving first data representative of one or more of an audio stream, a text stream, or a video stream associated with a user device(s) communicatively coupled with the instance of the application. For example, an audio, video, and/or textual stream generated using a user device(s) 104 may be received—e.g., by the AI agent device(s) 102, and Paragraph [0049], this process may continue throughout the video conference during times when the AI agent is to be displayed or presented—e.g., the entire time, only after activation criteria are satisfied and until a given interaction is complete, the remainder of the time after activation criteria are satisfied, until the AI agent is asked to leave or removed from the conference, etc., as applied to repeating this process a second time); 
processing, by a second foundation model, the second meeting input to generate second output (see Lebaredian Paragraph [0058], The method 300, at block B306, includes analyzing the first data using natural language processing. For example, the received data may be analyzed by the AI engine 112 (executed by, for example and without limitation, one or more parallel processing units), which may include applying natural language processing to the data, as applied to second data); 
based on the second output, determining a second action (see Lebaredian Paragraph [0059], The method 300, at block B308, includes generating second data representative of a textual output responsive to the first data and corresponding to the virtual agent. For example, the AI engine 112 may generate text that corresponds to a verbal response of the AI agent); 
determining a second rendering for the second action (see Lebaredian Paragraph [0060], The method 300, at block B310, includes applying the second data to a text-to-speech algorithm to generate audio data. For example, the textual data corresponding to the response or communication of the AI agent may be applied to a text-to-speech algorithm to generate audio data); and 
causing the digital agent to perform the second action on behalf of the user according to the persona and the determined second rendering (see Lebaredian Paragraph [0061], The method 300, at block B312, includes generating graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment. For example, the renderer 114 may generate the graphical data representative of a virtual field of view of the virtual environment from a perspective of a virtual camera, and the virtual field of view may include a graphical representation of the AI agent. For example, the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.—and the virtual environment may be generated to provide context to the response, and Paragraph [0062], The method 300, at block B314, includes causing presentation of a rendering of the graphical data and an audio output corresponding to the audio data as a communication exchanged using the instance of the application. For example, the renderer 114 may generate display data or image data corresponding to the graphical data, and audio data, and/or textual data may also be rendered or generated. This display or image data, audio data, and/or textual data may then be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—as an audio stream, a video stream, and/or a textual stream for output by the user device(s) 104).

Regarding Claim 6, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 5, wherein the first foundation model is one of the same or different from the second foundation model (see Lebaredian Paragraph [0087], the data center 500 may include tools, services, software or other resources to train one or more machine learning models or predict or infer information using one or more machine learning models according to one or more embodiments described herein).

Regarding Claim 7, Lebaredian in view of Lindmark and Srivastava teaches
The system of claim 1, wherein the determined first rendering of the first action is based at least in part on the persona (see Lebaredian Paragraph [0032], the AI agent's lips may be controlled with the virtual environment to correspond to the audio data—or at least the portions of the audio representing speech. In addition to the speech, there may be additional audio data corresponding to background noises or sounds, music, tones, ambient noises, other AI agents, virtual bots, and/or other sources. Ultimately, the audio data including the speech of the AI agent and other audio sources may be transmitted—e.g., as an audio stream—to the user device(s) 104 (e.g., via the host device(s) 106, in embodiments), Paragraph [0033], in addition to audio, a response or communication by an AI agent may include simulated physical movements, gestures, postures, poses, and/or the like that may be represented in the virtual world. The appearance, gestures, movements, posture, and/or other information corresponding to the AI agent—in addition to the virtual environment in which the AI agent is located—may be represented by graphical data. This graphical data may be rendered by the renderer 114 to generate display data or image data that may be streamed to the user device(s) 104 for presentation on a display 122, and Paragraph [0034], the AI engine 112 may determine the simulated physical characteristics of the AI agent based on an analysis of the incoming data, the general type or personality of the AI agent, and/or the determined textual, audible, and/or visual response or communication by the AI agent. For example, where the AI engine 112 determines that a current speaker is angry or sad, this information may be leveraged to simulate the AI agent to respond appropriately (e.g., using a gentle, uplifting, or consoling tone or phrasing). Where the AI engine 112 determines that a certain gesture or posture is fitting to the spoken response of the AI agent, the AI agent may be controlled as such within the virtual environment. As such, a body and/or face of the AI agent may be animated such that the AI agent may emote (express its own set of emotions) for the virtual camera).

Regarding Claim 10, Lebaredian teaches
A method of using one or more models to instantiate a digital agent (see Lebaredian Paragraph [0005], the systems and methods of the present disclosure provide a platform and pipeline for hosting or integrating a conversational AI assistant within any application that includes an audio, video, and/or textual output device, Paragraph [0015] the AI agent(s) described herein may be implemented for video conferencing applications (e.g., to participate in conversation for answering questions, displaying information, etc.), and Paragraph [0027], the AI engine 112 may include any number of features for speech tasks such as intent and entity classification, sentiment analysis, dialog modeling, domain and fulfillment mapping, etc. In some embodiments, the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data. For vision, the AI engine 112 may include any number of features for person, face, and/or body (gesture) detection and tracking, detection of key body or facial landmarks and body pose, gestures, lip activity, gaze, and/or other features. The AI engine 112 may further include fused sensory perception, tasks, or algorithms that analyze both audio and images together to make determinations. In embodiments, some or all of the speech, vision, and/or fused tasks may leverage machine learning and/or deep learning models (e.g., NVIDIA's Jarvis and Natural Language Processing Models), that may be trained on custom data to achieve high accuracy for the particular use case or embodiment. The AI agent as managed by the AI engine 112 may be deployed within a cloud-based environment, in a data center, and/or at the edge), comprising: 
instantiating the digital agent with a persona associated with a user, wherein the persona is associated with at least a speech simulation and visual representation resembling the user (see Lebaredian Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like); 
receiving an indication to dispatch the digital agent to a virtual meeting on behalf of the user (see Lebaredian Paragraph [0045], referring to FIG. 1B, the video conferencing system 100B may be used to host a video conference session including the AI agent via the AI agent device(s) 102 and one or more users via the user device(s) 104. In such an example, the client application 116A and 116B may correspond to end-user application versions of the video conferencing applications and the host device(s) 106 may include the host application 126 hosting the video conference session. The connection between the client applications 116A and 116B and the host application 126 may be via an email, a meeting invite, dial in, a URL, and/or another invitation means. For example, an AI agent—or the AI agent device(s) 102—may have a corresponding email handle, calendaring application connection, etc., that may allow the AI agent device(s) 102 to connect the AI agent to the conference. As such, similar to how a user 130 may join the video conference session via the user device(s) 104—e.g., using a link from an email or meeting invite, going to a URL, entering a meeting code, etc.—the AI agent device(s) 102 may connect the AI agent to the video conference session using any means of access, and Paragraph [0039], tailor the AI engine 112 to the particular user, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments); 
monitoring the virtual meeting (see Lebaredian Paragraph [0029], In some embodiments, such as where privacy concerns are not an issue or a user has opted in to constant recording of audio and/or video, no trigger activation may be used—although the audio, text, and/or video may still be monitored to determine when a user is addressing the AI agent, Paragraph [0030], in the circumstance that privacy concerns may not allow for constant recording of audio or speech in public spaces, once an activation trigger is satisfied, the microphones, cameras, and/or other I/O component(s) 120 may be opened up (e.g., activated to listen, monitor, or observe for user input beyond triggering events), and the data may be processed by the AI engine 112 to determine a response and/or other communication, and Paragraph [0050], referring to FIG. 2A, visualization 200A may correspond to a screenshot of a video conference corresponding to a video conferencing application. Any number of users 202 (e.g., users 202A-202D) may participate in the video conference, and the video conference may include an AI agent 204A. The AI agent 204A may be represented to be within a virtual environment 206 corresponding to a virtual representation of Paris, France, including the Eiffel Tower to provide context for the location of the AI agent 204A to the other users 202. In such an embodiment, the AI agent 204 may be represented—in addition to the virtual environment 206—by graphical data, and the renderer 114 may generate video data from the graphical data, which may be transmitted as a video stream—via the host device(s) 106—to the user devices 104 associated with the users 202. In addition to the video stream, an audio stream and/or a textual stream may be transmitted such that the AI agent 204A may appear to, and interact with, the users 202 as any other user would. The virtual environment 206 may be used by the AI agent 204A to provide context to the conversation within the video conference. For example, when giving weather information, traffic information, time information, stock information, and/or any other information as a response or communication within the video conference, the AI agent 204A may leverage this virtual environment 206); 
receiving meeting input (see Lebaredian Paragraph [0047], The AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment and Paragraph [0057], the method 300, at block B304, includes receiving first data representative of one or more of an audio stream, a text stream, or a video stream associated with a user device(s) communicatively coupled with the instance of the application. For example, an audio, video, and/or textual stream generated using a user device(s) 104 may be received—e.g., by the AI agent device(s) 102);
processing, by a model, the meeting input to generate model output (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent, Paragraph [0031], The incoming data—e.g., visual, textual, audible, etc.—may be analyzed by the AI engine 112 to determine a textual, visual, and/or audible response or communication—represented using three-dimensional (3D) graphics—for the AI agent. For example, the AI engine 112 may generate output text for text-to-speech processing—e.g., using one or more machine learning or deep learning models—to generate audio data. This audio data may be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—for output by a speaker or another I/O component(s) 120 of the user device(s) 104 and Figure 3, analyze the first data using natural language processing step B306); 
based on the model output, determining an action (see Lebaredian Figure 3, step B308, generate second data representative of a textual output responsive to the first data and corresponding to the virtual agent and Paragraph [0059], the method 300, at block B308, includes generating second data representative of a textual output responsive to the first data and corresponding to the virtual agent. For example, the AI engine 112 may generate text that corresponds to a verbal response of the AI agent); 
determining a rendering of the action (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment. In some embodiments, notes, question and answer dialogue box information, and/or other information associated with the video conference may be received and processed by the AI engine 112. As such, once the textual, visual, and/or audible response or communication of the AI agent is determined, the AI agent and the virtual environment may be updated according thereto, and display data and/or image data generated from the graphical data—e.g., from a virtual field of view or one or more virtual sensors, such as cameras, microphones, etc.—may be rendered using the renderer 114. A stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0060], the method 300, at block B310, includes applying the second data to a text-to-speech algorithm to generate audio data. For example, the textual data corresponding to the response or communication of the AI agent may be applied to a text-to-speech algorithm to generate audio data, and Paragraph [0061], the method 300, at block B312, includes generating graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment. For example, the renderer 114 may generate the graphical data representative of a virtual field of view of the virtual environment from a perspective of a virtual camera, and the virtual field of view may include a graphical representation of the AI agent. For example, the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.—and the virtual environment may be generated to provide context to the response); and 
causing the digital agent to perform the action on behalf of the user according to the persona and the determined rendering (see Lebaredian Paragraph [0047], a stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0062], the method 300, at block B314, includes causing presentation of a rendering of the graphical data and an audio output corresponding to the audio data as a communication exchanged using the instance of the application. For example, the renderer 114 may generate display data or image data corresponding to the graphical data, and audio data, and/or textual data may also be rendered or generated. This display or image data, audio data, and/or textual data may then be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—as an audio stream, a video stream, and/or a textual stream for output by the user device(s) 104, and Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like).

Lebaredian does not expressively teach
a foundation model
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

However, Lindmark teaches
a foundational model (see Lindmark Paragraph [0068], an AI model 330A-M is an AI model that has been trained on a corpus of data. For example, the AI model 330A-M can be an AI model that is first pre-trained on a corpus of data to create a foundational model, and afterwards fine-tuned on more data pertaining to a particular set of tasks to create a more task-specific, or targeted, model. The foundational model can first be pre-trained using a corpus of data that can include data in the public domain, licensed content, and/or proprietary content. Such a pre-training can be used by the AI model 330A-M to learn broad elements including, image or speech recognition, general sentence structure, common phrases, vocabulary, natural language structure, and other elements. In some implementations, this first foundational model is trained using self-supervision, or unsupervised training on such datasets)

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).

Lebaredian in view of Lindmark does not expressively teach
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

	However, Srivastava teaches
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input (see Srivastava Abstract, method for enabling a user to create a proxy persona to attend a meeting on behalf of the user, whereby the meeting is associated with a communication system for requesting a person to be available for a scheduled event is provided, and Paragraph [0027], a connected persona that may act as a proxy for them, attending meetings and phone calls (thus recording them), joining chats (thus saving the conversation), and even alerting the user in real-time to important information that might otherwise be missed).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).
It would have been further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian in view of Lindmark), with a digital agent that autonomously participates in a video conference as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user (as taught in Srivastava), the motivation being to prevent a lack of communication that may lead to a loss of productivity and business for companies because of the absence of a user, by providing a proxy person to step in for the user and participate and keep record of meetings, conversations, etc. (see Srivastava Paragraphs [0014], [0016] and [0017]).

Regarding Claim 11, Lebaredian in view of Lindmark and Srivastava teaches
The method of claim 10, further comprising: 
monitoring the virtual meeting using a state machine, wherein the state machine implements a thought loop (see Lebaredian Paragraph [0029], In some embodiments, such as where privacy concerns are not an issue or a user has opted in to constant recording of audio and/or video, no trigger activation may be used—although the audio, text, and/or video may still be monitored to determine when a user is addressing the AI agent, Paragraph [0030], in the circumstance that privacy concerns may not allow for constant recording of audio or speech in public spaces, once an activation trigger is satisfied, the microphones, cameras, and/or other I/O component(s) 120 may be opened up (e.g., activated to listen, monitor, or observe for user input beyond triggering events), and the data may be processed by the AI engine 112 to determine a response and/or other communication, and Paragraph [0050], referring to FIG. 2A, visualization 200A may correspond to a screenshot of a video conference corresponding to a video conferencing application. Any number of users 202 (e.g., users 202A-202D) may participate in the video conference, and the video conference may include an AI agent 204A. The AI agent 204A may be represented to be within a virtual environment 206 corresponding to a virtual representation of Paris, France, including the Eiffel Tower to provide context for the location of the AI agent 204A to the other users 202. In such an embodiment, the AI agent 204 may be represented—in addition to the virtual environment 206—by graphical data, and the renderer 114 may generate video data from the graphical data, which may be transmitted as a video stream—via the host device(s) 106—to the user devices 104 associated with the users 202. In addition to the video stream, an audio stream and/or a textual stream may be transmitted such that the AI agent 204A may appear to, and interact with, the users 202 as any other user would. The virtual environment 206 may be used by the AI agent 204A to provide context to the conversation within the video conference. For example, when giving weather information, traffic information, time information, stock information, and/or any other information as a response or communication within the video conference, the AI agent 204A may leverage this virtual environment 206, Paragraph [0070], The CPU(s) 406 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 400 to perform one or more of the methods and/or processes described herein. The CPU(s) 406 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) that are capable of handling a multitude of software threads simultaneously. The CPU(s) 406 may include any type of processor, and may include different types of processors depending on the type of computing device 400 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers), and Paragraph [0049], this process may continue throughout the video conference during times when the AI agent is to be displayed or presented—e.g., the entire time, only after activation criteria are satisfied and until a given interaction is complete, the remainder of the time after activation criteria are satisfied, until the AI agent is asked to leave or removed from the conference, etc., therefore the process is a loop in which is reoccurring through the conference).

Regarding Claim 12, Lebaredian in view of Lindmark and Srivastava teaches
The method of claim 11, wherein the state machine returns to the monitoring after each iteration of the thought loop (see Lebaredian Paragraph [0029], In some embodiments, such as where privacy concerns are not an issue or a user has opted in to constant recording of audio and/or video, no trigger activation may be used—although the audio, text, and/or video may still be monitored to determine when a user is addressing the AI agent, Paragraph [0030], in the circumstance that privacy concerns may not allow for constant recording of audio or speech in public spaces, once an activation trigger is satisfied, the microphones, cameras, and/or other I/O component(s) 120 may be opened up (e.g., activated to listen, monitor, or observe for user input beyond triggering events), and the data may be processed by the AI engine 112 to determine a response and/or other communication, and Paragraph [0050], referring to FIG. 2A, visualization 200A may correspond to a screenshot of a video conference corresponding to a video conferencing application. Any number of users 202 (e.g., users 202A-202D) may participate in the video conference, and the video conference may include an AI agent 204A. The AI agent 204A may be represented to be within a virtual environment 206 corresponding to a virtual representation of Paris, France, including the Eiffel Tower to provide context for the location of the AI agent 204A to the other users 202. In such an embodiment, the AI agent 204 may be represented—in addition to the virtual environment 206—by graphical data, and the renderer 114 may generate video data from the graphical data, which may be transmitted as a video stream—via the host device(s) 106—to the user devices 104 associated with the users 202. In addition to the video stream, an audio stream and/or a textual stream may be transmitted such that the AI agent 204A may appear to, and interact with, the users 202 as any other user would. The virtual environment 206 may be used by the AI agent 204A to provide context to the conversation within the video conference. For example, when giving weather information, traffic information, time information, stock information, and/or any other information as a response or communication within the video conference, the AI agent 204A may leverage this virtual environment 206, Paragraph [0070], The CPU(s) 406 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 400 to perform one or more of the methods and/or processes described herein. The CPU(s) 406 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) that are capable of handling a multitude of software threads simultaneously. The CPU(s) 406 may include any type of processor, and may include different types of processors depending on the type of computing device 400 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers), and Paragraph [0049], this process may continue throughout the video conference during times when the AI agent is to be displayed or presented—e.g., the entire time, only after activation criteria are satisfied and until a given interaction is complete, the remainder of the time after activation criteria are satisfied, until the AI agent is asked to leave or removed from the conference, etc., therefore the process is a loop in which is reoccurring through the conference and the audio, text, and/or video is monitored once a response is given by the AI Agent, in order for it to determine a response or action).

Regarding Claim 14, it is rejected similarly as Claim 7. The method can be found in Lebaredian (Paragraph [0005], method).

Regarding Claim 15, Lebaredian teaches
A method of using one or more models to instantiate a digital agent (see Lebaredian Paragraph [0005], the systems and methods of the present disclosure provide a platform and pipeline for hosting or integrating a conversational AI assistant within any application that includes an audio, video, and/or textual output device, Paragraph [0015] the AI agent(s) described herein may be implemented for video conferencing applications (e.g., to participate in conversation for answering questions, displaying information, etc.), and Paragraph [0027], the AI engine 112 may include any number of features for speech tasks such as intent and entity classification, sentiment analysis, dialog modeling, domain and fulfillment mapping, etc. In some embodiments, the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data. For vision, the AI engine 112 may include any number of features for person, face, and/or body (gesture) detection and tracking, detection of key body or facial landmarks and body pose, gestures, lip activity, gaze, and/or other features. The AI engine 112 may further include fused sensory perception, tasks, or algorithms that analyze both audio and images together to make determinations. In embodiments, some or all of the speech, vision, and/or fused tasks may leverage machine learning and/or deep learning models (e.g., NVIDIA's Jarvis and Natural Language Processing Models), that may be trained on custom data to achieve high accuracy for the particular use case or embodiment. The AI agent as managed by the AI engine 112 may be deployed within a cloud-based environment, in a data center, and/or at the edge), comprising: 
instantiating the digital agent with a persona associated with a user (see Lebaredian Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like); 
receiving an indication to dispatch the digital agent to a virtual meeting on behalf of the user (see Lebaredian Paragraph [0045], referring to FIG. 1B, the video conferencing system 100B may be used to host a video conference session including the AI agent via the AI agent device(s) 102 and one or more users via the user device(s) 104. In such an example, the client application 116A and 116B may correspond to end-user application versions of the video conferencing applications and the host device(s) 106 may include the host application 126 hosting the video conference session. The connection between the client applications 116A and 116B and the host application 126 may be via an email, a meeting invite, dial in, a URL, and/or another invitation means. For example, an AI agent—or the AI agent device(s) 102—may have a corresponding email handle, calendaring application connection, etc., that may allow the AI agent device(s) 102 to connect the AI agent to the conference. As such, similar to how a user 130 may join the video conference session via the user device(s) 104—e.g., using a link from an email or meeting invite, going to a URL, entering a meeting code, etc.—the AI agent device(s) 102 may connect the AI agent to the video conference session using any means of access, and Paragraph [0039], tailor the AI engine 112 to the particular user, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments); 
monitoring, by a state machine, the virtual meeting (see Lebaredian Paragraph [0029], In some embodiments, such as where privacy concerns are not an issue or a user has opted in to constant recording of audio and/or video, no trigger activation may be used—although the audio, text, and/or video may still be monitored to determine when a user is addressing the AI agent, Paragraph [0030], in the circumstance that privacy concerns may not allow for constant recording of audio or speech in public spaces, once an activation trigger is satisfied, the microphones, cameras, and/or other I/O component(s) 120 may be opened up (e.g., activated to listen, monitor, or observe for user input beyond triggering events), and the data may be processed by the AI engine 112 to determine a response and/or other communication, and Paragraph [0050], referring to FIG. 2A, visualization 200A may correspond to a screenshot of a video conference corresponding to a video conferencing application. Any number of users 202 (e.g., users 202A-202D) may participate in the video conference, and the video conference may include an AI agent 204A. The AI agent 204A may be represented to be within a virtual environment 206 corresponding to a virtual representation of Paris, France, including the Eiffel Tower to provide context for the location of the AI agent 204A to the other users 202. In such an embodiment, the AI agent 204 may be represented—in addition to the virtual environment 206—by graphical data, and the renderer 114 may generate video data from the graphical data, which may be transmitted as a video stream—via the host device(s) 106—to the user devices 104 associated with the users 202. In addition to the video stream, an audio stream and/or a textual stream may be transmitted such that the AI agent 204A may appear to, and interact with, the users 202 as any other user would. The virtual environment 206 may be used by the AI agent 204A to provide context to the conversation within the video conference. For example, when giving weather information, traffic information, time information, stock information, and/or any other information as a response or communication within the video conference, the AI agent 204A may leverage this virtual environment 206, and Paragraph [0070], The CPU(s) 406 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 400 to perform one or more of the methods and/or processes described herein. The CPU(s) 406 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) that are capable of handling a multitude of software threads simultaneously. The CPU(s) 406 may include any type of processor, and may include different types of processors depending on the type of computing device 400 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers)); 
receiving, by the state machine, meeting input (see Lebaredian Paragraph [0047], The AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment and Paragraph [0057], the method 300, at block B304, includes receiving first data representative of one or more of an audio stream, a text stream, or a video stream associated with a user device(s) communicatively coupled with the instance of the application. For example, an audio, video, and/or textual stream generated using a user device(s) 104 may be received—e.g., by the AI agent device(s) 102); 
receiving first output from the first model; based on the first output, determining an action (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent, Paragraph [0031], The incoming data—e.g., visual, textual, audible, etc.—may be analyzed by the AI engine 112 to determine a textual, visual, and/or audible response or communication—represented using three-dimensional (3D) graphics—for the AI agent. For example, the AI engine 112 may generate output text for text-to-speech processing—e.g., using one or more machine learning or deep learning models—to generate audio data. This audio data may be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—for output by a speaker or another I/O component(s) 120 of the user device(s) 104, Figure 3, analyze the first data using natural language processing step B306, and Figure 3, step B308, generate second data representative of a textual output responsive to the first data and corresponding to the virtual agent and Paragraph [0059], the method 300, at block B308, includes generating second data representative of a textual output responsive to the first data and corresponding to the virtual agent. For example, the AI engine 112 may generate text that corresponds to a verbal response of the AI agent); and 
causing the digital agent to perform the action on behalf of the user according to the persona (see Lebaredian Paragraph [0047], a stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0062], the method 300, at block B314, includes causing presentation of a rendering of the graphical data and an audio output corresponding to the audio data as a communication exchanged using the instance of the application. For example, the renderer 114 may generate display data or image data corresponding to the graphical data, and audio data, and/or textual data may also be rendered or generated. This display or image data, audio data, and/or textual data may then be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—as an audio stream, a video stream, and/or a textual stream for output by the user device(s) 104, and Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like).

Lebaredian does not expressively teach
a foundation model
generating a prompt comprising at least the meeting input;
based on the prompt, querying a first foundation model;
wherein
the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

However, Lindmark teaches
a foundational model (see Lindmark Paragraph [0068], an AI model 330A-M is an AI model that has been trained on a corpus of data. For example, the AI model 330A-M can be an AI model that is first pre-trained on a corpus of data to create a foundational model, and afterwards fine-tuned on more data pertaining to a particular set of tasks to create a more task-specific, or targeted, model. The foundational model can first be pre-trained using a corpus of data that can include data in the public domain, licensed content, and/or proprietary content. Such a pre-training can be used by the AI model 330A-M to learn broad elements including, image or speech recognition, general sentence structure, common phrases, vocabulary, natural language structure, and other elements. In some implementations, this first foundational model is trained using self-supervision, or unsupervised training on such datasets)
generating a prompt comprising at least the meeting input (see Lindmark Paragraph [0005], the method includes automatically generating a prompt using information associated with the virtual meeting, and Paragraph [0023], Information associated with the virtual meeting can be used to prompt the AI model, which can be trained to generate information about a given participant of the virtual meeting. For example, the prompt may include information specific to the virtual meeting (e.g., the meeting title, shared meeting notes, etc.) and information specific to the participant (e.g., the name of the participant, the email address of the participant, etc.)); 
based on the prompt, querying a first foundation model (see Lindmark Paragraph [0005], the method includes automatically generating a prompt using information associated with the virtual meeting, providing the prompt and context associated with the engagement of the first participant with the first visual item corresponding to the video stream generated by the second client device as input to a generative Artificial Intelligence (AI) model, obtaining one or more outputs from the generative AI model, and generating the one or more information items using the one or more outputs); 

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model that is queried based on a generated prompt (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).

Lebaredian in view of Lindmark does not expressively teach
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input.

	However, Srivastava teaches
wherein
	the digital agent autonomously participates in the virtual meeting as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user independently of receiving the first meeting input (see Srivastava Abstract, method for enabling a user to create a proxy persona to attend a meeting on behalf of the user, whereby the meeting is associated with a communication system for requesting a person to be available for a scheduled event is provided, and Paragraph [0027], a connected persona that may act as a proxy for them, attending meetings and phone calls (thus recording them), joining chats (thus saving the conversation), and even alerting the user in real-time to important information that might otherwise be missed).

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian), with implementing a foundational model that is queried based on a generated prompt (as taught in Lindmark), the motivation being to prevent training artificial intelligence in real time and instead implement a pre-trained model that is fine-tuned, task-specific, and targeted, which in return saves time and increases efficiency (see Lindmark Paragraph [0068]).
It would have been further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of an artificial intelligence digital agent for video conferences (as taught in Lebaredian in view of Lindmark), with a digital agent that autonomously participates in a video conference as a proxy participant representing the user, including performing the determined first action within the virtual meeting on behalf of the user (as taught in Srivastava), the motivation being to prevent a lack of communication that may lead to a loss of productivity and business for companies because of the absence of a user, by providing a proxy person to step in for the user and participate and keep record of meetings, conversations, etc. (see Srivastava Paragraphs [0014], [0016] and [0017]).

Regarding Claim 16, Lebaredian in view of Lindmark and Srivastava teaches
The method of claim 15, further comprising: 
updating the prompt with the action (see Lindmark Paragraph [0005], the method includes automatically generating a prompt using information associated with the virtual meeting, Paragraph [0023], Information associated with the virtual meeting can be used to prompt the AI model, which can be trained to generate information about a given participant of the virtual meeting. For example, the prompt may include information specific to the virtual meeting (e.g., the meeting title, shared meeting notes, etc.) and information specific to the participant (e.g., the name of the participant, the email address of the participant, etc.), Paragraph [0054], the prompt subsystem can be configured to perform automated identification of, and facilitate retrieval of, relevant and timely contextual information for efficient and accurate processing of prompts); and 
querying a second foundation model based on the updated prompt (see Lindmark Paragraph [0005], the method includes automatically generating a prompt using information associated with the virtual meeting, Paragraph [0023], Information associated with the virtual meeting can be used to prompt the AI model, which can be trained to generate information about a given participant of the virtual meeting. For example, the prompt may include information specific to the virtual meeting (e.g., the meeting title, shared meeting notes, etc.) and information specific to the participant (e.g., the name of the participant, the email address of the participant, etc.), Paragraph [0054], the prompt subsystem can be configured to perform automated identification of, and facilitate retrieval of, relevant and timely contextual information for efficient and accurate processing of prompts, and Paragraph [0070], the outputs of the pre-trained model may be input into a second AI model).

Regarding Claim 17, it is rejected similarly as Claim 6. The method can be found in Lebaredian (Paragraph [0005], method).

Regarding Claim 18, it is rejected similarly as Claim 11. 

Regarding Claim 19, Lebaredian in view of Lindmark and Srivastava teaches
The method of claim 18, wherein the thought loop comprises receiving state information from a previous iteration, receiving meeting input, requesting model output, receiving model output, handling model output, and sending state information to a next iteration (see Lebaredian Paragraph [0049], this process may continue throughout the video conference during times when the AI agent is to be displayed or presented—e.g., the entire time, only after activation criteria are satisfied and until a given interaction is complete, the remainder of the time after activation criteria are satisfied, until the AI agent is asked to leave or removed from the conference, etc., Paragraph [0046],  For each user device 104, a user(s) 130 may provide inputs to one or more I/O components 120 and/or the I/O components 120 may generate data. For example, a camera—e.g., a web cam—may capture a video stream of its field of view (which may include the user), a microphone may capture an audio stream, and/or a keyboard, mouse, or other input devices may capture a textual stream or other input streams, Paragraph [0047], These streams of audio, video, and/or textual data may be received by the client application 116B and transmitted—e.g., after encoding—to the host device(s) 106, and the host device(s) 106 may analyze, process, transmit, and/or forward the data to the client application 116A of the AI agent device(s) 102. The AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment. In some embodiments, notes, question and answer dialogue box information, and/or other information associated with the video conference may be received and processed by the AI engine 112. As such, once the textual, visual, and/or audible response or communication of the AI agent is determined, the AI agent and the virtual environment may be updated according thereto, and display data and/or image data generated from the graphical data—e.g., from a virtual field of view or one or more virtual sensors, such as cameras, microphones, etc.—may be rendered using the renderer 114, and Paragraph [0087], The data center 500 may include tools, services, software or other resources to train one or more machine learning models or predict or infer information using one or more machine learning models according to one or more embodiments described herein. For example, a machine learning model(s) may be trained by calculating weight parameters according to a neural network architecture using software and/or computing resources described above with respect to the data center 500. In at least one embodiment, trained or deployed machine learning models corresponding to one or more neural networks may be used to infer or predict information using resources described above with respect to the data center 500 by using weight parameters calculated through one or more training techniques, such as but not limited to those described herein).

Regarding Claim 20, Lebaredian in view of Lindmark and Srivastava teaches
The method of claim 15, further comprising: 
determining a rendering for the action (see Lebaredian Paragraph [0047], the AI engine 112 may access and/or receive the video, audio, and/or textual streams from the client application 116A and may process the data to determine a response or communication for the AI agent and/or the renderer 114 may generate any update(s) to the corresponding virtual environment. In some embodiments, notes, question and answer dialogue box information, and/or other information associated with the video conference may be received and processed by the AI engine 112. As such, once the textual, visual, and/or audible response or communication of the AI agent is determined, the AI agent and the virtual environment may be updated according thereto, and display data and/or image data generated from the graphical data—e.g., from a virtual field of view or one or more virtual sensors, such as cameras, microphones, etc.—may be rendered using the renderer 114. A stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0060], the method 300, at block B310, includes applying the second data to a text-to-speech algorithm to generate audio data. For example, the textual data corresponding to the response or communication of the AI agent may be applied to a text-to-speech algorithm to generate audio data, and Paragraph [0061], the method 300, at block B312, includes generating graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment. For example, the renderer 114 may generate the graphical data representative of a virtual field of view of the virtual environment from a perspective of a virtual camera, and the virtual field of view may include a graphical representation of the AI agent. For example, the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.—and the virtual environment may be generated to provide context to the response); and 
causing the digital agent to perform the action according to the rendering (see Lebaredian Paragraph [0047], a stream manager 128 may receive the rendered data and generate a video stream, an audio stream, a textual stream, and/or encoded representations thereof, and provide this information to the client application 116A, Paragraph [0062], the method 300, at block B314, includes causing presentation of a rendering of the graphical data and an audio output corresponding to the audio data as a communication exchanged using the instance of the application. For example, the renderer 114 may generate display data or image data corresponding to the graphical data, and audio data, and/or textual data may also be rendered or generated. This display or image data, audio data, and/or textual data may then be transmitted to the user device(s) 104—via the host device(s) 106, in embodiments—as an audio stream, a video stream, and/or a textual stream for output by the user device(s) 104, and Paragraph [0039], Personalized models may be generated for different users over time, such that the AI engine 112 may learn what a particular user looks like when they are happy, sad, etc., and/or to learn a particular users speech patterns, figures of speech, and/or other user-specific information that may be used to tailor the AI engine 112 to the particular user. This information may be stored in a user profile of the AI agent device(s) 102. Similarly, by studying any number of users, the AI engine 112 and the renderer 114—and/or the underlying machine learning or deep learning models associated therewith—may learn how to effectively emote and/or animate a 3D graphical rendering of the AI agents in the virtual environments such that the AI agents may communicate and appear more human-like).

Claims 8, 9 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (U.S. Pun. No. 2021/0358188, hereinafter “Lebaredian”) in view of Lindmark (U.S. Pub. No. 2025/0330555), Srivastava et al. (U.S. Pub. No. 2017/0006069, hereinafter “Srivastava”) and Kasaba (U.S. Pub. No. 2022/0385700).
Regarding Claim 8, Lebaredian in view of Lindmark and Srivastava teaches all the limitations of claim 1, but does not expressively teach
The system of claim 1, wherein determining the first action comprises one of asking the user, querying the same or different foundation model, or selecting a precached thought.

However, Kasaba teaches
The system of claim 1, wherein determining the first action comprises one of asking the user, querying the same or different foundation model, or selecting a precached thought (see Kasaba Paragraph [0044], system 100 uses an artificial intelligence (AI) engine 102 to process and analyze a plurality of data associated with one or more subject persons and uses the data to render and generate an interactive avatar of the subject person that is configured to mimic or emulate the speech, mannerisms, and inflections of the subject person, and Paragraph [0141], an interactive avatar of the subject person may participate in the web meeting and provide responses to the other participants based on the accumulated knowledge of the subject person contained in the subject person's data collection stored in avatar database 116). 

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of a pre-trained, artificial intelligence digital agent for video conferences that participates autonomously as a proxy (as taught in Lebaredian in view of Lindmark and Srivastava), with determining an action of artificial intelligence based on asking a user, querying the same or different foundation model, or selecting a precached thought (as taught in Kasaba), the motivation being to quickly provide information to viewers/listeners that has been collected and stored, therefore increasing efficiency of a call and accessibility of data (see Kasaba Paragraph [0141]).

Regarding Claim 9, Lebaredian in view of Lindmark, Srivastava and Kasaba teach
The system of claim 1, the set of operations further comprising: 
in response to the digital agent performing the first action, detecting a proprioception of the digital agent (see Kasaba Paragraph [0064], interactive digital avatar 402 may include a representation of a subject person from the waist up and include hands and arms so that interactive digital avatar 402 may mimic or emulate hand movements or other body language of the subject person. In still other embodiments, interactive digital avatar 402 may include a full body representation of a subject person that mimics or emulates entire body movements or motions of the subject person, including, for example, walking gaits, dance moves, exercise routines, etc., and Paragraph [0129] the topics, responses, and other information provided during interactive session 1400 between first user 1206 and interactive avatar 1302 of subject person 1202 may be stored in the user file for first user 1206 (e.g., as interaction data 314) and may also be provided back to subject person 1202. For example, subject person 1202 may use the information about one or more interactions between plurality of users 1204 and interactive avatars to identify users that need further assistance with certain topics or to identify areas of the lecture or lesson that are difficult for many users of plurality of users 1204 to understand. That is, by monitoring or analyzing the interactions between plurality of users 1204 and interactive avatars, subject person 1202 may use this feedback to modify or improve her lecture or lesson); and
storing the proprioception (see Kasaba Paragraph [0064], interactive digital avatar 402 may include a representation of a subject person from the waist up and include hands and arms so that interactive digital avatar 402 may mimic or emulate hand movements or other body language of the subject person. In still other embodiments, interactive digital avatar 402 may include a full body representation of a subject person that mimics or emulates entire body movements or motions of the subject person, including, for example, walking gaits, dance moves, exercise routines, etc., and Paragraph [0129] the topics, responses, and other information provided during interactive session 1400 between first user 1206 and interactive avatar 1302 of subject person 1202 may be stored in the user file for first user 1206 (e.g., as interaction data 314) and may also be provided back to subject person 1202. For example, subject person 1202 may use the information about one or more interactions between plurality of users 1204 and interactive avatars to identify users that need further assistance with certain topics or to identify areas of the lecture or lesson that are difficult for many users of plurality of users 1204 to understand. That is, by monitoring or analyzing the interactions between plurality of users 1204 and interactive avatars, subject person 1202 may use this feedback to modify or improve her lecture or lesson).

Regarding Claim 13, it is rejected similarly as Claim 9. The method can be found in Lebaredian (Paragraph [0005], method).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Refer to PTO-892, Notice of References Cited for a listing of analogous art.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARISSA A JONES whose telephone number is (703)756-1677. The examiner can normally be reached Telework M-F 6:30 AM - 4:00 PM CT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at 5712727503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CARISSA A JONES/               Examiner, Art Unit 2691     

/DUC NGUYEN/               Supervisory Patent Examiner, Art Unit 2691
Read full office action
Prosecution Timeline

May 09, 2024
Application Filed
Nov 19, 2025
Non-Final Rejection — §101, §103
Feb 23, 2026
Response Filed
Mar 25, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/299,777
Patent 12598267
IMAGE CAPTURE APPARATUS AND CONTROL METHOD
2y 5m to grant Granted Apr 07, 2026
18/354,967
Patent 12598354
INFORMATION PROCESSING SERVER, RECORD CREATION SYSTEM, DISPLAY CONTROL METHOD, AND NON-TRANSITORY RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/124,682
Patent 12593004
DISPLAY METHOD, DISPLAY SYSTEM, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM STORING PROGRAM
2y 5m to grant Granted Mar 31, 2026
18/163,371
Patent 12556468
QUALITY TESTING OF COMMUNICATIONS FOR CONFERENCE CALL ENDPOINTS
2y 5m to grant Granted Feb 17, 2026
18/297,357
Patent 12556655
Efficient Detection of Co-Located Participant Devices in Teleconferencing Sessions
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.