DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 4-5, 9-10, 13-14, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (US 2021/0358188) in view of Seligmann et al. (US 2014/0067936).
Regarding claim 1, Lebaredian teaches/suggests: One or more processors comprising processing circuitry (Lebaredian Fig. 4: CPU 406 and/or GPU 408) to:
receive, by one or more action servers that handle animation of gestures of an interactive agent (Lebaredian [0018] “The AI agent device(s) 102 may include a server”), one or more first interaction modeling events instructing one or more target states of one or more agent gestures (Lebaredian [0025] “The AI engine 112 of the AI agent device(s) 102 may process incoming textual, audio, and/or image data (e.g., multimodal data) to determine what is being communicated textually, audibly, and/or visually, and to determine whether a response or output is necessary by the AI agent, what response should be output where an output is determined, and/or how to output the response (e.g., to determine a tone, emotion, gesture, animation, etc. of the AI agent)” [The incoming multimodal data meet the interaction modeling events.]);
trigger, by the one or more action servers, presentation of a rendering of one or more animation states of the interactive agent corresponding to the one or more target states of the one or more agent gestures instructed by the one or more first interaction modeling events (Lebaredian [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.”).
Lebaredian does not teach/suggest represented using an interaction categorization schema. Seligmann, however, teaches/suggests an interaction categorization schema (Seligmann [0037] “the system can generate a schema to define what interaction events should trigger a cue, how to describe the interaction events with the cues, what type of media to use for the cues, and/or how to deliver the cues”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the AI engine of Lebaredian to include the schema of Seligmann to define the interaction.
Regarding claim 4, Lebaredian as modified by Seligmann teaches/suggests: The one or more processors of claim 1, wherein the one or more first interaction modeling events comprise one or more fields that identify a supported action type categorizing the one or more agent gestures, the one or more target states of the one or more agent gestures, and a representation of the one or more agent gestures (Lebaredian [0025] “The AI engine 112 of the AI agent device(s) 102 may process incoming textual, audio, and/or image data (e.g., multimodal data) to determine what is being communicated textually, audibly, and/or visually, and to determine whether a response or output is necessary by the AI agent, what response should be output where an output is determined, and/or how to output the response (e.g., to determine a tone, emotion, gesture, animation, etc. of the AI agent)” Seligmann [0037] “the system can generate a schema to define what interaction events should trigger a cue, how to describe the interaction events with the cues, what type of media to use for the cues, and/or how to deliver the cues”). The same rationale to combine as set forth in the rejection of claim 1 above is incorporated herein.
Regarding claim 5, Lebaredian as modified by Seligmann teaches/suggests: The one or more processors of claim 1, wherein the one or more first interaction modeling events comprises a natural language description of the agent gesture (Lebaredian [0027] “the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data”). Lebaredian and Seligmann are silent regarding generated using one or more large language models. However, the concept and advantages of a large language model are well known and expected in the art (Official Notice). It would have been obvious for the NPL of Lebaredian as modified by Seligmann to include such model to process the audio data.
Regarding claim 9, Lebaredian as modified by Seligmann teaches/suggests: The one or more processors of claim 1, wherein the one or more processors are comprised in at least one of:
a control system for an autonomous or semi-autonomous machine (Lebaredian [0015] “the AI agent(s) described herein may be implemented for … vehicle (e.g., autonomous, semi-autonomous, non-autonomous, etc.) applications (e.g., for in-vehicle controls, interactions, information, etc.)”);
a perception system for an autonomous or semi-autonomous machine (Lebaredian [0015] “the AI agent(s) described herein may be implemented for … vehicle (e.g., autonomous, semi-autonomous, non-autonomous, etc.) applications (e.g., for in-vehicle controls, interactions, information, etc.)”);
a system for performing simulation operations;
a system for performing digital twin operations;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets;
a system for performing deep learning operations;
a system for performing remote operations (Lebaredian [0015] “the AI agent(s) described herein may be implemented for video conferencing applications (e.g., to participate in conversation for answering questions, displaying information, etc.)”);
a system for performing real-time streaming (Lebaredian [0015] “the AI agent(s) described herein may be implemented for video conferencing applications (e.g., to participate in conversation for answering questions, displaying information, etc.)”);
a system for generating or presenting one or more of augmented reality content, virtual reality content, or mixed reality content (Lebaredian [0075] “used by the computing device 400 to render immersive augmented reality or virtual reality”);
a system implemented using an edge device (Lebaredian [0027] “The AI agent as managed by the AI engine 112 may be deployed within a cloud-based environment, in a data center, and/or at the edge”);
a system implemented using a robot;
a system for performing conversational AI operations (Lebaredian [0015] “the AI agent(s) described herein may be implemented for video conferencing applications (e.g., to participate in conversation for answering questions, displaying information, etc.)”);
a system implementing one or more language models (Lebaredian [0027] “the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data”);
a system implementing one or more large language models (LLMs);
a system implementing one or more vision language models (VLMs);
a system implementing one or more multimodal language models;
a system for generating synthetic data;
a system for generating synthetic data using AI;
a system incorporating one or more virtual machines (VMs);
a system implemented at least partially in a data center (Lebaredian [0027] “The AI agent as managed by the AI engine 112 may be deployed within a cloud-based environment, in a data center, and/or at the edge”); or
a system implemented at least partially using cloud computing resources (Lebaredian [0027] “The AI agent as managed by the AI engine 112 may be deployed within a cloud-based environment, in a data center, and/or at the edge”).
Regarding claim 10, Lebaredian as modified by Seligmann teaches/suggests: A system comprising one or more processors (Lebaredian Fig. 4: CPU 406 and/or GPU 408) to trigger, by one or more action servers (Lebaredian [0018] “The AI agent device(s) 102 may include a server”) based at least on one or more first interaction modeling events (Lebaredian [0018] “The AI agent device(s) 102 may include a server”), one or more first interaction modeling events instructing one or more target states of one or more agent gestures that instruct one or more target states of one or more agent movements represented using an interaction categorization schema (Lebaredian [0025] “The AI engine 112 of the AI agent device(s) 102 may process incoming textual, audio, and/or image data (e.g., multimodal data) to determine what is being communicated textually, audibly, and/or visually, and to determine whether a response or output is necessary by the AI agent, what response should be output where an output is determined, and/or how to output the response (e.g., to determine a tone, emotion, gesture, animation, etc. of the AI agent)” [The incoming multimodal data meet the interaction modeling events.] Seligmann [0037] “the system can generate a schema to define what interaction events should trigger a cue, how to describe the interaction events with the cues, what type of media to use for the cues, and/or how to deliver the cues”), one or more animation states corresponding to the one or more target states of the one or more agent movements (Lebaredian [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.”). The same rationale to combine as set forth in the rejection of claim 1 above is incorporated herein.
Claims 13-14 and 18 recite limitation(s) similar in scope to those of claims 4-5 and 9, respectively, and are rejected for the same reason(s).
Claims 19 and 20 recite limitation(s) similar in scope to those of claims 10 and 18, respectively, and are rejected for the same reason(s).
Claim(s) 2 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (US 2021/0358188) in view of Seligmann et al. (US 2014/0067936) as applied to claims 1 and 10 above, and further in view of Peltier et al. (US 2024/0278116).
Regarding claim 2, Lebaredian as modified by Seligmann does not teach/suggest: The one or more processors of claim 1, wherein the processing circuitry is further to apply, by the one or more action servers, a modality policy that overrides one or more active gestures of the interactive agent with the one or more agent gestures instructed by the one or more first interaction modeling events, and resumes the one or more active gestures upon completion of the one or more agent gestures. Peltier, however, teaches/suggests resumes the one or more active gestures upon completion (Peltier [0178] “Having fulfilled the action of defining the identity associated with the footsteps, Frank returns his attention to the magazine and resumes his idle animation”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the AI engine of Lebaredian as modified by Seligmann to include the animation of Peltier to represent the idle state of the AI agent.
As such, Lebaredian as modified by Seligmann and Peltier teaches/suggests apply, by the one or more action servers, a modality policy that overrides one or more active gestures of the interactive agent with the one or more agent gestures instructed by the one or more first interaction modeling events, and resumes the one or more active gestures upon completion of the one or more agent gestures (Lebaredian [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.” Peltier [0178] “Having fulfilled the action of defining the identity associated with the footsteps, Frank returns his attention to the magazine and resumes his idle animation”). In view of Lebaredian and Peltier, the AI agent’s interaction during the idle state meets the applying.
Claim 11 recites limitation(s) similar in scope to those of claim 2, and is rejected for the same reason(s).
Claim(s) 3 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (US 2021/0358188) in view of Seligmann et al. (US 2014/0067936) as applied to claims 1 and 10 above, and further in view of Ellis et al. (US 2021/0365174).
Regarding claim 3, Lebaredian as modified by Seligmann teaches/suggests: The one or more processors of claim 1, wherein the processing circuitry is further to manage, by the one or more action servers based at least on determining that the one or more target states instruct initiation of the one or more agent gestures, the one or more agent gestures instructed by corresponding interaction modeling events (Lebaredian [0025] “The AI engine 112 of the AI agent device(s) 102 may process incoming textual, audio, and/or image data (e.g., multimodal data) to determine what is being communicated textually, audibly, and/or visually, and to determine whether a response or output is necessary by the AI agent, what response should be output where an output is determined, and/or how to output the response (e.g., to determine a tone, emotion, gesture, animation, etc. of the AI agent)” [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.”).
Lebaredian as modified by Seligmann does not teach/suggest in a stack of active agent gestures. Ellis, however, teaches/suggests a stack (Ellis [0281] “responsive to receiving user input 818, device 800 displays, over response affordance 816, second response affordance 819 including information about Member #1 to form a stack of response affordances”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the AI engine of Lebaredian as modified by Seligmann to stack the active gesture animations as taught/suggested by Ellis for retrieval.
As such, Lebaredian as modified by Seligmann and Ellis teaches/suggests in a stack of active agent gestures (Lebaredian [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.” Ellis [0281] “responsive to receiving user input 818, device 800 displays, over response affordance 816, second response affordance 819 including information about Member #1 to form a stack of response affordances”).
Claim 12 recites limitation(s) similar in scope to those of claim 3, and is rejected for the same reason(s).
Claim(s) 6 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (US 2021/0358188) in view of Seligmann et al. (US 2014/0067936) as applied to claims 1 and 10 above, and further in view of Jiao et al. (US 2021/0049158).
Regarding claim 6, Lebaredian as modified by Seligmann does not teach/suggest: The one or more processors of claim 1, wherein the one or more first interaction modeling events represents the one or more agent gestures using a natural language description of the one or more agent gestures as an argument of a type of agent action defined by the interaction categorization schema. Jiao, however, teaches/suggests using a natural language description as an argument (Lebaredian [0025] “The AI engine 112 of the AI agent device(s) 102 may process incoming textual, audio, and/or image data (e.g., multimodal data) to determine what is being communicated textually, audibly, and/or visually, and to determine whether a response or output is necessary by the AI agent, what response should be output where an output is determined, and/or how to output the response (e.g., to determine a tone, emotion, gesture, animation, etc. of the AI agent)” [0027] “the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data” Seligmann [0037] “the system can generate a schema to define what interaction events should trigger a cue, how to describe the interaction events with the cues, what type of media to use for the cues, and/or how to deliver the cues” Jiao [0029] “the query converter 213 might use the description of a column in table to identify a natural language description of the data represented by the column ... These attributes and constraints could be inserted as arguments”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the AI engine of Lebaredian as modified by Seligmann to include the natural language as argument as taught/suggested by Jiao to retrieve the gesture animations.
As such, Lebaredian as modified by Seligmann and Jiao teaches/suggests the one or more first interaction modeling events represents the one or more agent gestures using a natural language description of the one or more agent gestures as an argument of a type of agent action defined by the interaction categorization schema (Jiao [0029] “the query converter 213 might use the description of a column in table to identify a natural language description of the data represented by the column ... These attributes and constraints could be inserted as arguments”).
Lebaredian, Seligmann, and Jiao are silent regarding a standardized type of agent action. However, the concept and advantages of a standardized type are well known and expected in the art (Official Notice). It would have been obvious for the schema of Lebaredian as modified by Seligmann and Jiao to define a standardized type of agent action for standardization.
Claim 15 recites limitation(s) similar in scope to those of claim 6, and is rejected for the same reason(s).
Claim(s) 7-8 and 16-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lebaredian et al. (US 2021/0358188) in view of Seligmann et al. (US 2014/0067936) as applied to claims 1 and 10 above, and further in view of Motohashi (US 2023/0177538).
Regarding claim 7, Lebaredian as modified by Seligmann does not teach/suggest: The one or more processors of claim 1, wherein the processing circuitry is further to identify one or more supported animation clips based at least on one or more natural language descriptions of the one or more agent gestures specified by the one or more first interaction modeling events. Motohashi, however, teaches/suggests identify one or more supported animation clips (Motohashi [0050] “the video search unit 120 searches for the video corresponding to the obtained search query” [0068] “even when the search query is inputted as the natural language, it is possible to properly calculate the similarity degree between the video and the search query”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the AI engine of Lebaredian as modified by Seligmann to include the search query of Motohashi to retrieve the gesture animations.
As such, Lebaredian as modified by Seligmann and Motohashi teaches/suggests identify one or more supported animation clips based at least on one or more natural language descriptions of the one or more agent gestures specified by the one or more first interaction modeling events (Lebaredian [0027] “the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data” [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.” Motohashi [0050] “the video search unit 120 searches for the video corresponding to the obtained search query” [0068] “even when the search query is inputted as the natural language, it is possible to properly calculate the similarity degree between the video and the search query”).
Regarding claim 8, Lebaredian as modified by Seligmann does not teach/suggest: The one or more processors of claim 1, wherein the processing circuitry is further to identify one or more supported animation clips based at least on a measure of similarity between one or more natural language descriptions of the one or more agent gestures and the one or more supported animation clips. Motohashi, however, teaches/suggests identify one or more supported animation clips (Motohashi [0050] “the video search unit 120 searches for the video corresponding to the obtained search query” [0068] “even when the search query is inputted as the natural language, it is possible to properly calculate the similarity degree between the video and the search query”). The same rationale to combine as set forth in the rejection of claim 7 above is incorporated herein.
As such, Lebaredian as modified by Seligmann and Motohashi teaches/suggests identify one or more supported animation clips based at least on a measure of similarity between one or more natural language descriptions of the one or more agent gestures and the one or more supported animation clips (Lebaredian [0027] “the AI engine 112 may use natural language processing (NLP) techniques or one or more neural network model to ingest, decipher, perceive, and/or make sense of incoming audio data” [0061] “the virtual field of view including a graphical representation of the virtual agent within the virtual environment ... the AI agent may be represented as responding verbally and/or physically—e.g., via simulated gestures, postures, movements, actions, etc.” Motohashi [0050] “the video search unit 120 searches for the video corresponding to the obtained search query” [0068] “even when the search query is inputted as the natural language, it is possible to properly calculate the similarity degree between the video and the search query”).
Claims 16 and 17 recite limitation(s) similar in scope to those of claims 7 and 8, respectively, and are rejected for the same reason(s).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
US 2023/0145369 – multi-modal interaction
US 2024/0069859 – multi-modal functionality
US 2024/0205174 – context for interactive chatbot
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANH-TUAN V NGUYEN whose telephone number is 571-270-7513. The examiner can normally be reached on M-F 9AM-5PM ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JASON CHAN can be reached on 571-272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ANH-TUAN V NGUYEN/
Primary Examiner, Art Unit 2619