Last updated: April 19, 2026
Application No. 18/668,529
EXPRESSING EMOTION IN SPEECH FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Non-Final OA §101§102§103
Filed
May 20, 2024
Examiner
LE, THUYKHANH
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
1 (Non-Final)
Interview Optional

— +37.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 393 resolved cases, 2023–2026
Examiner Intelligence

LE, THUYKHANH View full profile →
Grants 78% — above average
Career Allow Rate
307 granted / 393 resolved
+16.1% vs TC avg
Strong +37% interview lift
Without
With
+37.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
19 currently pending
Career history
412
Total Applications
across all art units
Statute-Specific Performance

§101
18.6%
-21.4% vs TC avg
§103
41.8%
+1.8% vs TC avg
§102
20.1%
-19.9% vs TC avg
§112
10.1%
-29.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 393 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted on 08/27/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
3.	35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. 

4.	Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. 
	Claim 1 recites
	“1. A method comprising:
 	generating, based at least on one or more first models processing user data, a first output representative of one or more attributes associated with one or more emotional states; 
 	generating, based at least on one or more second models processing the first output and a prompt, a second output representative of a response to the prompt and one or more tags corresponding to a voice that is related to the one or more attributes; 
 	generating, based at least on processing the second output, audio data representative of speech corresponding to the response and expressed using the voice; and 
 	causing an output of the speech represented by the audio data.” 
 The limitations recited in Claim 1 as drafted covers a mental process. More specifically, the underlying abstract idea revolved around what happen once a human look at a user’s face to determine the use’s emotion, combining content of user’s request with the determined emotion to determine a response content and the tag (e.g., tone) representative of the emotion in the response, and utter the response with the determined tone. Claim recites “one or more first model” and “one or more second model”. The model(s) is generic in nature and merely stands in for human mind in an otherwise mental process. 
	Claim 6 recites
	“6. A system comprising: one or more processors to: 
 	generate, based at least on one or more language models processing first data representative of one or more emotional states and first text, second data representative of second text and information associated with a voice related to the one or more emotional states;
 	generate, based at least on the second data, audio data representative of speech corresponding to the second text and expressed using the voice; and 
 	cause an output of the speech represented by the audio data.”
The limitations recited in Claim 6 as drafted covers a mental process. More specifically, the underlying abstract idea revolved around what happen once a human look at a user’s face to determine the use’s emotion, combining content of user’s request with the determined emotion to determine a response content and the tone representative of the emotion in the response, and utter the response with the determined tone. Claim recites “one or more language model”. The model(s) is generic in nature and merely stands in for human mind in an otherwise mental process. The language model is used to generally apply the abstract idea without placing any limits on how the language model functions. Rather, these limitations only recite the outcome of “second data representative of second text and information associated with a voice related to the one or more emotional states;” and do not include any details about how the second data is accomplished. See MPEP 2106.05(f). 
	Claim 18 recites 
	“18. One or more processors comprising: 
 	processing circuitry to generate audio data representative of speech in a voice related to one or more emotional states, wherein the audio data is generated based at least on one or more language models processing first data associated with a user that provides a prompt associated with a response and second data representative of one or more attributes associated with a character that is to output the speech.”  
The limitations recited in Claim 18 as drafted covers a mental process. More specifically, the underlying abstract idea revolved around what happen once a human look at a user’s face to determine the use’s emotion, combining content of user’s request with the determined emotion to determine a response content and the tone representative of the emotion in the response, and utter the response with the determined tone. Claim recites “one or more language model”. The model(s) is generic in nature and merely stands in for human mind in an otherwise mental process. The language model is used to generally apply the abstract idea without placing any limits on how the language model functions. Rather, these limitations only recite the outcome of “audio data representative of speech in a voice related to one or more emotional states” and do not include any details about how the audio data is accomplished. See MPEP 2106.05(f). 
 	The judicial exception is not integrated into a practical application. In particular, claims recite the additional limitations of “one or more processors”. The additional element(s) or combination of elements such as the processor in the claim(s) other than the abstract idea per se amount(s) to no more than (i) mere instructions to implement the idea on a computer, and/or (ii) recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. There is further no improvement to the computing device other than uttering a response to the user based on the content of the user’s request and the user’s emotion. The mere recitation of a processor and/or the like is akin of adding the word “apply it” and/or “use it” with a computer in conjunction with the abstract idea.
	The paragraph [0083] disclose “The CPU(s) 1006 may be configured to execute at least some of the computer- readable instructions to control one or more components of the computing device 1000 to perform one or more of the methods and/or processes described herein. The CPU(s) 1006 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) that are capable of handling a multitude of software threads simultaneously. The CPU(s) 1006 may include any type of processor, and may include different types of processors depending on the type of computing device 1000 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers). For example, depending on the type of computing device 1000, the processor may be an Advanced RISC Machines (ARM) processor implemented using Reduced Instruction Set Computing (RISC) or an x86 processor implemented using Complex Instruction Set Computing (CISC). The computing device 1000 may include one or more CPUs 1006 in addition to one or more microprocessors or supplementary co-processors, such as math co- processors.”
 	As filed in the specification, the computer is listed as a general-purpose computer and are mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
 	The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer is noted as a general computer. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
 The dependent claims further do not remedy the issues noted above. More specifically, Claim 2 recites generating an attribute associated with a character and generating the second output based on the attribute. This reads on the human could determine an emotional attribute of the user based on the past conversation to determine a response to the user. Claim 2 recites first models and second models. The model(s) is generic in nature and merely stands in for human mind in an otherwise mental process. Claim 3 recites a mental process of obtaining character data (e.g., past conversation, current interaction with other people). Claim 4 recites a mental process of obtaining the previous question/request or the previous response. Claim 5 recites text data, audio data, video data and image data. The human could determine an emotional attribute of the user by using the text data, audio data, video data and/or image data. Claim 7 describes the first value and second value in the emotion state. There are no additional limitations presented. Claim 8 recites describing first emotion state and second emotion state of the user. There are no additional limitations presented. Claim 9 recites using the one or more language model to output speech response. Claim recites “one or more language model”. The model(s) is generic in nature and merely stands in for human mind in an otherwise mental process. The language model is used to generally apply the abstract idea without placing any limits on how the language model functions. Rather, these limitations only recite the outcome of “second data” and do not include any details about how the second data is accomplished. See MPEP 2106.05(f). Claim 10 recites a mental process of  using labels to describes attributes and the intensity value. Claim 11 recites indicating that one or more emotional states associated with the voice. There are no additional limitations presented. Claim 12 recites a mental process of using third data (e.g., image, video) to determine emotion of the user. Claim 13 define the third data. There are no additional limitations presented. Claim 14 recites a mental process of outputting a response based on the content of the user’s input and the user’s emotion. Claim 15 recites using the emotion and previous content in the previous turn in the conversation to determine a response. There are no additional limitations presented. Claim 16 recites determining an audio response based on the content of user’s input and the emotion determined from the voice. There are no additional limitations presented. Claim 17 recites a list of systems to perform a process. The present process uses computer as a tool to implement a mental process. Claim 19 recites a mental process of determining emotional state of the user. There are no additional limitations presented. Claim 20 recites a list of system for performing a process. The present process uses computer as a tool to implement a mental process. 
For at least the supra provided reasons, claims 1-20 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. 

Claim Rejections - 35 USC § 102
5.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

6.	Claims 1-14 and 16-20 are rejected under 35 U.S.C. 102(a) (2) as being anticipated by Park et al. (US 2025/0200855 A1.)

	With respect to Claim 1, Park et al. disclose 
 	A method comprising:
 	generating, based at least on one or more first models processing user data, a first output representative of one or more attributes associated with one or more emotional states (Park et al. [0073] describes model trained based on a machine-learning (ML)-based modeling method for detecting the facial area from a facial image, [0012] describes recognizing an emotion of the user from the facial image of the user); 
 	generating, based at least on one or more second models processing the first output and a prompt, a second output representative of a response to the prompt and one or more tags corresponding to a voice that is related to the one or more attributes (Park et al. [0022 and 0094] describes receiving a user’s question (e.g., a prompt), processing the received user’s input and the emotion of the user from the facial image of the user to generate the response having the content in compliance with an emotion of the user by reflecting the multimodal recognition result with respect to the emotion of the user, [0080] determining one or more intensity value associated with the one or more labels (e.g., the valence of an emotion is a positive force of 7). See paragraphs [0091-0092 and 0096].)
	generating, based at least on processing the second output, audio data representative of speech corresponding to the response and expressed using the voice (Park et al. [0094] a response with respect to the word of the user “hello” as a prompt sentence of the LLM may be generated, wherein an emotion of the response should be the arousal of A and the valence of B (A and B are the arousal and valence states of the user, estimated in the multimodal emotion recognition in B. of operation S305 above). The content of the word generated by the LLM may be converted into the voice through the TTS technique. Here, the tone and the manner of the voice of the virtual human may have to have the same arousal and valence as the content of the word generated by the LLM. When a gap occurs between the content and the arousal and the valence, a feeling of distance may occur between the content of the word and the voice expression, and thus, a hearer may feel it unnatural. For example, when the virtual human says “hello!” the tone and the manner must not be too dark or slow. Tagging the recognized emotion of the user into the response text to synthesize the audio response); and 
 	causing an output of the speech represented by the audio data (Park et al. [0094] the converted voice data is delivered to the user through a sound output device (for example, a speaker)).  

	With respect to Claim 2, Park et al. disclose 
 	further comprising:
 	generating, based at least on the one or more first models processing character data, a third output representative of one or more second attributes associated with a character (Park et al. [0016] recognize an emotion inherent in a voice of the user by analyzing text from the voice-text conversation unit), 
 	wherein the generating of the second output is further based at least on the one or more second models processing the third output (Park et al. [0094] using LLM to generate a response based on the content of the user’s input and the recognized emotion.)

 	With respect to Claim 3, Park et al. disclose 
 	further comprising:
 	obtaining character data representative of one or more second attributes associated with a character that is to output the speech (Park et al. [0016] recognize an emotion inherent in a voice of the user by analyzing text from the voice-text conversation unit), 
 	wherein the generating of the second output is further based at least on the one or more second models processing the character data (Park et al. [0094] using LLM to generate a response based on the content of the user’s input and the recognized emotion.)

	With respect to Claim 4, Park discloses 
 	further comprising:
 	obtaining data representative of at least one of one or more previous prompts (Park et al. [0047] describes the database store personal information, and previous input from the user) or  one or more previous responses associated with the one or more previous prompts, 
 	wherein the generating the first output is further based at least on the one or more first models processing the data representative of the at least one of the one or more previous prompts or the one or more previous responses (Park et al. [0047] describes conversation based on the previous input from the user.)

 	With respect to Claim 5, Park et al. disclose 
 	wherein the user data comprises at least one of: 
 	text data representative of text describing one or more second emotional states associated with a user; 
 	audio data representative of user speech corresponding to the user (Park et al. [0077] receives the user’s utterance via a microphone); 
 	video data representative of one or more videos corresponding to the user; or 
 	image data representative of one or more images corresponding to the user.  

	With respect to Claim 6, Park et al. disclose
 	A system comprising: one or more processors to: 
 	generate, based at least on one or more language models processing first data representative of one or more emotional states and first text (Park et al. [0042] a large language model (LLM) recognizes an emotion of the user through comprehensive analysis (multimodal emotion recognition) of a facial expression (an image) of the user and conversation content (text) spoken by the user during the conversation), second data representative of second text and information associated with a voice related to the one or more emotional states (Park et al. [0042] based on a recognition result value, shows an expression of an emotional state similar to the emotion of the user and performs an emotional conversation in compliance with the emotion state of the user, thereby empathizing with the user, [0080] using LLM to estimate which emotional state is indicated by the content of the word of the user, [0094] Response text of the virtual human generated based on the LLM may be converted to a voice of a human being by using the TTS technique, and the converted voice data may be delivered to the user through a sound output device (for example, a speaker). When the response is generated through the LLM, the LLM may be instructed to generate the response having the content in compliance with an emotion of the user by reflecting the multimodal recognition result with respect to the emotion of the user);
 	generate, based at least on the second data, audio data representative of speech corresponding to the second text and expressed using the voice (Park et al. [0094] using the TTS technique to convert the response generated by the LLM to a voice of a human); and 
 	cause an output of the speech represented by the audio data (Park et al. [0094] the converted voice data is delivered to the user through a sound output device (for example, a speaker)).  

	With respect to Claim 7, Park et al. disclose 
 	wherein: 
 	the one or more emotional states includes at least a first emotional state associated with outputting a response corresponding to the second text and a second emotional state associated with outputting the response corresponding to the second text (Park eta l. [0082] Based on the analyzed final emotion result value, an expression of the virtual human may be manipulated and the sentiment of the word delivered by the virtual human to the user may be manipulated to express an emotion similar to an emotional state of the user. Here, the manipulation of the expression of the virtual human may be gradually performed, and a response of the virtual human may be delivered at a certain point in time at which the expression manipulation is performed); and 
 	the first data is further representative of a first value associated with the first emotional state and a second value associated with the second emotional state (Park et al. [0082] When the user utterance is ended in operation S304, an emotion recognition result (value) analyzed based on text data and an emotion recognition result analyzed based on a facial image collected during the utterance may be combined to analyze a final emotion result of the user.)  

	With respect to Claim 8, Park et al. disclose 
 	wherein: the one or more emotional states includes at least a first emotional state associated with a user and a second emotional state associated with the user (Park et al. [0023] describes recognize an emotion of the user by analyzing the image data and the voice data); and 
 	the first data is further representative of a first value associated with the first emotional state and a second value associated with the second emotional state (Park et al. [0051] describes presenting the values of the emotion-analysis result.)

 	With respect to Claim 9, Park et al. disclose 
 	wherein the second data is further generated based at least on the one or more language models processing third data representative of one or more attributes associated with a character that is to output the speech (Park et al. [0042] using the language model to generate a response based on the conversation content (text) spoken by the user during the conversation.)

 	With respect to Claim 10, Park et al. disclose
 	wherein the second data is representative of at least one of: 
 	one or more labels describing the one or more attributes associated with the character (Park et al. [0080] describes labels); or
 	one or more intensity values associated with the one or more labels (Park et al. [0080] using one or more intensity value associated with the one or more labels (e.g., the valence of an emotion is a positive force of 7. See paragraphs [0091-0092 and 0096].)

 	With respect to Claim 11, Park et al. disclose
 	wherein the information associated with the voice related to the one or more emotional states comprises at least one of:
 	one or more second emotional states associated with the voice (Park et al. [0016] describes a voice-text conversation unit and an emotion recognition unit configured to recognize an emotion inherent in a voice of the user by analyzing text from the voice-text conversation); 
 	one or more first values associated with the one or more second emotional states; 
 	one or more voice characteristics associated with the voice; or 
 	one or more second values associated with the one or more voice characteristics.  

	With respect to Claim 12, Park et al. disclose 
 	wherein the one or more processors are further to: 
 	obtain third data associated with a user (Park et al. [0080] describes obtaining the content of the word of the user in the conversation); and 
 	generate, based at least on the one or more language models processing the third data, the first data representative of the one or more emotional states (Park et al. [0080] using LLM to estimate which emotional state is indicated by the content of the word of the user.) 

 	With respect to Claim 13, Park et al. disclose 
 	wherein the third data comprises at least one of: 
 	text data representative of text describing one or more second emotional states associated with the user (Park et al. [0016] describes the emotion obtained from the text); 
 	audio data representative of user speech corresponding to the user (Park et al. [0016] describes recognize an emotion inherent in a voice of the user in the conversation); or 
 	image data representative of one or more images corresponding to the user (Park et al. [0018] describes analyze the facial image from the terminal to recognize an emotion of the user in the image).  

	With respect to Claim 14, Park et al. disclose 
 	wherein the one or more processors are further to: 
 	generate, based at least on the one or more language models processing third data representative of one or more second emotional states and third text, fourth data representative of fourth text and second information associated with a second voice related to the one or more second emotional states (Park et al. Fig. 3 show the process is a loop, the process is repeated. The system processing the user input to obtain next content of the user input (e.g., third text) and the emotion of the user to generate the response in compliance with the emotion state of the user); 
 	generate, based at least on the fourth data, second audio data representative of second speech corresponding to the fourth text and expressed using the second voice (Park et al. [0094] using TTS to synthesize the audio response); and 
 	cause a second output of the second speech represented by the second audio data (Park et al. [0094] describes output the synthesized response via a speaker.)

	With respect to Claim 16, Park et al. disclose 
 	wherein:
 	one or more first language models of the one or more language models generate the first data representative of the one or more emotional states (Park et al. [0042] a large language model (LLM) recognizes an emotion of the user through comprehensive analysis (multimodal emotion recognition) of a facial expression (an image) of the user and conversation content (text) spoken by the user during the conversation); 
 	one or more second language models of the one or more language models generate the second data representative of the second text and the information associated with the voice related to the one or more emotional states (Park et al. [0042] based on a recognition result value, shows an expression of an emotional state similar to the emotion of the user and performs an emotional conversation in compliance with the emotion state of the user, thereby empathizing with the user, [0080] using LLM to estimate which emotional state is indicated by the content of the word of the user, [0094] Response text of the virtual human generated based on the LLM); and 
 	one or more third language models of the one or more language models generate the audio data representative of the speech (Park et al. [0094] Response text of the virtual human generated based on the LLM may be converted to a voice of a human being by using the TTS technique, and the converted voice data may be delivered to the user through a sound output device (for example, a speaker). When the response is generated through the LLM, the LLM may be instructed to generate the response having the content in compliance with an emotion of the user by reflecting the multimodal recognition result with respect to the emotion of the user.)

 	With respect to Claim 17, Park et al. disclose
 	wherein the system is comprised in at least one of:
 	a control system for an autonomous or semi-autonomous machine; 
 	a perception system for an autonomous or semi-autonomous machine; 
 	a system for performing one or more simulation operations; a system for performing one or more digital twin operations; 
 	a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; 
 	a system for performing one or more deep learning operations; 
 	a system implemented using an edge device; 
 	a system implemented using a robot; a system for performing one or more generative AI operations; 
 	a system for performing operations using one or more large language models (LLMs) (Park et al. [0042] describes a large language model); 
 	a system for performing operations using one or more vision language models (VLMs); 
 	a system for performing one or more conversational AI operations; 
 	a system for generating synthetic data; 
 	a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; 
 	a system incorporating one or more virtual machines (VMs); 
 	a system implemented at least partially in a data center; or 
 	a system implemented at least partially using cloud computing resources.  

With respect to Claim 18, Park et al. disclose
 	One or more processors comprising: 
 	processing circuitry to generate audio data representative of speech in a voice related to one or more emotional states, wherein the audio data is generated based at least on one or more language models processing first data associated with a user that provides a prompt associated with a response and second data representative of one or more attributes associated with a character that is to output the speech (Park et al. [0042] based on a recognition result value, shows an expression of an emotional state similar to the emotion of the user and performs an emotional conversation in compliance with the emotion state of the user, thereby empathizing with the user, [0080] using LLM to estimate which emotional state is indicated by the content of the word of the user in the voice conversation, [0094] Response text of the virtual human generated based on the LLM may be converted to a voice of a human being by using the TTS technique, and the converted voice data may be delivered to the user through a sound output device (for example, a speaker). When the response is generated through the LLM, the LLM may be instructed to generate the response having the content in compliance with an emotion of the user by reflecting the multimodal recognition result with respect to the emotion of the user.)

	With respect to Claim 19, Park et al. disclose 
 	wherein the processing circuitry is further to: 
 	generate, based at least on the one or more language models processing at least one of the first data or the second data, third data representative of the one or more emotional states or one or more second emotional states associated with the user, wherein the audio data is generated based at least on the one or more language models further processing the second data and the third data (Park et al. [0042] based on a recognition result value, shows an expression of an emotional state similar to the emotion of the user and performs an emotional conversation in compliance with the emotion state of the user, thereby empathizing with the user, [0080] using LLM to estimate which emotional state is indicated by the content of the word of the user, [0094] Response text of the virtual human generated based on the LLM may be converted to a voice of a human being by using the TTS technique, and the converted voice data may be delivered to the user through a sound output device (for example, a speaker). When the response is generated through the LLM, the LLM may be instructed to generate the response having the content in compliance with an emotion of the user by reflecting the multimodal recognition result with respect to the emotion of the user.)

 	With respect to Claim 20, Park et al. disclose
 	wherein the one or more processors are comprised in at least one of:
 	a control system for an autonomous or semi-autonomous machine; 
 	a perception system for an autonomous or semi-autonomous machine; 
 	a system for performing one or more simulation operations; 
 	a system for performing one or more digital twin operations; 
 	a system for performing light transport simulation; 
 	a system for performing collaborative content creation for 3D assets; 
 	a system for performing one or more deep learning operations; 
 	a system implemented using an edge device; a system implemented using a robot; 
 	a system for performing one or more generative AI operations; 
 	a system for performing operations using one or more large language models (LLMs) (Park et al. [0042] describes a large language model);  
 	a system for performing operations using one or more vision language models (VLMs); 
 	a system for performing one or more conversational AI operations; a system for generating synthetic data; 
 	a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; 
 	a system incorporating one or more virtual machines (VMs); 
 	a system implemented at least partially in a data center; or 
 	a system implemented at least partially using cloud computing resources.


Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

8.	Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US 2025/0200855 A1) and Strubbe et al. (US 6,795,808 B1.) 

 	With respect to Claim 15, Park et al. disclose all the limitation of Claim 14 upon which Claim 15 depends. Park et al. fail to explicitly teach  	
wherein the one or more processors are further to generate, based at least on the one or more language models processing fifth data associated with a user, the first text, and the second text, the third data representative of the one or more second emotional states. 
However, Strubbe et al. teach
wherein the one or more processors are further to generate, based at least on the one or more language models processing fifth data associated with a user, the first text, and the second text, the third data representative of the one or more second emotional states (Strubbe et al. Fig. 7 element 290 Mood/personality classifier, col. 30 lines 14-34,  col. 26 lines 64-67, and col. 27 lines 1-12 detecting a user’s mood based on content of the previous conversation, and signal from the video image classifier 240.)
 Park et al. and Strubbe et al. are analogous art because they are from a similar field of endeavor in the Speech Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of generating a response to the user’s input based on the content of the user’s input and the user’s emotion as taught by Park et al., using teaching of the content of the previous conversation and the video image as taught by Strubbe et al. for the benefit of classifying the user’s mood (Strubbe et al. Fig. 7 element 290 Mood/personality classifier, col. 30 lines 14-34,  col. 26 lines 64-67, and col. 27 lines 1-12 detecting a user’s mood based on content of the previous conversation, and signal from the video image classifier 240.)

Conclusion
9.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. See PTO-892.
a.	Bonar et al. (US 2024/0169974 A1.) In this reference, Bonar et al. disclose the large language model to select an appropriate sentiment to attach to the text response based on the sentiment determined from the user input. 
b.	Jayaraman et al. (US 2024/0163232 A1.) In this reference, Jayaraman et al. disclose user the recorded user sentiment to determine future chat bot response. 
c. 	Wang (US 2024/0096329 A1.) In this reference, Wang disclose adjusting the emotion of the response to be displayed using the cloned character voice model based on the detected emotion of the user. 

10. 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/THUYKHANH LE/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

May 20, 2024
Application Filed
Feb 07, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/105,011
Patent 12597413
ELECTRONIC DEVICE AND CONTROL METHOD THEREOF
2y 5m to grant Granted Apr 07, 2026
18/178,563
Patent 12592218
COMMUNICATION DEVICE, COMMUNICATION METHOD, AND NON-TRANSITORY STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/646,310
Patent 12592239
ACTIVE VOICE LIVENESS DETECTION SYSTEM
2y 5m to grant Granted Mar 31, 2026
18/242,053
Patent 12586577
AUTOMATIC SPEECH RECOGNITION USING MULTIPLE LANGUAGE MODELS
2y 5m to grant Granted Mar 24, 2026
18/567,634
Patent 12579365
INFORMATION ACQUISITION METHOD AND APPARATUS, DEVICE, AND MEDIUM
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+37.1%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 393 resolved cases by this examiner. Grant probability derived from career allow rate.