Last updated: April 19, 2026
Application No. 18/521,310
EXPRESSING EMOTION IN SPEECH FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Non-Final OA §103
Filed
Nov 28, 2023
Examiner
HSU, JONI
Art Unit
2611
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
3 (Non-Final)
Interview Optional

— +7.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 848 resolved cases, 2023–2026
Examiner Intelligence

HSU, JONI View full profile →
Grants 87% — above average
Career Allow Rate
741 granted / 848 resolved
+25.4% vs TC avg
Moderate +7% lift
Without
With
+7.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
34 currently pending
Career history
882
Total Applications
across all art units
Statute-Specific Performance

§101
8.4%
-31.6% vs TC avg
§103
59.7%
+19.7% vs TC avg
§102
11.4%
-28.6% vs TC avg
§112
3.1%
-36.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 848 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on January 26, 2026 has been entered.
Response to Arguments
Applicant's arguments filed January 26, 2026 with respect to Claims 1-13 and 17-22 have been fully considered but they are not persuasive.
As per Claim 1, Applicant argues that Lembersky (US 20190095775A1) describes using a facial recognition API to determine a mood of a user.  However, Lembersky does not teach applying data representing the user’s converted text and the answer to a machine learning model. Lembersky does not even teach applying data representing the user’s converted text and the answer to the AI engine. Therefore, Lembersky does not teach that a machine learning model processes such data to determine an emotional state associated with the answer. Lembersky does not even teach that the AI engine processes the user’s converted text and the answer to determine the specific emotion. Rather, in Lembersky, the specific emotion is based on the mood determined using facial recognition API (p. 14-15).
In reply, the Examiner points out that Lembersky describes “determine an emotion of the user based on the aforementioned audio user input (e.g., the words themselves and/or how they are spoken) or visual user input (e.g., accomplished via facial recognition API)” [0101]. Lembersky describes “determine one or more avatar characteristics based on the one or both of the audio user input and the visual user input of the user. For instance, in various embodiments, the device can be configured to modify features of an avatar that is to be presented to the user based on the user(s) themselves, such as how to respond, with what emotion to display, with what words to say, with what tone to speak” [0102]. Thus, Lembersky teaches that in one embodiment, the emotion could be determined by using facial recognition API. However, that is only one of the embodiments. Lembersky teaches in another embodiment, the emotion is determined based only on the words themselves. Lembersky teaches the user’s converted text (speech) is then passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion) [0022]. Thus, the first text (user’s converted text) is applied to a machine learning model (AI engine 112). Since the AI engine 112 determines the second text (answer to a question) and specific emotion [0022], this means that the AI engine 112 processes the second text (answer to a question) in order to generate specific emotion (emotional state associated with the second text). Thus, Lembersky teaches applying, to a machine learning model (AI engine 112), first data representative of at least the first text (user’s converted text) and the second text (answer to a question); generating, based at least on the machine learning model (AI engine 112) processing the first data (user’s converted text and answer to a question), second data representative of an emotional state associated with the second text (specific emotion) [0022].
As per Claim 10, Applicant argues that the combination of Lembersky and Chen (US 20240005905A1) does not teach one or more prompts that cause one or more language models to generate one or more values corresponding to one or more speech characteristics. Therefore, the combination of Lembersky and Chen does not teach one or more language models that process such one or more prompts and/or generate data representing the one or more values from the one or more prompts. Consequently, the combination of Lembersky and Chen does not teach obtaining first input data representative of one or more prompts that cause one or more language modules to generate one or more values corresponding to one or more speech characteristics; generating, based at least on the one or more language models processing first input data and the second input data, output data representative of an emotional state associated with the second text and the one or more values corresponding to the one or more speech characteristics associated with the second text (p. 17, last paragraph-p. 18, 1st paragraph).
In reply, the Examiner points out that that Lembersky teaches obtaining first input data representative of one or more prompts (user’s converted text (speech)); generating, based at least on the one or more language models (AI engine 112) processing first input data (user’s converted text (speech) and the second input data (answer to a question), output data representative of an emotional state associated with the second text (specific emotion) [0022], as discussed above. Chen teaches obtaining first input data representative of one or more prompts that cause one or more language modules to generate one or more values corresponding to one or more speech characteristics; generating, based at least on one or more language models processing first input data, output data representative of an emotional state associated with the text and the one or more values corresponding to the one or more speech characteristics associated with the text (acoustic model, an optimal emotion intensity of the sample text input may be considered, and a mel spectrum closest to the optimal emotion intensity may be selected, which makes the emotion intensity of generated speech more reasonable and more in line with an actual need, [0085], emotion intensity extraction model 490 may be a machine learning model for determining the sample emotion intensity, [0198], sample may include the sample text input corresponding to multiple languages, so that the trained acoustic model may be capable of processing text input in multiple languages, [0068], generating a prediction speech corresponding to the text input based on the prediction mel spectrum, [0014]). Since Lembersky teaches generating, based at least on the one or more language models (AI engine 112) processing first input data (user’s converted text (speech) and the second input data (answer to a question), output data representative of an emotional state associated with the second text (specific emotion) [0022], this teaching of one or more values corresponding to the one or more speech characteristics from Chen can be implemented on the first text and the second text of Lembersky so that it obtains first input data representative of one or more prompts that cause one or more language modules to generate one or more values corresponding to one or more speech characteristics; generating, based at least on one or more language models processing first input data and the second input data, output data representative of an emotional state associated with the second text and the one or more values corresponding to the one or more speech characteristics associated with the second text.
As to Claim 23, new grounds of rejection are made in view of Wu (US 20180189857A1).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1, 3, 6, 9-12, 17-20, and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1).
As per Claim 1, Lembersky teaches a method comprising:  receiving input data representative of first text corresponding to a query (receive user input indicative of a user’s speech through an audio processor (speech-to-text), user’s converted text (speech), [0022]); determining, based at least on the first text corresponding to the query, second text (114) corresponding to a response to the query for output by a character (user’s converted text (speech) may then be passed to an AI engine 112 to determine a proper response 114 to the user (an answer to a question), which results in the proper text and emotional response being sent to a processor, which then translates the responsive text back to synthesized speech 118, and also triggers visual display blend shapes to morph a face of the AI character into a proper facial expression to convey the appropriate emotional response and mouth movement for the response, [0022]). Lembersky describes “determine an emotion of the user based on the aforementioned audio user input (e.g., the words themselves and/or how they are spoken) or visual user input (e.g., accomplished via facial recognition API)” [0101]. Lembersky describes “determine one or more avatar characteristics based on the one or both of the audio user input and the visual user input of the user. For instance, in various embodiments, the device can be configured to modify features of an avatar that is to be presented to the user based on the user(s) themselves, such as how to respond, with what emotion to display, with what words to say, with what tone to speak” [0102]. Thus, Lembersky teaches that in one embodiment, emotion could be determined by using facial recognition API. However, that is only one of the embodiments. Lembersky teaches in another embodiment, emotion is determined based only on words themselves. Lembersky teaches user’s converted text (speech) is then passed to AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion) [0022]. Thus, the first text (user’s converted text) is applied to machine learning model (AI engine 112). Since AI engine 112 determines the second text (answer to a question) and specific emotion [0022], this means that AI engine 112 processes the second text (answer to a question) in order to generate specific emotion (emotional state associated with the second text). Thus, Lembersky teaches applying, to a machine learning model (AI engine 112), first data representative of at least the first text (user’s converted text) and the second text (answer to a question); generating, based at least on the machine learning model (AI engine 112) processing the first data (user’s converted text and answer to a question), second data representative of an emotional state associated with the second text (specific emotion) [0022]; generating, using one or more text-to-speech models and based at least on the second data, audio data (118) representative of the speech that is associated with the second text and based at least on the emotional state (which results in the proper text and emotional response being sent to a processor, which then translates the responsive text back to synthesized speech 118, [0022]); and causing the character to be animated using at least the speech (triggers visual display blend shapes to morph a face of the AI character or avatar into a proper facial expression to convey the appropriate emotional response and mouth movement (lip synching) for the response, [0022]).
	However, Lembersky does not teach generating, based at least on the machine learning model processing the first data, one or more variables associated with at least one of the emotional state or speech associated with the second text; the speech is based on the one or more variables.  However, Chen teaches generating, based at least on the machine learning model processing data representative of the text, second data representative of an emotional state associated with text and one or more variables associated with at least one of the emotional state or speech associated with the text; generating, using one or more text-to-speech models and based at least on the second data, audio data representative of the speech that is associated with the text and based at least on the emotional state and the one or more variables (acoustic model, an optimal emotion intensity of the sample text input may be considered, and a mel spectrum closest to the optimal emotion intensity may be selected, which makes the emotion intensity of generated speech more reasonable and more in line with an actual need, [0085], emotion intensity extraction model 490 may be a machine learning model for determining the sample emotion intensity, [0198], sample may include the sample text input corresponding to multiple languages, so that the trained acoustic model may be capable of processing text input in multiple languages, [0068], generating a prediction speech corresponding to the text input based on the prediction mel spectrum, [0014]).  Since Lembersky teaches generating, based at least on the machine learning model processing the first data, second data representative of an emotional state associated with the second text [0022, 0005]; generating, using one or more text-to-speech models and based at least on the second data, audio data (118) representative of the speech that is associated with the second text and based at least on the emotional state [0022], this teaching of one or more variables associated with at least one of the emotional state or speech from Chen can be implemented on the first text and the second text of Lembersky so that it generates, based at least on the machine learning model processing the first data, one or more variables associated with at least one of the emotional state or speech associated with the second text; the speech is based on the one or more variables.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky to include generating, based at least on the machine learning model processing the first data, one or more variables associated with at least one of the emotional state or speech associated with the second text; the speech is based on the one or more variables because Chen suggests that this improves the naturalness and emotion richness of a synthesized speech [0003].
As per Claim 3, Lembersky does not teach wherein: the one or more variables include at least an intensity associated with the emotional state; and the second data further represents a value associated with the intensity.  However, Chen teaches wherein:  the one or more variables include at least an intensity associated with the emotional state; and the second data further represents a value associated with the intensity ([0085], emotion intensity may be quantified as a scale of 0-10, and a greater scale indicates a stronger corresponding emotion intensity, [0072]).  This would be obvious for the reasons given in the rejection for Claim 1.
As per Claim 6, Lembersky teaches wherein the generating the second data is further based at least on the machine learning model processing:  third data associated with the character, the third data representative of at least one of one or more characteristics associated with the character, one or more situations associated with the character, or one or more interactions associated with the character, or one or more past communications associated with the character ([0005], virtual assistant can learn more about the user and make appropriate responses based on their past experiences, collect a historical activity database, the sentiment from the user using facial recognition, and stores this in their emotional history in the database of emotions and responses for a particular user, machine learning tools and techniques may then be used to improve the virtual assistant’s responses based on the user’s past experiences, the user will then be able to receive personalized greetings and suggestions, [0031], based on the user’s profile the virtual assistant can be a targeted personal advertisement directed at the user from the stores, for example, the virtual assistant could suggest salad place to eat based on John’s information and given an excited look and encouraged tone to stay on the diet, [0032]).
As per Claim 9, Lembersky teaches wherein:  the second text includes one or more words; and the speech includes the one or more words spoken using the emotional state [0022].
However, Lembersky does not teach the speech includes the one or more words spoken based at least on the one or more variables.  However, Chen teaches the speech includes the one or more words spoken based at least on the one or more variables [0085, 0198, 0068].  This would be obvious for the reasons given in the rejection for Claim 1.
As per Claim 10, Lembersky teaches a system comprising: one or more processors to: obtain first input data representative of one or more prompts (user’s converted text (speech)); generate, based at least on first text (user’s converted text (speech)) associated with a query, second input data representative of second text (answer to a question) corresponding to one or more words to be output by a character; generate, based at least on the one or more language models (AI engine 112) processing first input data (user’s converted text (speech)) and the second input data (answer to a question), output data representative of an emotional state associated with the second text (specific emotion); and generate, based on at least a portion of the second input data (answer to a question) and the output data (specific emotion), audio data (118) representative of speech that is based at least on the emotional state (specific emotion) (AI character system 100, receive user input data indicative of a user’s speech 102 through an audio processor 104 (e.g., speech-to-text), the user’s converted text (speech) may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being sent to a processor 116, which then translates the responsive text back to synthesized speech 118, and also triggers visual display blend shapes to morph a face of the AI character into a proper facial expression to convey the appropriate emotional response and mouth movement for the response, if the user is upset, the character might respond in a calm voice, [0022], natural language processing gives machines the ability to read and understand human language, such as for question answering, [0005]), as discussed in the rejection for Claim 1.
However, Lembersky does not teach obtaining the first input data representative of the one or more prompts that cause one or more language models to generate one or more values corresponding to one or more speech characteristics; generate, based at least on the one or more language models processing the first input data and the second input data, the output data representative of the one or more values corresponding to the one or more speech characteristics associated with the second text; and generate, based on at least the portion of the second input data and the output data, the audio data representative of the speech that is based at least on the one or more values corresponding to the one or more speech characteristics. However, Chen teaches obtaining the first input data representative of the one or more prompts that cause one or more language models to generate one or more values corresponding to one or more speech characteristics; generate, based at least on the one or more language models processing the first input data, the output data representative of the one or more values corresponding to the one or more speech characteristics associated with the text; and generate, based on at least the portion of the input data and the output data, the audio data representative of the speech that is based at least on the one or more values corresponding to the one or more speech characteristics (acoustic model, an optimal emotion intensity of the sample text input may be considered, and a mel spectrum closest to the optimal emotion intensity may be selected, which makes the emotion intensity of generated speech more reasonable and more in line with an actual need, [0085], emotion intensity extraction model 490 may be a machine learning model for determining the sample emotion intensity, [0198], sample may include the sample text input corresponding to multiple languages, so that the trained acoustic model may be capable of processing text input in multiple languages, [0068], generating a prediction speech corresponding to the text input based on the prediction mel spectrum, [0014]).  Since Lembersky teaches generating, based at least on the one or more language models processing first input data and the second input data, output data representative of an emotional state associated with the second text; and generate, based on at least a portion of the second input data and the output data, audio data representative of speech that is based at least on the emotional state, as discussed above, this teaching of one or more values corresponding to one or more speech characteristics from Chen can be implemented on the first text and the second text of Lembersky so that it obtains the first input data representative of the one or more prompts that cause one or more language models to generate one or more values corresponding to one or more speech characteristics; generate, based at least on the one or more language models processing the first input data and the second input data, the output data representative of the one or more values corresponding to the one or more speech characteristics associated with the second text; and generate, based on at least the portion of the second input data and the output data, the audio data representative of the speech that is based at least on the one or more values corresponding to the one or more speech characteristics.  This would be obvious for the reasons given in the rejection for Claim 1.
As per Claim 11, Lembersky teaches wherein at least one of:  the generation of the second input data uses the one or more language models; or the generation of the second input data uses one or more second language models [0022, 0005].
As per Claim 12, Claim 12 is similar in scope to Claim 3, and therefore is rejected under the same rationale.  As per Claim 17, Claim 17 is similar in scope to Claim 1, and therefore is rejected under the same rationale.
As per Claim 18, Lembersky teaches wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources [0022].
As per Claims 19-20, these claims are similar in scope to Claims 3 and 18 respectively, and therefore are rejected under the same rationale.
As per Claim 22, Lembersky teaches wherein the generation of the second audio data uses one or more speech-to-text models that process the second data [0022].
Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1) in view of Hussen Abdelaziz (US 20210248804A1).
Lembersky and Chen are relied upon for the teachings as discussed above relative to Claim 1.  Lembersky teaches wherein the determining the second text corresponding to the response to the query comprises generating, using one or more language models and based at least on the first text corresponding to the query, second data representative of the second text corresponding to the response to the query, as discussed in the rejection for Claim 1.
However, Lembersky and Chen do not teach generating, using the one or more language models, third data representative of the second text.  However, Hussen Abdelaziz teaches generating, using one or more language models and based at least on the text, third data representative of the text (emotional state determined based on punctuation) (neural network 810 determines the emotional state based on received text 802, neural network 810 determines whether received text 802 or a representation of received text 802 includes one or more words that indicate a particular emotional state, when text 802 includes “Happy Birthday!” neural network 810 may determine that text 802 includes the word “happy” and thus, that text 802 indicates an emotional state of happy, as another example, when text 802 includes “I fell and hurt myself” neural network 810 determines that text 802 includes the word “hurt” and thus, indicates an emotional state of sad or worried, [0272], neural network 810 determines whether text 802 or a representation of received text 802 include punctuation that indicates a particular emotional state, when text 802 includes “Happy Birthday!”, neural network 810 determines that text 802 indicates an emotional state of happy based on the use of an exclamation point, neural network 810 determines an emotional state based on words and punctuation included in text 802, neural network 810 may determine that the presence of an exclamation point in “Happy Birthday!” could indicate an emotional state of happy or angry, by determining that text 802 also includes the word “happy” neural network 810 may determine that text 802 indicates an emotional state of happy and not angry, [0273]).  Since Lembersky teaches wherein the determining the second text corresponding to the response to the query comprises generating, using one or more language models and based at least on the first text corresponding to the query, second data representative of the second text corresponding to the response to the query, as discussed in the rejection for Claim 1, this teaching of third data from Hussen Abdelaziz can be implemented on the first text corresponding to the query and the second text corresponding to the response to the query from Lembersky so that it generates, using one or more language models, third data representative of the second text.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky and Chen to include generating, using one or more language models, third data representative of the second text because Hussen Abdelaziz suggests that it can further determine the emotional state based on punctuation, in order to more accurately determine the emotional state [0272-0273].
Claim(s) 4, 5, and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1) in view of van der Meulen (US010360716B1).
As per Claim 4, Lembersky and Chen are relied upon for the teachings as discussed above relative to Claim 1.
	However, Lembersky and Chen do not teach wherein: the one or more variables include one or more characteristics associated with the speech, the one or more characteristics including at least one of a volume, a rate, a pitch, or an emphasis associated with the speech; and the second data further represents one or more values associated with the one or more characteristics.  However, van der Meulen teaches wherein:  the one or more variables include one or more characteristics associated with the speech, the one or more characteristics including at least one of a volume, a rate, a pitch, or an emphasis associated with the speech; and the second data further represents one or more values associated with the one or more characteristics (message analyzer 208 may receive messages or other text and determine characteristics, through analysis, which may be used to enhance animation of an avatar that speaks the text, col. 5, lines 42-45; message analyzer 208 may markup text to indicate special audio features and/or special visual features, message analyzer 208 may identify information in the text, as well as a context of the message, message analyzer 208 may determine a speed, a pitch, a volume, and other attributes of speech, which may be included in the audio features and/or visual features, col. 8, lines 26-29, 42-45).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky and Chen so the one or more variables include one or more characteristics associated with the speech, the one or more characteristics including at least one of volume, rate, a pitch, or an emphasis associated with the speech; and the second data further represents one or more values associated with the one or more characteristics because van der Meulen suggests that for example if the message analyzer determines that the text has the emotion of surprise, then the speech rate is increased, since a person usually has an increased speech rate when they are surprised, and thus the speech that is output sounds more like when a real person is speaking when they are surprised (col. 9, line 61-col. 10, line 7).
As per Claim 5, Lembersky does not teach wherein: the one or more variables include at least an intensity level associated with the emotional state; the second data further represents a first value associated with the intensity level; and the generating the audio data representative of the speech comprises generating, using the one or more text-to-speech models and based at least one the emotional state and the first value, the audio data such that the speech expresses the emotional state using the intensity level.  However, Chen teaches wherein:  the one or more variables include at least an intensity level associated with the emotional state; the second data further represents a first value associated with the intensity level; and the generating the audio data representative of the speech comprises generating, using the one or more text-to-speech models and based at least one the emotional state and the first value, the audio data such that the speech expresses the emotional state using the intensity level [0072, 0085, 0198, 0014].  This would be obvious for the reasons given in the rejection for Claim 1.
	However, Lembersky and Chen do not teach wherein: the one or more variables include one or more characteristics associated with the speech; the second data further represents one or more second values associated with one or more levels of the one or more characteristics; and the generating the audio data representative of the speech comprises generating, based on the one or more second values, the audio data such that the speech expresses the emotional state using the one or more levels of the one or more characteristics.  However, van der Meulen teaches wherein:  the one or more variables include one or more characteristics associated with the speech; the second data further represents one or more second values associated with one or more levels of the one or more characteristics (col. 5, lines 42-45, col. 8, lines 26-29, 42-45); and the generating the audio data representative of the speech comprises generating, based on the one or more second values, the audio data such that the speech expresses the emotional state using the one or more levels of the one or more characteristics (ASML module 212 may create an audio indicator of <lower pitch for “step 2”>, which may be inserted into an output directed to the audio processor, col. 9, lines 20-23; ASML module 212 may create an audio indicator of “<speech rate +2>did you hear the news”, which may be inserted into an output directed to the audio processor, col. 10, lines 4-7; information that indicates an emotion of an originator of the message may be identified from text, punctuation, formatting, emoticons, word choice, and other information in the message or about the message and may be associated with one or more audio features, col. 11, lines 43-48).  This would be obvious for reasons given in rejection for Claim 4.
As per Claim 13, Claim 13 is similar in scope to Claim 4, and therefore is rejected under the same rationale.
Claim(s) 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1) in view of Yang 1 (US 20200035215A1).
As per Claim 7, Lembersky and Chen are relied upon for the teachings as discussed above relative to Claim 1.
	However, Lembersky does not teach the first data further represents one or more first values associated with the one or more variables.  However, Chen teaches wherein: the first data further represents one or more first values associated with the one or more variables [0085, 0198].  This would be obvious for the reasons given in the rejection for Claim 1.
	However, Lembersky and Chen do not teach the method further comprises generating, using the machine learning model and based at least on third data representative of one or more inputs, fourth data representative of a second emotional state associated with the second text and one or more second values associated with the one or more variables.  However, Yang 1 teaches the method further comprises generating, using the machine learning model and based at least on third data representative of one or more inputs (previous sentence), fourth data representative of a second emotional state (second emotion vector) associated with the second text (second text is the current sentence, first text is the previous sentence) and one or more second values associated with the one or more variables (entire weight applied to “love”) (plurality of sentences in received text data and different second emotion vectors may be set for the current sentence “Where are you?”, although weight “1” may be applied to the emotion item “neutral” when determination is carried out only using the current sentence “Where are you?”, a larger weight can be applied to the emotion item “love” or “happy” for the current sentence “Where are you” when the previous sentence “I miss you” is considered through context analysis, the entire weight is applied to “love”, [0288], emotion vector can be generated through DNN learning for situation explanation, deep learning model is used for learning with respect to emotion expression on the basis of situation explanation, [0250]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky and Chen to include generating, using the machine learning model and based at least on third data representative of one or more inputs, fourth data representative of a second emotional state associated with the second text and one or more second values associated with the one or more variables because Yang 1 suggests that this way, the context of the sentence can be determined from the previous sentence, and the context is used to more accurately determine the emotion of the sentence [0288].
As per Claim 8, Lembersky teaches the second data is associated with a first portion of the second text; and causing the character to be animated using the first speech [0022].  It would be obvious that after this first speech, there is second speech, and the character is animated using at least the second speech [0022].  Thus, Lembersky teaches causing the character to be animated using at least the second speech [0022].
However, Lembersky does not teach the second data further represents one or more first values for the one or more variables.  However, Chen teaches wherein: the second data further represents one or more first values for the one or more variables [0085, 0198].  This would be obvious for the reasons given in the rejection for Claim 1.
	However, Lembersky and Chen do not teach the method further comprises: generating, using the machine learning model and based at least on the first data, third data associated with a second portion of the second text, the third data representative of a second emotional state and one or more second values associated with the one or more variables; generating, based at least on the third data, second audio data representative of second speech associated with the second portion of the second text, the second speech being based at least on the second emotional state and the one or more second values associated with the one or more variables; and causing the character to be animated using at least the second speech.  However, Yang 1 teaches a plurality of sentences in received text data and different second emotion vectors are set for the current sentence “Where are you?”.  Although weight “1” is applied to the emotion item “neutral” when determination is carried out only using the current sentence “Where are you?”, a larger weight is applied to the emotion item “love” or “happy” for the current sentence “Where are you” when the previous sentence “I miss you” is considered through context analysis.  The entire weight is applied to “love” [0288].  It would be obvious that the context analysis was performed on the previous sentence “I miss you” before it was performed on the current sentence “Where are you?”  Thus, the context analysis was performed on the previous sentence “I miss you”, and it generated a weight for the emotion item for it.  Thus, Yang 1 teaches wherein:  the second data is associated with a first portion of the text (previous sentence “I miss you”) and further represents one or more first values for the one or more variables (weight for the emotion item); the method further comprises:  generating, using the machine learning model and based at least on the first data, third data associated with a second portion of the text (current sentence “Where are you?”), the third data representative of a second emotional state (second emotion vector) and one or more second values associated with the one or more variables (entire weight applied to “love”) [0288, 0250]; generating, based at least on the third data, second audio data representative of second speech associated with the second portion of the text, the second speech being based at least on the second emotional state and the one or more second values associated with the one or more variables ([0288], speech synthesis apparatus may generate second metadata corresponding to the second emotion information corresponding to the sum of the first emotion vector and the second emotion vector, [0279], the generated second metadata may be transmitted to the speech synthesis engine, and the speech synthesis engine may add the second metadata to speech synthesis target text of the received data to perform speech synthesis, [0280], speech synthesis for outputting lively speech, [0006]).  Since Lembersky teaches the second text [0022], this teaching from Yang 1 can be implemented on the second text of Lembersky so that it generates, using the machine learning model and based at least on the first data, third data associated with a second portion of the second text, the third data representative of a second emotional state and one or more second values associated with the one or more variables; generating, based at least on the third data, second audio data representative of second speech associated with the second portion of the second text, the second speech being based at least on the second emotional state and the one or more second values associated with the one or more variables; and causing the character to be animated using at least the second speech.  This would be obvious for the reasons given in the rejection for Claim 7.
Claim(s) 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1) in view of Beaver (US011822888B2).
Lembersky and Chen are relied upon for teachings relative to Claim 1. Lembersky teaches second data representative of an emotional state associated with the second text [0022].
However, Lembersky does not teach the one or more variables.  However, Chen teaches the one or more variables [0085, 0198, 0068], as discussed in the rejection for Claim 1.
However, Lembersky and Chen do not expressly teach wherein the second data represents at least one or more tokens associated with the second text, one or more first tags that indicate the emotional state associated with the second text, and one or more second tags associated with the one or more variables.  However, Beaver teaches wherein the second data represents at least one or more tokens associated with the text (receives the query in the form of a string of text, preprocesses the string by identifying tokens within the string, tokenizing the string of text, mapping tokens from the original string of text to vocab items, col. 16, lines 18-35), one or more first tags that indicate the emotional state associated with the text (tag Express Emotion is used for emotional statement, col. 9, lines 22-23).  Since Lembersky teaches second data representative of an emotional state associated with the second text [0022], this teaching from Beaver of the tokens and the tags can be implemented on the second text of Lembersky so that the second data represents at least one or more tokens associated with the second text, one or more first tags that indicate the emotional state associated with the second text.  Since Chen teaches the one or more variables [0085, 0198, 0068], this teaching of using tags to indicate data from Beaver can be implemented on the one or more variables of Chen so that one or more second tags are associated with one or more variables.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky and Chen so that the second data represents at least one or more tokens associated with the second text, one or more first tags that indicate the emotional state associated with the second text, and one or more second tags associated with the one or more variables as suggested by Beaver.  It is well-known in the art that tokens are the basic units that text is broken down into for processing by computers, especially in AI and natural language processing.  It is well-known in the art to use tags so that the processor can efficiently determine the correct instruction to be processed.
Claim(s) 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky (US 20190095775A1) and Chen (US 20240005905A1) in view of Wu (US 20180189857A1).
Lembersky and Chen are relied upon for the teachings as discussed above relative to Claim 1. Lembersky teaches determining one or more characteristics associated with the character, the one or more characteristics including personal habits of the character, wherein the first data further represents the one or more characteristics (analyze sentiment of the user, and may correspondingly adjust the response, tone of the AI character, for instance, a child lost in a mall, and based on detecting a worried child, may appear as a calming and concerned cartoon character, if an adult user approaches the system, and the user is correlated to a user that frequents the athletic store in the mall, [0029], machine learning tools 122 may then be used to improve the virtual assistant’s responses based on the user’s past experiences such as shopping and dining habits, the user will then be able to receive personalized greetings and suggestions, [0031], the virtual assistant may then suggest lighter food if the user John set in his preferences to help him to watch after his diet, the virtual assistant could suggest salad place to eat based on John’s information and give encouraged tone to stay on the diet, [0032]).
However, Lembersky and Chen do not expressly teach the one or more characteristics including at least one of a profession of the character, a relationship associated with the character, or one or more personal traits of the character.  However, Wu teaches the one or more characteristics including at least one of a profession of the character, a relationship associated with the character, or one or more personal traits of the character (user query and profile modeling for providing tailored product recommendations that take into account user traits, examples of user traits include: gender, age, country affiliation, among others, [0018], conversational AI environment 100 for providing product and service recommendations, conversational AI environment 100 includes user profile 108, conversational AI systems may provide conversational responses to user input through chat bots which may be associated with businesses and the like, recommendations may be provided to a user, [0032]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky and Chen so that the one or more characteristics include at least one of a profession of the character, a relationship associated with the character, or one or more personal traits of the character because Wu suggests that this way, the conversational AI system can provide conversation responses that are even more personalized and tailored for the user [0018, 0032].
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONI HSU whose telephone number is (571)272-7785. The examiner can normally be reached M-F 10am-6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JH
/JONI HSU/Primary Examiner, Art Unit 2611
Read full office action
Prosecution Timeline

Nov 28, 2023
Application Filed
Jun 13, 2025
Non-Final Rejection — §103
Aug 26, 2025
Response Filed
Nov 20, 2025
Final Rejection — §103
Jan 26, 2026
Request for Continued Examination
Jan 30, 2026
Response after Non-Final Action
Feb 10, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/257,410
Patent 12592028
METHODS AND DEVICES FOR IMMERSING A USER IN AN IMMERSIVE SCENE AND FOR PROCESSING 3D OBJECTS
2y 5m to grant Granted Mar 31, 2026
18/337,537
Patent 12586306
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR MODELING OBJECT
2y 5m to grant Granted Mar 24, 2026
18/432,989
Patent 12586260
CREATING IMAGE ENHANCEMENT TRAINING DATA PAIRS
2y 5m to grant Granted Mar 24, 2026
18/027,304
Patent 12581168
A METHOD FOR A MEDIA FILE GENERATING AND A METHOD FOR A MEDIA FILE PROCESSING
2y 5m to grant Granted Mar 17, 2026
18/449,286
Patent 12561850
IMAGE GENERATION WITH LEGIBLE SCENE TEXT
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
87%
Grant Probability
95%
With Interview (+7.2%)
2y 9m
Median Time to Grant
High
PTA Risk
Based on 848 resolved cases by this examiner. Grant probability derived from career allow rate.