Last updated: April 19, 2026
Application No. 18/736,331
VOICE ASSISTANT SYSTEM BASED ON PARALINGUISTIC ELEMENT OF INPUT SPEECH

Non-Final OA §101§103
Filed
Jun 06, 2024
Examiner
SIRJANI, FARIBA
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 547 resolved cases, 2023–2026
Examiner Intelligence

SIRJANI, FARIBA View full profile →
Grants 76% — above average
Career Allow Rate
414 granted / 547 resolved
+13.7% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
578
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
49.1%
+9.1% vs TC avg
§102
14.7%
-25.3% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 547 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 16, and 20 are independent.
This Application was published as U.S. 20250378830.
Apparent priority: 6 June 2024.
The PCT examiner at the EPO found Claim 14 novel and inventive.  This Claim goes outside the trend of the other Claims and applies a mixing function /gated function / mixture of experts (MoEs) to the emotion detection of the instant Application.  
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “prompt generator,” “expression generator,” “avatar generator,” and “expression controller” in Claims 1-15. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and displays or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, the Applicant may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “microphone” or “processor” as appropriately supported by the Specification. In the alternative, the Applicant may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1, 16, and 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Step 1: The independent Claims are directed to statutory categories: 
Claim 1 is a system claim and directed to the machine or manufacture category of patentable subject matter.
Claim 16 is a computer-readable-storage device claim and is directed to the machine or manufacture category of patentable subject matter.
Claim 20 is a method claim and directed to the process category of patentable subject matter.
Step 2A, Prong One: Does the Claim recite a Judicially Recognized Exception? Abstract Idea? Are these Claims nevertheless considered Abstract as a Mathematical Concept (mathematical relationships, mathematical formulas or equations, mathematical calculations), Mental Process (concepts performed in the human mind (including an observation, evaluation, judgment, opinion), or Certain Methods of Organizing Human Activity (1-fundamental economic principles or practices (including hedging, insurance, mitigating risk), 2-commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations), 3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions) and fall under the judicial exception to patentable subject matter?)
The rejected Claims recite Mental Processes or Methods of Organizing Human Activity.
Step 2A, Prong Two: Additional Elements that Integrate the Judicial Exception into a Practical Application? Identifying whether there are any additional elements recited in the claim beyond the judicial exception(s), and evaluating those additional elements to determine whether they integrate the exception into a practical application of the exception. “Integration into a practical application” requires an additional element(s) or a combination of additional elements in the claim to apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the exception. Uses the considerations laid out by the Supreme Court and the Federal Circuit to evaluate whether the judicial exception is integrated into a practical application.
The rejected Claims do not include additional limitations that point to integration of the abstract idea into a practical application and are therefore directed to the abstract idea.
Claim 1 is a generic automation of a mental process of listening to another person and forming a question based on what they say and then inputting it to an LLM like Claude or CHATGPT and receiving the response.  

1. A voice assistant system, comprising: 
one or more processing devices configured to: [Additional elements of 2B but generic and wrc elements.]
receive input audio data that corresponds to an input speech; and [Jack listens to Jill speak.]
process the input audio data to obtain an input text of the input speech and an input paralinguistic element of the input speech; and [Jack takes notes that include mentions to the paralinguistic elements.]
a large language processing subsystem configured to: [Additional elements of 2B but generic and wrc elements.]
generate a response based on the input text and the input paralinguistic element of the input speech, [Jack feed a question to an LLM including the comments he took down from what Jill said.]
wherein the one or more processing devices are further configured to: convert the response into output audio data that corresponds to an output speech. [Jack receives the response from the LLM in the form of audio.]

Step 2B: Search for Inventive Concept: Additional Elements Do not amount to Significantly More: The limitations of processing devices and large language processing subsystems are well-understood, routine, and conventional machine components that are being used for their well-understood, routine, conventional and rather generic functions. Additionally, these limitations are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they are not sufficient to cause the Claim as a whole to amount to significantly more than the underlying abstract idea. 

Independent Claims 16 and 20 are counterparts of Claim 1 and do not add any additional elements that may be helpful and are also abstract.

The dependent claims are not abstract.
Dependent Claims 2 and 7 and 12 are not abstract and remaining of the dependents of Claim 1 refer back to one of these claims.
Of the dependents of Claim 16, Claims 17 and 19 are not abstract and Claim 18 is borderline and will not be considered abstract.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-9, 11, 15-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Brueckner (U.S. 20230395078) in view of Datta (U.S. 20250201233).
Regarding Claim 1, Brueckner teaches:
1. A voice assistant system, comprising: 
one or more processing devices configured to: [Brueckner, Figure 1, “processor 106.”]
receive input audio data that corresponds to an input speech; and [Brueckner, Figure 1, “audio sensor 120.”  Figure 3, “receive signals from audio sensor 302.”]
process the input audio data to obtain an input text of the input speech [Brueckner, Figure 3, “detect wake-up word? 304” which conducts an ASR which is usually done by first converting the audio signals into text. “[0026] … The wake-up word detector 202 may perform speech recognition on their received acoustic signals to identify whether a wake-up word was spoken. For instance, the wake-up word detector 202 may include an ASR model 204 configured to analyze the sound signals to determine whether that sound includes utterance of a wake-up word. …”  “[0019]… The signal from the audio sensor 120 may facilitate speech recognition from audio received via the audio sensor 120 according to grammar associated with available commands and voice prompt generation.”]  and an input paralinguistic element of the input speech; and [Brueckner, Figure 3, “306: detect emotion” based on the audio signals.  “[0039] At operation 306, the emotion-aware processing system 130 detects emotion in the audio signals. In an example, the emotion-aware processing system 130 may utilize the emotion detector 206 to identify the emotion of the same audio signals detected as including the WuW….”  WuW=wakeup word.]
a large language processing subsystem configured to: 
generate a response based on the input text and the input paralinguistic element of the input speech, [Brueckner, Figure 3, “provide voice control session … according to detected emotion 312.” “[0048] At operation 312, the emotion-aware processing system 130 performs a voice control session configured according to the detected emotion….” The Paralinguistic element is emotion in this reference and the “Answer Prompt” of the reference is the “response” of the Claim.  “[0013] The voice assistant system 100 may include one or more processors 106 configured to perform certain instructions, commands and other routines as described herein….”  “[0014] The processors 106 may execute instructions for certain applications, including navigation, infotainment, climate control, etc….”]
wherein the one or more processing devices are further configured to: convert the response into output audio data that corresponds to an output speech. [Brueckner, Figure 3, this step would occur after 312 and in response to the command provided by “provide voice control session … according to detected emotion 312” and is done by TTS / audio. “[0045] As another possibility, the verbosity of the prompts of the emotion-aware processing system 130 may be changed based on detected emotion….In another example, the TTS style of the answer prompt may change using an emotional TTS specific to the detected emotion (e.g., a first TTS for happy users, a second, different TTS for unhappy users, etc.). In yet another example, responsive to the detected emotion indicating anger, the prompt may be played in an apologetic TTS style. In a further example, wording and intonation of responses may be adjusted based on detected emotion, e.g., submissive responsive to detected of an aggressive user input or aggressive upon aggressive input.”]
Brueckner teaches the use of an ASR/NLU system (Figure 2, “ASR/NLU Models 204”) for recognition of the input audio which would generally produce text but not always. So the “input text” is implied but not inherent in the teachings of the reference.
Brueckner does not use an LLM to generate the response.  It does include LMs for its ASR/NLU operations: “[0011] … For instance, the performance of the ASR/NLU system may be adjusted according to the emotional state through acoustic models, language model weight sets, specific language models, etc. that correspond to the detected emotional state. …”
Datta:

    PNG
    media_image1.png
    656
    904
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    570
    826
    media_image2.png
    Greyscale

Datta teaches: 
1. A voice assistant system, comprising: 
one or more processing devices configured to: [Datta, Figure 6 shows the hardware.  “[0001] This disclosure relates to emotive text-to-speech (TTS) with auto detection of emotions.”]
receive input audio data that corresponds to an input speech; and [Datta, Figure 1, user 102 is saying utterance 104 to the user device 10.]
process the input audio data to obtain an input text of the input speech and an input paralinguistic element of the input speech; and [Datta, Figure 1, “ASR system 112” generates text 108 from the input audio.  “[0027] … The user 102 may speak the utterance of the query 106 in natural language and the ASR system 112 may perform speech recognition on the audio data characterizing the utterance 104 of the query 106 to generate a textual representation 108 of the query 106 spoken by the user 102. The textual representation 108 of the query 106 may be simply referred to as a textual query 108.”  Figure 2, emotion 214 is also first detected from the input text and then added to the input text 202.  This is done by a “conversational assistant LLM 220:  “[0028] Referring to FIG. 2…  . Here, the emotion detection task prompt 214 specifies a task for the assistant LLM 220 to detect an emotional state 232P of the input text 202 from a set of possible emotional states 232.”   Note that “input text 202” is the output reply from the LLM 220 and is called input because it is input to the TTS 300.] (Datta is not detecting the emotion from the input utterance 104 of the user.)
a large language processing subsystem configured to: [Datta, Figures 1-2, “Assistant Large Language Model 220.”]
generate a response based on the input text and the input paralinguistic element of the input speech, [Datta, Figures 1-2, “Assistant Large Language Model 220” outputs the “input text 202” and “emotional embedding 242” where the “input text 202” is the response to the user question see Figures 1 and 2, “synthesized speech representation 352” which is being output from TTS 300 and is also output from the device 10:  “Don’t worry, it will come right out … 352.”  Emotion is an input to the TTS as shown in Figure 2, 214 and 232.  This emotion is not from the input utterance of the user, though.]
wherein the one or more processing devices are further configured to:
convert the response into output audio data that corresponds to an output speech. [Datta, Figures 1-2, “TTS Model 300” generating an emotive response 352.]
Brueckner and Datta pertain to emotive TTS and it would have been obvious to modify the older Brueckner with the more modern aspects of Datta to arrive at the claimed system that uses LLMs instead of simpler digital assistant.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Brueckner teaches:
2. The voice assistant system of claim 1, wherein the large language processing subsystem comprises: 
an auxiliary encoder configured to process at least the input audio data to obtain one or more input paralinguistic parameters representing the input paralinguistic element of the input speech; and [Brueckner, Figure 3, “306: detect emotion” based on the audio signals. The Paralinguistic element is emotion in this reference.] “[0021] The voice assistant system 100 may further include an audio system providing audio playback functionality through speakers 124 or headphones located within the environment….”]
a prompt generator configured to generate a prompt based on the input text and the one or more input paralinguistic parameters, [Brueckner, “[0021] The voice assistant system 100 may further include an audio system providing audio playback functionality through speakers 124 or headphones located within the environment….” “[0019] … The signal from the audio sensor 120 may facilitate speech recognition from audio received via the audio sensor 120 according to grammar associated with available commands and voice prompt generation.”  The “prompt” of the Claim is the question/command that is generated based on the input audio and emotion and is taught in Figure 3 by the input to the box 312.]
wherein: 
the response is generated based on applying a large language model of the large language processing subsystem on at least the prompt, and [Brueckner, Figure 3, 312 and Figure 1 output of a response by the “emotion-aware processing system 130” in the form of audio from speaker 124 to the input of the command by the user.]
the response includes at least an output text of the output speech. [Brueckner teaches that the output/response may be visual/text shown on the “display 126” depending on the circumstance/emotion.  “[0034] In yet another example, the emotion customization engine 208 may utilize the detected emotional state to modify the way and style of the UI or VA responses such as its verbosity or its empathetic reaction (e.g., submissive behavior upon aggressive user input)….”  “[0041] As another possibility, visual user interfaces may provide alternative of complementary visual feedback instead of or in addition to the indication provided via the audio user interface….”  “[0046] In yet a further example, modalities may be initiated to support the answer prompt based on the detected emotion. As a possibility, responsive to the emotion detected in the WuW, the emotion-aware processing system 130 may adjust to using one or more of visual feedback, haptic or tactile feedback, olfactory feedback, gustatory feedback, etc.”]
Brueckner does not teach the user of an encoder or an LLM.
 Datta teaches:
an auxiliary encoder configured to process at least the input audio data to obtain one or more input paralinguistic parameters representing the input paralinguistic element of the input speech; and [Datta, Figure 3, the “context model 360” / “auxiliary encoder” detects the emotion that should be attributed to the output response based on “context features 362.”  “[0048] In some implementations, a context model 360 in communication with the assistant LLM 220 is configured to receive and process one or more context features 362 to generate a context embedding 364 associated with the input text 202. For example, the context features 362 may include the conversation history between the user 102 and the conversational assistant application 200 as context to the assistant LLM 220….”  The “context” teaches the paralinguistic parameters of the Claim.]
…
wherein: 
the response is generated based on applying a large language model of the large language processing subsystem on at least the prompt, and [Datta, Figures 1, 2, 3, the assistant device that generates the response 352 is an “Assistant LLM 220” operating on “prompt 212” generated by a “prompt structurer 210.”]

    PNG
    media_image3.png
    592
    884
    media_image3.png
    Greyscale

Brueckner and Datta pertain to emotive TTS and it would have been obvious to modify the older Brueckner with the more modern aspects of Datta to arrive at the claimed system that uses LLMs instead of simpler digital assistant and also the context of Datta and the emotional content of Brueckner are substitutable.  In this case Brueckner gets the emotion from the input voice but Datta tries to get it from past conversation context. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Brueckner teaches:
3. The voice assistant system of claim 2, wherein the prompt generator includes a learning logic or a machine learning model. [Brueckner teaches that its question/command that is generated based on the audio input and detected emotion (step 306 performed by module 206) is generated by a system that includes machine learning:  “[0027] The emotion detector 206 may also receive, from the audio sensors 120, the signals indicative of the sounds captured in the environment. In an example, the emotion detector 206 may analyze the manner in which an utterance is spoken. This may include extracting a set of features (e.g., prosodic, spectral, cepstral, voice quality, etc.), and learning a mapping between those features and respective target emotions. … In another example, the emotion detector 206 may learn the features inherently in a mapping directly from the raw waveform data of the signals received from the audio sensors 120. ….  “[0028] … In an example, raw data or processed feature data of the user detected from the image sensor 122 may be applied to a machine learning model to aid in predicting a probable emotional state of the user….”] (Prompt in the Claim is question/command generated by the system for input to LLM.  Prompt in the reference is the answer/response from system 130.]

Regarding Claim 4, Brueckner teaches:
4. The voice assistant system of claim 2, further comprising a user interface configured to display the prompt generated by the prompt generator. [Brueckner, Figure 3, “provide initial confirmation 310” is when the system checks with the user if the question/command /”prompt” of the Claim that it has generated is ok.  “[0047] … In an example, the emotion-aware processing system 130 may provide a confirmation that the WuW was detected and that a voice control session has been initiated. This confirmation may include one or more of a visual confirmation (e.g., via the displays 126), an Earcon (e.g., provided to the speakers 124), and/or a speech confirming the session (e.g., provided to the speakers 124).”]

Regarding Claim 5, Brueckner teaches:
5. The voice assistant system of claim 2, 
further comprising a user interface configured to obtain a user profile, [Brueckner, Figure 1, “[0012] FIG. 1 illustrates a block diagram for a voice assistant system 100 for providing an interface customized to the detected emotional state of the user, in accordance with an embodiment of the disclosure….”  “[0011] … In yet another example, the emotional state may be used to modify the way and style of the user interface (UI) or virtual assistant (VA) responses, such as its verbosity or its empathetic reaction (e.g., submissive behavior upon aggressive user input). …”]
wherein the prompt generator is configured to generate the prompt further based on the user profile. [Brueckner: the use of machine learning for the generating of the emotion (part of the prompt/question/command) implies the use of user profile: “[0049] … Moreover, the emotion detector 206 may be relatively easier to train as the emotion detector 206 may be trained to detect emotional state on the more limited corpus of the WuW, as opposed to on any general speech input. The emotion customization engine 208 may be configured to utilize the detected emotional state to customize further interaction of the user with the emotion-aware processing system 130. Accordingly, the detection of the emotion of a WuW spoken by a user may be used to enable and build an empathetic user interface.”]
Although the use of machine learning for the generating of the emotion (part of the prompt/question/command) implies the use of user profile, this is not express.
Datta teaches:
further comprising a user interface configured to obtain a user profile, [Datta, Figure 3, “context features 362” include history of past conversations and teach the user profile of the Claim.  “[0048] … For example, the context features 362 may include the conversation history between the user 102 and the conversational assistant application 200 as context to the assistant LLM 220….”]
wherein the prompt generator is configured to generate the prompt further based on the user profile. [Datta, Figure 3, the context/user profile is included as part of the prompt to the LLM 220.  “[0048] … By receiving historical context (e.g., via the context embedding 364), the assistant LLM 220 may be more efficiently perform the task of predicting the emotional state 232 of the input text 202. For example, the historical emotional states 232 (e.g., the previously predicted emotional states 232P from previous conversation turns) may better inform the assistant LLM 220 on the tone and/or emotion of the conversation between the user 102 and the assistant LLM 220.”]
Rationale as provided for Claim 5.

Regarding Claim 6, Brueckner teaches:
6. The voice assistant system of claim 2, further comprising 
one or more sensors configured to obtain one or more sensory inputs, wherein the prompt generator is configured to generate the prompt further based on the one or more sensory inputs. [Brueckner, Figure 1, “physiological sensors 128” collecting “paralinguistic elements” of speech which are used together with the audio input in Figure 3, “detect emotion 306” to generate the command/question/Prompt that is performed at 312.  “[0023] The voice assistant system 100 may also include various physiological sensors 128. The physiological sensors 128 may include, in an example, one or more pressure sensors, e.g., to measure the occupant's firmness of grasp of the steering wheel. In another example, the physiological sensors 128 may include one or more skin conductance sensors….”  “[0029] In some examples, the emotion detector 206 may be configured to additionally or alternately detect emotion of the user based on information received from the physiological sensors 128….”]

Regarding Claim 7, Brueckner teaches:
7. The voice assistant system of claim 1, wherein the large language processing subsystem comprises: 
an auxiliary encoder configured to obtain one or more input paralinguistic parameters representing the input paralinguistic element of the input speech, [Brueckner, Figure 2, 206, Figure 3, 306: “[0027] The emotion detector 206 may also receive, from the audio sensors 120, the signals indicative of the sounds captured in the environment. In an example, the emotion detector 206 may analyze the manner in which an utterance is spoken….”]
wherein: 
the response is generated based on applying a large language model of the large language processing subsystem on at least the input text and the one or more input paralinguistic parameters, and [Brueckner, Figure 3, 308 and 312:  the response depends on the input audio command and the emotion/paralinguistic parameters detected in the audio input. “Text” is implied from the ASR proves that is taught and generally produces text.  “[0026] … For instance, the wake-up word detector 202 may include an ASR model 204 configured to analyze the sound signals to determine whether that sound includes utterance of a wake-up word….”]
the response includes at least an output text of the output speech. [Brueckner, Figure 3, 312, [0045] the use of TTS to generate the output indicates that the output is first in the form of text.  See also: “[0034] … Further, a TTS model utilized by the emotion-aware processing system 130 may be configured as well. For instance, if the user is in a good mood, then a first TTS model may be used that provides for lively responses, but if the user is in a poor mood, a sad TTS model may instead be used that provides simple responses.”]
Brueckner does not teach an auxiliary encoder for the paralinguistic aspects nor an LLM.
Datta teaches:
an auxiliary encoder configured to obtain one or more input paralinguistic parameters representing the input paralinguistic element of the input speech, [Datta, Figures 2-3, “Assistant LLM 220” is the auxiliary encoder of the Claim when it is used the second round.  Datta uses the LLM twice: first to generate a response and second time to detect the emotion to be associated with the response.  In some embodiments TWO SEPARATE LLMS can be used:  “[0029] … While examples herein depict the same assistant LLM 220 generating the input text 202 characterizing the natural language response to the user's query 106 input to the assistant LLM 220 and detecting the emotional state 232P of the input text 202, other configurations where a two LLMs are utilized: a first LLM that processes the user's query 106 to generate the input text 202 characterizing the natural language response; and a second LLM that processes the input text 202 to predict the emotional state 232P of the input text.”]
wherein: 
the response is generated based on applying a large language model of the large language processing subsystem on at least the input text and the one or more input paralinguistic parameters, and [Datta, Figures 1-3, “Assistant LLM 220” generates the response 352 based on the “textual query 108” / “input text” and the context 362 of Figure 3 or emotional state 230 of Figure 2 as paralinguistic parameters.]
Rationale as provided for Claim 5.

Regarding Claim 8, Brueckner teaches:
8. The voice assistant system of claim 7, wherein the response is generated based on applying the large language model of the large language processing subsystem further on: a user setting, a user profile, one or more sensory inputs, or any combination thereof. [Brueckner, see rejection of Claim 6.  This reference does not use LLMs to generate its response but uses the inputs of various sensors to generate the question/prompt that leads to the response.  Figure 1, “physiological sensors 128” collecting “paralinguistic elements” of speech which are used together with the audio input in Figure 3, “detect emotion 306” to generate the command/question/Prompt that is performed at 312.  “[0029] In some examples, the emotion detector 206 may be configured to additionally or alternately detect emotion of the user based on information received from the physiological sensors 128….”]
Brueckner does not teach an auxiliary encoder for the paralinguistic aspects nor an LLM.  Datta, as applied to Claim 1, uses the Assistant LLM 220 instead of the digital assistant of Brueckner.  Rationale for combination as provided for Claim 1.

Regarding Claim 9, Brueckner teaches:
9. The voice assistant system of claim 7, wherein the response further includes one or more output paralinguistic parameters representing an output paralinguistic element of the output speech. [Brueckner, the output/response to the user is adjusted according to detected emotion visually, verbally, and audibly by paralinguistic/emotional parameters.  Note that in the reference “prompt” means response of the machine to the input command of the user:  “[0045] As another possibility, the verbosity of the prompts of the emotion-aware processing system 130 may be changed based on detected emotion. In an example the verbosity may change based on the detected emotion to select from no answer prompts to user input, small answer prompts to user input, medium answer prompts to user input, or long answer prompts to user input. In another example, the TTS style of the answer prompt may change using an emotional TTS specific to the detected emotion (e.g., a first TTS for happy users, a second, different TTS for unhappy users, etc.). In yet another example, responsive to the detected emotion indicating anger, the prompt may be played in an apologetic TTS style. In a further example, wording and intonation of responses may be adjusted based on detected emotion, e.g., submissive responsive to detected of an aggressive user input or aggressive upon aggressive input.”]

Regarding Claim 11, Brueckner teaches:
11. The voice assistant system of claim 9, wherein the one or more processing devices comprise: 
an expression generator configured to generate one or more expression embeddings based on the one or more output paralinguistic parameters; and [Brueckner, Figure 2, “emotion detector 206” and Figure 3, “detect emotion 306.”  Emotion is the paralinguistic parameter.]
an expressive text-to-speech subsystem configured to convert the response into the output audio data of the output speech based on the output text and the one or more expression embeddings. [Brueckner, Figure 2, “the emotion-aware processing system 130” / “expressive tts subsystem.” The TTS taught by reference indicates that some answers/responses are in text and converted to speech by the TTS of Brueckner. “[0045] As another possibility, the verbosity of the prompts of the emotion-aware processing system 130 may be changed based on detected emotion. In an example the verbosity may change based on the detected emotion to select from no answer prompts to user input, small answer prompts to user input, medium answer prompts to user input, or long answer prompts to user input. In another example, the TTS style of the answer prompt may change using an emotional TTS specific to the detected emotion (e.g., a first TTS for happy users, a second, different TTS for unhappy users, etc.)….”]
Brueckner does not teach the use of an encoder that would generate an embedding.
Datta teaches:
an expression generator configured to generate one or more expression embeddings based on the one or more output paralinguistic parameters; and [Datta, Figure 3, “Emotional Embedding (predicted) 242.” “[0029] During a second round trip, the assistant LLM 220 performs the task of predicting the emotional state 232P of the input text 202 and then, based on the predicted emotional state 232P of input text 202 characterizing the natural language response, the conversational assistant application 200 determines an emotional embedding 242 specifying the emotional state of the input text 202 characterizing the natural language response for synthesizing the input text 202 into expressive speech, and instructs the TTS model 300 to process the input text 202 and the emotional embedding 242 to generate a synthesized speech representation 352 of the natural language response….”]
an expressive text-to-speech subsystem configured to convert the response into the output audio data of the output speech based on the output text and the one or more expression embeddings. [Datta, Figure 3, “synthesized speech 352” out of “TTS model 300” which receives the EE 244 as input.  See [0029] above.]
Rationale for combination similar to that provided for Claim 1.  Datta uses encoders/decoders which require embeddings as input.

Regarding Claim 15, Brueckner teaches:
15. The voice assistant system of claim 1, further comprising one or more microphones configured to capture the input audio data that corresponds to the input speech. [Brueckner, Figure 1, “audio sensor 120.” “[0019] … For example, the cabin of the vehicle 104 may include at least one audio sensor 120. In one example, the audio sensor 120 may include an array of microphone elements arranged throughout the cabin…”]

Claim 16 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale. 
Claim 17 is a computer program product system claim with limitations corresponding to the limitations of method Claim 2 and is rejected under similar rationale.
Claim 18 is a computer program product system claim with limitations corresponding to the limitations of method Claim 7 and is rejected under similar rationale.

Claim 20 is a method claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Brueckner and Datta further in view of Bennett (U.S. 20100036660).
Regarding Claim 10, Brueckner teaches:
10. The voice assistant system of claim 9, wherein the one or more processing devices comprise: an expressive text-to-speech subsystem configured to convert the response into the output audio data of the output speech based on the output text and the one or more output paralinguistic parameters. [Brueckner, see the rejection of Claim 9.] 
 The difference here is that this Claim asks that the output is generated and then the emotive/expressive nature of the output is deduced from the text of the output. Brueckner does not teach this expressly although it is quite well-known in the art.
Datta gets the paralinguistic/emotion from the response of the assistant.
Bennett teaches:
an expressive text-to-speech subsystem configured to convert the response into the output audio data of the output speech based on the output text and the one or more output paralinguistic parameters. [Bennett, Figure 1 shows that emotional content of speech can be detected from the “syntactic cues, lexical cues, or prosodic cues” and both syntactic and lexical cues are obtained from the text of the speech that is output from the “speech recognition 115.” See Figure 3 that shows “Recognized Speech -Text.”   “[0047] At server 180, the partially processed speech signal data is handled by a server-side SRE 182, which then outputs recognized speech text corresponding to the user's question….”  “[0098] A speech recognition agent 402 handles the process of recognizing a speech utterance and outputting a stream of recognized text. …”  Bennett also generates output response by an agent based on the detected emotion of the input speech.  See Figure 5 the DM agent 312 and the “synthesis TTS 320.”]

    PNG
    media_image4.png
    602
    788
    media_image4.png
    Greyscale

    PNG
    media_image5.png
    584
    844
    media_image5.png
    Greyscale

Brueckner/Datta and Bennett pertain to emotive speech and it would have been obvious to modify the system of combination with the extraction of emotion from text of Bennett as an additional method of extracting emotion from speech (prosody and content).   This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 12-13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Brueckner and Datta further in view of Ramachandra (U.S. 20210326372).
Regarding Claim 12, Brueckner teaches:
12. The voice assistant system of claim 1, wherein the one or more processing devices comprise: 
an auxiliary encoder configured to obtain one or more input paralinguistic parameters representing the input paralinguistic element of the input speech; [Brueckner, Figure 2, “emotion detector 206”.]
an expression generator configured to generate one or more expression embeddings based on the one or more input paralinguistic parameters, one or more output paralinguistic parameters derived based on the one or more input paralinguistic parameters, or both; and [Brueckner, Figure 2, “emotion customization engine 208.”]
an audio-driven avatar generator configured to generate output display data of an avatar of a virtual agent based on the output audio data of the output speech and the one or more expression embeddings, the output display data being configured for display in coordination with playback of the output audio data. [Brueckner, Figure 2, “voice control session recognizer 210.”  “[0046] In yet a further example, modalities may be initiated to support the answer prompt based on the detected emotion. As a possibility, responsive to the emotion detected in the WuW, the emotion-aware processing system 130 may adjust to using one or more of visual feedback, haptic or tactile feedback, olfactory feedback, gustatory feedback, etc.”  These are in addition to the audio output generated by the TTS of [0045].]
Auxiliary encoder and embeddings are taught by Datta as applied to Claims 1 and 2 above.
Brueckner/Datta do not teach the avatar generation of this claim.
Ramachandra teaches:
an audio-driven avatar generator configured to generate output display data of an avatar of a virtual agent based on the output audio data of the output speech and the one or more expression embeddings, the output display data being configured for display in coordination with playback of the output audio data. [Ramachandra, Figures 1-4.  A 3D hologram of a “digital persona 104” /avatar is generated based on the sentiment of the voice 134 and the text 132 of speech and expression of the face 136 of the speaker and used to output the audiovisual response. “In some examples, human centered computing based digital persona generation may include generating, for a digital persona that is to be generated for a target person, synthetic video files and synthetic audio files that are combined to generate synthetic media files. The digital persona may be generated based on a synthetic media file. An inquiry may be received from a user of the generated digital persona. Another synthetic media file may be used by the digital persona to respond to the inquiry. A real-time emotion of the user may be analyzed based on a text sentiment associated with the inquiry, and a voice sentiment and a facial expression associated with the user. Based on the real-time emotion of the user, a further synthetic media file may be utilized by the digital persona to continue or modify a conversation between the generated digital persona and the user.”  Abstract. “[0043] A digital persona generator 118 that is executed by the at least one hardware processor (e.g., the hardware processor 1102 of FIG. 11, and/or the hardware processor 1304 of FIG. 13) may generate, based on a synthetic media file of the plurality of synthetic media files 116, the digital persona 104 displayed as a holographic projection 120 of the target person 106.”  See Figure 11, 1108 for generation of the “synthetic audio files 1108” that is combined with the video files 1106 to generate the digital persona 1110. “[0046] A conversation adapter 138 that is executed by the at least one hardware processor (e.g., the hardware processor 1102 of FIG. 11, and/or the hardware processor 1304 of FIG. 13) may utilize, based on the real-time emotion 130 of the user 124 and by the generated digital persona 104, a further synthetic media file of the plurality of synthetic media files 116 to continue or modify a conversation 140 between the generated digital persona 104 and the user 124.”  “[0062] …  Low-dimensional speaker embeddings may be used to model the variability among the thousands of different speakers in a dataset. The sequence-to-sequence model may include an encoder, which maps input to embeddings containing relevant information to the output, and a decoder which generates output from these embedding.”]
Brueckner/Datta and Ramachandra pertain to an interactive response systems and it would have been obvious to modify the system of combination with the digital persona hologram generation of Ramachandra to provide a more advanced interface.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

    PNG
    media_image6.png
    668
    940
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    652
    902
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    586
    986
    media_image8.png
    Greyscale

Regarding Claim 13, Brueckner teaches:
13. The voice assistant system of claim 12, wherein the one or more processing devices comprise: 
an expression controller configured to adjust the one or more expression embeddings to become one or more adjusted expression embeddings, [Brueckner, Figure 2, “emotion customization engine 208.” Customization is adjustment of the emotion.  “[0049] … The emotion customization engine 208 may be configured to utilize the detected emotional state to customize further interaction of the user with the emotion-aware processing system 130. Accordingly, the detection of the emotion of a WuW spoken by a user may be used to enable and build an empathetic user interface.”]
wherein: 
the response includes at least an output text of the output speech, [Brueckner, Figure 2, “voice control session recognizer 210.”  “[0046] In yet a further example, modalities may be initiated to support the answer prompt based on the detected emotion. As a possibility, responsive to the emotion detected in the WuW, the emotion-aware processing system 130 may adjust to using one or more of visual feedback, haptic or tactile feedback, olfactory feedback, gustatory feedback, etc.”  These are in addition to the audio output generated by the TTS of [0045].]
the output display data is generated by the audio-driven avatar generator based on the one or more adjusted expression embeddings, and [Brueckner, Figure 3, “configure system according to detected emotion 308.”]
the output audio data is generated by an expressive text-to-speech subsystem of the voice assistant system based on the output text and the one or more adjusted expression embeddings. [Brueckner, Figure 3, 312: “[0048] At operation 312, the emotion-aware processing system 130 performs a voice control session configured according to the detected emotion. In an example, the voice control session recognizer 210 may perform a voice control session triggered by the wake-up word detector 202 as detected at operation 306, configured by the emotion customization engine 208 at operation 308….”]
Auxiliary encoder and embeddings are taught by Datta as applied to Claims 1 and 2 above.
Brueckner/Datta do not teach the avatar generation of this claim.
Ramachandra teaches:
the output display data is generated by the audio-driven avatar generator based on the one or more adjusted expression embeddings, [Ramachandra, Figures 1-4 and 11 and mapping provided for Claim 12.]
Rationale as provided for Claim 12. 

Claim 19 is a computer program product system claim with limitations corresponding to the limitations of method Claim 12 and is rejected under similar rationale.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Brueckner and Datta further in view of Mustafa (U.S. 20250217641).
Regarding Claim 14, Brueckner teaches:
14. The voice assistant system of claim 13, wherein the expression controller configured to adjust the one or more expression embeddings is further configured to: 
apply a relative adjustment on the one or more expression embeddings based on a user input to obtain one or more derived expression embeddings; [Brueckner, Figure 3, “configure system according to detected emotion 308” is performed based on the same initial audio input by the user at 302.]
generate one or more baseline expression embeddings based on the user input and one or more template embeddings; and 
generate the one or more adjusted expression embeddings based on applying a mixing function on the one or more derived expression embeddings and the one or more baseline expression embeddings.
Brueckner does not go into the details of the emotion adjustment the way the Claim does.
 Datta teaches the “expression embeddings” but not the mixing function..
Mustafa teaches:
generate one or more baseline expression embeddings based on the user input and one or more template embeddings; and [Mustafa, Figures 1 and 2.  “[0045] The neural network system 100 includes an attention-based, multi-modal mixture of experts (MoE) neural network, or an “attention neural network” for short, that has been configured through training to process an input tuple 102 and to perform any of a variety of multi-modal machine learning tasks on the input 102 to generate an output 172. The input tuple 102 can include multiple network inputs from different modalities.”  The modalities may be text, image, and audio as set forth in [0045].  Figure 2, the first embedded sequence 204 and the second embedded sequence 206 teach the baseline embeddings of the Claim and the “input tuple” of 202 teaches the user input and step 204 uses a “predefined mapping” / “template embeddings” of the Claim:  “[0076] … The vocabulary-based text embedding engine represents a sequence of text included in the first network input as a sequence of vocabulary tokens from a predefined set of vocabulary tokens, and then maps each vocabulary token to a corresponding numerical value in accordance with a predefined mapping.”]
generate the one or more adjusted expression embeddings based on applying a mixing function on the one or more derived expression embeddings and the one or more baseline expression embeddings. [Mustafa, Figure 2, 208 updating/modifying/adjusting the embeddings generated at 204 and 206 and mixing/combing the results at 210 and 212.  “[0078] The system processes the first embedded sequence and the second embedded sequence using an attention neural network having multiple attention layers to generate an updated first embedded sequence and an updated second embedded sequence (step 208)….”  “[0079] The system processes the updated first embedded sequence and the updated second embedded sequence to generate a final representation for the first network input in the first modality and a final representation for the second network input in the second modality (step 210). The updated first (or second) embedded sequence includes a respective updated token at each of the one or more positions in the first (or second) embedded sequence. Thus, the system can generate the final representation for the first (or second) network input by applying a pooling function to the updated first (or second) embedded sequence to combine the respective updated tokens within the updated first (or second) embedded sequence. The pooling function can be modality-specific, and the system can apply different pooling functions to the updated first and second embedded sequences.”]
Brueckner/Datta and Mustafa pertain to multimodal detection and subsequent classification of characteristics by neural networks and it would have been obvious to apply the MoE/mixing model of Mustafa to the embeddings generated by the system of combination for the standard benefits of the use of MoEs.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Fariba Sirjani/
Primary Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Jun 06, 2024
Application Filed
Jan 12, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/454,031
Patent 12603099
SELF-ADJUSTING ASSISTANT LLMS ENABLING ROBUST INTERACTION WITH BUSINESS LLMS
2y 5m to grant Granted Apr 14, 2026
18/152,553
Patent 12579482
Schema-Guided Response Generation
2y 5m to grant Granted Mar 17, 2026
18/341,681
Patent 12572737
GENERATIVE THOUGHT STARTERS
2y 5m to grant Granted Mar 10, 2026
18/406,094
Patent 12537013
AUDIO-VISUAL SPEECH RECOGNITION CONTROL FOR WEARABLE DEVICES
2y 5m to grant Granted Jan 27, 2026
18/180,329
Patent 12492008
Cockpit Voice Recorder Decoder
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+31.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 547 resolved cases by this examiner. Grant probability derived from career allow rate.