Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 34-52 are pending. Claims 34, 40, 42, 45, and 49 are independent and are all method claims of various scopes. Claims 1-33 were canceled by a preliminary amendment and Claims 34-52 were added.
Claims 45-48 were withdrawn from consideration responsive to an orally conveyed restriction requirement without traverse. Claims 34-44 and 49-52 are pending and under examination of which claims 34, 40, 42, and 49 are independent.
This Application was published as U.S. 20250008290.
Apparent priority: 27 June 2023.
Claim Objections
Claim 41 is objected to because of the following informalities:
41. The method of claim 40, further comprising
synchronizing, via at least one processor, the classified facial sentiment attribute with the timing of the transcribed text-based sentiment,
whereby both emotional indicators are presented in a temporally consistent audiovisual output on an electronic display device.
The phrase “the classified facial sentiment attribute” has no antecedent basis. Generate one for it in Claim 40 where the classifying occurs.
The phrase “both emotional indicators,” while correct in terms of antecedent basis because of being the first appearance of the phrase, is not clear because no definition has been provided for “emotional indicators” nor that there is two of them.
Appropriate correction is required.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.
Such claim limitation(s) is/are: the various “modules” in Claims 40-42, 45-45, and 48-50 and the various “engines” in Claims 42-43, 45, and 47. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and displays or to a combination of software and hardware.
Support may be found at:
[0012] An embodiment of the invention currently claimed pertains to a system for generating spatially explicit auditory cues designed to enhance situational awareness for a recipient. This system comprises a processor configured to receive real-time location data of at least one real-world, non-virtual, physical entity relative to the spatial position and visual sight orientation of the recipient. The location data includes directional bearing, distance, and visual focus trajectory of the entity, which are resolved by directional or positional sensors or instruments conveying the spatial data on the entities. An audio generation module processes this location and focus data to create audio signals that simulate the perceived direction, distance, and focus trajectory.
0102] After identifying the emotional content, the system proceeds with the translation of the transcribed text into the target language, in this case, Spanish. This translation process utilizes sophisticated machine translation models such as Google Translate API, DeepL, or Microsoft Translator. These models are trained on vast multilingual datasets, enabling them to handle complex linguistic structures and idiomatic expressions with high fidelity. The translation process ensures that the semantic meaning of the original speech is accurately conveyed in the target language.
[0104] To synchronize the translated speech with the avatar's expressions, the system employs text-to-speech (TTS) technology. TTS engines like Google Cloud Text-to-Speech or Amazon Polly are used to generate synthetic speech in the target language. These engines can be customized to adjust the tone, pitch, and speed of the synthesized voice to match the emotional content identified earlier. For example, a joyful sentence would be spoken in a lively and upbeat tone, whereas a somber sentence would be delivered in a slower and more subdued manner.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “microphone,” “speaker,” or “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 34-38 are rejected under 35 U.S.C. 103 as being unpatentable Lee (U.S. 20210090592) in view of Samant (U.S. 20240185880).
(Note this document has two references one by the name of Samant from 2024 and another Sumant from 2020.)
Regarding Claim 34, Lee teaches:
34. A computer-implemented method of generating an audiovisual output that conveys emotional sentiment of a speaker, [Lee, Figure 4 shows an “enhanced transcript” which is visual but not audio.]
the method comprising:
a. receiving, by at least one processor, an audio input from a speaker, the audio input including spoken words; [Lee, Figure 3, 300: “[0050] … Beginning at block 300, the device may receive or otherwise access microphone data indicating audio of one or more users speaking.….”]
b. transcribing, by the at least one processor executing speech recognition instructions stored in memory, the spoken words into an alphanumeric text representation; [Lee, Figure 3, 302: “[0052] … At block 302 the device may use voice recognition software and/or speech to text software to generate a transcript of the user speaking as indicated via the data accessed at block 300.….”]
c. performing sentiment analysis, by the at least one processor, on at least a portion of the alphanumeric text representation to classify the emotional sentiment into one of a predefined set of sentiment categories stored in a database repository; [Lee, Figure 3, 314: “[0059] From block 312 the logic may proceed to block 314 where the device may associate emotions identified as occurring at certain times as the user speaks with various parts of the speech itself and/or with the corresponding parts of the transcript of the user's speech. ….”]
d. retrieving, from an electronic memory storing a sentiment-mapped visual data structure, at least one anthropomorphic visual representation corresponding to the classified sentiment, [Lee, Figure 3, 316: “0060] At block 316 the device may present a transcript of the user's speech on a display as part of a graphical user interface (GUI), such as the example GUI 400 of FIG. 4 that will be described later. The transcript may be presented with certain text from the transcript being enhanced with indications of corresponding emotion(s) of the user determined to have occurred at the time the user spoke the audio corresponding to that text, thus differentiating emotion-related text from other text of the transcript that might also be presented but for which no emotions were determined.” Figure 4, shows emoticons / “anthropomorphic visual representations” associated with parts of the transcript.]
wherein the sentiment-mapped visual data structure comprises a plurality of predefined anthropomorphic visual representations, each indexed to at least one sentiment category; and [Lee, “[0069] Also note that emojis and emotion labels may also be provided in some examples for words spoken with extreme emphasis and associated with an extreme emotion….” Emojis and Emoticons are “anthropomorphic visual representations” and each represents/ “is indexed to” an emotion/ sentiment category.]
e. generating, via the at least one processor, an audiovisual output encoded as a digital signal and transmitting the audiovisual output to an electronic display device for presentation to a recipient, [Lee, Figure 3, 316 and Figure 4. ‘[0065] Now in reference to FIG. 4, it shows an example graphical user interface (GUI) 400 that may be presented on an electronic display. The GUI 400 may show at least a portion of a transcript 402 of a user's speech….”]
wherein the audiovisual output comprises:
i. the alphanumeric text representation of the spoken words, and [Lee, Figure 4 showing the transcript of the speech.]
ii. the anthropomorphic visual representation conveying the identified sentiment. [Lee, Figure 4, “[0069] Also note that emojis and emotion labels may also be provided in some examples for words spoken with extreme emphasis and associated with an extreme emotion. So, for example, as shown in the fifth line of the transcript 402, the word “concerned” 416 may not just be bolded and underlined but also an emoji 418 for stress may be indicated adjacent to it along with the label “stressed” 420.”]
Lee does not teach an audiovisual output because Figure 4 which shows the GUI output with the transcript showing the emotion does not appear to include audio. Note that the Claim defines the “audiovisual output” as just text and visuals: no audio.
Samant teaches:
34. A computer-implemented method of generating an audiovisual output that conveys emotional sentiment of a speaker, [Samant, Figure 2, the output is “media features” which teaches the “audiovisual output” of the Claim. See Figure 7 input is Media Stream including Audio and/or Video Live or Prerecorded. and output is Annotated Media.]
the method comprising:
a. receiving, by at least one processor, an audio input from a speaker, the audio input including spoken words; [Samant, Figure 2, Audio (speech, music), Speech audio/text in media. “[0045] … The language in the media can be separately analyzed by a “word sentiment analyzer” AI module 220, which outputs language sentiment features (E.g., assurance, contemplation, etc.)….”]
b. transcribing, by the at least one processor executing speech recognition instructions stored in memory, the spoken words into an alphanumeric text representation; [Samant, Figure 2, “0042] FIG. 2 shows in more detail the exemplary Media Feature Extractor Module 110. As illustrated in FIG. 2, the media feature extractor module can be implemented so that different modalities (audio, images, text, speech-to-text) in the input media will be analyzed separately….”]
c. performing sentiment analysis, by the at least one processor, on at least a portion of the alphanumeric text representation to classify the emotional sentiment into one of a predefined set of sentiment categories stored in a database repository; [Samant, Figure 2, “word sentiment analyzer 220” is operating on the text obtained from speech-to-text and generates the “language sentiment features.” “[0045] … The language in the media can be separately analyzed by a “word sentiment analyzer” AI module 220, which outputs language sentiment features (E.g., assurance, contemplation, etc.)….” The language sentiment is placed in categories of “assurance” and “contemplation” and others like: “[0097] … The AI approach of FIG. 7 is quick and provides crucial emotional data like engagement, frustration, assurance, among other recommendations, to the call center agent.” Figure 7, 700. “[0050] The “perceiver module” analyzes a combination of tone and/or word usage and/or body language and assigns an emotion label as per the audience (Input #2). The combination of the user's tone, word choice, and facial expression are analyzed to assign an emotional state like (and not limited to): excited, cheerful, pleased, passionate, composed, poised, relaxed, mellow, stressed, tense, nervous, uneasy, thinking, surprised, annoyed, weary, tired, disengaged, fatigued, reasoning, doubtful, or confident.” ‘[0100] The system may identify a speaker's expressed/perceived emotional state by analyzing a combination of tone and/or word usage and/or body language and assigning an emotion label….”]
d. retrieving, from an electronic memory storing a sentiment-mapped visual data structure, at least one anthropomorphic visual representation corresponding to the classified sentiment, [Samant, Figure 6, “emotion association database 130” where a set of features is associated with an Emotion Label. Figure 7, “Emotion detection …100” which feeds the “… annotates attendees with emoticons … 700.” Emoticons are anthropomorphic visual representations.]
wherein the sentiment-mapped visual data structure comprises a plurality of predefined anthropomorphic visual representations, each indexed to at least one sentiment category; and [Samant, Figure 7, 700, each attendee and user himself is annotated with an emoticon / anthropomorphic visual representation which represents / is indexed to his detected or predicted emotion. See Figure 10, Emotion label = passionate is shown with an emoticon. Figure 11 each attendee is shown with his emotion label.]
e. generating, via the at least one processor, an audiovisual output encoded as a digital signal and transmitting the audiovisual output to an electronic display device for presentation to a recipient, [Samant, Figure 7, the output is “Annotated Media” which is generated by “Media Annotator 600” which annotates the media stream with emojis, emoticons, etc. “[0098] In one implementation as shown in FIG. 7, a system is provided for commentary/retrospect analysis by annotating a media stream to assign a speaker's emotion and energy level by analyzing voice tone and/or word choice and/or body language. The media stream may be audio, video, or an audio and video stream. The media stream may be pre-recorded and uploaded for analysis. Alternatively, the media stream may be analyzed live through online meetings, media-broadcast, or presentations in real-time. The emotions, energy, and coaching feedback will be generated and stored for later queries, reports, and analytics. Such analytics, feedback, and emotions can be annotated on the input media clip to provide embedded frame-by-frame analysis.” “[0100] The system may identify a speaker's expressed/perceived emotional state by analyzing a combination of tone and/or word usage and/or body language and assigning an emotion label. The combination of the user's tone, word choice, and facial expression are analyzed to assign an emotional state like (and not limited to): excited, cheerful, pleased, passionate, composed, poised, relaxed, mellow, stressed, tense, nervous, uneasy, thinking, surprised, annoyed, weary, tired, disengaged, fatigued, reasoning, doubtful, or confident.”]
wherein the audiovisual output comprises: [Samant’s output may be annotated text if the input media is text. Note that the Claim defines the “audiovisual output” with visual elements only.]
i. the alphanumeric text representation of the spoken words, and [Samant, “[0044] Speech can be converted into text and the text features can be extracted. The system can apply text feature extraction methods such as fusion, mapping and clustering. The filtering method is particularly useful for large-scale text feature extraction. …” Samant generates the transcript but there is no express teaching that it is part of the output. The “Report” of Figure 7 may include the transcript of the speech but this is not express.]
ii. the anthropomorphic visual representation conveying the identified sentiment. [Samant, Figure 7, the Report includes annotations and may be transcript. Also the “annotated media” output may be just text if the input media is text: “[0017] annotate emotional analysis on the media clip/stream/snippet.” “[0098] … Such analytics, feedback, and emotions can be annotated on the input media clip to provide embedded frame-by-frame analysis.”]
Lee and Samant pertain to analysis of input media for emotion and generate an output that is annotated with visual emotional indicators and it would have been obvious to combine the audiovisual output of Samant with the system of Lee to include an annotated audiovisual output although the manner in which “audiovisual output” is defined by the Claim includes no audio. This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Regarding Claim 35, Lee teaches:
35. The method of claim 34, further comprising
inserting an annotation into the alphanumeric text representation, [Lee, Figure 3, 314 and Figure 4, 418.]
wherein the annotation explicitly labels the classified sentiment category, and the modified text is presented on an electronic display device. [Lee, Figure 4, 418 showing an emoji for stressed on the GUI 400. “[0069] Also note that emojis and emotion labels may also be provided in some examples for words spoken with extreme emphasis and associated with an extreme emotion. So, for example, as shown in the fifth line of the transcript 402, the word “concerned” 416 may not just be bolded and underlined but also an emoji 418 for stress may be indicated adjacent to it along with the label “stressed” 420.”]
Regarding Claim 36, Lee teaches:
36. The method of claim 34,
wherein the predefined sentiment categories are selected from a taxonomy stored in a database and [Lee, Figure 3, 318: “[0061] From block 316 the logic may then proceed to block 318. At block 318 the device may receive or access user input that indicates that one or more emotions that were associated with part of the user's speech (as indicated in the transcript) have been correctly or incorrectly labeled. This user input may then be used by the device at block 318 to improve or change a taxonomy dictionary that classifies certain keywords and/or key phrases as being associated with respective particular emotions so that the dictionary may be improved for future use by the device consistent with present principles. For example, should a user indicate that a given word in the transcript has been incorrectly labeled with a certain emotion, the device may remove the entry from the taxonomy dictionary associating that word with the labeled emotion.”
derived from machine learning models trained using speech and text datasets annotated for sentiment, [Lee: “[0062] The user input regarding correctly or incorrectly labeled emotions may also be used at block 318 to improve artificial intelligence models that might be used to determine emotions as described above. For example, machine learning in the form of supervised learning may be performed to optimize or re-train a neural network used for inferring emotions based on correctly and incorrectly determined emotions as indicated by the user (and hence labeled by the user for training). For example, optimization/training may occur using one or more classification algorithms and/or regression algorithms along with the user indications themselves.”
including but not limited to positive, neutral, and negative classifications with subcategories. [Lee: “[0066] … Also note that although not shown in the black and white drawing of FIG. 4, the word “thanks” 406 may be presented in green color owing to it being associated with a positive emotion (e.g., gratitude and/or happiness) by a device undertaking present principles while words immediately before and after that have not been associated with an emotion (“Hi, everyone.” and “for meeting.”) may be presented in black text and not bolded. As another example, the phrase “thrown away” 408 may be presented in red color without bolding owing to it being associated with a slightly negative emotion (e.g., annoyed) while words immediately before and after that have not been associated with an emotion may be presented in black text.” “[0007] Still further, in some implementations the instructions may be executable by the least one processor to use facial recognition data to determine whether certain portions of the user's speech are associated with a positive emotion and to indicate, on the GUI and based on the determination, particular words from the user's speech as being associated with one or more emotions of the user by differentiating text determined to be associated with a positive emotion from text associated with a negative emotion and from text not associated with one or more emotions of the user.”]
Regarding Claim 37, Lee teaches:
37. The method of claim 34,
wherein the sentiment-mapped visual data structure is stored in a database and comprises a plurality of predefined anthropomorphic avatar images or animations, each indexed to a corresponding sentiment category, [Lee, emoticons and emojis are “anthropomorphic avatar images” which are each indexed to a particular emotion/sentiment. Figure 4, 418 showing a frowning emoji.]
wherein the sentiment classifications are derived from at least one of: [Lee teaches the use of all three: text, voice, face.]
a. text-based sentiment analysis; [Lee, “[0058] Still further, in some embodiments at block 312 the device may determine emotions based on the audio data itself, e.g., using voice recognition. The device may also determine emotions from the audio data using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for emotion inference based on speech audio and/or transcribed words for the device to identify emotions from keywords and/or word sequences of the user's speech. Other software for determining particular emotions based on speech and/or transcribed words may also be used, such as software that identifies correlations of keywords/phrases (or volume levels) with emotions via a taxonomy dictionary. Further, in some examples the indicators referenced in the paragraph above may also be used to determine emotions from audio/transcribed words for periods of time corresponding to the determined high and low volume levels, but not for other periods of time.”]
b. prosodic feature extraction; or [Lee, Figure 3, 306, volume and 308, highs and lows. “[0057] In some examples, the determined high and low volume levels may be used by the device as indicators of emotions to identify, ….
c. facial expression recognition. [Lee, Figure 3, 310 access data from a camera, 312, use facial recognition to determine emotions of the user based on data from the camera. “[0056] From block 308 the logic may then proceed to block 310. At block 310 the device may access camera data showing the user speaking the speech indicated via the microphone data that was accessed at block 300, such as still images or video of the user. The camera data may be streamed from the camera as the user speaks, or accessed at a storage location as stored based on images collected in the past. The logic may then proceed to block 312 where the device may analyze the images/video using facial and gesture recognition software, emotion detection software, etc. The images/video may also be analyzed using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for inferring emotion based on images of the user's face and/or the user's hand/arm/face gestures to determine one or more emotions of the user that occur at different times during the user's speech.”]
Regarding Claim 38, Lee teaches
38. The method of claim 34, further comprising
modifying a visual presentation of at least one word or phrase in the alphanumeric text representation by changing a font stylization, [Lee, Figure 7, 706. Figure 4, 404, 406, 408. “[0066] As may be appreciated from the GUI 400, in this example the word “thanks” on the top line of the transcript has been bolded and the graphical representation 404 beneath the word “thanks” indicates an increase in the volume of the user's voice when this word was spoken. Also note that although not shown in the black and white drawing of FIG. 4, the word “thanks” 406 may be presented in green color owing to it being associated with a positive emotion (e.g., gratitude and/or happiness) by a device undertaking present principles while words immediately before and after that have not been associated with an emotion (“Hi, everyone.” and “for meeting.”) may be presented in black text and not bolded. As another example, the phrase “thrown away” 408 may be presented in red color without bolding owing to it being associated with a slightly negative emotion (e.g., annoyed) while words immediately before and after that have not been associated with an emotion may be presented in black text.”]
wherein the font stylization is selected to emphasize the classified sentiment category and is presented on an electronic display device. [Lee, Figure 4, GUI 400 displaying the transcript of the speech. “[0064] … These patterns may then be presented to the user via an electronic display at block 322 through a representation of the profile, either automatically when the transcript itself is presented and/or based on user command.”]
PNG
media_image1.png
426
480
media_image1.png
Greyscale
Claim 39 is rejected under 35 U.S.C. 103 as being unpatentable over Lee and Samant in view of Zhang (U.S. 2024040289).
Regarding Claim 39, Lee does not generate audio and rather shows the emotions with the visual aid of emoticons and emojis. Samant, Figures 2 and 3 show that media features extracted from the input including audio/video/image/text are extracted and output and are indicators of “cognitive states” /emotions as detected by the “perceiver module 120” of Figure 1 which receives the “media features” as input and generates “emotional response” as output but does not teach transformation of voice.
Zhang teaches
39. The method of claim 34, further comprising
generating a synthetic audio output corresponding to the alphanumeric text representation and applying an audio processing transformation, wherein the transformation modifies at least one of pitch, volume, or duration to reflect the classified sentiment category. [Zhang, Figure 10, S1002, S1003, S1004. “[0206] S1003: Adjust an acoustic parameter with reference to an emotion status of the user and/or a vehicle status. “[0207] In one embodiment, the acoustic parameter may be adjusted by using an emotion result of the user and the vehicle status based on vehicle presetting and user presetting.” Figure 6, “[0281] The interaction assistant appearance may include, for example, a character image or an animation image. The interaction assistant audio may be an acoustic parameter of a voice assistant, including one or more of an interaction speaking speed, an intonation, a pitch, a timbre, a volume, and a rhythm. The interaction style may include, for example, sad, happy, and normal.”]
PNG
media_image2.png
310
496
media_image2.png
Greyscale
Lee/Samant and Zhang pertain to detection of emotion from speech and presentation of the detected emotions and it would have been obvious to include the emotive speech synthesis of Zhang that modifies prosody to reflect emotion with the system of combination in order to accompany the text with audio. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claim 40 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Samant.
Regarding Claim 40, Lee teaches
40. A computer-implemented method for conveying emotion from a speaker's facial expressions in combination with spoken text, [Lee, Figure 3 and the mappings provided for Claim 37 above.]
the method comprising:
a. capturing, via at least one image sensor, a video input that includes a plurality of frames depicting the speaker's face while concurrently capturing an audio input of the speaker's speech; [Lee, Figure 3, 300, 308 and 310. “[0051] In any case, the data of block 300 may be streamed from the microphone as the user speaks, or accessed at a storage location as stored based on past speech detected by a microphone…..” “[0056] … At block 310 the device may access camera data showing the user speaking the speech indicated via the microphone data that was accessed at block 300, such as still images or video of the user. The camera data may be streamed from the camera as the user speaks, or accessed at a storage location as stored based on images collected in the past….”]
b. transcribing, via an automated speech recognition module, the audio input into an alphanumeric text representation; [Lee, Figure 3, 302, generates the transcript shown in Figure 4.]
c. extracting, via a facial-expression recognition module executed by at least one processor, facial landmark data from at least one video frame to classify a facial sentiment attribute corresponding to an emotion category; [Lee, “[0056] … The logic may then proceed to block 312 where the device may analyze the images/video using facial and gesture recognition software, emotion detection software, etc. The images/video may also be analyzed using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for inferring emotion based on images of the user's face and/or the user's hand/arm/face gestures to determine one or more emotions of the user that occur at different times during the user's speech.”]
d. generating, via at least one processor, an audiovisual output and transmitting it to an electronic display device, [Lee’s output is Visual and not Audiovisual. However, the Claim defines “audiovisual” in terms that are entirely taught by Lee because they do not include an “audio” component.]
wherein the audiovisual output presents:
i. the classified facial sentiment attribute as an anthropomorphic visual representation retrieved from a sentiment-mapped visual data structure, and [Lee, Figure 4, emoji 418 superimposed on the transcript. “[0060] At block 316 the device may present a transcript of the user's speech on a display as part of a graphical user interface (GUI), such as the example GUI 400 of FIG. 4 that will be described later. The transcript may be presented with certain text from the transcript being enhanced with indications of corresponding emotion(s) of the user determined to have occurred at the time the user spoke the audio corresponding to that text, thus differentiating emotion-related text from other text of the transcript that might also be presented but for which no emotions were determined.” “[0069] Also note that emojis and emotion labels may also be provided in some examples for words spoken with extreme emphasis and associated with an extreme emotion. So, for example, as shown in the fifth line of the transcript 402, the word “concerned” 416 may not just be bolded and underlined but also an emoji 418 for stress may be indicated adjacent to it along with the label “stressed” 420.” The emojis and emoticons come from a “sentiment-mapped visual data structure.”]
ii. the alphanumeric text representation of the transcribed spoken words, thereby ensuring a recipient perceives both spoken context and non-verbal emotional cues. [Lee, Figure 4 “[0019] FIG. 4 shows an example graphical user interface (GUI) that includes an enhanced transcript of a person's audible speech consistent with present principles;”]
Lee teaches the use of facial recognition but does not teach the details of lip or mouth tracing as facial landmark data.
Lee does not teach an audiovisual output and its output is entirely visual. However, as with Claim 34, this Claim too defines its “audiovisual output” without any audio.
Samant teaches:
c. extracting, via a facial-expression recognition module executed by at least one processor, facial landmark data from at least one video frame to classify a facial sentiment attribute corresponding to an emotion category; [ Samant teaches that facial expressions are detected from facial muscles/landmarks: “[0002] … Facial expressions can be created by the physical movements and positions of facial muscles….” “[0037] In one embodiment, facial expressions are formed from the movements and positions of facial muscles….” “[0045] … The video and images in the media can be independently analyzed by a “body language analyzer” AI module 230, which outputs body language and facial expression features. (E.g., frown, smile, animated, etc.). The media feature extractor thus outputs a comprehensive set of features from the input media. …”]
d. generating, via at least one processor, an audiovisual output and transmitting it to an electronic display device, [Samant, Figure 7, “annotated media” and rejection of Claim 34.]
Rationale for combination as provided for Claim 34.
Claim 41 is rejected under 35 U.S.C. 103 as being unpatentable over Lee and Samant in view of Zhang (U.S. 20240402989).
Regarding Claim 41, Lee places the emoticons next to the pertinent part of the transcript. As provided in the Objection above, “both emotional indicators” are undefined and while the phrase does not create an antecedent basis issue, absent a particular definition, it is interpretated as “emotional indicators” which is more than one and includes two. Lee in [0057] teaches the changes in the volume of the voice are tracked in time with timestamps and used as an indication of emotion. Lee also teaches that facial recognition to determine emotions of the user, Figure 312.
Lee does not teach that the facial sentiments are shown back to the user.
Samant, Figures 2 and 3 show that media features extracted from the input including audio/video/image/text are extracted and output and are indicators of “cognitive states” /emotions as detected by the “perceiver module 120” of Figure 1 which receives the “media features” as input and generates “emotional response” as output: “[0036] FIG. 1 shows an exemplary emotion detection and coaching system 100. The system includes a media feature extractor module 110 that extracts features from a plurality of media modalities whose output is provided to a perceiver module 120 that detects and predicts an emotional response for a given audience. Input data can include text, a video feed from a camera or a video source, and audio data from a recording or audio microphone, among others. The Video Data can be used to analyze facial expressions, facial recognition, identity confirmation or to identify cognitive states that are related to the individual. Voice data can also be used to analyze voice data for biometric confirmation or cognitive states. Cognitive states include frustration, ennui and confusion, cognitive overloads, skepticism or delight, satisfaction, calmness, or stress. Human voices and facial expressions are captured to analyze the levels of human communication.”
Samant does not teach output of text or synchronizing emotion with text.
Zhang teaches:
41. The method of claim 40, further comprising
synchronizing, via at least one processor, the classified facial sentiment attribute with the timing of the transcribed text-based sentiment, whereby both emotional indicators are presented in a temporally consistent audiovisual output on an electronic display device. [Zhang, “In various examples, techniques are described for animating characters by decoupling portions of a face from other portions of the face. Systems and methods are disclosed that use one or more neural networks to generate high-fidelity facial animation using inputted audio data….” Abstract. “[0246] It should be understood that the text sequence and the emotion sequence that are aligned in a time sequence may mean that there is a correspondence between the text sequence and the emotion sequence at a same moment. After the text sequence and the emotion sequence are aligned in a time sequence, model processing can be facilitated, and it can also be ensured that a more accurate animation sequence is generated” “[0254] .. Finally, a body language of the voice assistant is synchronized with the speech, the vehicle condition, and the environment.”]
PNG
media_image3.png
808
1090
media_image3.png
Greyscale
PNG
media_image4.png
574
950
media_image4.png
Greyscale
PNG
media_image5.png
696
936
media_image5.png
Greyscale
Lee/Samant and Zhang pertain to detection of emotion from speech and presentation of the detected emotions and it would have been obvious to include the animation of Zhang with the system of combination so that the recognized text is presented together with an animated avatar or virtual reality graphics which is more sophisticated than emoticons of the combination. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claims 42-43 are rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Myron (U.S. 20250190686).
Regarding Claim 42, Lee teaches
42. A computer-implemented method of annotating text to reflect emotional emphasis detected from speech waveforms,
the method comprising:
a. receiving, by at least one processor, a digital audio waveform of a speaker's utterance, [Lee, Figure 3, 300, 306, 308. Lee examines the volume level of the speech as the prosodic feature that is analyzed and used for detecting emotion.]
wherein the audio waveform includes prosodic features indicative of speech intonation, pitch, and rhythm; [Lee, any audio waveform would inherently include the above prosodic features but Lee does not extract or use these for emotion detection.]
b. automatically transcribing, via an automated speech recognition module, the waveform into an alphanumeric text output; [Lee, Figure 3, 302, transcribe and 304 store the transcript.]
c. detecting, via a speech-emotion recognition engine, an emotional classification for at least one portion of the audio waveform, the classification being derived from deep learning models trained on prosodic speech features; [Lee, Figure 3, 314, 316, the prosodic feature used in Lee is the volume. “[0058] Still further, in some embodiments at block 312 the device may determine emotions based on the audio data itself, e.g., using voice recognition. The device may also determine emotions from the audio data using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for emotion inference based on speech audio and/or transcribed words for the device to identify emotions from keywords and/or word sequences of the user's speech….”]
d. encoding, in memory, a timestamped sentiment annotation for at least one portion of the audio waveform and mapping it to a corresponding substring of the alphanumeric text output; and [Lee associates emotion with parts of the transcript by using time: “[0059] From block 312 the logic may proceed to block 314 where the device may associate emotions identified as occurring at certain times as the user speaks with various parts of the speech itself and/or with the corresponding parts of the transcript of the user's speech. For example, for each high or low volume of a portion of the user's speech, a corresponding emotion identified as occurring at the time the user provided that part of the speech may be associated with the portion of the transcript indicating that part of the speech….” “[0060] … The transcript may be presented with certain text from the transcript being enhanced with indications of corresponding emotion(s) of the user determined to have occurred at the time the user spoke the audio corresponding to that text, thus differentiating emotion-related text from other text of the transcript that might also be presented but for which no emotions were determined.” “[0057] In some examples, the determined high and low volume levels may be used by the device as indicators of emotions to identify, and thus at block 312 the device may only analyze images/video for periods of time corresponding to the determined high and low volume levels (and/or may otherwise only determine/output emotions for those periods of time). This may be done by matching timestamps for the images with timestamps for the relevant speech data/audio itself. Thus, the foregoing may help the device consume less processing resources and thus improve the functionality of the device itself. Also note that repeated words as determined at block 302 may be similarly used as indicators of emotions to identify at block 312 using facial/gesture recognition.”]
e. generating, via at least one processor, a modified alphanumeric text output with encoded emotional context, [Lee, Figure 4, 400 GUI showing the transcript with emotions encoded in the stylized text and with use of emojis and emoticons.]
wherein the modified text is transmitted to an electronic display device. [Lee, Figure 4, 400 GUI showing the transcript.]
The prosodic feature used to determine emotion in Lee is volume and Lee does not mention the features enumerated in the Claim.
Myron teaches:
a. receiving, by at least one processor, a digital audio waveform of a speaker's utterance, wherein the audio waveform includes prosodic features indicative of speech intonation, pitch, and rhythm; [Myron, see Figures 2 and 3 where the emotion is detected from speech and encoded into the recognized text of the speech. Figure 2, “speech analytics 220.” “[0002] When people communicate via speech, a significant portion of information is conveyed via nonverbal communication, such as based on the speaker's pitch, timbre, volume, tone of voice, inflection, speaking rate, and/or changes in any of the foregoing characteristics….” “[0035] The speech analytics module 220 analyzes speech in received audio data of spoken utterances …. For example, the speech analytics module 220 can generate text transcriptions of spoken utterances. Additionally, the speech analytics module 220 extracts and/or identifies nonverbal characteristics of spoken utterances. For example, the speech analytics module 220 can determine speaking rate (e.g., syllables per minute, words per minute), pitch (e.g., average pitch, highest/lowest pitch), timbre, volume (e.g., in decibels), rhythm or patterns (e.g., syncopation), and/or changes in any of the foregoing characteristics….” “[0058] … For example, a speech analytics model can use sentiment analysis and/or audio processing techniques to determine audio characteristics of the audio data (e.g., changes in pitch, inflection, timbre, tone of voice, speaking rate) that indicate one or more nonverbal characteristics (e.g., sentiment).”]
PNG
media_image6.png
230
602
media_image6.png
Greyscale
Lee and Myron pertain to detection of emotion from voice and it would have been obvious to expand the prosodic features extracted by Lee to include the more comprehensive list of Myron. This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Regarding Claim 43, Lee teaches
43. The method of claim 42,
wherein the speech-emotion recognition engine utilizes machine learning classifiers, including at least one of Support Vector Machines (SVM), Naive Bayes, or deep recurrent neural networks trained on speech sentiment datasets. [Lee, “[0056] … The images/video may also be analyzed using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for inferring emotion based on images of the user's face and/or the user's hand/arm/face gestures to determine one or more emotions of the user that occur at different times during the user's speech.”“[0058] Still further, in some embodiments at block 312 the device may determine emotions based on the audio data itself, e.g., using voice recognition. The device may also determine emotions from the audio data using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for emotion inference based on speech audio and/or transcribed words for the device to identify emotions from keywords and/or word sequences of the user's speech….”]
Claim 44 is rejected under 35 U.S.C. 103 as being unpatentable over Lee and Myron in view of Biswas (U.S. 20240037824) and further in view of Wang (U.S. 20230267916).
Regarding Claim 44, Lee does not output audio. Neither does Myron.
Biswas teaches
44. The method of claim 42, further comprising
applying an audio inflection transformation to a playback signal of the speaker's voice, wherein the transformation modifies at least one of pitch, volume, or duration in synchrony with the detected emotional classification. [Biswas, Figure 1, takes in the “audiovisual input 104” converts it to text 108 and converts the text back to audio 114 to get to the output of “Digital Content 124” is generated based on emotion 112 and phonemic representation 116 of the text 108. So the output is the playback signal of the user’s voice but the goal of Biswas is showing the emotion in the image 120. [0024] … Digital content generator 122 can use input to generate an output frame (of digital content 124) comprising a video component generated using the image 120 input and an audio component generating using the audio component from the respective input frame.” ]
Lee/Myron and Biswas pertain to emotional speech and it would have been obvious to combine the playback of the received audio from Biswas with the system of combination to provide for generation of an avatar and addition of an image to the audio as done in Biswas.
Biswas is taking the input speech (audiovisual) and also generating it back. But Biswas does not teach that the detected emotion is used to modify the audio while the detected emotion is used to modify the video image expressly.
Wang teaches:
applying an audio inflection transformation to a playback signal of the speaker's voice, wherein the transformation modifies at least one of pitch, volume, or duration in synchrony with the detected emotional classification. [Wang, Figure 1, detects emotion from text and reflects the emotion in the synchronized audio that is output with the text. “[0052] For example, during speech synthesis, for each word obtained by the word segmenting in step S102, the corresponding speech may be determined based on the text, emotional feature, and rhyme boundary of the each word. All words of the text information are arranged in the temporal sequence and go through the speech synthesis model, so as to obtain the corresponding speech sequence with emotion, and the speech sequence with emotion is also in the temporal sequence, and the speech sequence is synchronized with the text sequence.” “[0184] In the solution, the corresponding linguistic feature sequence in the original audio or text is extracted and used as the input information of the preset temporal sequence mapping model. Since the linguistic feature is related to only semantic content of the audio and not related to timbre, pitch, fundamental frequency FO features, and other features that vary by an articulator. Therefore, the solution of this embodiment is not limited to a specific articulator, and the original audio with different audio features can be applied to the preset temporal sequence mapping model described in this embodiment. That is to say, since in the solution of this embodiment, the audio feature in the audio information is not analyzed, but the linguistic feature of the pronunciation unit is analyzed after the audio information is converted into the pronunciation unit, making it possible to generate the virtual object animation without relying on the drive of a neural network model by the specific audio feature. Therefore, the end-to-end virtual object animation generation method provided by the solution of this embodiment can be applied to the end-to-end virtual object animation generation of any voice actor and any text, solving the dependence on the specific voice actor in the existing end-to-end automated audio synthesis virtual object animation technology and truly achieving the “versatility” of the technology.”]
PNG
media_image7.png
552
690
media_image7.png
Greyscale
Lee/Myron/Biswas and Wang pertain to emotional speech and it would have been obvious to generate an emotive speech based on the emotions detected from text and play it back synchronized with the display of the text in case the audio of the original speaker is unavailable. This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claims 49-50 and 52 are rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Craner (U.S. 2004/0013252).
Regarding Claim 49, Lee teaches
49. A computer-implemented method of enhancing video conference comprehension by highlighting participant emotions, [Lee, “[0023] The present application discloses devices and methods for determining and presenting indications of emotions along with the transcript of one or more people that might be speaking during a meeting to highlight important words, phrases, sentences, and even topics. This may be done by analyzing the decibel levels used to speak the words via start and end times.” See also [0002] for statement of the problem. Lee also teaches the use of a video camera: “[0045] … The camera 193 may be a thermal imaging camera, an infrared (IR) camera, a digital camera such as a webcam, a three-dimensional (3D) camera, and/or a camera otherwise integrated into the system 100 and controllable by the processor 122 to gather pictures/images and/or video.” The two teachings together suggest that a video conference is possible.]
the method comprising:
a. capturing, via a plurality of input devices, audiovisual streams from multiple participants in a video conference session; [Lee teaches the use of a video camera: “[0045] … The camera 193 may be a thermal imaging camera, an infrared (IR) camera, a digital camera such as a webcam, a three-dimensional (3D) camera, and/or a camera otherwise integrated into the system 100 and controllable by the processor 122 to gather pictures/images and/or video.” “[0051] In any case, the data of block 300 may be streamed from the microphone as the user speaks, or accessed at a storage location as stored based on past speech detected by a microphone. Also note that the speech itself may be speech provided by a person while engaged in a meeting with other participants ….”]
b. analyzing, by at least one processor, each participant's facial expressions, audio waveform, and transcribed speech, to classify sentiment data, [Lee, Figure 3, 312. “[0056] … The images/video may also be analyzed using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for inferring emotion based on images of the user's face and/or the user's hand/arm/face gestures to determine one or more emotions of the user that occur at different times during the user's speech.”]
wherein the sentiment data is derived from: [Lee teaches the use of all three: text, voice, face. See rejection of Claim 37.]
i. facial-expression recognition executed on video input; [Lee, Figure 3, 310, 312.]
ii. speech waveform analysis detecting prosodic variations; and [Lee, Figure 3, 306, 308: “[0058] Still further, in some embodiments at block 312 the device may determine emotions based on the audio data itself, e.g., using voice recognition. The device may also determine emotions from the audio data using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for emotion inference based on speech audio and/or transcribed words for the device to identify emotions from keywords and/or word sequences of the user's speech….”]
iii. text-based sentiment analysis on transcribed speech; [Lee, Figure 3, 302, 304: “[0059] From block 312 the logic may proceed to block 314 where the device may associate emotions identified as occurring at certain times as the user speaks with various parts of the speech itself and/or with the corresponding parts of the transcript of the user's speech….”]
c. generating, for each participant, at least one of: Lee teaches the use of color and annotation of the text but because the output is a transcript it does not teach pitch adjustment which is not required by the “or” language of the Claim. See rejection of Claim 38.]
i. a visual cue comprising a color-coded emotion overlay; [Lee: “[0066] … Also note that although not shown in the black and white drawing of FIG. 4, the word “thanks” 406 may be presented in green color owing to it being associated with a positive emotion (e.g., gratitude and/or happiness) by a device undertaking present principles while words immediately before and after that have not been associated with an emotion (“Hi, everyone.” and “for meeting.”) may be presented in black text and not bolded. As another example, the phrase “thrown away” 408 may be presented in red color without bolding owing to it being associated with a slightly negative emotion (e.g., annoyed) while words immediately before and after that have not been associated with an emotion may be presented in black text.”]
ii. a textual annotation describing the detected sentiment; or [Lee, Figure 4 the emoticons and emojis 418 are annotations. “[0069] Also note that emojis and emotion labels may also be provided in some examples for words spoken with extreme emphasis and associated with an extreme emotion. So, for example, as shown in the fifth line of the transcript 402, the word “concerned” 416 may not just be bolded and underlined but also an emoji 418 for stress may be indicated adjacent to it along with the label “stressed” 420.”]
iii. an audio pitch adjustment modifying the vocal output to reflect the identified sentiment;
d. encoding, via a digital communication module, the encoded emotion-enhanced audiovisual stream and transmitting it to an electronic display device, [Lee, Figure 4 shows the GUI with annotated and stylized text that can convey the emotion of the meeting participant to the others. Figure 3, 322. “[0064] … These patterns may then be presented to the user via an electronic display at block 322 through a representation of the profile, either automatically when the transcript itself is presented and/or based on user command.” “[0065] Now in reference to FIG. 4, it shows an example graphical user interface (GUI) 400 that may be presented on an electronic display. The GUI 400 may show at least a portion of a transcript 402 of a user's speech. As shown in FIG. 4, below each line of the transcript 402 may be a graphical representation 404 of a waveform demonstrating the volume levels of the user's speech over time when corresponding words represented immediately above respective parts of the representation 404 were spoken….”] thereby assisting visually or hearing-impaired recipients in perceiving emotional nuances among the participants. [This portion is a statement of intended result which will happen for a deaf person who cannot hear the voices of other participants but can tell if they were speaking loudly or calmly and were showing anger or not.]
Lee teaches a meeting where the participants are being video-taped with the system receiving the audio and video streams of the participants.
But Lee does not expressly include “video conferencing” and the output of Lee is visual without audio.
Craner teaches:
49. A computer-implemented method of enhancing video conference comprehension by highlighting participant emotions, [Craner, pertains to the early versions of video-conferencing. See Figure 15 showing two “Talkers” and a “voice capture device 14.” “[0037] The voice capture device 14 (or a plurality of voice capture devices) is configured to capture voices from one or more talkers, and transmit the voices as voice information over the network to the voice-projecting device 23. The voice capture device and the associated voice projecting device 23 may be as simple as a telephone or stereo phone or, alternatively, may be as complex as a teleconferencing system with video, audio, and data communications. …”]
…
d. encoding, via a digital communication module, the encoded emotion-enhanced audiovisual stream and transmitting it to an electronic display device, thereby assisting visually or hearing-impaired recipients in perceiving emotional nuances among the participants. [Craner, See Figures 11-15 showing the detection of emotion from the voice and display of the emotion of the talkers by caricatures or images of the speaking talkers. Figure 15, “emotion image morphing 1522.” “[0095] In certain embodiments, figures that display different ones of the talkers speaking can be displayed as computer generated figures. The computer generated figures of the talkers can include an image of a part of the talker, such as a photograph of the face of the talkers projected over an animated body. In addition, emotions of the talkers can be illustrated using computer-generated techniques for each of the talkers. Computer generated figures (as illustrated in FIGS. 12-14), represent each of the talker in a manner that is configured to illustrate emotions of each respective talker is referred to in this disclosure as an "emotigraphic." Since the image of the three talkers is not being transmitted in these systems as is the case in teleconferencing systems, the bandwidth required to transmit the displayed image is reduced considerably or to zero in the case of a receive side only implementation.”]
Lee and Craner pertain to detection of emotion in participants in a meeting and both take video images of the speakers/participants and it would have been obvious to modify Lee to expressly include video conferencing that is express and pronounced in Craner. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Regarding Claim 50, Lee pertains to the transcription and presentation of the transcript of speech of participants in a meeting from the audio and video streams of data. “[0056] … The images/video may also be analyzed using an artificial intelligence model having at least one neural network (e.g., recurrent or deep neural network) configured for inferring emotion based on images of the user's face and/or the user's hand/arm/face gestures to determine one or more emotions of the user that occur at different times during the user's speech.” It makes sense that the transcripts are presented in real-time during the meeting. But this is not taught expressly in Lee.
Craner teaches
50. The method of claim 49,
wherein a real-time emotion tracking module executes continuously on incoming audiovisual streams, dynamically updating the detected emotional classifications and rendering them on an electronic display device in real- time. [Craner pertains to audio/video teleconferencing that is real-time and teaches the update of the locations of the speakers and indicating the talking speaker from amongst the participants and because the emotions are also indicated, the emotions would be updated in real-time as the conference is proceeding. “[0057] Certain embodiments of the talker indicator 32 indicate to the listener the identity of the talker who is most likely speaking at any given time. The identity of the talker is displayed on a display in one group of embodiments. …” “[0066] For example, take the case of a listener who is listening to a voice-projecting device 23 that is projecting the voices of two distinct talkers. The voices are transmitted from the voice capture device 14. The voice projecting device 23 projects each voice segment in combination with a short introductory announcement, briefly announcing the identity of that talker. The amount of time necessary to make the brief audio announcement is deducted from the duration of the respective voice segment to provide a continuous dialog of talkers. In one embodiment, the voice segment is temporally compressed while maintaining the original frequency characteristics of the speech for that voice segment….” “[0116] … The talker emotion determiner portion 1108, if installed, transmits the mental state of the talker to the final detect talker portion 1110 over conductor 1130.”]
Rationale as provided for Claim 49 and the real-time nature of teleconferencing requires real-time update of other associated information.
Regarding Claim 52, Lee teaches:
52. The method of claim 49,
wherein the conference visualization interface assigns a dynamic emotion-based speaker indicator, visually emphasizing participants exhibiting strong sentiment shifts over time, [Lee teaches that the emotions are timestamped to the words of the transcript. [0057]. Lee also tracks the shift in volume indicating a shift in emotions. “[0057] In some examples, the determined high and low volume levels may be used by the device as indicators of emotions to identify, and thus at block 312 the device may only analyze images/video for periods of time corresponding to the determined high and low volume levels (and/or may otherwise only determine/output emotions for those periods of time). This may be done by matching timestamps for the images with timestamps for the relevant speech data/audio itself….”]
wherein the encoded visualization is generated as an overlay on an electronic display device.
Lee does not teach and dynamic audiovisual output.
Craner teaches:
52. The method of claim 49,
wherein the conference visualization interface assigns a dynamic emotion-based speaker indicator, visually emphasizing participants exhibiting strong sentiment shifts over time, [Craner, Figure 15: “pre-stored caricature and/or images of speaking talkers 1520” and “emotion image morphing engine 1522.” “[0143] T … Another embodiment involves the use of initially or pre-stored, caricatures and/or images of talkers 1520 in conjunction with the emotion image morphing engine 1522. …. Based upon the emotion detected within the controller, the emotion image morphing engine 1522 operates to morph the imagines and/or caricatures of the talkers and provide an emotigraphic image indicating whether the talker is happy, sad, silent, etc. An increased level of emotion may be provided by the emotion image morphing engine 1522 the talker stamping their feet, sleeping, or performing some other such activity whether emotigraphics are icon based, photographic image based or a combination thereof.”]
wherein the encoded visualization is generated as an overlay on an electronic display device. [Craner, Figures 3- 4, “Active talker indicator 406” on the “liquid crystal display 402.” Figures 12-14 showing human-like figures indicating who is talking. “[0054] There are multiple embodiments of talker indicators that display the identity of the talker to the listener. In the embodiment shown in FIG. 3, a distinct light (e.g., an indicator portion on a light emitting diode or a liquid crystal display) illuminates the identity of the current speaker. In the embodiment shown in FIG. 4, a portion of a display indicates the name and other information of the talker. …. In yet another embodiment, distinct human-like figures (e.g., stick figures or more detailed human-like figures) as shown in FIGS. 12-14 may be displayed for each of the talkers, and the human-like figure corresponding to the current talker may be highlighted, accentuated, or animated. Different embodiments of talker identifier systems 16, such as shown in FIGS. 10, 11, and 15, can display one or more of such a human-like likeness. ...”]
PNG
media_image8.png
238
440
media_image8.png
Greyscale
PNG
media_image9.png
238
444
media_image9.png
Greyscale
Rationale for combination as provided for Claim 49. The emotions change during the speech and it would be rational to show the changing emotions with the morphing animated emotigraphic images of Crane.
Claim 51 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Craner and further in view of Sumant (U.S. 20200206631).
Regarding Claim 51, Lee teaches:
51. The method of claim 49,
wherein a machine learning-based ensemble modeling system is used to reconcile sentiment classifications by aggregating results from: [Lee teaches the use of all three Text (keywords), Audio (volume) and Facial expressions to arrive at a detected emotion. See rejection of Claim 37 above.]
a. text-based sentiment analysis, [Lee, Figure 3, 302, “[0058] … emotion inference based on speech audio and/or transcribed words for the device to identify emotions from keywords and/or word sequences of the user's speech….”]
b. audio waveform analysis, and [Lee, Figure 3, 306, 308, “[0058] … emotion inference based on speech audio ….” “[0057] In some examples, the determined high and low volume levels may be used by the device as indicators of emotions to identify,…”]
c. facial expression recognition, [Lee, Figure 3, 310, 312, “[0007] Still further, in some implementations the instructions may be executable by the least one processor to use facial recognition data to determine whether certain portions of the user's speech are associated with a positive emotion ….” [0056]-[0057].]
wherein ensemble modeling optimizes classification accuracy through weighted scoring and confidence estimation.
Lee uses a trained neural network model that takes into account all three types of information to come up with an emotion. [0056] and [0058].
Lee does not teach ensemble modeling. Neither does Craner.
Sumant teaches:
wherein ensemble modeling optimizes classification accuracy through weighted scoring and confidence estimation. [Sumant includes emotion detection and prediction for participants in a video game and includes ensemble modeling as one of types of models used for emotion detection and prediction. See Figure 1D for the use of a number of models 160A … 160N which can be ensembled. “Weights 164” in Figure 1C and the use of a threshold confidence in [0095]. “[0054] Some non-limiting examples of machine learning algorithms that can be used to generate and update the parameter functions or prediction models can include supervised and non-supervised machine learning algorithms, including … ensemble algorithms (such as, for example, Stacked Generalization), and/or other machine learning algorithms.”]
Lee/Craner and Sumant pertain to detection of emotion from various types of data and it would have been obvious to replace the neural network model of the combination with the ensemble modeling of Sumant which is included in a list as equivalent of the neural network machine learning of the combination. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kaushik (U.S. 20220068001) “0040] The text that the avatar is to speak from block 500 is sent to an emotion and sentiment detection block 508 to extract sentiment and emotion from the text in accordance with the training from FIG. 4. The emotion and sentiment extracted from the text at block 508 are input to the model 510 along with the speech features from block 504 and aligned text/speech signals from block 506. If desired, a target emotion may be input to the model as indicated at block 512. The target emotion may be part of annotated text input by the user or derived from the input text using machine learning.” “[0007] In example embodiments, the method can include detecting emotion and sentiment from the first text, aligning the first text with speech representing the first text to render aligned text/speech, and inputting the emotion, sentiment, and aligned text/speech to the ML model. The method also may include inputting a target emotion to the ML module. Example implementations of the method may include receiving first probabilities from the ML modality representing facial action, receiving second probabilities from the ML model representing emotion, and using the first and second probabilities to establish facial action units (FAU). The method can then include animating the image of the first face in accordance with the FAU.” “[0008] In another aspect, an assembly includes at least one display configured to present an animated computer avatar and at least one processor configured with instructions to execute a machine learning (ML) model. The instructions are executable to receive text indicating speech to be spoken by the avatar, process the text using the ML model to generate facial action units (FAU), and animate the computer avatar in accordance with the FAU.”
PNG
media_image10.png
674
842
media_image10.png
Greyscale
Nguyen(U.S. 10367931): “A communication apparatus, a method, a computer readable medium, and a system providing communication with cognitive and visual assistance. The cognitive assistance and visual assistance is provided during a communication between a first communication apparatus with at least one second communication apparatus via a network. The first communication apparatus captures communication data comprising visual and audio information obtained from the communication and captures synchronized cognitive and emotional data generated from the user during the communication with the second communication apparatus. The communication data and the synchronized cognitive and emotional data is stored and converted into a visual form comprising at least one of synchronized text, symbols, sketches, images, and animation. The visual form is displayed on a display of the first communication apparatus.” Abstract. “The processor 800 further includes a speech to text converter 804. The speech to text converter 804 may obtain only the audio portion of the primary contents via the communicator 801 (using an extractor, not shown) and convert the audio into text scripts. This is provided by way of an example and not by way of a limitation. According to an exemplary embodiment, the speech to text converter 804 may split the audio data (voice) into respective portions or chunks and convert to text, which is then displayed as scripts 91 (shown in FIG. 9, explained below). That is, the primary contents (video/audio data) are split into segments based on continuity of the voice signal via the speech to text converter….” 20:64-21:10. “According to another exemplary embodiment, as shown in FIG. 8, the converted speech is provided to the secondary contents generator 805. The secondary contents are explained in greater detail below. The scripts, along with other secondary contents such as notes and comments may be provided to the data block generator to be added as part of the data block by the synchronizer 822. That is, the primary contents (audio/video data) is synchronized or linked with a corresponding determined cognitive or emotional state by the determiner 803 and (optionally) the segmented audio data which is converted into text by the speech to text converter 804 for a display on as scripts (shown in FIG. 9). The scripts may be generated in real-time as the conversation takes place. Each script may be displayed in a distinct manner (e.g., color coded) so as to show the emotional state of the user for each portion of the primary data.” 21:20-35.
PNG
media_image11.png
526
740
media_image11.png
Greyscale
PNG
media_image12.png
524
702
media_image12.png
Greyscale
Wang (U.S. 20230267916):
Figure 1, S104” “[0015] In the solution, the virtual object animation, especially a 3D animation, with emotional speech can be quickly and efficiently generated from a text, which has high versatility and does not need the drive of a specific voice actor. Specifically, the emotional speech is synthesized by analyzing the emotional feature and the rhyme boundary of the text. Further, the corresponding virtual object animation is generated based on the text and the emotional speech. Further, data of the generated virtual object animation in the temporal sequence is synchronized in time with the audio information, making it possible to generate the virtual object animation directly from the text, and the generated virtual object animation can be synchronized with the emotional speech when acting in the temporal sequence.” “[163] Based on the text-based virtual object animation generation system in this embodiment, the user provides the input information at the collection module to obtain the corresponding virtual object animation and the emotional speech synchronized with the virtual object animation at the text-based virtual object animation generation apparatus 5.” Animation is for display.”]
Hui (U.S. 20250384605): Hui takes in an audiovisual input and generates a transcript of the audio and later has to re-align and re-synchronize a rendered video having the proper emotional features with the text. Figure 5A: “[0084] At S4005, the subtitle text, the emotion index, and the topic feature data are rendered with the video stream according to a configuration of the rendering set. “[0025] FIG. 5B is a schematic diagram of a mapping relationship between a preset sensitive statement and a rendering set according to an embodiment of the present disclosure;” “[0091] … dynamic changes of facial expressions are extracted, and different expression feature data is generated for different expressions …” “[0040] … For example, the text feature data obtained from the audio stream, the topic feature data (i.e., the secondarily processed content) obtained through mapping or multiple levels of mapping of the text feature data, and the emotion index obtained from the video stream are respectively rendered with the multimedia data, to obtain a video stream, which not only includes the picture content of the conference, but also includes multi-dimensional auxiliary information. …” “[0085] It can be understood that the subtitle text is used to obtain a corresponding sound time series which is since starting of the feature audio stream, where the subtitle text has a timestamp when being extracted, so time corresponding to the text can be determined. The micro-expression combination is used to obtain a corresponding mouth shape time series which is since starting of a mouth shape expression. Micro-expressions are extracted from the video stream according to a time series sequence, so the micro-expression combination also has a timestamp. In addition, the video stream and the audio stream are from the same multimedia source and have the same timing. However, because the audio stream and the video stream are separately processed and secondary data processing is performed, data timing of the audio stream and data timing of the video stream may become asynchronous after data processing. Therefore, consistency of the sound time series and the mouth shape time series is determined, and the subtitle text, the emotion index, and the coordinates of the facial region are aligned in timing. The subtitle text corresponds to the sound time series, the emotion index corresponds to the mouth shape time series, and the coordinates of the facial region correspond to the mouth shape time series. When it is determined that the sound time series is consistent with the mouth shape time series, it indicates that the subtitle text, the emotion index, and the coordinates of the facial region are also aligned in time series.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659