DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see p. 13, 1st paragraph and p. 16, 3rd paragraph, filed December 15, 2025, with respect to Claims 9-17 and 22 have been fully considered and are persuasive. The 35 U.S.C. 102 rejections of Claims 9, 10, 14, 15, and 17 and the 35 U.S.C. 103 rejections of Claims 11-13 and 16 have been withdrawn.
Applicant’s arguments with respect to claim(s) 1-5, 7, 8, 18, 20, and 21 have been considered but are moot because new grounds of rejection are made in view of Wang (US 20240212705A1).
Applicant argues that Belanich (US 20250061917A1) does not suggest that first emotions data and second emotions data are input into the SER model or processed by the SER model. Belanich does not suggest applying, to the one or more machine learning models, first emotions data representative of a first sequence of emotional states, and second emotions data representative of a second sequence of emotional states (p. 15).
In reply, the Examiner points out that new grounds of rejection are made in view of Wang to teach this limitation.
Applicant argues that Belanich does not suggest that the SER model outputs two different predicted emotions for the same snippet of speech. Therefore, Belanich does not suggest that the SER model determines both one or more first probabilities associated with a first predicted emotion representing the snippet of speech and one or more second probabilities associated with a second predicted emotion representing the snippet of speech (p. 15).
In reply, the Examiner points out that Belanich describes “A snippet of speech 122…is input to SER model 124, which generates a predicted emotion 126. This is compared against the weak label 112. When making the comparison, the system may use a sigmoid cross-entropy loss. In this case, for each emotion in a taxonomy the model would output a probability between 0 to 1 that the emotion is present int the speech. This set of probabilities is then compared to the weak label, and is penalized according to a cross-entropy loss function” [0024]. Belanich describes “fine tuning or inference, in which obtained speech 142 is applied to SER model 144, which generates predicted emotion 146. Here, the predicted emotion 146 is compared against one or more ground truth labels 148” [0025]. Belanich describes “The ground truth labels 148 may come from human raters who listen to a speech utterance and make a judgement on the emotions they perceive in the speech. Multiple raters can be employed on each utterance, which enables building a probability distribution over possible responses. The nature of these ground truth labels may dictate how the model’s output is compared against the ground truth. In the case of a single response, just like the weak labels, a cross-entropy loss is appropriate. Probability distributions over emotions can also be handled with a cross-entropy loss” [0027]. Thus, Belanich teaches that the SER model determines one of more first probabilities associated with a first sequence of emotional states (126) representing the second speech (122) [0024] and one or more second probabilities associated with the second sequence of emotional states (ground truth) representing the second speech (122) [0027].
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 7, 8, 18, 20, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Belanich (US 20250061917A1) and Wang (US 20240212705A1).
As per Claim 1, Belanich teaches a method comprising: determining, using one or more machine learning models and based at least on first audio data representative of first speech, one or more emotional states that represents the first speech (input data, such as speech segments or other audio input, may be operated on by trained LLM models, perform speech emotion recognition, the models may be used to convey the tone of the speech, and thus can be widely applied to situations in which speech emotion recognition can provide helpful context, [0080], input speech 102 (a snippet of an audio file), input speech 102 is applied to an automatic speech recognition (ASR) module 104, which is configured to convert the speech to text, ASR module 104 outputs a transcript that corresponds to the speech, the transcript is applied to a pre-trained LLM, the result of this stage is to extract emotion labels, generates a predicted emotion, [0022]), wherein the one or more machine learning models (124) are trained, at least, by: applying, to the one or more machine learning models, second audio data representative of second speech (122);
determining, using the one or more machine learning models and based at least on the second audio data (122), one or more first probabilities associated with the first sequence of emotional states representing the second speech and one or more second probabilities associated with the second sequence of emotional states representing the second speech; determining one or more losses based at least on the one or more first probabilities, the one or more second probabilities, and an indication of whether the first sequence of emotional states better represents the second speech as compared to second sequence of emotional states; and updating one or more parameters of the one or more machine learning models based at least on the one or more losses ([0022], snippet of speech 122 is input to SER model 124, which generates a predicted emotion 126, this is compared against the weak label 112, when making the comparison, the system may use a sigmoid cross-entropy loss, for each emotion in a taxonomy the model would output a probability between 0 to 1 that the emotion is present in the speech, this set of probabilities is then compared to the weak label, and is penalized according to a cross-entropy loss function, [0024], fine tuning or inference, in which obtained speech 142 is applied to SER model 144, which generates predicted emotion 146, the predicted emotion 146 is compared against ground truth labels 148, [0025], ground truth labels 148 may come from human raters who listen to a speech utterance and make a judgement on the emotions they perceive in the speech, multiple raters can be employed on each utterance, which enables building a probability distribution over possible responses, the nature of these ground truth labels may dictate how the model’s output is compared against the ground truth, probability distributions over emotions can be handled with a cross-entropy loss, [0027]). Belanich describes “A snippet of speech 122…is input to SER model 124, which generates a predicted emotion 126. This is compared against the weak label 112. When making the comparison, the system may use a sigmoid cross-entropy loss. In this case, for each emotion in a taxonomy the model would output a probability between 0 to 1 that the emotion is present int the speech. This set of probabilities is then compared to the weak label, and is penalized according to a cross-entropy loss function” [0024]. Belanich describes “fine tuning or inference, in which obtained speech 142 is applied to SER model 144, which generates predicted emotion 146. Here, the predicted emotion 146 is compared against one or more ground truth labels 148” [0025]. Belanich describes “The ground truth labels 148 may come from human raters who listen to a speech utterance and make a judgement on the emotions they perceive in the speech. Multiple raters can be employed on each utterance, which enables building a probability distribution over possible responses. The nature of these ground truth labels may dictate how the model’s output is compared against the ground truth. In the case of a single response, just like the weak labels, a cross-entropy loss is appropriate. Probability distributions over emotions can also be handled with a cross-entropy loss” [0027]. Thus, Belanich teaches that the SER model determines one of more first probabilities associated with a first sequence of emotional states (126) representing the second speech (122) [0024] and one or more second probabilities associated with the second sequence of emotional states (ground truth) representing the second speech (122) [0027].
However, Belanich does not teach applying, to the one or more machine learning models, first emotions data representative of a first sequence of emotional states, and second emotions data representative of a second sequence of emotional states; determining, using the one or more machine learning models, the first emotions data, and the second emotions data, the one or more first probabilities and the one or more second probabilities. However, Wang teaches applying, to the one or more machine learning models, second audio data representative of second speech, first emotions data representative of a first sequence of emotional states (first candidate emotion), and second emotions data representative of a second sequence of emotional states (second candidate emotion) (the information for recognition may be a to-be-recognized audio, emotion prediction is a process of determining an emotion expressed by the information for recognition, which may be implemented based on physical information (audio waveform and a timbre), emotion processing device may preset or predefine a plurality of candidate emotions, [0078], emotion prediction model described herein are models corresponding to the artificial neural network, [0046]); determining, using the one or more machine learning models and based at least on the second audio data, the first emotions data, and the second emotions data, one or more first probabilities associated with the first sequence of emotional states (first candidate emotion) representing the second speech and one or more second probabilities associated with a second sequence of emotional states (second candidate emotion) representing the second speech (performing emotion prediction on information for recognition for a current round to obtain a current emotion direction and a first probability of each of a plurality of candidate emotions, the current emotion direction being a direction of an emotion of the information for recognition, the candidate emotion being a preset emotion type, and a type of the direction of the emotion including positive, negative, and directionless, [0007], [0078], [0046]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Belanich to include applying, to the one or more machine learning models, first emotions data representative of a first sequence of emotional states, and second emotions data representative of a second sequence of emotional states; determining, using the one or more machine learning models, the first emotions data, and the second emotions data, the one or more first probabilities and the one or more second probabilities because Wang suggests that this improves accuracy of emotion recognition [0005].
As per Claim 2, Belanich teaches wherein the one or more machine learning models are further trained, at least, by: determining that the one or more second probabilities are less than the one or more first probabilities, wherein the determining the one or more losses is based at least on the determining the one or more second probabilities are less than the one or more first probabilities and the indication [0024, 0025, 0027].
As per Claim 7, it would have been obvious to one of ordinary skill in the art that in Belanich, new speech can be input, and the same process is performed on the new speech. Thus, Belanich teaches wherein the one or more machine learning models are further trained, at least, by: determining, using the one or more machine learning models and based at least on third audio data representative of third speech, a first distribution of values associated with second emotional states; determining one or more second losses based at least on the first distribution of values and a second distribution of values associated with the second emotional states, the second distribution of values being associated with ground truth data; and updating, based at least on the one or more second losses, one or more initial parameters of the one or more machine learning models to include the one or more parameters [0022, 0024, 0025, 0027].
As per Claim 8, Belanich teaches wherein the second distribution of values includes at least a first value associated with a first emotional state of the second emotional states and a second value associated with a second emotional state of the second emotional states, the first value being greater than the second value based at least on the ground truth data indicating that the first emotional state represents the third speech [0022, 0024, 0025, 0027].
As per Claim 18, Claim 18 is similar in scope to Claim 1, and therefore is rejected under the same rationale.
As per Claim 20, Belanich teaches wherein the one or more processors is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources (computing device 902 may be implemented as a cloud-based server system, [0077]).
As per Claim 21, Belanich teaches a snippet of speech 122 is input to SER model 124, which generates a predicted emotion 126. This is compared against the weak label 112. When making the comparison, the system uses a sigmoid cross-entropy loss. For each emotion the model would output a probability between 0 and 1 that the emotion is present in the speech. This set of probabilities is then compared to the weak label, and is penalized according to a cross-entropy loss function [0024]. Obtained speech 142 is applied to SER model 144, which generates predicted emotion 146. The predicted emotion 146 is compared against one or more ground truth labels 148 [0025]. The nature of these ground truth labels dictates how the model’s output is compared against the ground truth. A cross-entropy loss is appropriate. Probability distributions over emotions can also be handled with a cross-entropy loss [0027]. Thus, the first sequence of emotional states (predicted emotion) includes a first emotional state (126) associated with a first portion (122) of the second speech and a second emotional state (predicted emotion) associated with a second portion (the snippet that comes after the snippet of speech 122) of the second speech; and the second sequence of emotional states (ground truth) includes at least a third emotional state (ground truth) associated with the first portion (122) of the second speech and a fourth emotional state (ground truth) associated with the second portion (the snippet that comes after the snippet of speech 122) of the second speech [0024, 0025, 0027].
Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Belanich (US 20250061917A1) and Wang (US 20240212705A1) in view of Liu (US 20220374637A1) and Isobe (US 20100003969A1).
Belanich and Wang are relied upon for the teachings as discussed above relative to Claim 1.
However, Belanich and Wang do not teach wherein the one or more machine learning models are further trained, at least, by: generating, based at least on the second audio data and the first sequence of emotional states, first video data representative of a first video depicting a first animation associated with the first sequence of emotional states. However, Liu teaches wherein the one or more machine learning models are further trained, at least, by: generating, based at least on the second audio data and the first sequence of emotional states, first video data representative of a first video depicting a first animation associated with the first sequence of emotional states (generating video showing this speaker’s mouth moving to appear as if this speaker can be seen uttering this speech, this can include related movements or expressions determined based upon aspects of this speech, such as emotion, [0061], GAN can be trained to generate video based upon key points identified for people’s faces when talking, changes in key point locations may be impacted by factors such as emotion detected in speech, [0064]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Belanich and Wang so that the one or more machine learning models are further trained, at least, by: generating, based at least on the second audio data and the first sequence of emotional states, first video data representative of a first video depicting a first animation associated with the first sequence of emotional states because Liu suggests that this way, video data does not need to be transmitted, since the machine learning model can generate the video, and thus this prevents presentation and quality issues for networks that have limited bandwidth [0047].
However, Belanich, Wang, and Liu do not teach generating, based at least on the second audio data and the second sequence of emotional states, second video data representative of a second video depicting a second animation associated with the second sequence of emotional states; receiving input data representative of a selection that the first video better represents the second speech as compared to the second video; and generating the indication based at least on the selection. However, Isobe teaches a voice message service for portable phones, where a caller terminal leaves a message by video [0005]. The range of emotional expressions by tone of voice will inevitably be less in a situation in which a human talks to the machine without the presence of the corresponding party. As a result, the emotion intended by the caller sometimes is not fully reflected in the message left in a voice message. To solve this problem, in the prior art, a chat server detects emotion parameters based on speech data, to generate emotion data, that generates synthesized facial expression data by synthesizing the emotion data with facial expression data extracted from image data, and that generates a character synthesis image by synthesizing character data with the synthesized facial expression data [0010]. However, merely reading the emotion of a message automatically by emotion recognition software might give an incorrect impression to a receiver user because a result of the recognition result sometimes does not correspond to the emotion intended by the caller user [0012]. To solve this problem, scenes that best represent the emotion for each message are thumbnailed. Each thumbnailed scene corresponds to image data extracted from each message. The message is a video message [0066]. Thus, the message has a plurality of scenes with a plurality of emotions. The system selects the scenes (first video) that better represent the emotion of the message as compared to the other scenes (second video). Thus, this teaching from Isobe can be implemented into the device of Liu to include generating, based at least on the second audio data and the second sequence of emotional states, second video data representative of a second video depicting a second animation associated with the second sequence of emotional states; receiving input data representative of a selection that the first video better represents the second speech as compared to the second video; and generating the indication based at least on the selection; because the device of Liu might give an incorrect impression because the result of the recognition result sometimes does not correspond to the emotion intended by the user who is talking, and thus by implementing this teaching from Isobe into the device of Liu, a video that better represents the emotion intended by the user who is talking is output.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Belanich, Wang, and Liu to include generating, based at least on the second audio data and the second sequence of emotional states, second video data representative of a second video depicting a second animation associated with the second sequence of emotional states; receiving input data representative of a selection that the first video better represents the second speech as compared to the second video; and generating the indication based at least on the selection because Isobe suggests that merely reading the emotion of a message automatically by emotion recognition software might give an incorrect impression to a receiver user because a result of the recognition result sometimes does not correspond to the emotion intended by the caller user [0012], and thus the teaching from Isobe solves this problem by selecting a video that better represents the emotion intended by the caller user [0066].
Claim(s) 4-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Belanich (US 20250061917A1) and Wang (US 20240212705A1) in view of Zhang (US 20220036916A1).
As per Claim 4, Belanich and Wang are relied upon for the teachings as discussed above relative to Claim 1.
However, Belanich and Wang do not teach wherein the one or more machine learning models are further trained, at least, by: determining, using one or more second machine learning models and based at least on the second audio data, one or more third probabilities associated with the one or more first sequence of emotional states representing the second speech, wherein the determining the one or more losses is further based at least on the one or more third probabilities. However, Zhang teaches wherein the one or more machine learning models are further trained, at least, by: determining, using the one or more machine learning models and based at least on the audio data, the one or more sequence of emotional states representing the speech (for each speech frame, the first neural network model may output an emotional state information prediction result, [0067]), determining, using one or more second machine learning models and based at least on the second audio data, one or more third probabilities associated with the one or more first sequence of emotional states representing the second speech (the statistical result corresponding to the current utterance and the n-1 statistical results may be input to the second neural network model, and an output of the second neural network model is the emotional state information corresponding to the current utterance, [0129]), wherein the determining the one or more losses is further based at least on the one or more third probabilities (the second neural network, a loss function is MSE, [0068]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Belanich and Wang so that the one or more machine learning models are further trained, at least, by: determining, using one or more second machine learning models and based at least on the second audio data, one or more third probabilities associated with the one or more first sequence of emotional states representing the second speech, wherein the determining the one or more losses is further based at least on the one or more third probabilities because Zhang suggests that this achieves a more accurate speech emotion recognition effect [0006].
As per Claim 5, Belanich and Wang do not teach wherein the one or more machine learning models are further trained, at least, by: determining, using the one or more second machine learning models and based at least on the second audio data, one or more fourth probabilities associated with the second sequence of emotional states representing the second speech, wherein the determining the one or more losses is further based at least on the one or more fourth probabilities. However, Zhang teaches wherein the one or more machine learning models are further trained, at least, by: determining, using the one or more second machine learning models and based at least on the second audio data, one or more fourth probabilities associated with the second sequence of emotional states representing the second speech [0129], wherein the determining the one or more losses is further based at least on the one or more fourth probabilities [0068]. This would be obvious for the reasons given in the rejection for Claim 4.
Allowable Subject Matter
Claims 9-17 are allowed.
Claim 22 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
The prior art taken singly or in combination do not teach or suggest the combination of all the limitations of Claim 9, and in particular, do not teach determining, using one or more machine learning models and based at least on audio data representative of speech, a first distribution of values associated with emotional states; determining one or more losses based at least on the first distribution of values and a second distribution of values associated with the emotional states for the speech, the second distribution of values comprising a target distribution derived from ground truth data, the ground truth data indicating a preference for a first sequence of emotional states over a second sequence of emotional states as representing the speech; and updating, based at least on the one or more losses, one or more parameters of the one or more machine learning models. Claims 10-17 depend from Claim 9, and therefore also contain allowable subject matter.
The prior art taken singly or in combination do not teach or suggest the combination of all the limitations of Claim 22 and base Claim 1, and in particular, do not teach wherein the one or more machine learning models are further trained, at least, by receiving training data, the training data including at least: the first emotions data representing the first sequence of emotional states for the second speech; the second emotions data representing the second sequence of emotional states for the second speech; and ground truth data representing the indication whether the first sequence of emotional states better represents the second speech as compared to the second sequence of emotional states.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONI HSU whose telephone number is (571)272-7785. The examiner can normally be reached M-F 10am-6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JH
/JONI HSU/Primary Examiner, Art Unit 2611