Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 11-15, 17, and 19-20 are pending. Claim 11 is independent.
This Application was published as US 20240021193.
Apparent priority is 4 December 2020.
The instant Application is directed to a method of training a neural network to produce conversational responses with different emotions, which can be selected by a user.
Response to Arguments
Applicant’s arguments with respect to claim(s) 11 have been considered but not persuasive. Applicant argues that none of the references disclose ranking the responses based on an emotional category. However, Wu does disclose that a relevancy score is assigned to each response, and that it is based on the emotion labels of the sentences:
“[0153] …FIG. 4D illustrates an example of a method for performing operation 410. Operation 410 may include operations 444 and 446 as illustrated in FIG. 4D. At operation 444, a relevancy score is assigned to each response in the response database based on the current query and the labeled sentences. Responses in the response database with high semantic similarity to the current user query and/or to the labeled context sentences are assigned higher scores than responses in the response database that have low semantic similarity to the current user query and/or to the labeled context sentences. As such, the emotion labels of the labeled sentences are compared to the stored emotion labels for the stored responses on the response database at operation 410 to determine the stored responses with the most similarity to the current user query and labeled sentences.”
Assigning a relevancy score (ranking) to the responses based on the current query (conversational input) and emotion labels (emotional category) reads on the amended claims. Therefore, the rejection is maintained.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 11-15, 17, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu (US 20180174020 A1) in view of Fong et al. ("An Event Driven Neural Network System for Evaluating Public Moods from Online Users’ Comments"), Rastrow et al. (US 10339925 B1), and Lan et al. (“SPEAKER DIARIZATION WITH UNSUPERVISED TRAINING FRAMEWORK”).
Regarding claim 11, Wu discloses: 11. A method of training a neural network ("[0059] FIG. 9 is a schematic diagram illustrating an example of a neural network structure…" )
to generate conversational replies, (Fig. 3B shows example conversational replies. )
the method comprising: providing a first dataset of stored phrases linked to form a plurality of conversational sequences; ("[0132] A recurrent neural network (RNN) with gated recurrent units (GRUs) to learn the similarity among a query and good/bad responses as illustrated in FIG. 11A. In FIG. 11A, one training sample includes three elements: query; good response; and bad response. For example, a query of, “I love you”, a good response of “that makes me feel so happy”, and a bad response of, “deep learning is interesting,” is listed in FIG. 11A. ..." )
training the neural network to generate responses to input phrases using the first dataset; ("[0133] With large-margin training, the embedding matrices from words to vectors, and the transform matrices from embedding vectors to hidden layer lower-dimension vectors can be obtained. When these matrices are obtained, the testing process can be then performed. Given a query and a corresponding chat bot response, the training can go through the network to compute the similarity of the query and the response to obtain a similarity score. ..." )
using the trained neural network to generate a list of conversational replies in response to conversational inputs, ("[0128]... In some aspects, the response prediction system 116 select a predetermined number of responses from the response database 120 based on the highest relevancy scores and then randomly selects one or more result responses from the predetermined number of responses. In other aspects, the response prediction system 116 select one or more result responses 132 from the response database 120 based on the highest relevancy scores." )
training a first Al to classify the first dataset of stored phrases into emotional categories, ("[0118] The sentiment system 114 analyzes the one or more context sentences 128 from the context summary system 112 to determine an emotion for each context sentence 128. … [0126] … A multiple class support vector machine (SVM) model is trained utilizing these features to determine the sentiment of each context sentence …" – SVM is a first AI.)
classifying the first dataset of stored phrases into emotionally categorised datasets using the first Al, ("[0118] ...The sentiment system 114 receives the text input of the context summary system 112 and outputs an emotion label 129 for each context sentence 128 that is representative of the emotion of the user 102 for that sentence…." )
training a plurality of neural networks to generate a plurality of responses to input phrases using the emotionally categorised datasets, (“[0132] A recurrent neural network (RNN) with gated recurrent units (GRUs) to learn the similarity among a query and good/bad responses as illustrated in FIG. 11A. In FIG. 11A, one training sample includes three elements: query; good response; and bad response. For example, a query of, “I love you”, a good response of “that makes me feel so happy”, and a bad response of, “deep learning is interesting,” is listed in FIG. 11A. ...” ; Wu discloses generating a plurality of responses. Plurality of neural networks is not explicitly disclosed.)
selecting a recipient of a conversation; configuring the generation of the plurality of emotionally categorized conversational replies to be personalized to the recipient; (not explicitly disclosed)
using the plurality of trained neural networks to generate a plurality of emotionally categorised conversational replies; (Fig. 2 shows a plurality of responses are predicted)
ranking the list of conversational replies according to a probability score of the response matching the conversational input and the emotional category, generated by each neural network; and (“[0153] …FIG. 4D illustrates an example of a method for performing operation 410. Operation 410 may include operations 444 and 446 as illustrated in FIG. 4D. At operation 444, a relevancy score is assigned to each response in the response database based on the current query and the labeled sentences. Responses in the response database with high semantic similarity to the current user query and/or to the labeled context sentences are assigned higher scores than responses in the response database that have low semantic similarity to the current user query and/or to the labeled context sentences. As such, the emotion labels of the labeled sentences are compared to the stored emotion labels for the stored responses on the response database at operation 410 to determine the stored responses with the most similarity to the current user query and labeled sentences.” )
selecting the replies where the probability score exceeds a threshold as a selection list, ("[0154] Next, at operation 446 one or more result responses are selected based on the relevancy scores. In some aspects, the one or more result responses are the responses in the response database with the highest relevancy scores at operation 446. … In other aspects, the predetermined number of responses or number of selected result responses may be any response that meets a predetermine relevancy score threshold that configured by the creator and/or selected by the user." )
wherein training the neural network further comprises: segmenting the conversational sequences in the first dataset based on a timing pattern between speakers; and identifying speaker transitions by using a speech diarization algorithm without training the neural network on an individual voice profile of each speaker. (not explicitly disclosed)
Wu does not disclose training a plurality of networks to generate a plurality of emotionally categorized responses, a response that is customized to the recipient, or segmenting conversational sequences based on timing between speakers without training on voice profiles.
Fong discloses: training a plurality of neural networks ("Our model features a Mood engine made up of a number of artificial neutral networks, one for each type of mood, and different sets of ANNs for different cultures." Pg. 242, last para)
Wu and Fong are considered analogous art to the claimed invention because they disclose detecting emotions in conversations. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Wu to use a different neural network for each emotion as taught by Fong. Doing so would have been beneficial so that additional emotions could be used. (Wu [0118] teaches only positive, negative, or neutral emotion. Fong (e.g. Pg. 241, para 5) teaches additional emotions such as happiness and surprised. The structure taught by Fong could be expanded to any number of emotions.)
Fong does not disclose a response that is customized to the recipient, or segmenting conversational sequences based on timing between speakers without training on voice profiles.
Rastrow discloses: selecting a recipient of a conversation; (“The server 120 may determine (709) a recipient of the text message, for example by analyzing a recipient field of the text message.” Col 18; 36-38)
configuring the generation of the plurality of emotionally categorized conversational replies to be personalized to the recipient; (“The system may employ a first machine learning model to determine (154) the text of an automated response. The first model may consider a variety of input data. The first model may be specific to the recipient of the message or may be used for multiple users. The first model may be trained using a multitude of training examples where each example includes values for the different data of the example as well as a ground truth as to what automated response message text is appropriate for that particular example. The first model may be trained using a large text corpus taken from responses to communications (whether automated or not) that may be converted into training examples (and encoded) to train the model how to respond to incoming messages under different circumstances.” Col 18; 45-58)
Wu, Fong, and Rastrow are considered analogous art to the claimed invention because they are in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Wu in view of Fong to use a model specific to the recipient as taught by Rastrow. Doing so would have been beneficial so that the response is personalized to the recipient.
Rastrow does not disclose segmenting conversational sequences based on timing between speakers without training on voice profiles.
Lan discloses: wherein training the neural network further comprises: segmenting the conversational sequences in the first dataset based on a timing pattern between speakers; (See pg. 5560, Section 2.1. “BIC segmentation and clustering”)
and identifying speaker transitions by using a speech diarization algorithm without training the neural network on an individual voice profile of each speaker. (“In this paper we address the problem of cross-show diarization, using a priori models trained on unlabeled data.” Pg. 5560, col 2 para 2 – Lan discloses a diarization method using clustering rather than data labeled with a speaker.)
Wu, Fong, Rastrow, and Lan are considered analogous art to the claimed invention because they are in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Wu in view of Fong and Rastrow to use the diarization method taught by Lan to segment training data. Doing so would have been beneficial so that labeled training data is not required.
Regarding claim 12, Wu discloses: 12. The method of claim 11, wherein the conversational inputs are received from a speech to text device. ("[0081] Sounds need to be recognized and decoded as texts. A speech recognition API may be necessary for the speech-to-text conversion task and is part of the LU system 110. …" )
Regarding claim 13, Wu discloses: 13. The method of claim 11, further comprising providing a second dataset of stored phrases containing phrases previously created by a user, and further training the neural network based on the second dataset. ("[0137] In other aspects, the feedback system 119 collects user answers from the user in reply to a previously provided result responses. The feedback system 119 analyzes the answer to determine user feedback for the result response. The feedback system 119 utilizes the determined user feedback as positive or negative training data for the context summary system 112, the sentiment system 114, and/or the response prediction system 116. ..." )
Regarding claim 14, Wu discloses: 14. The method of claim 11, wherein the neural network forms one or more encoder-decoder networks. ("[0059] FIG. 9 is a schematic diagram illustrating an example of a neural network structure for the neural network language model decoder (NNLM) with additional encoder (enc) elements for section (a) and a network diagram for the attention-based encoder enc3 for section (b), in accordance with aspects of the disclosure." )
Regarding claim 15, Wu discloses: 15. The method of claim 11, wherein providing a first dataset of stored phrases includes: using a speech to text engine configured to receive speech from recordings of conversations, ("[0081] Sounds need to be recognized and decoded as texts. A speech recognition API may be necessary for the speech-to-text conversion task and is part of the LU system 110. …" )
segmenting the conversations based on timing into phrases; (Not explicitly disclosed)
and generating a dataset of stored phrases comprising linking metadata associating phrases with adjacent phrases as conversational sequences. ("[0132] A recurrent neural network (RNN) with gated recurrent units (GRUs) to learn the similarity among a query and good/bad responses as illustrated in FIG. 11A. In FIG. 11A, one training sample includes three elements: query; good response; and bad response. For example, a query of, “I love you”, a good response of “that makes me feel so happy”, and a bad response of, “deep learning is interesting,” is listed in FIG. 11A. ..." )
Wu does not disclose segmenting conversations based on timing into phrases. Neither does Fong or Rastrow.
Lan discloses: segmenting the conversational sequences in the first dataset based on a timing pattern between speakers; (See pg. 5560, Section 2.1. “BIC segmentation and clustering”)
See claim 1 for motivation statement.
Regarding claim 17, Wu in view of Fong, Rastrow, and Lan further discloses: 17. The method of claim 11, further comprising training a first neural network to generate a plurality of responses to input phrases using the complete dataset of stored phrases, wherein using the neural network to generate a list of conversational replies comprises using the first trained neural network to generate a plurality of conversational replies, and combining the plurality of conversational replies with the plurality of emotionally categorised conversational replies to generate a combined list of conversational replies. (Wu teaches that the complete dataset is used to determine positive, negative, or neutral emotion [0118]. This would be the first neural network. Fong teaches a plurality of emotionally categorized neural networks. See claim 1 for mapping. It would have been obvious to one of ordinary skill in the art to train the original network trained on all data and train separate networks for each emotion. It would have been beneficial to combine results from all networks to provide more variety of responses.)
See claim 1 for motivation statement.
Regarding claim 19, Wu discloses: 19. The method of claim 18, further comprising presenting the selection list to a user to select a conversational reply and outputting the selected reply. ("[0155] In response to the one or more result responses being selected at operation 410, operation 411 is performed. At operation 411 the one or more result response are provided to the user in reply to the query. … For example, the client computing device may provide the one or more result responses with an artificial voice speaking through speakers on the client computing device." )
Wu does not disclose the user selecting the reply to be output. Neither does Fong.
Rastrow discloses: presenting the selection list to a user to select a conversational reply and outputting the selected reply. (See Fig. 11 below. Rastrow discloses that the user can select which of the possible responses should be used. See also col 21, para 1, which discloses N-best list and TTS response.)
PNG
media_image1.png
700
596
media_image1.png
Greyscale
Wu, Fong, Rastrow, and Lan are considered analogous art to the claimed invention because they are in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further modified the combination to allow the user to select a response to another user as taught by Rastrow. Doing so would have been beneficial so that a user could quickly reply with the most correct response. (Rastrow col 6;24-26.)
Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Fong, Rastrow, and Lan, in further view of Deng et al. (US 20210287657 A1).
Regarding claim 20, Wu and Fong discloses: 20. The method of claim 19, further comprising training a first AI to classify the first dataset of stored phrases into emotional categories, classifying the first dataset of stored phrases into emotionally categorised datasets using the first AI, training a plurality of neural networks to generate a plurality of responses to input phrases using the emotionally categorised datasets, and using the plurality of trained neural networks to generate a plurality of emotionally categorised conversational replies, (See claim 16 rejection )
wherein outputting the reply includes using a text to speech engine to convert the reply into speech, and modifying the voice profile of the text to speech engine based on the emotional category of the selected conversational reply. ("[0081] Sounds need to be recognized and decoded as texts. A speech recognition API may be necessary for the speech-to-text conversion task and is part of the LU system 110. Furthermore, the LU system 110 may need to convert a generated response 132 from text to voice to provide a voice response to the user 102." )
See claim 16 for motivation statement for Fong.
Wu and Fong do not disclose modifying the voice profile of the TTS engine.
Rastrow discloses: wherein outputting the reply includes using a text to speech engine to convert the reply into speech, and modifying the voice profile of the text to speech engine based on the emotional category of the selected conversational reply. ("For example a user may prefer a speech output voice to be a specific gender, have a specific accent, speak at a specific speed, have a distinct emotive quality (e.g., a happy voice), or other customizable characteristic(s) (such as speaking an interjection in an enthusiastic manner) as explained in other sections herein." Col 17 para 1)
Wu and Rastrow are considered analogous art to the claimed invention because they disclose methods of generating conversational responses. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further modified the combination to modify the output voice as taught by Rastrow. Doing so would have been beneficial to allow for user preferences. (Rastrow col 17;13-15.)
Rastrow does not disclose that the emotional category is based on the selected reply. Neither does Lan.
Deng discloses: wherein outputting the reply includes using a text to speech engine to convert the reply into speech, and modifying the voice profile of the text to speech engine based on the emotional category of the selected conversational reply. ("[0005] Embodiments of this application provide a speech synthesis method and a speech synthesis apparatus, to synthesize emotional speeches corresponding to emotional types of different emotional intensities, making the emotional speeches more realistic and rich in emotional expressions." )
Wu and Deng are considered analogous art to the claimed invention because they disclose methods for replying with emotion. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with different voice emotion depending on the response. Doing so would have been beneficial to make the speech more realistic and richer. (Deng [0005])
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654