DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending and claims 1, 8 and 15 are independent claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-3, 5-10, 12-16, and 18-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The independent claims 1, 8 and 15 recite “receiving an utterance from an audio input device; determining a context associated with the utterance; providing the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context; and using an output of the joint model to perform an action requested in the utterance…; whereby the concept match corresponds to a completed concept…” as drafted cover an abstract idea of data analysis/retrieval and mental steps. More specifically, “receiving an utterance from an audio input device; determining a context associated with the utterance; providing the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context; and using an output of the joint model to perform an action requested in the utterance” which requires just data analysis / retrieval step and mental process. For instance, one can mentally receive/listen to/ an utterance/speech and determine a context associated with the utterance and this is just a mental step. Automatically recognizing speech can also be a mental step since one can mentally recognize someone’s speech, and spoken language understanding is also something one can do mentally. Thus, the joint operation of automatic speech recognition and spoken language understanding can be performed mentally. The claimed invention is, therefore, directed to an abstract idea and a mental process without significantly more and thus, claims 1, 8 and 15 are rejected under 35 U.S.C. 101.
Similarly, the dependent claims 2-3, 5-7, 9-10, 12-14, 16, and 18-20 recite similar claim language as in claims 1, 8 and 15. Claims 2, 9 and 16 recite “the joint model comprises a speech encoder, a shared encoder, and a shared decoder,” which requires just a mental step of encoding, which can be basically considered understanding the speech, for instance, and decoding which might be done mentally to learn the context and making use of the context information for the speech recognition. Thus, these claims 2, 9 and 16 are directed to an abstract idea.
Claims 3 and 10 which recite “the joint model further comprises a layer normalization between the speech encoder and the shared encoder,” which is just a mathematical step. For instance, one can apply one of the mathematical formulae for producing the normalization steps. Thus, claims 3 and 10 are directed to an abstract idea.
Claims 5, 12 and 18 which recite “selecting to use the single mode or the dual mode depending on the context,” which also requires just a mental step of selecting whether to recognize and understand a spoken language or just do one of these steps which can be done mentally. Thus, claims 5, 12 and 18 are directed to an abstract idea.
Claims 6, 13 and 19 which recite “in the single mode, the output of the joint model is a tokenized transcript of the utterance concatenated with intent and slot keys and values,” which also requires just a simple procedure that can be easily performed on a paper or table or using a spreadsheet. For instance, the speech the brain hears can be: “I want to travel from New York to Chicago on the 12th of December. The brain can generate entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th December <travel-date>”. Thus, claims 6, 13 and 19 are directed to an abstract idea.
Claims 7, 14 and 20 which recite “in the dual mode, the input includes an indicator token identifying whether to perform ASR or SLU, and wherein the output of the joint model is a tokenized transcript of the utterance for ASR and intent slot keys and values for SLU,” which also requires just making a mental step of choosing that can be easily performed mentally Or it can be performed using a conventional/generic (general-purpose) computer (Spec. para 0015) or a simple calculator. Thus, claims 11 and 20 are directed to an abstract idea. Thus, claim 7, 14 and 20 are directed to an abstract idea.
Thus, claims 1-3, 5-10, 12-16, and 18-20 as drafted cover a mental process and abstract idea of data gathering/retrieval and analysis/processing steps, and they are mental processes directed to an abstract idea of implementing mathematical formulae for data processing and data analysis using a conventional/generic (general-purpose) computer as well and thus, all the claims are directed to an abstract idea.
This judicial exception is not integrated into a practical application. In particular, claims 1, 8 and 15 recite additional element of “processor” and “memory” as per the independent claims. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea (Spec. 0010). The claim is directed to an abstract idea.
Thus, taken alone, the additional elements do not amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional general purpose computer implementation. Claims 1-3, 5-10, 12-16, and 18-20 are therefore not drawn to patent eligible subject matter as they are directed to an abstract idea without significantly more. Thus, the claimed invention is directed to an abstract idea and a mental process without significantly more and thus, claims 1-3, 5-10, 12-16, and 18-20 are rejected under 35 U.S.C. 101.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer is noted as a general computer as noted. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept (Spec., para 0015). Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible.
Dependent claims 2-3, 5-7, 9-10, 12-14, 16, and 18-20 are also directed toward an abstract idea and do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional elements when considered both individually and as an ordered combination do not amount to significantly more than the abstract idea. Therefore, claims 1-3, 5-10, 12-16, and 18-20 do not contain patent eligible subject matter that has been identified by the courts.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-2, 5-9, 12-16 and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Thomas et al. Pat App No. US 20230056680 A1 (Thomas).
Regarding Claim 1, Thomas discloses a method (Thomas, para 0056, the system and/or method described herein) comprising:
receiving an utterance from an audio input device (Thomas, para 0005, receive audio signals representing a current utterance in a conversation and a dialog history including at least information associated with past utterances corresponding to the current utterance in the conversation);
determining a context associated with the utterance (Thomas, 0002 - 0017, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems… The summary of the disclosure is given to aid understanding of a computer system and method of integrating dialog history into a spoken language understanding system…The dialog history can include audio signals, and at least one processor can be configured to encode the dialog history into the embedding directly from the audio signals…At least some of the dialog history can include machine inferred information associated with the past utterances…);
providing the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU) (Thomas, para 0054, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above. For example, these 128 dimensional BERT embeddings can be used as input features by appending them to the 240 dimensional acoustic features used to train a baseline system), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context (Thomas, para 0054-0057, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data]); and
using an output of the joint model to perform an action requested in the utterance (Thomas, para 0035, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 2, Thomas discloses the method of claim 1, wherein the joint model comprises a speech encoder, a shared encoder, and a shared decoder (Thomas, para 0030-0036, Figure 1, Units 102, 104, 106, The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history: e.g., the textual content of previous turns, speaker role (whether agent or user) for each turn and previous SLU tags for each utterance in the dialog history 108. Another encoder can be used for generating such embeddings. These embeddings can then be used as features that contain side information on dialog history for an SLU system 106, for example, but not limited to, a recurrent neural network (RNN) Transducer based E2E SLU system. For instance, information associated with the dialog history 108 can be encapsulated as embeddings or vector embeddings, e.g., consolidated in a single vector. In an embodiment, the dialog history 108 need not be text, e.g., the system in an embodiment can directly extract the dialog history embedding from speech without converting it first into text. Briefly BERT (Bidirectional Encoder Representations from Transformers) is a machine learning language model, which can be used for natural language processing (NLP)… In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data. A pre-trained model can then be modified to include semantic labels specific to the SLU task by resizing the output layer and the embedding layer of the prediction network to include additional output nodes. In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output… the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label… RNN-T based ASR models are a class of single end-to-end trained, streamable, all-neural models that are adopted for speech recognition… The joint network combines the two embedding outputs to produce a posterior distribution over the output symbols. This architecture can replace a conventional ASR system composed of separate acoustic model, language model, pronunciation lexicon, and decoder components. RNN-T models can handle more abstract output symbols such as ones marking speaker turns, and these models can be extended for SLU tasks; [Thomas, Figure 1, consists of speech features extractor and speech encoder (element 102), the dialog encoder (element 104) as “a shared encoder” and element 106 (ASR+SLU model) can consist of RNN-T based ASR models with decoder components as a “shared decoder”]).
Regarding Claim 5, Thomas discloses the method of claim 1, further comprising selecting to use the single mode or the dual mode depending on the context (Thomas, para 0054-0064, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data] … Figure 2, Steps 202, 204 and 206, At Step 206, the method … perform a spoken language understanding task based on input features, which include speech features associated with the received audio signals and the embedding; [i.e., SLU task is performed based on input speech features from the current utterance and the embedding (i.e., context), since the embedding from the Encoder (which can be BERT) of the Dialogue history produces the Context] ).
Regarding Claim 6, Thomas discloses the method of claim 1, wherein, in the single mode, the output of the joint model is a tokenized transcript of the utterance concatenated with intent and slot keys and values (Thomas, para 0034-0035, The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can … produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 7, Thomas discloses the method of claim 1, wherein, in the dual mode, the input includes an indicator token identifying whether to perform ASR or SLU, and wherein the output of the joint model is a tokenized transcript of the utterance for ASR and intent slot keys and values for SLU (Thomas, para 0028 - 0035, FIG. 1 is a diagram illustrating E2E SLU model or system architecture with dialog history in an embodiment…Coupled memory devices may be configured to selectively store instructions executable by one or more hardware processors… The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can … produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed…In another embodiment, the ASR+SLU model or system 106 … produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; [i.e., Figure 1 has an E2E architecture with various embodiments and various elements/models including the ASR+SLU. The coupled memory devices for this architecture can be configured to “selectively store instructions” (as, for instance, “an indicator token identifying whether to perform ASR or SLU”) executable by one or more hardware processors of the system. The ASR+SLU model or system 106 can produce ASR transcripts or SLU labels, and it can produce tokenized transcript of the utterance for ASR and intent slot keys and values for SLU]).
Regarding Claim 8, Thomas discloses an electronic device (0105, computing/processing devices) comprising:
at least one processing device (0105, computing/processing devices) configured to:
receive an utterance from an audio input device (para 0005-0017, receive (receiving) audio signals representing a current utterance in a conversation and a dialog history…);
determine a context associated with the utterance (para 0037, Given a dialog dataset D, an example is denoted as a triplet <c, u.sub.t, l>, where c={u.sub.1, u.sub.2, . . . , u.sub.t−1} represents the dialog context with t−1 utterances);
provide the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU) (Thomas, para 0054, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above. For example, these 128 dimensional BERT embeddings can be used as input features by appending them to the 240 dimensional acoustic features used to train a baseline system), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context (Thomas, para 0054-0057, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data])and
use an output of the joint model to perform an action requested in the utterance (Thomas, para 0035, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 9, Thomas discloses the electronic device of claim 8, wherein the joint model comprises a speech encoder, a shared encoder, and a shared decoder (Thomas, para 0030-0036, Figure 1, Units 102, 104, 106, The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history: e.g., the textual content of previous turns, speaker role (whether agent or user) for each turn and previous SLU tags for each utterance in the dialog history 108. Another encoder can be used for generating such embeddings. These embeddings can then be used as features that contain side information on dialog history for an SLU system 106, for example, but not limited to, a recurrent neural network (RNN) Transducer based E2E SLU system. For instance, information associated with the dialog history 108 can be encapsulated as embeddings or vector embeddings, e.g., consolidated in a single vector. In an embodiment, the dialog history 108 need not be text, e.g., the system in an embodiment can directly extract the dialog history embedding from speech without converting it first into text. Briefly BERT (Bidirectional Encoder Representations from Transformers) is a machine learning language model, which can be used for natural language processing (NLP)… In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data. A pre-trained model can then be modified to include semantic labels specific to the SLU task by resizing the output layer and the embedding layer of the prediction network to include additional output nodes. In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output… the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label… RNN-T based ASR models are a class of single end-to-end trained, streamable, all-neural models that are adopted for speech recognition… The joint network combines the two embedding outputs to produce a posterior distribution over the output symbols. This architecture can replace a conventional ASR system composed of separate acoustic model, language model, pronunciation lexicon, and decoder components. RNN-T models can handle more abstract output symbols such as ones marking speaker turns, and these models can be extended for SLU tasks; [Thomas, Figure 1, consists of speech features extractor and speech encoder (element 102), the dialog encoder (element 104) as “a shared encoder” and element 106 (ASR+SLU model) can consist of RNN-T based ASR models with decoder components as a “shared decoder”])).
Regarding Claim 12, Thomas discloses the electronic device of claim 8, wherein the at least one processing device is configured to select to use the single mode or the dual mode depending on the context (Thomas, para 0054-0064, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data] … Figure 2, Steps 202, 204 and 206, At Step 206, the method … perform a spoken language understanding task based on input features, which include speech features associated with the received audio signals and the embedding; [i.e., SLU task is performed based on input speech features from the current utterance and the embedding (i.e., context), since the embedding from the Encoder (which can be BERT) of the Dialogue history produces the Context]).
Regarding Claim 13, Thomas discloses the electronic device of claim 8, wherein, in the single mode, the output of the joint model is a tokenized transcript of the utterance concatenated with intent and slot keys and values (Thomas, para 0034-0035, The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+ SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+ SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+ SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 14, Thomas discloses the electronic device of claim 8, wherein, in the dual mode, the input includes an indicator token identifying whether to perform ASR or SLU, and wherein the output of the joint model is a tokenized transcript of the utterance for ASR and intent slot keys and values for SLU (Thomas, para 0028 - 0035, FIG. 1 is a diagram illustrating E2E SLU model or system architecture with dialog history in an embodiment…Coupled memory devices may be configured to selectively store instructions executable by one or more hardware processors… The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can … produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed…In another embodiment, the ASR+SLU model or system 106 … produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; [i.e., Figure 1 has an E2E architecture with various embodiments and various elements/models including the ASR+SLU. The coupled memory devices for this architecture can be configured to “selectively store instructions” (as, for instance, “an indicator token identifying whether to perform ASR or SLU”) executable by one or more hardware processors of the system. The ASR+SLU model or system 106 can produce ASR transcripts or SLU labels, and it can produce tokenized transcript of the utterance for ASR and intent slot keys and values for SLU]).
Regarding Claim 15, Thomas discloses a non-transitory machine readable medium containing instructions that when executed cause at least one processor of an electronic device (Thomas, para 0074-0077, one or more processors or processing units 12, a system memory 16, … The processor 12 may include a module 30 that performs the methods described herein. The module 30 may be programmed into the integrated circuits of the processor 12, or loaded from memory 16, storage device 18, or network 24 or combinations thereof… System memory 16 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory or others. Computer system may further include other removable/non-removable, volatile/non-volatile computer system storage media…) to:
receive an utterance from an audio input device (Thomas, para 0005, receive audio signals representing a current utterance in a conversation and a dialog history including at least information associated with past utterances corresponding to the current utterance in the conversation);
determine a context associated with the utterance (Thomas, 0002 - 0017, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems… The summary of the disclosure is given to aid understanding of a computer system and method of integrating dialog history into a spoken language understanding system…The dialog history can include audio signals, and at least one processor can be configured to encode the dialog history into the embedding directly from the audio signals…At least some of the dialog history can include machine inferred information associated with the past utterances…);
provide the utterance as an input to a joint model for automatic speech recognition (ASR) and spoken language understanding (SLU) (Thomas, para 0054, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above. For example, these 128 dimensional BERT embeddings can be used as input features by appending them to the 240 dimensional acoustic features used to train a baseline system), wherein the joint model operates in a single mode to perform both ASR and SLU or a dual mode to perform one of ASR or SLU depending on the context (Thomas, para 0054-0057, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data]); and
use an output of the joint model to perform an action requested in the utterance (Thomas, para 0035, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 16, Thomas discloses the non-transitory machine readable medium of claim 15, wherein the joint model comprises a speech encoder, a shared encoder, and a shared decoder (Thomas, para 0030-0036, Figure 1, Units 102, 104, 106, The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history: e.g., the textual content of previous turns, speaker role (whether agent or user) for each turn and previous SLU tags for each utterance in the dialog history 108. Another encoder can be used for generating such embeddings. These embeddings can then be used as features that contain side information on dialog history for an SLU system 106, for example, but not limited to, a recurrent neural network (RNN) Transducer based E2E SLU system. For instance, information associated with the dialog history 108 can be encapsulated as embeddings or vector embeddings, e.g., consolidated in a single vector. In an embodiment, the dialog history 108 need not be text, e.g., the system in an embodiment can directly extract the dialog history embedding from speech without converting it first into text. Briefly BERT (Bidirectional Encoder Representations from Transformers) is a machine learning language model, which can be used for natural language processing (NLP)… In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data. A pre-trained model can then be modified to include semantic labels specific to the SLU task by resizing the output layer and the embedding layer of the prediction network to include additional output nodes. In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output… the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label… RNN-T based ASR models are a class of single end-to-end trained, streamable, all-neural models that are adopted for speech recognition… The joint network combines the two embedding outputs to produce a posterior distribution over the output symbols. This architecture can replace a conventional ASR system composed of separate acoustic model, language model, pronunciation lexicon, and decoder components. RNN-T models can handle more abstract output symbols such as ones marking speaker turns, and these models can be extended for SLU tasks; [Thomas, Figure 1, consists of speech features extractor and speech encoder (element 102), the dialog encoder (element 104) as “a shared encoder” and element 106 (ASR+SLU model) can consist of RNN-T based ASR models with decoder components as a “shared decoder”]).
Regarding Claim 18, Thomas discloses the non-transitory machine readable medium of claim 15, further containing instructions that when executed cause the at least one processor of the electronic device to select to use the single mode or the dual mode depending on the context (Thomas, para 0054-0064, The jointly trained ASR+SLU can be run with different kinds of dialog history embeddings, for example, described above… The trained model can be run with such additional information as input features, for example, different kinds of embeddings. Experiments demonstrate the benefit of integrating dialog history for the task of dialog act prediction. ; OR, Thomas, para 0032-0035, In an embodiment, the system and/or method disclosed herein allows for the integration of entire dialog history, not just a previous system prompt. Experiments indicate that performance improves with longer history context. It can handle both dialog human-human conversations and computer-human interactions, given the flexibility of the length of dialog history. In an embodiment, an existing SLU model can be modified to accommodate dialog history via a customization step. In an embodiment, the embedding extractor 104 for dialog history can be a BERT model that has been trained on large amounts of data. The BERT model can also be adapted on the current data and task. The approach (e.g., a system and method) disclosed herein improves the performance of speech-based SLU models, for example, in performing tasks such as dialog action prediction and intent recognition. In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. Another SLU model, such as attention mechanism neural network can be implemented. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data… In an embodiment, the ASR in ASR+SLU model 106 can transform speech signals or audio signals (e.g., 102) to word for word transcript, e.g., linguistic text. The SLU in ASR+SLU model 106 can assign meaning to the transcript, e.g., dialog tag and/or intent. The output of the ASR+SLU model 106 can be one or more of dialog act or tag, dialog intent, and text transcript of speech. Such output can be saved as part of dialog history, in an embodiment, for use in the next utterance turn. The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; Thomas, para 0002-0003, Conventional spoken language understanding (SLU) systems can be built by integrating a text-based natural language understanding system with an automatic speech recognition (ASR) system….these traditional systems have been replaced by end-to-end (E2E) systems that directly process speech to produce spoken language understanding (SLU) entity or intent label targets without any intermediate ASR processing. When processing human-human or human-computer interactions, these E2E SLU systems process each turn of a conversation independently. However, spoken task-oriented conversations are often context dependent as users and agents converse in multiturn conversations to achieve the various user goals. These turns are also related, as the user or agent might refer to information introduced in previous turns. Without proper context these pieces of information introduce ambiguity. For example, “one” could refer to a scheduled appointment date or a part of a phone number or zip code depending on the context. Dialog history hence contains useful information that can be effectively used to improve the processing of each conversational turn and resolve such ambiguities in SLU systems; [i.e., According to Thomas, Figure 1, Element (System) 106 (i.e., the joint ASL+SLU Model) operates in a single mode to perform both ASR and SLU to produces an output of ASR (e.g., speech to text), Dialog Act, and Dialog intent, based on Current utterance (Speech) (with Speech Features Model) and Dialog History (that helps to determine CONTEXT with an Encoder); The ASL+SLU Model has different embodiments including dual mode to perform one of ASR or SLU: In an embodiment, an end-to-end speech based SLU system includes an RNN transducer (RNN-T) based SLU model 106. In an embodiment, one or more RNN-T models are developed by pre-training the models on task independent ASR data] … Figure 2, Steps 202, 204 and 206, At Step 206, the method … perform a spoken language understanding task based on input features, which include speech features associated with the received audio signals and the embedding; [i.e., SLU task is performed based on input speech features from the current utterance and the embedding (i.e., context), since the embedding from the Encoder (which can be BERT) of the Dialogue history produces the Context]).
Regarding Claim 19, Thomas discloses the non-transitory machine readable medium of claim 15, wherein, in the single mode, the output of the joint model is a tokenized transcript of the utterance concatenated with intent and slot keys and values (Thomas, para 0034-0035, The ASR+ SLU model or system 106 can be trained in many ways. In one embodiment it can be trained to produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed. For example, in a travel reservation SLU embodiment, for a speech utterance corresponding to a user prompt, “I want to travel from New York to Chicago on the 12th.”, the ASR+ SLU model or system 106 can produce the full verbatim transcript along with an SLU intent label <travel-reservation>. The transcripts can be further processed to extract the origin and destination airports. In another embodiment, the ASR+ SLU model or system 106 can be trained to produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+ SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”).
Regarding Claim 20, Thomas discloses the non-transitory machine readable medium of claim 15, wherein, in the dual mode, the input includes an indicator token identifying whether to perform ASR or SLU, and wherein the output of the joint model is a tokenized transcript of the utterance for ASR and intent slot keys and values for SLU (Thomas, para 0028 - 0035, FIG. 1 is a diagram illustrating E2E SLU model or system architecture with dialog history in an embodiment…Coupled memory devices may be configured to selectively store instructions executable by one or more hardware processors… The model shown in FIG. 1 can effectively encode full dialog history into a speech based E2E SLU system. A series of utterances in a conversation is shown with a current utterance 110, e.g., a current user response, being input as speech features 104. An encoder 104 encodes a dialog history (preceding utterances or turns in the conversation) 108 into an embedding. In an embodiment, a system and method disclosed herein can use Bidirectional Encoder Representations from Transformers (BERT) model embeddings to encode various elements of dialog history. The ASR+SLU model or system 106 can be trained in many ways. In one embodiment it can … produce full verbatim transcripts along with SLU labels at the output. SLU entities can then be further extracted from the ASR transcripts as needed…In another embodiment, the ASR+SLU model or system 106 … produce SLU labels in the output itself. In that case, e.g., the output can look like: “I want to travel from New York <origin-airport> to Chicago <destination-airport> on the 12th <travel-date>”. The ASR+SLU model or system 106 can also be trained to generate just the SLU entities and their slot values, e.g.: “New York <origin-airport> Chicago <destination-airport>12th <travel-date>”; [i.e., Figure 1 has an E2E architecture with various embodiments and various elements/models including the ASR+SLU. The coupled memory devices for this architecture can be configured to “selectively store instructions” (as, for instance, “an indicator token identifying whether to perform ASR or SLU”) executable by one or more hardware processors of the system. The ASR+SLU model or system 106 can produce ASR transcripts or SLU labels, and it can produce tokenized transcript of the utterance for ASR and intent slot keys and values for SLU]).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 3 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Thomas in view of Fu et al. Pat App No. US 20200219486 A1 (Fu).
Regarding Claim 3, Thomas discloses the method of claim 2.
Thomas does not specifically disclose wherein the joint model further comprises a layer normalization between the speech encoder and the shared encoder.
However, Fu, in the same field of endeavor discloses wherein the joint model further comprises a layer normalization between the speech encoder and the shared encoder (Fu, para 0045, As illustrated in FIG. 5, the shared encoder 520 includes one convolution layer (Conv), N LSTMs, and N batch normalization (BN) layer, where N may be a positive integer (e.g., 5, etc.). LSTM may be unidirectional. For a given input speech signal, the shared encoder 520 firstly encodes the speech signal to obtain a corresponding sequence 530 of implicit features. In some embodiments, the speech signal 510 have been subjected to feature extraction to obtain a model input x before being input to the shared encoder 520. It should be understood that although the internal hierarchical structure of the shared encoder 520 is illustrated in FIG. 5, encoders with other structure may be used in conjunction with the embodiments of the present disclosure ).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Fu in the method of Thomas because this would introduce intelligent customer service assistants and improve the accuracy of speech recognition to produce better user experiences with speech-related products (Fu, para 0003).
Regarding Claim 10, Thomas discloses the electronic device of claim 9.
Thomas does not specifically disclose wherein the joint model further comprises a layer normalization between the speech encoder and the shared encoder.
However, Fu, in the same field of endeavor discloses wherein the joint model further comprises a layer normalization between the speech encoder and the shared encoder (Fu, para 0045, As illustrated in FIG. 5, the shared encoder 520 includes one convolution layer (Cony), N LSTMs, and N batch normalization (BN) layer, where N may be a positive integer (e.g., 5, etc.). LSTM may be unidirectional. For a given input speech signal, the shared encoder 520 firstly encodes the speech signal to obtain a corresponding sequence 530 of implicit features. In some embodiments, the speech signal 510 have been subjected to feature extraction to obtain a model input x before being input to the shared encoder 520. It should be understood that although the internal hierarchical structure of the shared encoder 520 is illustrated in FIG. 5, encoders with other structure may be used in conjunction with the embodiments of the present disclosure ).
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to incorporate the method of Fu in the method of Thomas because this would introduce intelligent customer service assistants and improve the accuracy of speech recognition to produce better user experiences with speech-related products (Fu, para 0003).
Allowable Subject Matter
10. Claims 4, 11 and 17 are objected to as being dependent upon rejected base claims, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The reasons for allowance are that the prior art of record do not specifically teach the limitations as recited in the mentioned claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MULUGETA T. DUGDA whose telephone number is (703)756-1106. The examiner can normally be reached Mon - Fri, 4:30am - 7:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at 571-270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MULUGETA TUJI DUGDA/Examiner, Art Unit 2653
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653
03/21/2026