Last updated: April 19, 2026
Application No. 17/571,425
METHODS AND SYSTEMS FOR STREAMABLE MULTIMODAL LANGUAGE UNDERSTANDING

Non-Final OA §103
Filed
Jan 07, 2022
Examiner
COLUCCI, MICHAEL C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Huawei Technologies Co., Ltd.
OA Round
6 (Non-Final)
Interview Optional

— +15.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 990 resolved cases, 2023–2026
Examiner Intelligence

COLUCCI, MICHAEL C View full profile →
Grants 76% — above average
Career Allow Rate
749 granted / 990 resolved
+13.7% vs TC avg
Strong +15% interview lift
Without
With
+15.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
41 currently pending
Career history
1031
Total Applications
across all art units
Statute-Specific Performance

§101
14.2%
-25.8% vs TC avg
§103
59.2%
+19.2% vs TC avg
§102
8.5%
-31.5% vs TC avg
§112
6.0%
-34.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 990 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


DETAILED ACTION

NOTE: 
The previous office action dated 02/02/2026 is hereby withdrawn.
This is a supplemental non-final action. The amendments from 01/15/2026, following the advisory action on 12/29/2025, did not properly show in the application viewer contents within USPTO system. Attorney promptly notified Examiner on 02/27/2026 that there were in fact supplemental amendments filed. This office action is intended to address those supplemental amendments as part of the record. Examiner thanks attorney for the prompt notice.

Response to Arguments
Applicant's arguments with respect to claims 1, 3-8, 10-15, 22-24, and 26-29 have been considered but are moot in view of the new ground(s) of rejection. Refence Zheng has been incorporated into the rejection of independent claims 1, 8, and 15 to address the claim amendments.
The claims are now directed to speech and text/transcript alignment using cross-modal attention model, recited along with monotonic alignment and streaming per se. Such concepts are taught in Zheng which expressly aligns speech with text using a cross-modal attention model using embeddings per se representing speech and text. Furthermore, the concept of a cross-modal attention model inherently uses weights known as attention weights or scores per se. Further, Zheng defines the FAT-MLM as cross-modal attention based and calculates probabilities thereof as in 0071-0075. However, though the weights from a fundamental cross-modal attention model somehow appear inherent, for clarity, Moritz addresses the concept of attention mechanisms and how they are expressly tied to weights.
	

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-8, 10-15, 22-24, and 26-29 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 11562735 B1 Gupta; Arshit et al. (hereinafter Gupta) in view of US 20210183373 A1 Moritz; Niko et al. (hereinafter Moritz) and further in view of 20230169281 A1 ZHENG; Renjie et al. (hereinafter Zheng)
Re claim 8, Gupta teaches 
8. A computing system comprising: one or more processors; a memory storing machine-executable instructions, which, when executed by the one or more processors, cause the computing system to: (as in fig. 6, encoder, decoders, require processors or are processors themselves)
receive, for a speaker's speech, a sequence of speech chunks and corresponding text transcripts, and each text transcript represents a portion of the utterance, the text transcripts being received from a streamable automatic speech recognition (ASR) module,… (streamable as in fig. 1 + col 4, with a stream of speech to an ASR which outputs text as is the purpose an ASR and also expressly illustrated in fig. 1, with col 4 lines 6-69, speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
for each speech chunk and the corresponding text transcript for the speech chunk: (a “chunk” itself is simply data or audio which is then encoded into a representation of data or audio, wherein such a representation can include emotion found in the data/audio. Such an interpretation falls in line with Gupta Col 2 lines 60 to col 3 line 14 and as previously cited col 9 line 14 to col 10 line 42 with fig. 6.)
encode the speech chunk to generate a speech embedding; (Under BRI, the specification of the present invention provides multiple assorted embodiments as to the nature of speech chunks. Initially, as evidenced in the present invention specification 0007-0008 and 0042, the chunk itself is simply data or audio which is then encoded into a representation of data or audio, wherein such a representation can include emotion found in the data/audio. Such an interpretation falls in line with Gupta Col 2 lines 60 to col 3 line 14 and as previously cited col 9 line 14 to col 10 line 42 with fig. 6. Audio/data is encoded into a representation and in parallel audio/data is converted to a transcript and encode into a representation inclusive of emotion (as a single example but not limiting). Further, both present invention spec (0042) and Gupta (above cited sections) describe the transcript as a vector of fixed-length. Following synchronization after 640 and 650, and thereafter concatenation or merger at 655, and finally prediction at 670. Evidenced in the above cited paragraphs…. element 605… using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
encode the text transcript to generate a text embedding; (text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… Under BRI, the specification of the present invention provides multiple assorted embodiments as to the nature of speech chunks. Initially, as evidenced in the present invention specification 0007-0008 and 0042, the chunk itself is simply data or audio which is then encoded into a representation of data or audio, wherein such a representation can include emotion found in the data/audio. Such an interpretation falls in line with Gupta Col 2 lines 60 to col 3 line 14 and as previously cited col 9 line 14 to col 10 line 42 with fig. 6. Audio/data is encoded into a representation and in parallel audio/data is converted to a transcript and encode into a representation inclusive of emotion (as a single example but not limiting). Further, both present invention spec (0042) and Gupta (above cited sections) describe the transcript as a vector of fixed-length. Following synchronization after 640 and 650, and thereafter concatenation or merger at 655, and finally prediction at 670. Evidenced in the above cited paragraphs… in element 605… using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
synchronize… a collection of the speech embeddings and a collection of the text embeddings, based on a temporal alignment of the collection of speech embeddings and the collection of text embeddings, to generate a collection of aligned speech embeddings by: (as in col 9 line 13 to col 10 line 42 with fig. 6 text and speech are aligned and synchronized under the premise of embedding vectors or embedded data representing text from speech per se… text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6…in element 660 synched audio and transcript… using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
Weights (a fully connected layer having a set of neurons each of which may be connected to all representations produced in the previous layer (e.g., encoder 410) through weighted connections, as seen in regular feedforward artificial neural networks. Thus, encoder 410 together with classifier 415 may be considered analogous to an ANN model)
concatenate the collection of aligned speech embeddings and the collection of text embeddings to generate [[an]] a collection of audio-textual embeddings; and (text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… from element 660 all encoded data combined and concatenated at 655 … using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
generate a semantic prediction based on the collection of audio-textual embeddings, wherein the semantic prediction is generated before the speech signal representative of the speaker's speech comprises the entire utterance; and (detection of one word at a time to establish intent such as in fig. 3 with col 5 line 64 to col 7 line 10, where a hidden state is present and intent can still be determined… text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… at module 620 language understanding of intent predictions…from element 660 all encoded data combined and concatenated at 655 … using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
transform one or more of the semantic predictions into a command action based on a predefined set of commands. (the output intent is the command…from module 620 language understanding of intent predictions…from element 660 all encoded data combined and concatenated at 655 … using the speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)

However, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of alignment, attention weights, synchronization of text and speech according to LSTM-style concepts, and vectorized embeddings, thus failing to teach:
 … wherein each speech chunk represents a segment of a speech signal corresponding to part of a word within an utterance of words; (Moritz characters such as word or sentence pieces/portions 0007, using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
  based on a temporal alignment…; (Moritz using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
attention weights (Moritz synching encoded transcriptions with audio, see abstract with 0015 with attention weights expressly, see 0096 0108 fig. 3a)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of weights similar to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy and error rates.

However, while Gupta in view of Moritz teaches monotonic (Moritz) aligned in time for attention schemes as well as streamed input into an ASR to produce aligned and synchronized output embeddings of speech with text, and Moritz teaching weights thereof, it fails to teach:
…the streamable ASR providing a monotonic alignment between the sequence of speech chunks and corresponding text transcripts (Zheng as in fig. 8b speech aligned with text and also monotonic alignment in attention for transcription and text 0069-0070)
using a cross-modal attention network, (Zheng cross-modal attention precisely taught, 0070 expressly aligns speech with text using a cross-modal attention model using embeddings per se 0057-0058 representing speech and text. Furthermore, the concept of a cross-modal attention model inherently uses weights known as attention weights or scores per se. Further, Zheng defines, under the concept of embedding, the FAT-MLM as cross-modal attention based and calculates probabilities thereof as in 0071-0076)
computing attention weights between the collection of speech embeddings and the collection of text embeddings based on the cross-modal attention network; and (Moritz teaches attention weights, however the concept of a cross-modal attention model inherently uses weights known as attention weights or scores per se. Further, Zheng defines, under the concept of embedding, the FAT-MLM as cross-modal attention based and calculates probabilities thereof as in 0071-0076, with 0070 expressly aligns speech with text using a cross-modal attention model using embeddings per se 0057-0058 representing speech and text)
aligning the collection of speech embeddings with the collection of text embeddings, based on the attention weights; (Zheng the concept of a cross-modal attention model inherently uses weights known as attention weights or scores per se. Further, Zheng defines, under the concept of embedding, the FAT-MLM as cross-modal attention based and calculates probabilities thereof as in 0071-0076, with 0070 expressly aligns speech with text using a cross-modal attention model using embeddings per se 0057-0058 representing speech and text)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta in view of Moritz to incorporate the above claim limitations as taught by Zheng to allow for the use of a known technique of cross-modal attention networks, to improve similar devices in the same way, thereby improving speech recognition with text alignment and concatenation, providing an improved analogous model such as an FAT-MLM with cross-modal attention to handle not only text/speech but also a context shift in input, analogous to the alignment and text to speech mapping in both Gupta and Moritz.


Re claim 1, this claim has been rejected for teaching a broader representation of claim 8 omitting hardware for instance, otherwise amounting to a virtually identical scope.


Re claim 15, this claim has been rejected for teaching a broader representation of claim 8 omitting hardware for instance, otherwise amounting to a virtually identical scope.
Fig. 6 of Gupta demonstrates a memory.


Re claim 3, 10, and 17, Gupta teaches
10. The computing system of claim 8, wherein the machine- executable instructions, when executed by the one or more processors, further cause the system to generate a semantic prediction based on the collection of audio-textual embeddings by performing sequence classification on the collection of audio-textualembedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… labeling as analogous to classification with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29 for speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)

Re claims 4, 11, and 18, Gupta teaches
11. The computing system of claim 8, wherein the machine-executable instructions, when executed by the one or more processors further cause the system to generate a semantic prediction based on collection of the audio-textual embeddings by performing sequence classification... (text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29… speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)

However, while Gupta teaches the nearly identical mechanism to produce predicted intent using labeling as a form of classification, it fails to teach:
…and localization on the audio-textual representation (Moritz using CTC for instance with locating frames/features as in fig. 1a and 2a …and synching encoded transcriptions with audio, see abstract with 0015 with attention weights expressly, see 0096 0108 fig. 3a)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for combining prior art elements of prediction with labels as in Gupta with CTC concepts as in Moritz for improved real time ASR operations, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of CTC for neural networks to handle variability in cadence or general anomalies,  applicable to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy, error rates, and handling of exceptions.


Re claims 5 and 12, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to teach:
12. The computing system of claim 8, wherein each speech chunk in the sequence of speech chunks corresponds to a time step in a series of time steps. (Moritz using CTC temporal concepts in fig. 2a and 1a 0046 with 0127-0128 and synching encoded transcriptions with audio, see abstract with 0015 with attention weights expressly, see 0096 0108 fig. 3a) 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for combining prior art elements of time related operations as in Gupta with temporal or express time steps under CTC concepts as in Moritz for improved real time ASR operations, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of CTC for neural networks to handle variability in cadence or general anomalies,  applicable to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy, error rates, and handling of exceptions.


Re claims 6, 13, and 19, Gupta teaches
13. The computing system of claim 8, wherein the machine-executable instructions, when executed by the one or more processors, further cause the computing system to: prior to receiving the sequence of speech chunks and corresponding text transcripts: (speech input prior to transcript and merger… e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42 with commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29)
receive a speech signal corresponding to the speaker's speech; (speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)
generate a sequence of speech chunks based on the speech signal; (chunks as in a word, words, or command speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)
encode one or more encoded text features from each speech chunk; (features extracted with a word, words, or command speech input and encoded e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)
generate, based on the CTC-based predictions, a text prediction corresponding to each speech chunk. (intent predilection… features extracted with a word, words, or command speech input and encoded e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)

However, while Gupta teaches inherent temporal concepts, neural networks which require scaling/weighting, it fails to teach:
process the one or more encoded text features using an attention mechanism to generate an attention-based text prediction corresponding to each speech chunk; (Moritz using CTC for instance with locating frames/features as in fig. 1a and 2a …and synching encoded transcriptions with audio, see abstract with 0015 with attention weights expressly, see 0096 0108 fig. 3a)
process the one or more encoded text features using connectionist temporal classification (CTC) to generate a CTC-based text prediction corresponding to each speech chunk; and (Moritz using CTC for instance with locating frames/features as in fig. 1a and 2a …and synching encoded transcriptions with audio, see abstract with 0015 with attention weights expressly, see 0096 0108 fig. 3a)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for combining prior art elements of prediction with labels as in Gupta with CTC concepts as in Moritz for improved real time ASR operations, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of CTC for neural networks to handle variability in cadence or general anomalies,  applicable to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy, error rates, and handling of exceptions.


Re claims 7, 14, and 20, Gupta teaches
14. The system of claim 8,  wherein the machine-executable instructions, when executed by the one or more processors9, further cause the computing system to: update the semantic prediction for each speech chunk in the sequence of speech chunks, as each speech chunk and corresponding text transcript are received, wherein the semantic prediction is iteratively updated for each subsequent speech chunk before the speech signal representative to the speaker's speech comprises the entire utterance. (detection of one word at a time to establish intent such as in fig. 3 with col 5 line 64 to col 7 line 10, where a hidden state is present and intent can still be determined… text embedding and audio embedding, analogous to a speech vector aligned with text vector, analogous to a collection of embedded data thereof, from speech are aligned as input into element 660 and further concatenated as input into element 655 as in col 3 lines 30-55 and cited col 9 line 14 to col 10 line 42 with fig. 6… system learns with training model where commands analogous to requests or intents using a training model with datasets such that a system expects certain commands/requests, see col 12 line 60 to col 13 line 29 with speech input e.g. “how is the weather in Dallas” as one or more words as a command/request as well as a transcript, see fig. 6 and col 9 line 14 to col 10 line 42)
 
Re claim 22, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of weights when comparing candidates, thus failing to teach:
22. (New) The method of claim 1, wherein the collection of aligned speech embeddings [[as]] represents an aligned speech vector corresponding to a first word in the sequence of words. (Moritz, aligned vector with embedded sequence expressly taught, as the letters for instance are received e.g. D-O-G as the first and only word, using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d… and mathematical versions described thereof 0093-0096)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization.


Re claim 23, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of weights when comparing candidates, thus failing to teach:
23. (New) The method of claim 2, wherein aligning the collection of speech embeddings with into text feature space, based on the attention weights. Moritz, the output is the transcription as a feature space in fig. 1a element 125 vector, post-alignment and post-attention weight use, using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d… and mathematical versions described thereof 0093-0096)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization.
 

Re claim 24, Gupta teaches as each speech chunk is received and before the speech signal representative of the speaker's speech comprises the entire utterance. (detection of one word at a time to establish intent such as in fig. 3 with col 5 line 64 to col 7 line 10, where a hidden state is present and intent can still be determined)
However, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of weights when comparing candidates, thus failing to teach:
24. (New) The method of claim 5, wherein the semantic prediction is output as a sequence of predicted semantic events, the semantic prediction incrementally capturing an intent of the speaker for each time step. (Moritz time step for each character to produce intent or intention to say word DOG versus DUG, using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d… and mathematical versions described thereof 0093-0096)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization.

Re claim 26, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of alignment, attention weights, synchronization of text and speech according to LSTM-style concepts, and vectorized embeddings, thus failing to teach:
26. (New) The method of claim 1, wherein the monotonic alignment between the sequence of speech chunks and corresponding text transcripts is based on a local attention performed on each speech chunk in the sequence of speech chunks. (Moritz the local or time based speech in a region using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of weights similar to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy and error rates.

Re claim 27, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of alignment, attention weights, synchronization of text and speech according to LSTM-style concepts, and vectorized embeddings, thus failing to teach:
27. (New) The method of claim 26, wherein the local attention is enforced within a moving forward window having a fixed size associated with a width of a speech chunk of the sequence of speech chunks. (Moritz s shifting look-ahead location for fames with fixed partitions moving 0075-0076…using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of weights similar to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy and error rates.
 
Re claim 28, while Gupta in view of Moritz teaches monotonic (Moritz) aligned in time for attention schemes as well as streamed input into an ASR to produce aligned and synchronized output embeddings of speech with text, and Moritz teaching weights thereof, it fails to teach:
28. (New)  The method of claim 1, wherein the temporal alignment of the collection of speech embeddings and the collection of text embeddings is a monotonic temporal alignment. (Zheng as in fig. 8b speech aligned with text and also monotonic alignment in attention for transcription and text 0069-0070)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta in view of Moritz to incorporate the above claim limitations as taught by Zheng to allow for the use of a known technique of cross-modal attention networks, to improve similar devices in the same way, thereby improving speech recognition with text alignment and concatenation, providing an improved analogous model such as an FAT-MLM with cross-modal attention to handle not only text/speech but also a context shift in input, analogous to the alignment and text to speech mapping in both Gupta and Moritz.

Re claim 29, while Gupta teaches the nearly identical mechanism to produce predicted intent, it fails to expressly recite known uses of alignment, attention weights, synchronization of text and speech according to LSTM-style concepts, and vectorized embeddings, thus failing to teach:
29. (New)  The method of claim 1, wherein the attention weights for a current speech chunk  are restricted to a local window comprising the current speech chunk and one or more  preceding speech chunks, the method further comprising: (Moritz s shifting look-ahead location for fames with fixed partitions moving 0075-0076…using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
updating the aligned collection of speech embeddings with the collection of text embeddings at each time step without re-aligning any speech embeddings in the collection of speech embeddings that are associated with preceding speech chunks. (NOTE: There is no support in the present invention specification for such concepts, however under BRI in lieu of a 112 rejection, the limitation is construed as processing and aligning in real-time i.e. not altering or processing past frames… Moritz shifting look-ahead location for fames with fixed partitions moving 0075-0076…using synchronization 0015 with fig. 1b and time alignment such as CTC for characters of words one letter as a time as spoken in time steps 0035 incrementally provided into a vector sequence of characters embedded, encoded and aligned 0040-0043 with 0067 and fig. 1d)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Gupta to incorporate the above claim limitations as taught by Moritz to allow for the use of a known technique of label synchronous coding, or specifically CTC time stepping synchronization and encoding-based alignment with attention weighting, to improve similar devices in the same way, thereby improving speech recognition for streaming or “real-time” applications, the recognition accuracy and to reduce the computational load produced, as well as the accuracy of transcription outputs coded during synchronization, wherein the suggestions of Gupta’s training models is expressly improved with otherwise inherent or known or necessary concepts as in uses of weights similar to neural network models in Gupta, now expressly combined under Moritz, which at the very least improves accuracy and error rates.



Conclusion

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/15/2026 has been entered.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20210225357 A1	ZHAO; Pei et al.
Emotion extraction

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847.  The examiner can normally be reached on M-F 9 AM - 7 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655                                                                                                                                                                                               (571)-270-1847
Examiner FAX:  (571)-270-2847
Michael.Colucci@uspto.gov
Read full office action
Prosecution Timeline

Jan 07, 2022
Application Filed
Jul 09, 2024
Non-Final Rejection — §103
Oct 11, 2024
Response Filed
Nov 25, 2024
Final Rejection — §103
Mar 28, 2025
Request for Continued Examination
Mar 31, 2025
Response after Non-Final Action
May 06, 2025
Non-Final Rejection — §103
Aug 08, 2025
Response Filed
Oct 09, 2025
Final Rejection — §103
Dec 15, 2025
Response after Non-Final Action
Jan 15, 2026
Request for Continued Examination
Jan 26, 2026
Response after Non-Final Action
Jan 29, 2026
Non-Final Rejection — §103
Mar 04, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/515,502
Patent 12592240
ENCODING AND DECODING OF ACOUSTIC ENVIRONMENT
2y 5m to grant Granted Mar 31, 2026
18/585,168
Patent 12586570
CHUNK-WISE ATTENTION FOR LONGFORM ASR
2y 5m to grant Granted Mar 24, 2026
18/131,021
Patent 12573405
WORD CORRECTION USING AUTOMATIC SPEECH RECOGNITION (ASR) INCREMENTAL RESPONSE
2y 5m to grant Granted Mar 10, 2026
18/656,274
Patent 12573380
MANAGING AMBIGUOUS DATE MENTIONS IN TRANSFORMING NATURAL LANGUAGE TO A LOGICAL FORM
2y 5m to grant Granted Mar 10, 2026
18/492,177
Patent 12567414
SYSTEM AND METHOD FOR DETECTING A WAKEUP COMMAND FOR A VOICE ASSISTANT
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

6-7
Expected OA Rounds
76%
Grant Probability
91%
With Interview (+15.3%)
3y 1m
Median Time to Grant
High
PTA Risk
Based on 990 resolved cases by this examiner. Grant probability derived from career allow rate.
METHODS AND SYSTEMS FOR STREAMABLE MULTIMODAL LANGUAGE UNDERSTANDING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email