Last updated: April 19, 2026
Application No. 18/762,552
ACCURATE RESPONSE FOR NOISY USER SPEECH BY CROSS-ATTENTION STITCHING ENCODED AUDIO FEATURES INTO LARGE LANGUAGE MODELS

Non-Final OA §102§103§112
Filed
Jul 02, 2024
Examiner
GAY, SONIA L
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 855 resolved cases, 2023–2026
Examiner Intelligence

GAY, SONIA L View full profile →
Grants 82% — above average
Career Allow Rate
701 granted / 855 resolved
+20.0% vs TC avg
Moderate +11% lift
Without
With
+11.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
33 currently pending
Career history
888
Total Applications
across all art units
Statute-Specific Performance

§101
10.2%
-29.8% vs TC avg
§103
50.6%
+10.6% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
13.9%
-26.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 855 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
This action is in response to the initial filing of application no. 18/762,552 on 07/02/2024.
Claims 1 – 20 are still pending in this application, with claims 1 and 12 being independent.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.



Claims 1, 6, 7 and 15  -18 as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1 and 18 recite the limitation "in response to receiving the audio data”, wherein this limitation comprises the initial recitation of “audio data” in independent claims 1 and 18.  There is insufficient antecedent basis for this limitation in the claim.

Claims 6 and 7 recite the limitation, “wherein processing the audio data to extract the acoustic features of the audio data comprises”. There is insufficient antecedent basis for this limitation in the claim.

Claim 15 recites the limitation “using a multi-head attention mechanism of the pre-trained generative model”, wherein this limitation comprise the initial recitation of “pre-trained generative model” in dependent claim 15 (including all of the limitations of parent claims 12 and 13). There is insufficient antecedent basis for this limitation in the claim.

Claims 16 and 17 recite the limitation, “and/or”. This limitation is indefinite since it is unclear if the limitation is to be interpreted as conjunctive (and, or both) or disjunctive (or, as in one or the other). To further prosecution, this limitation is interpreted as conjunction. However, further correction or advisement is necessary.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1 and 18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xing et al. (US 2023/0223018) (“Xing”).
As to claim 1, Xing teaches a method implemented using one or more processors (Fig.1, 102; [0050 - 0051] [0054]), the method comprising: receiving audio data capturing user speech (Fig.2, 210 and Fig.3, 210 [0058] [0059] [0063]); and in response to receiving the audio data capturing the user speech: processing the audio data to determine a speech recognition (text transcript) of the user speech (An online attention CTC neural network receives speech and processes it to generate a text transcript., Fig.2, 220, 240, Fig.3, 303, 306, 312, 310, 316; [0061] [0062] [0064 – 0068] [0070]), processing the audio data (speech chunks/segments of an input speech signal, [0039] [0040]) to generate one or more audio embeddings (encoded speech embeddings, Fig.4, 414; [0041 - 0043]) that represent acoustic features of the audio data (Fig.4, 402; [0039] [0040] [0073] [0074]), and processing, using a machine learning (ML) model (neural network including a cross modal attention subnetwork, concatenator subnetwork and sequence classifier, Fig.4, 418, 422 and 426; [0006] [0009] [0030] [0074]), both (i) the one or more audio embeddings that represent the acoustic features of the audio data (encoded speech embeddings, Fig.4, 414) and (ii) a text embedding that represent the speech recognition (encoded word embeddings, Fig.4, 416; [0075]), to generate a model output (semantic prediction, Fig.2, 260 and Fig.4, 260; [0045] [0046]) ([0076 – 0083]); determining a response (command action, Fig.2, 280) to the user speech based on the model output ([0047] [0084]), and causing the response to be rendered in response to the user speech (The semantic predictions 260 may be transformed by an interpreter 270 into a command action 280 based on a predefined set of commands. A computing system or computer application running on a computing system that is capable of executing the predefined command action 280 may then be able to execute the command action 280…  The streamable MLU system 200 may process the speech signal 210 to output a semantic prediction 260 that captures the speaker's intent to “turn on” “the lights”. The smart speaker may then be able to map the semantic prediction to a command action 280 from a predefined set of command actions that the user wishes to turn on the lights, and may execute the command action 280, [0060]).  

 As to claim 18, Xing teaches a system  (Abstract) comprising one or more processors (Fig.1, 102; [0050 - 0051] [0054]), and memory (Fig.1, 116)  storing instructions that, when executed by one or more of the processors, cause one or more of the processors ([0054]) to: in response to receiving the audio data capturing the user speech (Fig.2, 210 and Fig.3, 210 [0058] [0059] [0063]): process the audio data to determine a speech recognition (text transcript) of the user speech (An online attention CTC neural network receives speech and processes it to generate a text transcript., Fig.2, 220, 240, Fig.3, 303, 306, 312, 310, 316; [0061] [0062] [0064 – 0068] [0070]), process the audio data (speech chunks/segments of an input speech signal, [0039] [0040]) to generate one or more audio embeddings (encoded speech embeddings, Fig.4, 414; [0041 - 0043]) that represent acoustic features of the audio data (Fig.4, 402; [0039] [0040] [0073] [0074]), and process, using a machine learning (ML) model (neural network including a cross modal attention subnetwork, concatenator subnetwork and sequence classifier, Fig.4, 418, 422 and 426; [0006] [0009] [0030] [0074]), both (i) the one or more audio embeddings that represent the acoustic features of the audio data (encoded speech embeddings, Fig.4, 414) and (ii) a text embedding that represent the speech recognition (encoded word embeddings, Fig.4, 416; [0075]), to generate a model output (semantic prediction, Fig.2, 260 and Fig.4, 260; [0045] [0046]) ([0076 – 0083]); determine a response (command action, Fig.2, 280) to the user speech based on the model output ([0047] [0084]), and causing the response to be rendered in response to the user speech (The semantic predictions 260 may be transformed by an interpreter 270 into a command action 280 based on a predefined set of commands. A computing system or computer application running on a computing system that is capable of executing the predefined command action 280 may then be able to execute the command action 280…  The streamable MLU system 200 may process the speech signal 210 to output a semantic prediction 260 that captures the speaker's intent to “turn on” “the lights”. The smart speaker may then be able to map the semantic prediction to a command action 280 from a predefined set of command actions that the user wishes to turn on the lights, and may execute the command action 280, [0060]).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2 - 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xing et al. (US 2023/0223018) (“Xing”) in view of KO (US 2020/0143807).
For claim 2, Xing fails to teach the following: determining whether the audio data capturing the user speech is noisy.  
However, Ko discloses a method for providing a response to user’s speech or utterance (Abstract), comprising the following: determining that an audio data capturing user speech is noisy (Fig.4, S410, S420; [0089 – 0101]); and further performing automatic speech recognition and natural language understanding at a remote device (server based ASR and NLU) in response to determining that the audio data capturing user speech is noisy  (Fig.4, S430, S440, S470 and S480; [0102 – 0106]).
Additionally, Xing discloses that the automatic speech recognition and natural language understanding are performed at a remote device (The streamable MLU system is provided as service to other electronic devices, wherein a speech signal is generated by a microphone of another electronic device and communicated to the device comprising the streamable MLU system, [0048] [0059]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Xing’s invention in the same way that KO’s invention has been improved to achieve the following, predictable results for the purpose of accurately and reliably processing natural language input to better capture a user’s intent to provide a desirable response (Xing, [0002] [0003]) (KO, [0009]):  further determining whether the audio data capturing the user speech is noisy.  

For claim 3, Xing and KO further disclose, wherein processing the audio data to generate the one or more audio embeddings is performed in response to determining that the audio data capturing the user speech is noisy (Xing, ASR and NLU are performed at a remote device, [0048] [0058 - 0060] [0073]) (KO, The audio signal is transmitted to a remote device/sever comprising ASR and NLU to process the audio signal, [0089 - 0106]).

For claim 4, KO further discloses, wherein determining that the audio data capturing the user speech is noisy comprises: determining that a distance between a user providing the user speech and a client device that captures the audio data is greater than a distance threshold (KO, [0078]).  

For claim 5, KO further discloses, wherein determining that the audio data capturing the user speech is noisy comprises: determining that a signal-to-noise ratio (SNR) for the audio data does not satisfy a SNR threshold (KO, [0078] [0093 – 0095] [0103]).  

Claim(s) 6 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xing et al. (US 2023/0223018) (“Xing”) in view of Graciarena et al. (US 2025/0046333) (“Graciarena”).
For claim 6, Xing fails to tech the following: generating a spectrogram from the audio data capturing the user speech, and processing the spectrogram to extract spectrogram features corresponding to the audio data as the acoustic features. 
However, Graciarena discloses a system and method to automatically identity and classify audio input (Abstract), comprising the following: generating a spectrogram from audio data capturing user speech (An input audio waveform  is converted to a spectrogram, [0004] [0019 – 0021]); and processing the spectrogram to extract spectrogram features corresponding to the audio data as acoustic features (The audio spectrogram is applied to a log Mel filter bank to generate acoustic features, [0041 – 0046]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Xing’s invention in the same way that Graciarena’s invention has been improved to achieve the following, predictable results for the purpose of accurately and reliably processing natural language input to better capture a user’s intent to provide a desirable response (Xing, [0002] [0003]): further generating a spectrogram from the audio data capturing the user speech, and processing the spectrogram to extract spectrogram features corresponding to the audio data as the acoustic features (Xing, The speech features are extracted busing a 80-dimenensional 
log Mel filter bank, [0040]). 

For claim 7, Xing and Graciarena further disclose, wherein processing the audio data to generate one or more audio embeddings that represent acoustic features of the audio data comprises: processing the spectrogram features, using an audio encoder, to generate the one or more audio embeddings (Xing, [0040] [0058 – 0061] [0073] [0074]) (Graciarena, [0041 – 0043] [0047] [0048]).

Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xing et al. (US 2023/0223018) (“Xing”) in view of Shabat et al. (US 2024/0203404) (“Shabat”).
For claim 8, Xing fails to teach that the ML model (neural network, [0006] [0009] ) is a transformer-based large language model (LLM).
However, Shabat discloses a system and method for the purpose of enabling large language model-based spoken language understanding systems to leverage both audio and textual data (Abstract), wherein a machine learning model (Fine-Tuned LLM/NLU Module) which receives audio (e.g. speech) input and text input generated by ASR and outputs natural language understanding data(e.g. intent) is a transformer based large language model (Fig.1B, 160 and Fig 3A, 300; [0003] [0032 – 0040] [0054 – 0056]). 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s to improve Xing’s invention in the same way that Shabat’s invention has been improved to achieve the predictable results of the ML model, which receives and processes audio and text input to generate a spoken language understanding output, further comprising a transformer based LLM for the purpose of leveraging both audio and textual data to predict semantics information contained in received speech to generate a desirable response using LLMs, wherein LLMs enable transfer learning of general-purpose knowledge into specific NLP tasks (Xing, [0002] [0003] [0008]) (Shabat, [0001 – 0003] [0006]).

Claim(s) 9, 10,11, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xing et al. (US 2023/0223018) (“Xing”) in view of Shabat et al. (US 2024/0203404) (“Shabat”), and further in view of Liu et al. (US 2023/0368796) (“Liu”)  and further in view of Jaber et al. (US 2024/0054342) (“Jaber”).
For claims 9 and 19, Xing fails to teach, wherein processing both (i) the one or more audio embeddings that represent the acoustic features of the audio data and (ii) the text embedding that represent the speech recognition comprises: processing the text embedding, using a multi-head attention mechanism, to generate intermediate attention features, and providing the intermediate attention features and the one or more audio embeddings to an additional multi-head attention mechanism.  
However, Shabat discloses a system and method for the purpose of enabling large language model-based spoken language understanding systems to leverage both audio and textual data (Abstract), wherein a machine learning model (Fine-Tuned LLM/NLU Module) which receives audio (e.g. speech) input and text input generated from an ASR and outputs natural language understanding data(e.g. intent) is a large language model (Fig.1B, 160 and Fig 3A, 300 and Fig.3B; [0003] [0032 – 0040] [0054 – 0057] [0061]). Additionally, Shabat discloses providing one or more text embeddings and one or more audio embeddings to a multi-head attention mechanism (Fig.3C, 341; [0057] [0061]).
Moreover, Liu discloses a system and method for performing spoken language understanding (Abstract), comprising the following: a text encoder further comprises a transform encoder and a text embedder (Fig.5, 420, 522 and 524; [0106] [0107]); and text embeddings generated by the text embedder are further processed by the transform encoder of the text encoder to generate intermediate features which are forwarded to a decoding process ([0107]).
Furthermore, Jaber discloses a system and method for processing input using a machine learning model (Abstract), wherein input (e.g. text) is processed by a transformer encoder (Fig.2B, 250) comprising a multi-headed attention section ([0052]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s to improve Xing’s invention in the same way that Shabat’s invention has been improved to achieve the predictable results of the ML model, which receives and processes audio and text input to generate a spoken language understanding output, further comprising a transformer based LLM, wherein one or more text embeddings and one or more audio embeddings are provided to a multi-head attention mechanism in the LLM model for the purpose of leveraging both audio and textual data to predict semantics information contained in received speech to generate a desirable response using LLMs, wherein LLMs enable transfer learning of general-purpose knowledge into specific NLP tasks (Xing, [0002] [0003] [0008]) (Shabat, [0001 – 0003] [0006]).
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Xing and Shabat in the same way that Liu’s invention has been improved to achieve the following, predictable results for the purpose of leveraging both audio and textual data to predict semantics information contained in received speech to generate a desirable response using LLMs which enable transfer learning of general-purpose knowledge into specific NLP tasks (Xing, [0002] [0003] [0008]) (Shabat, [0001 – 0003] [0006]): a text encoder of the LLM model (Shabat, Fig.3A, 310) further comprises a transform encoder and text embedder; and text embeddings generated by the text embedder are further processed by the transform encoder of the text encoder to generate intermediate features which are provided to the multi-head attention mechanism.
Moreover, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Xing, Shabat and Liu in the same way that Jaber’s invention has been improved to achieve the following, predictable results for the purpose of leveraging both audio and textual data to predict semantics information contained in received speech to generate a desirable response using LLMs which enable transfer learning of general-purpose knowledge into specific NLP tasks (Xing, [0002] [0003] [0008]) (Shabat, [0001 – 0003] [0006]: the transformer encoder further comprises a multi-head attention mechanism.

For claim 10, Jaber further discloses, wherein the multi-head attention mechanism or the additional multi-head attention mechanism includes multiple attention heads each having a query matrix, a key matrix, and a value matrix (Jaber, [0052]).

For claims 11 and 20, Shabat, Liu and Jaber further disclose, wherein providing the intermediate attention features and the one or more audio embeddings to an additional multi-head attention mechanism causes the intermediate attention features to be multiplied with the query matrix, and the one or more audio embeddings to be multiplied with the key matrix and the value matrix, respectively (Shabat,  [0054 – 0057] [0061]) (Liu, [0106] [0107]) (Fig.5, 430 and 532; [0108] [0109]) (Jaber, All inputs features are multiplied with the query matrix, key matrix and value matrix, [0052]).

Claim(s) 12 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 2023/0368796) (“Liu”) in view of Serdyuk et al. (“Towards End-to-End Spoken Language Understanding”) (“Serdyuk”) and further in view of Du et al. (“LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT”)(“Du”).
For claim 12, Liu discloses a method (Abstract) implemented using one or more processors (Fig.7, 704) ([0112 – 0115]), the method comprising: generating one or more training instances (Fig.4B, 450; [0086] [0088] [0100]), the one or more training instances including a first training instance that includes a first training instance input and a first ground truth response ([0084] [0086] [0088] [0100]), wherein the first training instance input includes audio data capturing a user speech ([0088] [0100]), and wherein the first ground truth response includes content responsive to the user speech and is generated based on content of the user speech ([0084] [0088] [0100]); processing the first training instance input, using a pre-trained transformer based language model ([0085] [0100] [0101]), to generate a first training instance output (Fig.4B,435; [0092] [0093 [0100] [0101] [0104]). Yet, Liu fails to teach the following: the first training instance input comprises noisy audio; the pre-trained transformer based language model is a large language model; the first training instance output is compared with the first ground truth response to determine a first difference; and the pre-trained transformer based language model is fine-tuned based on the determined first difference. 
However, Serdyuk discloses a system and method for performing spoken language understanding (Abstract), wherein both clean audio and noisy audio and respective intent labels are used to train an end-to-end spoken language understanding (SLU) model (“We train and evaluate our models on an in-house dataset containing VR spoken commands collected for that purpose. The dataset is close in spirit to ATIS corpus [25]. The dataset contains about 320 hours of near field annotated data collected from a diverse set of more than 1000 de-identified speakers … Every utterance has transcription as well as meta information including a domain label and an intent label … We also emulate the real-world situation where the input to the SLU system is noisy. Both training and evaluation datasets were corrupted by convolving with recorded room impulse responses (RIRs) whose T60 times ranges from 200ms to 1 second. Background noise was added as well: for training data, the SNR ranges from 5 to 25dB …”, 3. End-to - End Spoken Language Understanding and 4. Experiments).
Additionally, Du discloses a system and method for performing speech understanding (Abstract and 1 Introduction, pg. 1 and 2), comprising the following: a transformer based model (LauraGPT) which receives speech and text input and outputs spoken language understanding data (Figure 1; 1. Introduction and 3. Methodology, pg. 1 – 5) is a large language model (LLM) (1.Introduction, pg. 1- 2); and, the LLM is fine- tuned based on a difference between a ground truth response and response generated by the LLM based on a training input (The model is trained by minimizing a cross-entropy loss between output predicted by the model and target output associated with training data, 3.3. Modified Language Model for Unifying Audio-Text Modeling, 3.5 Multi-Task Finetuning, 4.2 Training Setup and A.2 Training Datasets, pg. 4 – 6, 15 and 16).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Liu’s invention in the same way that Serdyuk’s invention has been improved to achieve the following, predictable results for the purpose of accurately and reliably processing natural language input using the transformer based language model, wherein the language model has been trained to handle real-world situations, e.g. noisy input (Liu, [0002]) (Serdyuk, 4. Experiments): further adding noise to the audio data components of the training instances to generate training instances comprising noisy audio data, wherein these training instances are further provided to the pre-trained transformer based model to fine tune the model (Liu, SLU training data is used to fine tune the SLU system., [0019]).
Moreover, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Liu and Serdyuk in the same way that Du’s invention has been improved to achieve the following, predictable results for the purpose of accurately and reliably processing a large and varied amount of natural language input to better capture a user’s intent to provide a desirable response (Du, 1. Introduction): the transformer based machine learning model is a  large language model; and the large language model is fine-tuned by minimizing a cross loss entropy function which compares and determines a difference between a first training instance output and a ground truth response (target output).

For claim 16, Serdyuk further discloses, wherein the one or more training instances including a second training instance that includes a second training instance input and a second ground truth response, wherein the second training instance input includes alternative noisy audio data capturing the user speech (Serdyuk, “We also emulate the real-world situation where the input to the SLU system is noisy. Both training and evaluation datasets were corrupted by convolving with recorded room impulse responses (RIRs) whose T60 times ranges from 200ms to 1 second. Background noise was added as well: for training data, the SNR ranges from 5 to 25dB, while for evaluation data, the SNR ranges from 0 to 20dB. Every training utterance is distorted 2 times by using different RIRs, sources of background noise and SNRs. This results in a 600 hours noise-corrupted training set”, 4, Experiments) , the noisy audio and the alternative noisy audio including different levels of noise (Serdyuk, 4.Experiment) and/or different sources of noise.  

Claim(s) 13 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 2023/0368796) (“Liu”) in view of Serdyuk et al. (“Towards End-to-End Spoken Language Understanding”) (“Serdyuk”),and further in view of Du et al. (“LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT” (“Du”).  Catanazaro et al. (US 2017/0148433) (“Catanazaro”).
For claim 13, the combination of Liu, Serdyuk and Du fails to teach, wherein the first ground truth response is generated based on comparing the content of the user speech and a transcript of the user speech determined using an ASR engine.  
However, Catanazaro discloses an end-to-end speech recognition system (Abstract), wherein an initial training set comprising shorter utterances and correct transcriptions is generated from a larger dataset by comparing a content of user speech (ground truth transcriptions stored the data in the internal English and Mandarin datasets) with a transcript determined using an ASR engine (transcription generated by a bidirectional RNN model trained with CTC which aligns the transcription to audio frames)([0162 – 0166]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Liu, Serdyuk and Du in the same way that Catanazaro’s invention has been improved to achieve the following, predictable results for the purpose of generating training data which enables accurate and reliable natural language processing (Liu, [0002]): the training instances, including a first ground truth response (Liu, The training instances comprise ASR data which include ground truth transcriptions., [0086] [0089] [0092]) (Du, 3.3. Modified Language Model for Unifying Audio-Text Modeling, A.1 Basic Tasks, pg. 4, 5 and 15) are further generated from a larger dataset; and the first ground truth response is generated based on comparing the content of the user speech and a transcript of the user speech determined using an ASR engine.  

For claim 14, Catanazaro further discloses, wherein the transcript of the user speech determined using the ASR engine is a mistranscription that is different from the content of the user speech (Catanazaro, The word level edit distance between the transcript determined using the ASR and the content of the user speech, ground truth transcriptions associated with the larger dataset, exceeds a threshold, [0166]).

Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 2023/0368796) (“Liu”) in view of Serdyuk et al. (“Towards End-to-End Spoken Language Understanding”) (“Serdyuk”), and further in view of Du et al. (“LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT” (“Du”), and further in view of Shabat et al. (US 2024/0203404) (“Shabat”) and further in view of Jaber et al. (“US 2024/0054342”) (“Jaber”).
For claim 15, the combination of Liu, Serdyuk, and Du further discloses, wherein processing the first training instance input, using a pre-trained LLM, to generate the first training instance output comprises: processing the noisy audio data capturing the user speech to determine an audio embedding for the noisy audio data (Liu, [0085] [0092] [0093 [0100] [0101] [0104]) (Serdyuk, 3. End-to - End Spoken Language Understanding and 4. Experiments); processing the transcript of the user speech to determine a text embedding for the transcript (Liu, [0047] [0050] [0106] [0107]); processing the text embedding, using a transformer encoder of the pre-trained generative model, to generate intermediate attention features (Liu, [0106] [0107]). Yet, the combination of Liu, Serdyuk and Du fails to teach the following: the transformer encoder comprises a multi-head attention mechanism; and the first training instance output is determined based on processing the intermediate attention features and the audio embedding, using a cross-attention mechanism of the pre-trained generative model.  
However, Shabat discloses a system and method for processing speech (Abstract), comprising the following: a LLM-based NLU module (Fig.3A, 300; [0055]) generates output by processing the text encoder output and audio embeddings ([0055]) using a cross-attention mechanism of the pre-trained generative model (A fusion module comprises an attention module which projects both of the text encodings and audio encodings into a predefined dimension, [0056] [0062]).  
Furthermore, Jaber discloses a system and method for processing input using a machine learning model (Abstract), wherein input (e.g. text) is processed by a transformer encoder (Fig.2B, 250) comprising a multi-headed attention section ([0052]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of  Liu, Serdyuk, and Du in the same way that Shabat’s invention has been improved to achieve the following, predictable results for the purpose of purpose of improving intent recognition by leveraging the use of audio data in addition to textual data to generate an intent (Shabat, [0004 – 0007]): the SLU module (Liu, Fig.5, 240/340) further comprises functionality, including a cross-attention mechanism, to provide an output based on multimodal input comprising audio and text (Liu, the SLU model is trained jointly trained on multiple processing tasks, such as audio-to-text processing and text-to-NLU processing. Therefore, SLU model is capable of receiving both audio and text as inputs, [0017]); and the first training instance output is further determined based on processing the intermediate attention features (Liu, output of transform encoder of the text encoder; [0107]) and the audio embedding using the cross attention mechanism.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Liu, Serdyuk, Du and Shabat in the same way that Jaber’s invention has been improved to achieve the following, predictable results for the purpose of purpose of improving intent recognition by leveraging the use of audio data in addition to textual data to generate an intent, wherein using audio data allows for paralinguistics or compensating for low quality ASR (Shabat, [0004 – 0007]): the transformer encoder further comprises a multi-head attention mechanism.

Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 2023/0368796) (“Liu”) in view of Serdyuk et al. (“Towards End-to-End Spoken Language Understanding”) (“Serdyuk”), and further in view of Du et al. (“LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT” (“Du”), and further in view of Carbune et al. (US 2023/0215422) (“Carbune”).
For claim 17, the combination of Liu, Serdyuk and Du fails to teach, wherein the first ground truth response indicates a first source of noise in the noisy audio data, and/or the second ground truth response indicates a second source of noise in the alternative noisy audio data.  
However, Carbune discloses a system and method  for the purpose of performing multimodal intent understanding  (Abstract), wherein training data comprises an indications of sources of noise in noisy audio data ([0006] [0038]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Liu, Serdyuk and Du in the same way that Carbunes’s invention has been improved to achieve the following, predictable results for the purpose of generating training data which enables accurate and reliable natural language processing (Liu, [0002]) (Carbune, [0009]): the first ground truth response indicates a first source of noise in the noisy audio data, and/or the second ground truth response indicates a second source of noise in the alternative noisy audio data.  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Jul 02, 2024
Application Filed
Jan 10, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/065,406
Patent 12602617
DATA MANUFACTURING FRAMEWORKS FOR SYNTHESIZING SYNTHETIC TRAINING DATA TO FACILITATE TRAINING A NATURAL LANGUAGE TO LOGICAL FORM MODEL
2y 5m to grant Granted Apr 14, 2026
18/136,634
Patent 12602408
STREAMING OF NATURAL LANGUAGE (NL) BASED OUTPUT GENERATED USING A LARGE LANGUAGE MODEL (LLM) TO REDUCE LATENCY IN RENDERING THEREOF
2y 5m to grant Granted Apr 14, 2026
18/390,675
Patent 12602539
PROACTIVE ASSISTANCE VIA A CASCADE OF LLMS
2y 5m to grant Granted Apr 14, 2026
18/467,276
Patent 12596708
SYSTEMS AND METHODS FOR AUTOMATED CODE GENERATION FOR CALCULATION BASED ON ASSOCIATED FORMAL SPECIFICATIONS
2y 5m to grant Granted Apr 07, 2026
18/209,100
Patent 12591604
INTELLIGENT ASSISTANT
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
93%
With Interview (+11.4%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 855 resolved cases by this examiner. Grant probability derived from career allow rate.
ACCURATE RESPONSE FOR NOISY USER SPEECH BY CROSS-ATTENTION STITCHING ENCODED AUDIO FEATURES INTO LARGE LANGUAGE MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email