Last updated: May 29, 2026
Application No. 17/964,633
Audio Understanding with Fixed Language Models

Non-Final OA §102§103
Filed
Oct 12, 2022
Examiner
ORTIZ SANCHEZ, MICHAEL
Art Unit
2656
Tech Center
2600 — Communications
Assignee
International Business Machines Corporation
OA Round
3 (Non-Final)
Interview Optional

— +27.6% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 67% grant rate with +27.6% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 496 resolved cases, 2023–2026
Examiner Intelligence

ORTIZ SANCHEZ, MICHAEL View full profile →
Grants 67% — above average
Career Allowance Rate
331 granted / 496 resolved
+4.7% vs TC avg
Strong +28% interview lift
Without
With
+27.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
13 currently pending
Career history
520
Total Applications
across all art units
Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
88.4%
+48.4% vs TC avg
§102
7.2%
-32.8% vs TC avg
§112
0.5%
-39.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 496 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 09/23/2025 have been fully considered but they are not persuasive.
Applicant argues that claim 11 is Patentable Over Steedman and Tomkins.   As   to   independent   claim   11   directed   to   a   computer-implement   method,   the   Examiner   has  rejected   this   claim   under   35   U.S.C.   103   over   the   combination   of   Steedman   and   Tomkins.     Applicant's   independent   claim   11   recites:   11.   (Original):   A   method   for   performing   audio   understanding   tasks,   the   method   comprising:   pretraining   an   audio   encoder   using   a   fixed   autoregressive   language   model   and   a  fixed   text   embedder;   receiving   a   prompt   sequence   comprising   demonstrations   of   an   audio   understanding   task    followed   by   a   new   question;    converting   the   prompt   sequence   into   embeddings   using   the   audio   encoder   and   the   fixed    text   embedder;   and   answering   the   new   question   using   the   embeddings   by   the   fixed   autoregressive   language  model.  Steedman discloses the receiving and converting steps of claim 11; Tomkins discloses the pretraining and answering steps of claim 11; and It would have been obvious to one of ordinary skill in the art to combine the Steedman invention with the teachings of Tomkins for the benefit of achieving an improvement in question-answer retrieval precision. Applicant argues that Steedman and Tomkins fail to disclose or suggest the claim 11 including the step of Applicant's claim 11 as originally filed of "pretraining an audio encoder using a fixed autoregressive language model and a fixed text embedder". The Office Action alleged that Tomkins discloses the pretraining step, but the Office Action failed to provide any analysis of how Tomkins allegedly uses a fixed autoregressive language model and a fixed text embedder to train the audio encoder. The Office Action takes no position as to where in Tomkins is the disclosure that during the pretraining an autoregressive language model is fixed and a text embedder is fixed, but an audio encoder is pretrained (e.g., audio encoder is updated). The discussion of Tomkins on pages 6 and 7 of the Office Action does not discuss these aspects of an autoregressive language model and of a text embedder being fixed, but the audio encoder is updated during pretraining. Thus, it is submitted that the rejection of claim 11 is prima facie deficient and should be reversed. The examiner disagrees. Applicants claims recite pre-training a fixed autoregressive model. The applicant’s specification states in paragraph [0030] that “ …reference to FIG. 1. In step 11, an audio encoder is pretrained on audio demonstration tasks using a fixed, pretrained autoregressive language model and a fixed, pretrained text embedder. Namely, the autoregressive language model and the text embedder are kept fixed, and only updates to the audio encoder are made during the pretraining in step 11. According to an exemplary embodiment, the autoregressive language model is a general-purpose learner containing the text embedder such as generative pre-trained Transformer 2 (GPT-2) which is a neural network machine learning model trained using internet data that translates text, answers questions, summarizes passages, and generates text output. By fixed, it is meant for example that the weights of the autoregressive language model are kept constant, i.e., fixed…” Steedman teaches in paragraph [0146] that “The use of two headed attention improves the model's ability to focus on different positions compared to single headed attention, whilst still being relatively quick and efficient to train, and using less parameters to obtain similar results than an 8-headed attention for example. The two-headed self-attention layer 613 has a projection dimension of 64 and a concatenated embedding dimension of 2D=1024. Including the two headed self-attention layer 613 increases the ability to incorporate a representation of other units in the sequence into the encoding of the current unit. The use of two headed self-attention improves the model's ability to focus on different positions and to capture the relationships between a subword and another based on its position in the sequence. Two query weight matrices, two key weight matrices and two value weight matrices are used, each being randomly initialized and learned during training.” So Steedman teaches a fixed weight as defined by the applicants specification. Additionally Steedman teaches in par.  [0135] that various methods of generating the positional encoding vectors may be used, for example the positional encodings may be learned as parameters of the second model, or fixed (for example each element may be some function of the position of the unit in the sequence).
Applicant argues that claim 19 is Patentable Over Steedman and Tomkins Independent claim 19 recites a physical component which is configured to cause the method of claim 11 to be performed. Thus, claim 19 is patentable over the cited references for essentially the same reasons that are discussed above regarding claim 11. For the reasons stated above the examiner believes the claims are still taught by Steedman in view of Tomkins. 
With regards to Dependent Claim 12 Claim 12 depends on claim 11 and specifies that the method further includes keeping weights of the fixed autoregressive language model, the fixed text embedder, and the audio encoder constant following the pretraining. The Office Action rejected claim 12 (see pages 7-8 of the Office Action) by referring to paragraph [0224] of Tomkins which discloses "text summarization models to use based on a preference weight, where the preference weight may be a binary value, categorical value, or categorical value." Paragraph [0224] of Tomkins does not disclose that these preference weights are kept constant following a pretraining. Thus, it is submitted that the rejection of dependent claim 12 is also prima facie deficient. The examiner disagrees. Tompkins par. [0224] teaches retrieving one or more stored configurations or versions of a text summarization model, where the stored configurations may include different sets of neural network parameters. For example, some embodiments may retrieve two or more text summarization models for generating a sequence of n-grams based on a user being associated with two different domains or domain class values. Some embodiments may then select which of the text summarization models to use based on a preference weight, where the preference weight may be a binary value, categorical value, or categorical value. For example, some embodiments may retrieve a first set of neural network parameters for a text summarization model in response to a determination that a first user is associated with a first domain category value. Additionally, some embodiments may then retrieve a second set of neural network parameters for the text summarization model in response to a determination that a second user is associated with a second domain category value. Additionally paragraph [0264] teaches that For example, some embodiments may train a text generation model that was initialized with a pre-trained model, where some embodiments may then perform a reduced-scope training operation for specific text generation tasks such as text summarization or query generation. 
A new search was made and art was found to Chen which reads on independent claims 11 and 19.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 2, 11, 12 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Steedman U.S. PAP2021/0141798 A1 in view of Tomkins U.S. PAP 2021/0295822 A1.
Regarding claim 1 Steedman teaches a system for performing audio understanding tasks (training response retrieval systems to provide a response to a query inputted by a user, see par. [0002]), the system comprising: 
a fixed text embedder for, on receipt of a prompt sequence comprising demonstrations of an audio understanding task followed by a new question, converting the prompt sequence into text embeddings (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]); 
a pretrained audio encoder for converting the prompt sequence into audio embeddings (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]). 
However Steedman does not teach and a fixed autoregressive language model for answering the new question using the text embeddings and the audio embeddings.
In the same field of endeavor Tomkins search systems for retrieving information, see par. [0036]. Some embodiments may use inputs provided by a user to perform semantic searches. In some embodiments, using one or more of the operations described in this disclosure may provide search results that match or exceeds other language models in general language tasks. For example, some embodiments may achieve 85-95% precision when tested using the SQUAD 1.1 dataset or Quora duplicate questions dataset. Some embodiments may surpass other language models when used to perform searches in domain-specific tasks. For example, some embodiments may achieve a 10 to 100% improvement in question-answer retrieval precision in a role-specific domain, see par. [0050]). Some embodiments may determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see par. [0201].  Some embodiments may use one or more indices to obtain or process information based on a query. In many cases, the query posed by a user may be provided in a form different from that used by a document storing an answer to the query. For example, a query of a user may be written in the natural language form, see par. [0242].
It would have been obvious to one of ordinary skill in the art to combine the Steedman invention with the teachings of Tomkins for the benefit of achieving an improvement in question-answer retrieval precision, see par. [0050].
Regarding claim 2 Steedman teaches the system of claim 1, wherein the fixed autoregressive language model answers the new question in a form specified in the demonstrations (the input provided is in the form of text or audio, and the output is provided to the user in the form of text or audio, see par. [0059]).
Regarding claim 11 Steedman teaches a method for performing audio understanding tasks, the method comprising: 
receiving a prompt sequence comprising demonstrations of an audio understanding task followed by a new question (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]); 
converting the prompt sequence into embeddings using the audio encoder and the fixed text embedder (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]); 
However Steedman does not teach pretraining an audio encoder using a fixed autoregressive language model and a fixed text embedder; and answering the new question using the embeddings by the fixed autoregressive language model.
In the same field of endeavor Tomkins search systems for retrieving information, see par. [0036]. Some embodiments may use inputs provided by a user to perform semantic searches. In some embodiments, using one or more of the operations described in this disclosure may provide search results that match or exceeds other language models in general language tasks. For example, some embodiments may achieve 85-95% precision when tested using the SQUAD 1.1 dataset or Quora duplicate questions dataset. Some embodiments may surpass other language models when used to perform searches in domain-specific tasks. For example, some embodiments may achieve a 10 to 100% improvement in question-answer retrieval precision in a role-specific domain, see par. [0050]). Some embodiments may determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see par. [0201].  Some embodiments may use one or more indices to obtain or process information based on a query. In many cases, the query posed by a user may be provided in a form different from that used by a document storing an answer to the query. For example, a query of a user may be written in the natural language form, see par. [0242].
It would have been obvious to one of ordinary skill in the art to combine the Steedman invention with the teachings of Tomkins for the benefit of achieving an improvement in question-answer retrieval precision, see par. [0050].

Regarding claim 12 Tomkins teaches the method of claim 11, further comprising: keeping weights of the fixed autoregressive language model, the fixed text embedder, and the audio encoder constant following the pretraining ( text summarization models to use based on a preference weight, where the preference weight may be a binary value, categorical value, or categorical value, see par. [0224]).
Regarding claim 19 Steedman teaches a computer program product for performing audio understanding tasks, the computer program product comprising a computer readable storage medium having program instructions embodied therewith (a computer program product, see par. [0063]), the program instructions executable by a computer to cause the computer to perform: 
receiving a prompt sequence comprising demonstrations of an audio understanding task followed by a new question (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]); 
converting the prompt sequence into embeddings using the audio encoder and the fixed text embedder (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]); 
However Steedman does not teach pretraining an audio encoder using a fixed autoregressive language model and a fixed text embedder; and answering the new question using the embeddings by the fixed autoregressive language model.
In the same field of endeavor Tomkins search systems for retrieving information, see par. [0036]. Some embodiments may use inputs provided by a user to perform semantic searches. In some embodiments, using one or more of the operations described in this disclosure may provide search results that match or exceeds other language models in general language tasks. For example, some embodiments may achieve 85-95% precision when tested using the SQUAD 1.1 dataset or Quora duplicate questions dataset. Some embodiments may surpass other language models when used to perform searches in domain-specific tasks. For example, some embodiments may achieve a 10 to 100% improvement in question-answer retrieval precision in a role-specific domain, see par. [0050]). Some embodiments may determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see par. [0201].  Some embodiments may use one or more indices to obtain or process information based on a query. In many cases, the query posed by a user may be provided in a form different from that used by a document storing an answer to the query. For example, a query of a user may be written in the natural language form, see par. [0242].
It would have been obvious to one of ordinary skill in the art to combine the Steedman invention with the teachings of Tomkins for the benefit of achieving an improvement in question-answer retrieval precision, see par. [0050].
Claim(s) 3-10, 13-18, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Steedman U.S. PAP2021/0141798 A1 in view of Tomkins U.S. PAP 2021/0295822 further in view of Perez U.S. PAP 2018/0137854 A1.

Regarding claim 3 Steedman in view of Tomkins does not teach the system of claim 2, wherein the demonstrations are in the form of triplets comprising: an audio utterance; a text prompt; and a text answer.
IN a similar field of endeavor Perez teaches a method for training a dialog state tracking system includes providing a set of triples, each of the triples including a dialog subpart, and a ground truth answer to a natural language question, the ground truth answer having been provided by an annotator based on the dialog subpart, each of the questions being related to a respective one of a plurality of slots of a dialog state tracker. A representation generator is provided, in memory, for generating representations of the dialog subparts and questions. iteratively, with a processor, a representation of the dialog subpart and a representation of the question of at least one of the triples are input to a memory end-to-end neural network model. A predicted answer is output from the model, based on the dialog subpart and question. Parameters of the model are updated to reduce a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018].
It would have been obvious to one of ordinary skill in the art to combine the Steedman in view of Tomkins invention with the teachings of Perez for the benefit of reducing a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018].
Regarding claim 4 Perez teaches the system of claim 3, wherein the audio utterance comprises speech (utterances, see par. [0062]).
Regarding claim 5 Perez teaches the system of claim 3, wherein the audio utterance comprises non-speech (training is performed with  10% random noise., see par. [0100]).
Regarding claim 6 Perez teaches the system of claim 3, wherein the new question comprises: a new audio utterance (receive next utterance, see par. [0049]); and a new text prompt (dialog segment is processed to generate text sequence, see par. [0050]), and wherein the new question is missing a new text answer (receiving a predicted answer 76 or distribution over answers from the model 60 for each input question (S126, see par. [0051]).
Regarding claim 7 Steedman in view of Tomkins teaches the system of claim 6, wherein the fixed text embedder converts the text prompt and the text answer into text embeddings of the demonstrations, and the new text prompt into a text embedding of the new question which are provided to the fixed autoregressive language model (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]), and wherein the pretrained audio encoder converts the audio utterance into audio embeddings of the demonstrations (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]), and the new audio utterance into an audio embedding of the new question which are provided to the fixed autoregressive language model (determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see Tomkins par. [0201]).
Regarding claim 8 Perez teaches the system of claim 7, wherein the fixed autoregressive language model fills in a gap at an end of a sentence based on a content of the new audio utterance (he variables correspond to the slots to be filled by the belief update component, see par. [0030]).
Regarding claim 9 Perez teaches the system of claim 1, wherein the prompt sequence comprises 10 or less of the demonstrations (Some dialogs include a sequence of more than two utterances, such as three, four, five, or more utterances, see par. [0027]).
Regarding claim 10 Perez teaches the system of claim 1, wherein the prompt sequence comprises from 0 to 10 of the demonstrations (Some dialogs include a sequence of more than two utterances, such as three, four, five, or more utterances, see par. [0027]).

Regarding claim 13 Steedman in view of Tomkins does not teach the method of claim 11, wherein the new question is answered using a form specified in the demonstrations, and wherein the demonstrations are in the form of triplets comprising: an audio utterance; a text prompt; and a text answer.
IN a similar field of endeavor Perez teaches a method for training a dialog state tracking system includes providing a set of triples, each of the triples including a dialog subpart, and a ground truth answer to a natural language question, the ground truth answer having been provided by an annotator based on the dialog subpart, each of the questions being related to a respective one of a plurality of slots of a dialog state tracker. A representation generator is provided, in memory, for generating representations of the dialog subparts and questions. iteratively, with a processor, a representation of the dialog subpart and a representation of the question of at least one of the triples are input to a memory end-to-end neural network model. A predicted answer is output from the model, based on the dialog subpart and question. Parameters of the model are updated to reduce a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018].
It would have been obvious to one of ordinary skill in the art to combine the Steedman in view of Tomkins invention with the teachings of Perez for the benefit of reducing a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018].

Regarding claim 14 Perez teaches the method of claim 13, wherein the audio utterance comprises speech (utterances, see par. [0062]).
Regarding claim 15 Perez teaches the method of claim 13, wherein the audio utterance comprises non-speech (training is performed with  10% random noise., see par. [0100]).
Regarding claim 16 Perez teaches the method of claim 13, wherein the new question comprises: a new audio utterance (receive next utterance, see par. [0049]); and a new text prompt (dialog segment is processed to generate text sequence, see par. [0050]), and wherein the new question is missing a new text answer (receiving a predicted answer 76 or distribution over answers from the model 60 for each input question (S126, see par. [0051]).
Regarding claim 17 Steedman in view of Tomkins teaches the method of claim 16, wherein the fixed text embedder converts the text prompt and the text answer into text embeddings of the demonstrations, and the new text prompt into a text embedding of the new question which are provided to the fixed autoregressive language model (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]), and wherein the pretrained audio encoder converts the audio utterance into audio embeddings of the demonstrations (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]), and the new audio utterance into an audio embedding of the new question which are provided to the fixed autoregressive language model (determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see Tomkins par. [0201]).
Regarding claim 18 Perez teaches the method of claim 11, wherein the prompt sequence comprises from 0 to 10 of the demonstrations (Some dialogs include a sequence of more than two utterances, such as three, four, five, or more utterances, see par. [0027]).
Regarding claim 20 Steedman in view of Tomkins teach wherein the fixed text embedder converts the text prompt and the text answer into text embeddings of the demonstrations, and the new text prompt into a text embedding of the new question which are provided to the fixed autoregressive language model (receiving a user inputted query; representing the user inputted query as a sequence of embedding vectors using a first model; encoding the sequence of embedding vectors to produce a context vector using a second model;, see par. [0022-0024]), and wherein the pretrained audio encoder converts the audio utterance into audio embeddings of the demonstrations (the second model has been trained using corresponding queries and responses such that an encoding is used that maximizes the similarity between the response vector and context vector , see par. [0029];  If the input is in the form of audio, an automatic speech recognition model may be included to convert the input audio to text, see par. [0059]), and the new audio utterance into an audio embedding of the new question which are provided to the fixed autoregressive language model (determine a set of embedding vectors of a natural language document using a transformer model or other neural network model. One or more models may be used to generate embedding vectors for words or other n-grams, such as BERT, XLNet, GPT, or the like. For example, some embodiments may use XLNet or another autoregressive transformer model to generate word embeddings, see Tomkins par. [0201]).
However Steedman in view of Tomkins does not teach the computer program product of claim 19, wherein the demonstrations are in a form of triplets comprising an audio utterance, a text prompt, and a text answer, wherein the new question comprises a new audio utterance, and a new text prompt. 
IN a similar field of endeavor Perez teaches a method for training a dialog state tracking system includes providing a set of triples, each of the triples including a dialog subpart, and a ground truth answer to a natural language question, the ground truth answer having been provided by an annotator based on the dialog subpart, each of the questions being related to a respective one of a plurality of slots of a dialog state tracker. A representation generator is provided, in memory, for generating representations of the dialog subparts and questions. iteratively, with a processor, a representation of the dialog subpart and a representation of the question of at least one of the triples are input to a memory end-to-end neural network model. A predicted answer is output from the model, based on the dialog subpart and question. Parameters of the model are updated to reduce a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018]; a new audio utterance (receive next utterance, see par. [0049]); and a new text prompt (dialog segment is processed to generate text sequence, see par. [0050]), and wherein the new question is missing a new text answer (receiving a predicted answer 76 or distribution over answers from the model 60 for each input question (S126, see par. [0051]).
It would have been obvious to one of ordinary skill in the art to combine the Steedman in view of Tomkins invention with the teachings of Perez for the benefit of reducing a computed error between the predicted answer and the ground truth answer for the at least one of the triples, see par. [0018].
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 11, 19 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by U.S. PAP 2021/0144251 A1.

Regarding claim 11 Chen teaches a method for performing audio understanding tasks (methods for smart dialogue communication, see abstract), the method comprising:
pretraining an audio encoder using a fixed autoregressive language model and a fixed text embedder (language models may include a generalized autoregressive pretraining for language understanding (XLNet) model, a statistical language model (e.g., N-gram), a neural network language model, a recurrent neural network language model, a neural probabilistic language model etc, see par. [0098]);
receiving a prompt sequence comprising demonstrations of an audio understanding task (The plurality of training samples may include a plurality of sample text messages and sample text features of the sample text messages, see par. [0098])  followed by a new question (he processing device 112 (e.g., the smart dialogue communication module 404) may determine question information associated with the text features by matching the text features in a question knowledge database, see par. [0099]);
converting the prompt sequence into embeddings using the audio encoder and the fixed text embedder ( some embodiments, the plurality of sample text messages may be inputted into the preliminary model to determine actual output(s). The sample text features of the sample text messages may be determined as desired output(s). The processing device 112 or any other processing device (e.g., an external processing device of the smart dialogue communication system 100) may compare the actual output(s) with the desired output(s) to determine loss function value(s). The loss function value(s) may measure difference(s) between the actual output(s) and the desired output(s). In the training process of the preliminary model, the plurality of parameters may be adjusted to minimize the loss function value(s), see par. [0098]); and
answering the new question using the embeddings by the fixed autoregressive language model (In 609, the processing device 112 (e.g., the smart dialogue communication module 404) may obtain answer information corresponding to the question information by matching the question information in an answer knowledge database, see par. [0100]). 
Regarding claim 19 Chen teaches a computer program product for performing audio understanding tasks, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform (a non-transitory computer-readable medium, comprising at least one set of instructions compatible for smart dialogue communication, see par. [0017]):
pretraining an audio encoder using a fixed autoregressive language model and a fixed text embedder (language models may include a generalized autoregressive pretraining for language understanding (XLNet) model, a statistical language model (e.g., N-gram), a neural network language model, a recurrent neural network language model, a neural probabilistic language model etc, see par. [0098]);
receiving a prompt sequence comprising demonstrations of an audio understanding task (The plurality of training samples may include a plurality of sample text messages and sample text features of the sample text messages, see par. [0098])  followed by a new question (he processing device 112 (e.g., the smart dialogue communication module 404) may determine question information associated with the text features by matching the text features in a question knowledge database, see par. [0099]);
converting the prompt sequence into embeddings using the audio encoder and the fixed text embedder ( some embodiments, the plurality of sample text messages may be inputted into the preliminary model to determine actual output(s). The sample text features of the sample text messages may be determined as desired output(s). The processing device 112 or any other processing device (e.g., an external processing device of the smart dialogue communication system 100) may compare the actual output(s) with the desired output(s) to determine loss function value(s). The loss function value(s) may measure difference(s) between the actual output(s) and the desired output(s). In the training process of the preliminary model, the plurality of parameters may be adjusted to minimize the loss function value(s), see par. [0098]); and
answering the new question using the embeddings by the fixed autoregressive language model (In 609, the processing device 112 (e.g., the smart dialogue communication module 404) may obtain answer information corresponding to the question information by matching the question information in an answer knowledge database, see par. [0100]). 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached at 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Show 8 earlier events
Jul 23, 2025
Notice of Allowance
Jul 23, 2025