Last updated: April 19, 2026
Application No. 18/633,820
SYSTEMS AND METHODS FOR MULTI-MODAL CONTINUAL PRE-TRAINING OF AUDIO ENCODERS

Non-Final OA §103
Filed
Apr 12, 2024
Examiner
PATEL, YOGESHKUMAR G
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Robert Bosch GmbH
OA Round
1 (Non-Final)
Interview Optional

— +3.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 650 resolved cases, 2023–2026
Examiner Intelligence

PATEL, YOGESHKUMAR G View full profile →
Grants 83% — above average
Career Allow Rate
538 granted / 650 resolved
+20.8% vs TC avg
Minimal +3% lift
Without
With
+3.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
17 currently pending
Career history
667
Total Applications
across all art units
Statute-Specific Performance

§101
4.7%
-35.3% vs TC avg
§103
61.9%
+21.9% vs TC avg
§102
14.4%
-25.6% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 650 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-6, 13, and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US PGPUB #2020/0293826) in view of Wu et al. (US PGPUB #2024/0144007, hereinafter ‘Wu-US’).

Regarding Claim 1, Liu discloses a method for training an audio encoder (Figs. 1-12; ¶0066), the method comprising:
receiving first training data comprising first audio data (Liu ¶0005 discloses receiving a first type of nontext input. ¶0006 discloses the first type of nontext input is audio);
performing a first training task on an audio encoder using the first training data (Liu ¶0005 discloses encoding the first type of nontext input using a first autoencoder having a first convolutional neural network. ¶0066 discloses thus, an audio input that was received from the microphone can be encoded by an autoencoder. ¶0033 discloses the audio module uses audio signals as training inputs; Fig. 1);
receiving second training data comprising first image data and second audio data (Liu ¶0005 discloses receiving a second type of nontext input. ¶0006 discloses the second type of nontext input is an image);
performing a second training task on the audio encoder using the second training data (Liu ¶0005 discloses encoding the second type of nontext input using a second autoencoder having a second convolutional neural network. ¶0066 discloses an image input that was received from the camera can be encoded by another autoencoder. ¶0033 discloses the image module uses images as training inputs; Fig. 1).
Liu may not explicitly disclose receiving third training data comprising first text data and third audio data; performing a third training task on the audio encoder using the third training data; and performing at least one downstream task using the audio encoder.
However, Wu-US (title, abstract, Figs. 1-7) teaches receiving third training data comprising first text data and third audio data (Wu-US ¶0046 discloses the first modality [third] can be any of the following modalities: an image, text, a video, audio, and the second modality may comprise a further one of the modalities, such as audio; Fig. 2: 220);
performing a third training task on the audio encoder using the third training data (Wu-US ¶0047 discloses the encoder for the first modality [i.e., third, text] and the encoder for the second modality can be selected as machine learning models or neural networks suitable for processing data of the corresponding modalities. For example, in cross-modal learning for images and text, Res50 encoder may be selected as an encoder for an image modality, while BERT-Base encoder may be selected as an encoder for a text modality); and
performing at least one downstream task using the audio encoder (Wu-US Fig. 2: 240 and Fig. 4B: 403; ¶0063-¶0071).
Liu and Wu-US are analogous art as they pertain to multi-modality encoder. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) for a pre-training method to significantly improve the performance of the encoder without increasing training overhead during downstream fine-tuning and subsequent model application overhead (as taught by Wu-US, ¶0065) so that the training time of the encoder for the modality A will not increase (Wu-US, ¶0065).

Regarding Claim 3, Liu in view of Wu-US discloses the method of claim 1. But Liu may not explicitly disclose wherein the second training task includes training the audio encoder by transferring knowledge from a pre-trained image encoder onto the audio encoder.
However, Wu-US (title, abstract, Figs. 1-7) teaches wherein the second training task includes training the audio encoder by transferring knowledge from a pre-trained image encoder onto the audio encoder (Wu-US Fig. 4A: cross-modal contrastive learning 401 and fine-tuning of a single modality downstream task 402; Fig. 4B: cross-modal contrastive learning 401 and fine-tuning of a cross-modal downstream task 403. ¶0065 discloses the enhanced encoder in the modality B [e.g., image] can significantly improve feature learning of an encoder in a further modality A [e.g., audio], so that the encoder for the modality A [i.e., by transferring knowledge from image encoder] can obtain a better starting point for fine-tuning in a downstream task. Compared with a conventional scheme, model capacity of the encoder for the modality A may be the same, however parameter values are adjusted to a better extent in the pre-training stage. Therefore, if fine-tuning for the downstream task is required, training time of the encoder for the modality A will not increase. In addition, the encoder for the modality A can provide a better predictive capability even when directly used for the downstream task. Such a pre-training method can significantly improve the performance of the encoder without increasing training overhead during downstream fine-tuning and subsequent model application overhead. ¶0071 discloses the enhanced encoder in the modality B can significantly improve feature learning of an encoder in a further modality A, so that the learning of a further encoder for the modality B in the downstream can be further guided by a better modality A in the training process of the cross-modal downstream task, thus improving efficiency of the whole cross-modal learning and model performance. In addition, since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time. Overall, such a model training scheme can improve training effectiveness and model performance while reducing downstream training time).
Liu and Wu-US are analogous art as they pertain to multi-modality encoder devices. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time (as taught by Wu-US, ¶0071) thus improving efficiency of the whole cross-modal learning and model performance (Wu-US, ¶0071).

Regarding Claim 4, Liu in view of Wu-US discloses the method of claim 3. But Liu may not explicitly disclose wherein transferring knowledge from the pre-trained image encoder onto the audio encoder includes using contrastive learning with the second training data.
However, Wu-US (title, abstract, Figs. 1-7) teaches wherein transferring knowledge from the pre-trained image encoder onto the audio encoder includes using contrastive learning with the second training data (Wu ¶0066 discloses in some embodiments, the downstream task may comprise a cross-modal downstream task for the first modality and the second modality. The execution of the cross-modal downstream task requires encoders for two modalities. In some embodiments, instead of directly using the pre-trained first encoder and the third encoder, the second encoder with a lower model capacity is chosen for the second modality. In other words, suppose that it is known that contrastive learning is to be carried out for the cross-modal downstream task, and it has been determined that the first encoder for the first modality and the second encoder for the second modality are to be chosen for the cross-modal downstream task. Fig. 4B illustrates a schematic diagram of a contrastive learning architecture 425 according to some further embodiments of the present disclosure. As shown in Fig. 4B, after the pre-training stage, a second contrastive learning model used for the cross-modal downstream task comprises the encoder 410 and an encoder 412. Correspondingly, in order to improve performance, the encoder 412 [e.g., audio] may be chosen to enhance to be the encoder 420 [e.g., image] in the pretraining stage. That is, the encoder 420 will have a larger model capacity than the encoder 412. In the cross-modal contrastive learning 401 stage, pre-training is conducted on the encoder 410 and the encoder 420 using the training dataset. ¶0067 discloses afterwards, the second contrastive learning model may be continuously constructed, which comprises the pre-trained first encoder and the second encoder, that is, the pre-trained encoder 410' and the encoder 412 with a lower model capacity. Then, fine-tuning 403 of the cross-modal downstream task may be performed on the second contrastive learning model. After training, the encoder 410' and the trained encoder 412 may be provided for using in the cross-modal downstream task).
Liu and Wu-US are analogous art as they pertain to multi-modality encoder. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time (as taught by Wu-US, ¶0071) thus improving efficiency of the whole cross-modal learning and model performance (Wu-US, ¶0071).

Regarding Claim 5, Liu in view of Wu-US discloses the method of claim 1. But Liu may not explicitly disclose wherein the third training task includes fine-tuning the audio encoder by transferring knowledge of a pre-trained text encoder onto the audio encoder.
However, Wu-US (title, abstract, Figs. 1-7) teaches wherein the third training task includes fine-tuning the audio encoder by transferring knowledge of a pre-trained text encoder onto the audio encoder (Wu-US Fig. 4A: cross-modal contrastive learning 401 and fine-tuning of a single modality downstream task 402; Fig. 4B: cross-modal contrastive learning 401 and fine-tuning of a cross-modal downstream task 403. ¶0071 discloses the enhanced encoder in the modality B can significantly improve feature learning of an encoder in a further modality A, so that the learning of a further encoder for the modality B in the downstream can be further guided by a better modality A in the training process of the cross-modal downstream task, thus improving efficiency of the whole cross-modal learning and model performance. In addition, since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time. Overall, such a model training scheme can improve training effectiveness and model performance while reducing downstream training time).
Liu and Wu-US are analogous art as they pertain to multi-modality encoder devices. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time (as taught by Wu-US, ¶0071) thus improving efficiency of the whole cross-modal learning and model performance (Wu-US, ¶0071).

Regarding Claim 6, Liu in view of Wu-US discloses the method of claim 5. But Liu may not explicitly disclose wherein transferring knowledge of the pre-trained text encoder onto the audio encoder includes using contrastive learning with the third training data.
However, Wu-US (title, abstract, Figs. 1-7) teaches wherein transferring knowledge of the pre-trained text encoder onto the audio encoder includes using contrastive learning with the third training data (Wu-US ¶0066 discloses in some embodiments, the downstream task may comprise a cross-modal downstream task for the first modality and the second modality. The execution of the cross-modal downstream task requires encoders for two modalities. In some embodiments, instead of directly using the pre-trained first encoder and the third encoder, the second encoder with a lower model capacity is chosen for the second modality. In other words, suppose that it is known that contrastive learning is to be carried out for the cross-modal downstream task, and it has been determined that the first encoder for the first modality and the second encoder for the second modality are to be chosen for the cross-modal downstream task. Fig. 4B illustrates a schematic diagram of a contrastive learning architecture 425 according to some further embodiments of the present disclosure. As shown in Fig. 4B, after the pre-training stage, a second contrastive learning model used for the cross-modal downstream task comprises the encoder 410 and an encoder 412. Correspondingly, in order to improve performance, the encoder 412 [e.g., audio] may be chosen to enhance to be the encoder 420 [e.g., text] in the pretraining stage. That is, the encoder 420 will have a larger model capacity than the encoder 412. In the cross-modal contrastive learning 401 stage, pre-training is conducted on the encoder 410 and the encoder 420 using the training dataset. ¶0067 discloses afterwards, the second contrastive learning model may be continuously constructed, which comprises the pre-trained first encoder and the second encoder, that is, the pre-trained encoder 410' and the encoder 412 with a lower model capacity. Then, fine-tuning 403 of the cross-modal downstream task may be performed on the second contrastive learning model. After training, the encoder 410' and the trained encoder 412 may be provided for using in the cross-modal downstream task).
Liu and Wu-US are analogous art as they pertain to multi-modality encoder. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time (as taught by Wu-US, ¶0071) thus improving efficiency of the whole cross-modal learning and model performance (Wu-US, ¶0071).
Claims 13 and 15-16 are rejected for the same reasons as set forth in Claims 1 and 3-6.

Claims 2, 7-9, 14, and 17-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US PGPUB #2020/0293826) in view of Wu et al. (US #2024/0144007, herein after ‘Wu-US’) further in view of Kim et al. (US #2024/0013564).

Regarding Claim 2, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the first training task includes training the audio encoder for supervised classification on an audio dataset with labels.
However, Kim (title, abstract, Figs. 1-10) teaches wherein the first training task includes training the audio encoder for supervised classification on an audio dataset with labels (Kim ¶0021 discloses self-supervised learning [SSL] techniques have yielded visual representations having an associated accuracy that approaches a level of accuracy enabled using fully supervised learning operations on large computer vision downstream tasks, for example. ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0085 discloses block 676 may comprise executing one or more first neural networks to provide an output tensor indicating detections of features in the content signal based, at least in part on an input tensor. In a particular implementation, such detections of features by the one or more first neural networks may comprise classifications and localizations of objects in an image).
Liu, Wu-US and Kim are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Kim since SSL training operations are directed to applications of structured document images (as taught by Kim, ¶0022) to develop an SSL approach for structured document images, an information bottleneck framework may be applied to derive a negative-sample-free contrastive learning objective (Kim, ¶0022).

Regarding Claim 7, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the first training task includes a supervised training task.
However, Kim (title, abstract, Figs. 1-10) teaches wherein the first training task includes a supervised training task (Kim ¶0021 discloses self-supervised learning [SSL] techniques have yielded visual representations having an associated accuracy that approaches a level of accuracy enabled using fully supervised learning operations on large computer vision downstream tasks, for example. ¶0025 discloses method of training a system for detection of objects in a content signal, the method comprising: applying a self-supervised operation to train parameters of an encoder and a decoder based, at least in part, on a first loss function based, at least in part, on a computed loss associated with reconstruction of a view of the content signal; and applying a supervised operation to further train parameters of the encoder and the decoder trained in the self-supervised operation based, at least in, in part, on a second loss function based, at least in part, on a computed loss associated with detection of objects).
Liu, Wu-US and Kim are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Kim since SSL training operations are directed to applications of structured document images (as taught by Kim, ¶0022) to develop an SSL approach for structured document images, an information bottleneck framework may be applied to derive a negative-sample-free contrastive learning objective (Kim, ¶0022).

Regarding Claim 8, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the second training task includes a self-supervised training task.
However, Kim (title, abstract, Figs. 1-10) teaches wherein the second training task includes a self-supervised training task (Kim ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0103 discloses Fig. 7B is a flow diagram of a process 750 to determine parameters of an encoder and decoder such as, for example, an encoder 610 and 614 shown in Fig. 6A, using a self-supervised pretraining operation applied to training sets of a content signal [e.g., content signal 702]. In a particular implementation, features of process 750 may be executed by system 700 [Fig. 7 A]. In one aspect, such self-supervised training operations may comprise computation of a loss function to be applied in updating parameters of an encoder/decoder pair, such as a loss function set forth according to expression (10) or expression (11). Block 752 may comprise application of a self-supervised operation to train parameters of an encoder and a decoder based, at least in part, on a first loss function such as a loss function set forth in expressions (10) and/or (12)).
Liu, Wu-US and Kim are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Kim since SSL training operations are directed to applications of structured document images (as taught by Kim, ¶0022) to develop an SSL approach for structured document images, an information bottleneck framework may be applied to derive a negative-sample-free contrastive learning objective (Kim, ¶0022).

Regarding Claim 9, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the third training task includes a self-supervised training task.
However, Kim (title, abstract, Figs. 1-10) teaches wherein the third training task includes a self-supervised training task (Kim ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0103 discloses Fig. 7B is a flow diagram of a process 750 to determine parameters of an encoder and decoder such as, for example, an encoder 610 and 614 shown in Fig. 6A, using a self-supervised pretraining operation applied to training sets of a content signal [e.g., content signal 702]. In a particular implementation, features of process 750 may be executed by system 700 [Fig. 7 A]. In one aspect, such self-supervised training operations may comprise computation of a loss function to be applied in updating parameters of an encoder/decoder pair, such as a loss function set forth according to expression (10) or expression (11). Block 752 may comprise application of a self-supervised operation to train parameters of an encoder and a decoder based, at least in part, on a first loss function such as a loss function set forth in expressions (10) and/or (12)).
Liu, Wu-US and Kim are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Kim since SSL training operations are directed to applications of structured document images (as taught by Kim, ¶0022) to develop an SSL approach for structured document images, an information bottleneck framework may be applied to derive a negative-sample-free contrastive learning objective (Kim, ¶0022).
Claims 14 and 17-19 are rejected for the same reasons as set forth in Claims 2, 7-9.
Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US PGPUB #2020/0293826) in view of Wu et al. (US PGPUB #2024/0144007, herein after ‘Wu-US’) further in view of Deng et al. (CN #115688937 A).

Regarding Claim 10, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the at least one downstream task includes audio tagging.
However, Deng (title, abstract, Figs. 1-21) teaches wherein the at least one downstream task (Deng ¶0192 discloses during the fine-tuning phase, the model is initialized using the parameters learned in the pre-training phase. It then performs fewer training steps on downstream tasks such as text classification and sequence labeling, successfully transferring the semantic information obtained from pre-training to downstream tasks) includes audio tagging (Deng ¶0215 discloses for text data, discrete tokens [i.e., tags] in the text data can be encoded; for image data, the image can be divided into patches, and each patch can be modeled as a discrete token for encoding; for speech data, the speech signal can be represented as continuous frequency domain features in frames, and then used for model modeling and downstream task processing. New research, exemplified by Meta’s recent work on Textless NLP, shows that speech frames can also represent discrete tags (i.e., tokens), and these tokens can be used as pre-trained models for modeling and training, just like text).
Liu, Wu-US and Deng are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Deng to divide image data into patches, and each patch can be modeled as a discrete token for encoding speech data (as taught by Deng, ¶0215) to represent speech as discrete tags (Deng, ¶0215).

Claims 10-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US PGPUB #2020/0293826) in view of Wu et al. (US PGPUB #2024/0144007 herein after ‘Wu-US’) further in view of Wu et al. (WAV2CLIP: Learning Robust Audio Representations from CLIP, hereinafter ‘Wu-WAV2CLIP’).

Regarding Claim 10, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the at least one downstream task includes audio tagging.
However, Wu-WAV2CLIP (title, abstract, Fig. 1) teaches wherein the at least one downstream task includes audio tagging (Wu-WAV2CLIP Fig. 1: audio and multimodal downstream tasks; classification tagging retrieval generation).
Liu, Wu-US and Wu-WAV2CLIP are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Wu-WAV2CLIP to derive Wav2CLIP embeddings from CLIP, to align with text, which makes zero-shot audio classification possible, as well as audio captioning, and cross-modal text/audio-to-audio retrieval, making theWav2CLIP training recipe very lightweight (as taught by Wu-WAV2CLIP, introduction: right side first para) unlike previous audio-visual correspondence models, it is not necessary to learn a visual model jointly with the auditory model (Wu-WAV2CLIP, introduction: right side first para).

Regarding Claim 11, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the at least one downstream task includes audio retrieval.
However, Wu-WAV2CLIP (title, abstract, Fig. 1) teaches wherein the at least one downstream task includes audio retrieval (Wu-WAV2CLIP Fig. 1: audio and multimodal downstream tasks; classification tagging retrieval generation. abstract: systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pretrained audio representation algorithms. introduction: right side second para: We systematically evaluate Wav2CLIP representations across a wide variety of audio tasks, including classification and retrieval, and compare with other audio representation learning approaches, as well as SOTA results from each task. table 1)
Liu, Wu-US and Wu-WAV2CLIP are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Wu-WAV2CLIP to derive Wav2CLIP embeddings from CLIP, to align with text, which makes zero-shot audio classification possible, as well as audio captioning, and cross-modal text/audio-to-audio retrieval, making theWav2CLIP training recipe very lightweight (as taught by Wu-WAV2CLIP, introduction: right side first para) unlike previous audio-visual correspondence models, it is not necessary to learn a visual model jointly with the auditory model (Wu-WAV2CLIP, introduction: right side first para).

Regarding Claim 12, Liu in view of Wu-US discloses the method of claim 1, but may not explicitly disclose wherein the at least one downstream task includes zero-shot classification.
However, Wu-WAV2CLIP (title, abstract, Fig. 1) teaches wherein the at least one downstream task includes zero-shot classification (Wu-WAV2CLIP Fig. 1: audio and multimodal downstream tasks; classification tagging retrieval generation. abstract: Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. introduction: right side second para: We show how to apply Wav2CLIP to solve several multimodal and zero-shot tasks. table 1).
Liu, Wu-US and Wu-WAV2CLIP are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Wu-US in light of the teachings of Wu-WAV2CLIP to derive Wav2CLIP embeddings from CLIP, to align with text, which makes zero-shot audio classification possible, as well as audio captioning, and cross-modal text/audio-to-audio retrieval, making theWav2CLIP training recipe very lightweight (as taught by Wu-WAV2CLIP, introduction: right side first para) unlike previous audio-visual correspondence models, it is not necessary to learn a visual model jointly with the auditory model (Wu-WAV2CLIP, introduction: right side first para).

Claim 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US PGPUB #2020/0293826) in view of Kim et al. (US #2024/0013564) further in view of Wu et al. (US #2024/0144007, herein after ‘Wu-US’).
Regarding Claim 20, Liu discloses an apparatus for training an audio encoder (Figs. 1-12), the apparatus comprising:
a computing device (Liu Fig. 10) configured to:
receive first training data comprising first audio data (Liu ¶0005 discloses receiving a first type of nontext input. ¶0006 discloses the first type of nontext input is audio);
perform a first training task on an audio encoder using the first training data (Liu ¶0005 discloses encoding the first type of nontext input using a first autoencoder having a first convolutional neural network. ¶0066 discloses thus, an audio input that was received from the microphone can be encoded by an autoencoder. ¶0033 discloses the audio module uses audio signals as training inputs; Fig. 1),
receive second training data comprising first image data and second audio data (Liu ¶0005 discloses receiving a second type of nontext input. ¶0006 discloses the second type of nontext input is an image);
perform a second training task on the audio encoder using the second training data (Liu ¶0005 discloses encoding the second type of nontext input using a second autoencoder having a second convolutional neural network. ¶0066 discloses an image input that was received from the camera can be encoded by another autoencoder. ¶0033 discloses the image module uses images as training inputs; Fig. 1).
Liu may not explicitly disclose wherein the first training task includes a supervised learning task that includes training the audio encoder for supervised classification on an audio dataset with labels; wherein the second training task includes a self-supervised learning task; wherein the third training task includes a self-supervised learning task that includes fine-tuning the audio encoder by transferring knowledge of a pre-trained text encoder onto the audio encoder; that includes training the audio encoder by transferring knowledge from a pre-receive third training data comprising first text data and third audio data; perform a third training task on the audio encoder using the third training data; and perform at least one downstream task using the audio encoder.
However, Kim (title, abstract, Figs. 1-10) teaches wherein the first training task includes a supervised learning task that includes training the audio encoder for supervised classification on an audio dataset with labels (Kim ¶0021 discloses self-supervised learning [SSL] techniques have yielded visual representations having an associated accuracy that approaches a level of accuracy enabled using fully supervised learning operations on large computer vision downstream tasks, for example. ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0085 discloses block 676 may comprise executing one or more first neural networks to provide an output tensor indicating detections of features in the content signal based, at least in part on an input tensor. In a particular implementation, such detections of features by the one or more first neural networks may comprise classifications and localizations of objects in an image);
wherein the second training task includes a self-supervised learning task (Kim ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0103 discloses Fig. 7B is a flow diagram of a process 750 to determine parameters of an encoder and decoder such as, for example, an encoder 610 and 614 shown in Fig. 6A, using a self-supervised pretraining operation applied to training sets of a content signal [e.g., content signal 702]. In a particular implementation, features of process 750 may be executed by system 700 [Fig. 7 A]. In one aspect, such self-supervised training operations may comprise computation of a loss function to be applied in updating parameters of an encoder/decoder pair, such as a loss function set forth according to expression (10) or expression (11). Block 752 may comprise application of a self-supervised operation to train parameters of an encoder and a decoder based, at least in part, on a first loss function such as a loss function set forth in expressions (10) and/or (12));
wherein the third training task includes a self-supervised learning task that includes fine-tuning the audio encoder by transferring knowledge of a pre-trained text encoder onto the audio encoder (Kim ¶0084 discloses in a particular implementation in which an encoder executed at block 672 [Fig. 6C] and decoder executed at block 674 comprise neural networks, such parameters may comprise neural network weights associated with nodes that are trained, at least in part, in a self-supervised training operation. ¶0103 discloses Fig. 7B is a flow diagram of a process 750 to determine parameters of an encoder and decoder such as, for example, an encoder 610 and 614 shown in Fig. 6A, using a self-supervised pretraining operation applied to training sets of a content signal [e.g., content signal 702]. In a particular implementation, features of process 750 may be executed by system 700 [Fig. 7 A]. In one aspect, such self-supervised training operations may comprise computation of a loss function to be applied in updating parameters of an encoder/decoder pair, such as a loss function set forth according to expression (10) or expression (11). Block 752 may comprise application of a self-supervised operation to train parameters of an encoder and a decoder based, at least in part, on a first loss function such as a loss function set forth in expressions (10) and/or (12)).
Liu and Kim are analogous art as they pertain to multi-modality encoders. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify learning devices (as taught by Liu) since SSL training operations are directed to applications of structured document images (as taught by Kim, ¶0022) to develop an SSL approach for structured document images, an information bottleneck framework may be applied to derive a negative-sample-free contrastive learning objective (Kim, ¶0022).
And Wu-US (title, abstract, Figs. 1-7) teaches wherein the second training task includes a self-supervised learning task that includes training the audio encoder by transferring knowledge from a pre-trained image encoder onto the audio encoder (Wu-US Fig. 4A: cross-modal contrastive learning 401 and fine-tuning of a single modality downstream task 402; Fig. 4B: cross-modal contrastive learning 401 and fine-tuning of a cross-modal downstream task 403. ¶0065 discloses the enhanced encoder in the modality B [e.g., image] can significantly improve feature learning of an encoder in a further modality A [e.g., audio], so that the encoder for the modality A [i.e., by transferring knowledge from image encoder] can obtain a better starting point for fine-tuning in a downstream task. Compared with a conventional scheme, model capacity of the encoder for the modality A may be the same, however parameter values are adjusted to a better extent in the pre-training stage. Therefore, if fine-tuning for the downstream task is required, training time of the encoder for the modality A will not increase. In addition, the encoder for the modality A can provide a better predictive capability even when directly used for the downstream task. Such a pre-training method can significantly improve the performance of the encoder without increasing training overhead during downstream fine-tuning and subsequent model application overhead. ¶0071 discloses the enhanced encoder in the modality B can significantly improve feature learning of an encoder in a further modality A, so that the learning of a further encoder for the modality B in the downstream can be further guided by a better modality A in the training process of the cross-modal downstream task, thus improving efficiency of the whole cross-modal learning and model performance. In addition, since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time. Overall, such a model training scheme can improve training effectiveness and model performance while reducing downstream training time);
receive third training data comprising first text data and third audio data (Wu-US ¶0046 discloses the first modality [third] can be any of the following modalities: an image, text, a video, audio, and the second modality may comprise a further one of the modalities, such as audio; Fig. 2: 220);
perform a third training task on the audio encoder using the third training data (Wu-US ¶0047 discloses the encoder for the first modality [i.e., third, text] and the encoder for the second modality can be selected as machine learning models or neural networks suitable for processing data of the corresponding modalities. For example, in cross-modal learning for images and text, Res50 encoder may be selected as an encoder for an image modality, while BERT-Base encoder may be selected as an encoder for a text modality); and
perform at least one downstream task using the audio encoder (Wu-US Fig. 2: 240 and Fig. 4B: 403; ¶0063-¶0071).
Liu, Kim and Wu-US are analogous art as they pertain to multi-modality encoder. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Liu in view of Kim in light of the teachings of Wu-US since the modality A does not participate in the gradient backpropagation during downstream training, the downstream training will take less time (as taught by Wu-US, ¶0071) thus improving efficiency of the whole cross-modal learning and model performance (Wu-US, ¶0071).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691
Read full office action
Prosecution Timeline

Apr 12, 2024
Application Filed
Feb 13, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/420,157
Patent 12598426
CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO
2y 5m to grant Granted Apr 07, 2026
18/534,033
Patent 12596525
METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION
2y 5m to grant Granted Apr 07, 2026
18/389,832
Patent 12592675
AUDIO DEVICE WITH MICROPHONE AND MEDIA MIXING
2y 5m to grant Granted Mar 31, 2026
18/553,799
Patent 12593010
COMMUNICATION ASSEMBLY
2y 5m to grant Granted Mar 31, 2026
18/386,826
Patent 12587448
AI-BASED NETWORK TROUBLESHOOTING WITH EXPERT FEEDBACK
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
86%
With Interview (+3.4%)
2y 4m
Median Time to Grant
Low
PTA Risk
Based on 650 resolved cases by this examiner. Grant probability derived from career allow rate.