Last updated: May 29, 2026

Application No. 18/627,098

CONTRASTIVE LEARNING WITH ADVERSARIAL DATA FOR ROBUST SPEECH TRANSLATION

Non-Final OA §103

Filed

Apr 04, 2024

Priority

Jun 15, 2023 — provisional 63/521,253

Examiner

PATEL, YOGESHKUMAR G

Art Unit

2691

Tech Center

2600 — Communications

Assignee

Zoom Video Communications, Inc.

OA Round

1 (Non-Final)

Interview Optional

— +3.3% interview lift. Interview lift (+3.3%) is below the 15.0% threshold. A written response is recommended.

Based on 655 resolved cases, 2023–2026

Examiner Intelligence

PATEL, YOGESHKUMAR G View full profile →

Grants 83% — above average

Career Allowance Rate

543 granted / 655 resolved

+20.9% vs TC avg

Minimal +3% lift

Without

With

+3.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

20 currently pending

Career history

672

Total Applications

across all art units

Statute-Specific Performance

§101

1.5%

-38.5% vs TC avg

§103

91.9%

+51.9% vs TC avg

§102

3.6%

-36.4% vs TC avg

§112

1.0%

-39.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 655 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Thomson et al. (US #2020/0175961) in view of Clinchant et al. (US #2023/084333) further in view of Hu et al. (Contrastive Learning for Robust Neural Machine Translation with ASR Errors, 2022).

Regarding Claim 1, Thomson discloses a method (title, abstract, Figs. 1-84) comprising:
inputting a speech signal (Thomson Fig. 1: 110, 112; ¶0106) to an automatic speech recognition model (Thomson Fig. 1: ASR 120a-120c) to obtain a transcript hypothesis (Thomson ¶0256 discloses the decoder 510 determines a series of words, denoted as a hypothesis, for use in generation a transcription; Fig. 5) including a first sequence of tokens (Thomson ¶0334 discloses the hypotheses can be represented as a string of tokens [i.e., sequence of tokens]), wherein the speech signal is associated with a golden transcript (Thomson ¶0236 discloses the reference transcription [i.e., golden transcript] can be based on audio collected from a production service that is transcribed oflline. One example of transcribing audio oflline can include the steps of configuring a transcription management, transcription, and editing tool to (a) send an audio sample to a first transcriber for transcription, then to a second transcriber to check the results of the first transcriber, (b) send multiple audio samples to a first transcriber and at least some of the audio samples to a second transcriber to check quality, or (c) send an audio sample to two or more transcribers and to use a third transcriber to check results when the first two transcribers differ. Additionally, or alternatively, the accuracy tester 410 can generate a reference transcription in real time and automatically compare the reference to the hypothesis to determine an error rate in real time. ¶0237 discloses a reference transcription can be generated by sending the same audio segment to multiple different revoiced transcription units that each transcribe the audio. Alternatively, or additionally, the same audio segment can be sent to multiple different non-revoiced transcription units that each transcribe the audio. The output of some or all of the non-revoiced and revoiced transcription units can be provided to a fuser that can combine the transcriptions into a reference transcription. ¶0157 discloses confidence models) including a second sequence of tokens (Thomson ¶0344 discloses the first denormalize text process 1404a, the second denormalize text process 1404b, and the third denormalize text process 1404c can be configured to receive the tokens from the first transcription generation process 1402a, the second transcription generation process 1402b, and the third transcription generation process 1402c, respectively);
inputting the first sequence of tokens to an encoder of a neural machine translation model to obtain a first sentence representation (Thomson Fig. 8: 8008; ¶0618).
Thomson may not explicitly disclose inputting the first sequence of tokens to an encoder of a neural machine translation model to obtain a first sentence representation; inputting the second sequence of tokens to the encoder of the neural machine translation model to obtain a second sentence representation; determining a contrastive loss function based on the first sentence representation and the second sentence representation; and training the encoder of the neural machine translation model based on the contrastive loss function.
However, Clinchant (title, abstract, Figs. 1-6) teaches inputting the first sequence of tokens to an encoder of a neural machine translation model (Clinchant Fig. 2: 204 NMT Model, encoder 214)  to obtain a first sentence representation (Clinchant ¶0047 discloses for the clean sequence pair, the clean source and target sequences are input [fed] as inputs 217, 219 to the encoder 214 and the decoder 216, respectively);
inputting the second sequence of tokens to the encoder of the neural machine translation model to obtain a second sentence representation (Clinchant ¶0047 discloses for the noisy sequence pair, the source and target sequences of the noisy sequence pair [at least one being noisy], can be input [fed] as inputs 217, 219 to the encoder 214 and the decoder 216, respectively).
Thomson and Clinchant are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify language translation (as taught by Thomson) to sample one or more token perturbations [e.g., replacements, substitutions, insertions, deletions], by the adversarial generator, in the clean source sequence, the clean target sequence, or both, to replace the masked tokens in such sequences (as taught by Clinchant, ¶0043) to preserve a meaning of the masked token, while another, competing objective of this sampling can be to maximize loss [e.g., translation loss] in the neural language model (Clinchant, ¶0043).
And Hu (title, abstract) teaches determining a contrastive loss function (Hu pages 83-85 last para in each: “contrastive learning” and equations 1-5) based on the first sentence representation (Hu page 84, last para, "expose the [NMT] model to ... incorrect input sentences" [the noisy inputs]) and the second sentence representation (Hu page 84, last para, "expose the [NMT] model to ... valid ... input sentences" [the correct/non-noised inputs]); and
training the encoder of the neural machine translation model based on the contrastive loss function (Hu abstract: to train an NMT model being robust to ASR output, we take contrastive learning framework to close the gap among representations of original input and its perturbed counterpart. page 82 second para to train an NMT model being robust to ASR output, then we automatically convert source-side sentences in NMT training dataset from non-noised version to noised version by mimicking typical types of ASR errors. pages 83-85 last para in each: “contrastive learning” and equations 1-5).
Thomson, Clinchant, and Hu are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Thomson in view of Clinchant in light of the teachings of Hu to train an NMT model being robust to ASR output, then we automatically convert source-side sentences in NMT training dataset from non-noised version to noised version by mimicking typical types of ASR errors (as taught by Hu, page 82 second para) since as of yet there are no publicly available bilingual parallel corpora with both naturally occurring noisy inputs and their corresponding correct inputs, thus previous studies could not tell to what extend the translation performance would drop when we feed the NMT models with noised sentences, rather than non-noised ones (Hu, page 82 first para).

Regarding Claim 2, Thomson in view of Clinchant and Hu discloses the method of claim 1. But Thomson may not explicitly disclose comprising: training the neural machine translation model based on source-target text translation pairs.
However, Clinchant (title, abstract, Figs. 1-6) teaches training the neural machine translation model based on source-target text translation pairs (Clinchant ¶0033 discloses Fig. 1 illustrates an example method 100 for training a neural model, which in example described herein is an autoregressive encoder-decoder model, examples of which include a neural language model such as a neural machine translation [NMT] model. Fig. 2 shows an example architecture 200 for carrying out the method 100. An example NMT is a bilingual translation model. ¶0037 discloses referring to Fig. 1, in the example method 100, a plurality of clean sequence pairs is received at 102. For example, the processor 202 can receive the clean sentence pairs from a batch stored locally to the processor or remotely, such as via a network. The batch can be stored in any suitable storage, such as but not limited to a database 220 in communication with the processor 202. An example batch is provided by a dataset, including [for training machine translation models] available machine translation training datasets known to those of ordinary skill in the art. The batch can include a [clean] parallel corpus having a [clean] source side and [clean] target side. ¶0038 discloses each clean sequence pair includes a clean source sequence [e.g., of tokens such as words or subwords], such as a clean source sentence, and a clean target sequence [e.g., of tokens such as words or subwords], such as a clean target sentence).
Thomson and Clinchant are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify language translation (as taught by Thomson) to sample one or more token perturbations [e.g., replacements, substitutions, insertions, deletions], by the adversarial generator, in the clean source sequence, the clean target sequence, or both, to replace the masked tokens in such sequences (as taught by Clinchant, ¶0043) to preserve a meaning of the masked token, while another, competing objective of this sampling can be to maximize loss [e.g., translation loss] in the neural language model (Clinchant, ¶0043).

Regarding Claim 3, Thomson in view of Clinchant and Hu discloses the method of claim 1, comprising:
iteratively switching between batches of training the encoder of the neural machine translation model based on transcript hypotheses from the automatic speech recognition model and corresponding golden transcripts using the contrastive loss function, and batches of training the neural machine translation model based on source-target text translation pairs (Thomson ¶0500 discloses alternatively, or additionally, Figs. 18-30, among others, describe various systems and methods that can switch between the different transcription units providing transcriptions for audio of a communication session during the communication session. In these and other embodiments, a criteria for selecting between transcription units can include the estimated accuracy of each transcription unit. For example, when a non-revoicing transcription unit provides an estimated accuracy that satisfies a threshold, the non-revoicing transcription unit may be selected over a revoicing transcription unit. ¶0570 discloses the scorer 2216 [Fig. 22] can be configured to evaluate similarity between two token strings, such as two transcriptions. In some embodiments, the scorer 2216 can compare hypotheses transcriptions, from transcription units or ASR systems, as illustrated in Figs. 20 and 21. In these and other embodiments, the output of the scorer 2216 can be referred to as an agreement rate. The scorer 2216 can compare a reference transcription [i.e., a transcription assumed to be correct; referred to here as the golden transcript] and a hypothesis transcription. In these and other embodiments, the output of the scorer 2216 can be referred to as an accuracy score with respect to the accuracy of the hypothesis transcription with respect to the reference transcription).

Regarding Claim 4, Thomson in view of Clinchant and Hu discloses the method of claim 1,
wherein a special contrastive loss token is appended to the first sequence of tokens when it is input to the encoder of the neural machine translation model to obtain the first sentence representation (Thomson ¶0334 discloses each of the ASR systems 1320 can be configured to generate a transcription based on the audio received by the ASR systems 1320. The transcriptions, referred to sometimes as "hypotheses," can have varying degrees of accuracy depending on the particular configuration of the ASR systems 1320. The hypotheses can be represented as a string of tokens. The string of tokens can include one or more of sentences, phrases, or words. A token can include a word, subword, character, or symbol).

Regarding Claim 5, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises:
determining a distance between the first sentence representation and the second sentence representation (Thomson ¶0365 discloses the align text process 1406 can align the tokens, e.g. the words in the above hypotheses, so that as many identical tokens as possible lie in each token group. In some embodiments, the alignment can reduce the edit distance [the minimum number of insertions, deletions, and substitutions to convert one string to the other] or Levenshtein distance between denormalized hypotheses provided to the align text process 1406 after the denormalized hypotheses have been aligned. Additionally, or alternatively, the alignment can reduce the edit or Levenshtein distance between each aligned denormalized hypothesis and the fused transcription. ¶0416 discloses the align text process 1406 and/or voting process 1408 can be configured to utilize a Viterbi search or variation of the Viterbi search adapted to measuring edit distance between tokens to align token sequences. ¶0878 discloses the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between).

Regarding Claim 6, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises:
determining a distance between the first sentence representation and a negative example that is constructed from other noisy and clean sentences in a batch of speech signal training data (Thomson ¶0878 discloses the comparer 4504 can be configured to compare the monitored transcription with the reference transcription. Additionally, or alternatively, the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between. In some embodiments, the comparison process by the comparer 4504 can be implemented as follows: (1) the comparer 4504 may align the monitored transcription and the reference transcription; (2) the comparer 4504 may compare each aligned pair of tokens from the monitored transcription and the reference transcription. The pair of tokens may include a first token from the monitored transcription and a second token from the reference transcription; (3) the comparer 4504 may provide an indication, such as a match or no match with respect to each aligned pair of tokens, to the counter 4506. For example, the comparer 4504 may output a zero when a pair of tokens match and a one if there is no match between a pair of tokens; and (4) the number of differences is counted or averaged by the counter 4506 to determine an average disagreement rate, edit distance, and/or Levenshtein distance).

Regarding Claim 7, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises:
determining a distance between the second sentence representation and a negative example that is constructed from other noisy and clean sentences in a batch of speech signal training data (Thomson ¶0878 discloses the comparer 4504 can be configured to compare the monitored transcription with the reference transcription. Additionally, or alternatively, the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between. In some embodiments, the comparison process by the comparer 4504 can be implemented as follows: (1) the comparer 4504 may align the monitored transcription and the reference transcription; (2) the comparer 4504 may compare each aligned pair of tokens from the monitored transcription and the reference transcription. The pair of tokens may include a first token from the monitored transcription and a second token from the reference transcription; (3) the comparer 4504 may provide an indication, such as a match or no match with respect to each aligned pair of tokens, to the counter 4506. For example, the comparer 4504 may output a zero when a pair of tokens match and a one if there is no match between a pair of tokens; and (4) the number of differences is counted or averaged by the counter 4506 to determine an average disagreement rate, edit distance, and/or Levenshtein distance).

System Claims 8-14 are rejected for the same reasons as set forth in Claims 1-7.
Non-transitory computer-readable storage medium Claims 15-20 are rejected for the same reasons as set forth in Claims 1-7 (Thomson ¶0217, ¶0277, ¶0282, ¶0287, ¶0619: the method can be performed based on the execution of instructions stored on one or more non-transitory computer-readable media).



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691

Read full office action

Prosecution Timeline

Apr 04, 2024

Application Filed

Feb 17, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/417,674

Patent 12640163

Method and System for Identifying Similarity Between Two Audio Tracks

2y 4m to grant Granted May 26, 2026

18/367,316

Patent 12626711

HIGH-QUALITY VOICE SIGNAL PROCESSING DEVICE AND METHOD THROUGH REMOVAL OF AMBIENT NOISE BASED ON MULTI-SENSOR SIGNAL FUSION

2y 8m to grant Granted May 12, 2026

18/401,292

Patent 12610167

DIRECTIONAL BILATERAL SOUND INTAKE-BASED MIC ASSEMBLY AND ELECTRONIC DEVICE

2y 3m to grant Granted Apr 21, 2026

18/420,157

Patent 12598426

CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO

2y 2m to grant Granted Apr 07, 2026

18/534,033

Patent 12596525

METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION

2y 4m to grant Granted Apr 07, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

86%

With Interview (+3.3%)

2y 3m (~1m remaining)

Median Time to Grant

Low

PTA Risk

Based on 655 resolved cases by this examiner. Grant probability derived from career allowance rate.