Prosecution Insights
Last updated: April 19, 2026
Application No. 18/627,098

CONTRASTIVE LEARNING WITH ADVERSARIAL DATA FOR ROBUST SPEECH TRANSLATION

Non-Final OA §103
Filed
Apr 04, 2024
Examiner
PATEL, YOGESHKUMAR G
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Zoom Video Communications, Inc.
OA Round
1 (Non-Final)
83%
Grant Probability
Favorable
1-2
OA Rounds
2y 4m
To Grant
86%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
538 granted / 650 resolved
+20.8% vs TC avg
Minimal +3% lift
Without
With
+3.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
17 currently pending
Career history
667
Total Applications
across all art units

Statute-Specific Performance

§101
4.7%
-35.3% vs TC avg
§103
61.9%
+21.9% vs TC avg
§102
14.4%
-25.6% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 650 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Thomson et al. (US #2020/0175961) in view of Clinchant et al. (US #2023/084333) further in view of Hu et al. (Contrastive Learning for Robust Neural Machine Translation with ASR Errors, 2022). Regarding Claim 1, Thomson discloses a method (title, abstract, Figs. 1-84) comprising: inputting a speech signal (Thomson Fig. 1: 110, 112; ¶0106) to an automatic speech recognition model (Thomson Fig. 1: ASR 120a-120c) to obtain a transcript hypothesis (Thomson ¶0256 discloses the decoder 510 determines a series of words, denoted as a hypothesis, for use in generation a transcription; Fig. 5) including a first sequence of tokens (Thomson ¶0334 discloses the hypotheses can be represented as a string of tokens [i.e., sequence of tokens]), wherein the speech signal is associated with a golden transcript (Thomson ¶0236 discloses the reference transcription [i.e., golden transcript] can be based on audio collected from a production service that is transcribed oflline. One example of transcribing audio oflline can include the steps of configuring a transcription management, transcription, and editing tool to (a) send an audio sample to a first transcriber for transcription, then to a second transcriber to check the results of the first transcriber, (b) send multiple audio samples to a first transcriber and at least some of the audio samples to a second transcriber to check quality, or (c) send an audio sample to two or more transcribers and to use a third transcriber to check results when the first two transcribers differ. Additionally, or alternatively, the accuracy tester 410 can generate a reference transcription in real time and automatically compare the reference to the hypothesis to determine an error rate in real time. ¶0237 discloses a reference transcription can be generated by sending the same audio segment to multiple different revoiced transcription units that each transcribe the audio. Alternatively, or additionally, the same audio segment can be sent to multiple different non-revoiced transcription units that each transcribe the audio. The output of some or all of the non-revoiced and revoiced transcription units can be provided to a fuser that can combine the transcriptions into a reference transcription. ¶0157 discloses confidence models) including a second sequence of tokens (Thomson ¶0344 discloses the first denormalize text process 1404a, the second denormalize text process 1404b, and the third denormalize text process 1404c can be configured to receive the tokens from the first transcription generation process 1402a, the second transcription generation process 1402b, and the third transcription generation process 1402c, respectively); inputting the first sequence of tokens to an encoder of a neural machine translation model to obtain a first sentence representation (Thomson Fig. 8: 8008; ¶0618). Thomson may not explicitly disclose inputting the first sequence of tokens to an encoder of a neural machine translation model to obtain a first sentence representation; inputting the second sequence of tokens to the encoder of the neural machine translation model to obtain a second sentence representation; determining a contrastive loss function based on the first sentence representation and the second sentence representation; and training the encoder of the neural machine translation model based on the contrastive loss function. However, Clinchant (title, abstract, Figs. 1-6) teaches inputting the first sequence of tokens to an encoder of a neural machine translation model (Clinchant Fig. 2: 204 NMT Model, encoder 214) to obtain a first sentence representation (Clinchant ¶0047 discloses for the clean sequence pair, the clean source and target sequences are input [fed] as inputs 217, 219 to the encoder 214 and the decoder 216, respectively); inputting the second sequence of tokens to the encoder of the neural machine translation model to obtain a second sentence representation (Clinchant ¶0047 discloses for the noisy sequence pair, the source and target sequences of the noisy sequence pair [at least one being noisy], can be input [fed] as inputs 217, 219 to the encoder 214 and the decoder 216, respectively). Thomson and Clinchant are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify language translation (as taught by Thomson) to sample one or more token perturbations [e.g., replacements, substitutions, insertions, deletions], by the adversarial generator, in the clean source sequence, the clean target sequence, or both, to replace the masked tokens in such sequences (as taught by Clinchant, ¶0043) to preserve a meaning of the masked token, while another, competing objective of this sampling can be to maximize loss [e.g., translation loss] in the neural language model (Clinchant, ¶0043). And Hu (title, abstract) teaches determining a contrastive loss function (Hu pages 83-85 last para in each: “contrastive learning” and equations 1-5) based on the first sentence representation (Hu page 84, last para, "expose the [NMT] model to ... incorrect input sentences" [the noisy inputs]) and the second sentence representation (Hu page 84, last para, "expose the [NMT] model to ... valid ... input sentences" [the correct/non-noised inputs]); and training the encoder of the neural machine translation model based on the contrastive loss function (Hu abstract: to train an NMT model being robust to ASR output, we take contrastive learning framework to close the gap among representations of original input and its perturbed counterpart. page 82 second para to train an NMT model being robust to ASR output, then we automatically convert source-side sentences in NMT training dataset from non-noised version to noised version by mimicking typical types of ASR errors. pages 83-85 last para in each: “contrastive learning” and equations 1-5). Thomson, Clinchant, and Hu are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the teachings of Thomson in view of Clinchant in light of the teachings of Hu to train an NMT model being robust to ASR output, then we automatically convert source-side sentences in NMT training dataset from non-noised version to noised version by mimicking typical types of ASR errors (as taught by Hu, page 82 second para) since as of yet there are no publicly available bilingual parallel corpora with both naturally occurring noisy inputs and their corresponding correct inputs, thus previous studies could not tell to what extend the translation performance would drop when we feed the NMT models with noised sentences, rather than non-noised ones (Hu, page 82 first para). Regarding Claim 2, Thomson in view of Clinchant and Hu discloses the method of claim 1. But Thomson may not explicitly disclose comprising: training the neural machine translation model based on source-target text translation pairs. However, Clinchant (title, abstract, Figs. 1-6) teaches training the neural machine translation model based on source-target text translation pairs (Clinchant ¶0033 discloses Fig. 1 illustrates an example method 100 for training a neural model, which in example described herein is an autoregressive encoder-decoder model, examples of which include a neural language model such as a neural machine translation [NMT] model. Fig. 2 shows an example architecture 200 for carrying out the method 100. An example NMT is a bilingual translation model. ¶0037 discloses referring to Fig. 1, in the example method 100, a plurality of clean sequence pairs is received at 102. For example, the processor 202 can receive the clean sentence pairs from a batch stored locally to the processor or remotely, such as via a network. The batch can be stored in any suitable storage, such as but not limited to a database 220 in communication with the processor 202. An example batch is provided by a dataset, including [for training machine translation models] available machine translation training datasets known to those of ordinary skill in the art. The batch can include a [clean] parallel corpus having a [clean] source side and [clean] target side. ¶0038 discloses each clean sequence pair includes a clean source sequence [e.g., of tokens such as words or subwords], such as a clean source sentence, and a clean target sequence [e.g., of tokens such as words or subwords], such as a clean target sentence). Thomson and Clinchant are analogous art as they pertain to speech translation. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify language translation (as taught by Thomson) to sample one or more token perturbations [e.g., replacements, substitutions, insertions, deletions], by the adversarial generator, in the clean source sequence, the clean target sequence, or both, to replace the masked tokens in such sequences (as taught by Clinchant, ¶0043) to preserve a meaning of the masked token, while another, competing objective of this sampling can be to maximize loss [e.g., translation loss] in the neural language model (Clinchant, ¶0043). Regarding Claim 3, Thomson in view of Clinchant and Hu discloses the method of claim 1, comprising: iteratively switching between batches of training the encoder of the neural machine translation model based on transcript hypotheses from the automatic speech recognition model and corresponding golden transcripts using the contrastive loss function, and batches of training the neural machine translation model based on source-target text translation pairs (Thomson ¶0500 discloses alternatively, or additionally, Figs. 18-30, among others, describe various systems and methods that can switch between the different transcription units providing transcriptions for audio of a communication session during the communication session. In these and other embodiments, a criteria for selecting between transcription units can include the estimated accuracy of each transcription unit. For example, when a non-revoicing transcription unit provides an estimated accuracy that satisfies a threshold, the non-revoicing transcription unit may be selected over a revoicing transcription unit. ¶0570 discloses the scorer 2216 [Fig. 22] can be configured to evaluate similarity between two token strings, such as two transcriptions. In some embodiments, the scorer 2216 can compare hypotheses transcriptions, from transcription units or ASR systems, as illustrated in Figs. 20 and 21. In these and other embodiments, the output of the scorer 2216 can be referred to as an agreement rate. The scorer 2216 can compare a reference transcription [i.e., a transcription assumed to be correct; referred to here as the golden transcript] and a hypothesis transcription. In these and other embodiments, the output of the scorer 2216 can be referred to as an accuracy score with respect to the accuracy of the hypothesis transcription with respect to the reference transcription). Regarding Claim 4, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein a special contrastive loss token is appended to the first sequence of tokens when it is input to the encoder of the neural machine translation model to obtain the first sentence representation (Thomson ¶0334 discloses each of the ASR systems 1320 can be configured to generate a transcription based on the audio received by the ASR systems 1320. The transcriptions, referred to sometimes as "hypotheses," can have varying degrees of accuracy depending on the particular configuration of the ASR systems 1320. The hypotheses can be represented as a string of tokens. The string of tokens can include one or more of sentences, phrases, or words. A token can include a word, subword, character, or symbol). Regarding Claim 5, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises: determining a distance between the first sentence representation and the second sentence representation (Thomson ¶0365 discloses the align text process 1406 can align the tokens, e.g. the words in the above hypotheses, so that as many identical tokens as possible lie in each token group. In some embodiments, the alignment can reduce the edit distance [the minimum number of insertions, deletions, and substitutions to convert one string to the other] or Levenshtein distance between denormalized hypotheses provided to the align text process 1406 after the denormalized hypotheses have been aligned. Additionally, or alternatively, the alignment can reduce the edit or Levenshtein distance between each aligned denormalized hypothesis and the fused transcription. ¶0416 discloses the align text process 1406 and/or voting process 1408 can be configured to utilize a Viterbi search or variation of the Viterbi search adapted to measuring edit distance between tokens to align token sequences. ¶0878 discloses the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between). Regarding Claim 6, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises: determining a distance between the first sentence representation and a negative example that is constructed from other noisy and clean sentences in a batch of speech signal training data (Thomson ¶0878 discloses the comparer 4504 can be configured to compare the monitored transcription with the reference transcription. Additionally, or alternatively, the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between. In some embodiments, the comparison process by the comparer 4504 can be implemented as follows: (1) the comparer 4504 may align the monitored transcription and the reference transcription; (2) the comparer 4504 may compare each aligned pair of tokens from the monitored transcription and the reference transcription. The pair of tokens may include a first token from the monitored transcription and a second token from the reference transcription; (3) the comparer 4504 may provide an indication, such as a match or no match with respect to each aligned pair of tokens, to the counter 4506. For example, the comparer 4504 may output a zero when a pair of tokens match and a one if there is no match between a pair of tokens; and (4) the number of differences is counted or averaged by the counter 4506 to determine an average disagreement rate, edit distance, and/or Levenshtein distance). Regarding Claim 7, Thomson in view of Clinchant and Hu discloses the method of claim 1, wherein determining the contrastive loss function comprises: determining a distance between the second sentence representation and a negative example that is constructed from other noisy and clean sentences in a batch of speech signal training data (Thomson ¶0878 discloses the comparer 4504 can be configured to compare the monitored transcription with the reference transcription. Additionally, or alternatively, the comparer 4504 can compare the monitored transcription with the reference transcription by determining an edit distance or Levenshtein distance there between. In some embodiments, the comparison process by the comparer 4504 can be implemented as follows: (1) the comparer 4504 may align the monitored transcription and the reference transcription; (2) the comparer 4504 may compare each aligned pair of tokens from the monitored transcription and the reference transcription. The pair of tokens may include a first token from the monitored transcription and a second token from the reference transcription; (3) the comparer 4504 may provide an indication, such as a match or no match with respect to each aligned pair of tokens, to the counter 4506. For example, the comparer 4504 may output a zero when a pair of tokens match and a one if there is no match between a pair of tokens; and (4) the number of differences is counted or averaged by the counter 4506 to determine an average disagreement rate, edit distance, and/or Levenshtein distance). System Claims 8-14 are rejected for the same reasons as set forth in Claims 1-7. Non-transitory computer-readable storage medium Claims 15-20 are rejected for the same reasons as set forth in Claims 1-7 (Thomson ¶0217, ¶0277, ¶0282, ¶0287, ¶0619: the method can be performed based on the execution of instructions stored on one or more non-transitory computer-readable media). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691
Read full office action

Prosecution Timeline

Apr 04, 2024
Application Filed
Feb 11, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12598426
CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO
2y 5m to grant Granted Apr 07, 2026
Patent 12596525
METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION
2y 5m to grant Granted Apr 07, 2026
Patent 12592675
AUDIO DEVICE WITH MICROPHONE AND MEDIA MIXING
2y 5m to grant Granted Mar 31, 2026
Patent 12593010
COMMUNICATION ASSEMBLY
2y 5m to grant Granted Mar 31, 2026
Patent 12587448
AI-BASED NETWORK TROUBLESHOOTING WITH EXPERT FEEDBACK
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
86%
With Interview (+3.4%)
2y 4m
Median Time to Grant
Low
PTA Risk
Based on 650 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month