Last updated: April 19, 2026

Application No. 17/733,956

PROVIDING MULTISTREAM MACHINE TRANSLATION DURING VIRTUAL CONFERENCES

Non-Final OA §103

Filed

Apr 29, 2022

Examiner

ZHU, RICHARD Z

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Zoom Video Communications, Inc.

OA Round

5 (Non-Final)

Interview Optional

— +15.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 718 resolved cases, 2023–2026

Examiner Intelligence

ZHU, RICHARD Z View full profile →

Grants 69% — above average

Career Allow Rate

498 granted / 718 resolved

+7.4% vs TC avg

Strong +15% interview lift

Without

With

+15.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

32 currently pending

Career history

750

Total Applications

across all art units

Statute-Specific Performance

§101

16.0%

-24.0% vs TC avg

§103

54.5%

+14.5% vs TC avg

§102

19.7%

-20.3% vs TC avg

§112

4.2%

-35.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 718 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 01/06/2026 has been entered.
Status of the Claims
Claims 1-2, 4-5, 7-10, 12-16, and 18-20 are pending. 
Response to Applicant’s Arguments
In response to “The Office Action asserts that the plain meaning of "allocate" is to "set apart or earmark; designate." Office Action, p. 4. Under this definition of "allocate," Liberman does not allocate any STTE to a virtual conference. No STTE is set apart or earmarked for a conference and for the duration of the conference. Instead, the STTEs remain available for use by any conference then in-session. Thus, if the TSM decides to use a particular STTE for a particular audio stream in one conference, it does so, and if another conference needs to use the same STTE, it also does so. Thus, the STTE itself is never allocated to any conference. It remains a generally available resource for any conference to use for translation. Instead the Office Action is interpreting "allocate" to mean "use," which is a broader term that simply involves making use of an STTE. While Liberman describes using STTEs, it does not describe allocating any to a conference for the duration of the conference”.
Exemplary claim 1 recites “exclusively allocating, for each determined source language and for the duration of the virtual conference, (1) a transcription process from a set of transcription processes and, for each determined source and target language paid and for the duration of the virtual conference, (2) a translation process from a set of translation processes to the virtual conference…”. 
According to Merriam-Webster, the verb “earmark” is defined as: to designate (something, such as funds) for a specific use or owner.1 
Therefore, interpreting “allocate” as to set apart or earmark: designate means that exemplary claim 1 requires exclusively designating a particular transcription process for each determined source language for the duration of the virtual conference. 
Liberman teaches an audio module 300 comprising a plurality of session audio modules 305A-N shown in Fig. 3:

    PNG
    media_image1.png
    484
    660
    media_image1.png
    Greyscale
 
one session audio module 305A-N per each session that the audio module 300 handles (¶79). In other words, audio module 300 handles each video communication session (i.e., videoconference per ¶12) using a particular session audio module 305A-N (i.e., exclusively), each session audio module 305A-N having respective STTE 365A-X and TE 367A-367X.
Therefore, for a first session being handled by session audio module 305A whose TSM 360 selects or designates or allocates a STTE 365 from STTE 365A-X in the session audio module 305A while a second session being handled by session audio module 305B with respective TSM 360 selects or designates or allocates a corresponding STTE 365 from STTE 365A-X in the session audio module 305B, the two sets of STTE 365A-X of respective session audio modules are mutually exclusive. 
In other words, in the case the first conference corresponding to session audio module 305A uses STTE 365A performing a first transcription process and the second conference corresponding to session audio module 305B uses a corresponding STTE 365A performing a different transcription process, the two session audio modules 305A-B are using two different and mutually exclusive STTE 365A. 
Further, each session audio module 305 comprises a translator selector module “TSM” 360 determining which audio stream to transfer to which speech to text engine “STTE” 365A-X and TSM 360 sends command information including the language of the audio stream and the languages to which the stream should be translated to the STTE365A-X together with the audio streams (¶86), each STTE 365 may be used for one or more languages to convert audio streams into a stream of text (¶85).
In other words, in the case where each STTE 365 is used for one language (¶85) and each sessional audio module 305A-N handles the duration of an audio session / virtual conference, Liberman teaches allocating a transcription process corresponding to a STTE 365 of a particular source language for each audio stream with a particular determined source language for the duration of the virtual conference.
Here, even if multiple audio streams from multiple endpoints are allocated a particular STTE 365, Liberman still teaches the limitation “exclusively allocating, for each determined source language and for the duration of the virtual conference, a transcription process from a set of transcription processes” because Liberman’s command information indicated that audio streams from the endpoints all have the same source language and thus exclusively designated the particular STTE 365 / transcription process for the endpoints with the same determined source language for the duration of the virtual conference session handled by the particular sessional audio module 305A-N.  
Therefore, Liberman does teach exclusively allocating, for each determined source language (which does not limit how many audio streams / endpoints can have the same source language) and for the duration of the virtual conference, a transcription process from a set of transcription processes.  
However, Liberman does not teach “exclusively allocating, …, for each determined source and target language paid and for the duration of the virtual conference, (2) a translation process from a set of translation processes to the virtual conference…”
In view of such combination of limitations set forth in claims 1, 14, and 19, anticipation rejection under Liberman has been withdrawn. Upon further search and consideration, please see a new citation of references set forth in details below. 
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4-5, 7-9, 12-15, and 18-20 are rejected under 35 USC 103 for being unpatentable over Liberman et al. (US 2011/0246172 A1) in view of Cordell et al. (US 20170046337 A1).	
Regarding Claims 1, 14, and 19, Liberman discloses a system (Fig. 1) comprising: 
a communications interface (¶55, Fig. 2, Network Interface 210); 
a non-transitory computer-readable medium (¶110, memory storing software program to be loaded into an appropriate processor); 
and one or more processors, the one or more processors communicatively coupled to the communications interface and the non-transitory computer-readable medium (¶110, memory storing software program to be loaded into an appropriate processor), the one or more processors configured to execute processor-executable instructions stored in the non-transitory computer-readable medium to: 
host, by a conference provider, a virtual conference between a plurality of client devices exchanging audio streams (¶12 and ¶48, a multipoint control unit (MCU) is a conference controlling entity that receives several channels from access ports (i.e., Fig. 1, endpoint devices 130A-130N), processes audiovisual signals, and distributes them to connected channels; see Figs. 4A-4B); 
during the virtual conference (Fig. 4 and ¶79, audio module 300 handles each session with a session audio module 305);
generate audio segments of exchanged audio streams (¶79, each session audio module 205 receives a plurality of audio streams from endpoints 130A-N; ¶80, transfer the audio streams to DTMF module 225 for conversion into digital data);
determining a source language for each audio segment based on analyzing the respective audio segment (¶84, TSM 360 receives audio streams from AD 310A-D; ¶86 and ¶91, TSM 360 accesses iListen or ViaVoice module to identify a language spoken in the audio streams to determine which audio stream to transfer to which STTE 365A-X according to the language of the audio stream; per ¶86, each STTE 365A-X is capable of identifying language of the audio stream itself and adapt itself to translate the received audio to the needed language) and determine a target language for each client device of the plurality of client device (¶87, determine the language to which the stream should be translated by identifying the endpoint to which the audio stream should be sent);
exclusively allocate, for each determined source language (¶86, command information including the language of the audio stream) and for the duration of the virtual conference (¶79, one session audio module 305A-N per each session that the audio module 300 handles), a transcription process from a set of transcription processes (¶85 and Fig. 3, STTE 365A-365X receiving audio streams and convert the audio streams into a stream of text, each STTE 365 may be used for one or more languages; in the example where each STTE 365A-X is used for one language, per ¶86, when the command information indicates which audio stream to transfer to which STTE 365A-X according to the language of the audio stream, TSM360 designates (i.e., exclusively allocate) a particular STTE 365A-X to perform speech to text / transcription processing of all audio streams with the particular language by sending command information to the STTE 365A-X together with the audio streams) and, for each determined source and target language and for the duration of the virtual conference, a translation process from a set of translation processes  (¶92 and Fig. 3, TE367A-367X, each TE367 serves a different language; e.g., Fig. 4A showing Japanese subtitles 410 and 412 resulting from transcription of corresponding audio stream / audio segments 410 and 412 by an allocated / selected Japanese STTE 365 and a corresponding Japanese TE 367; Fig. 4B showing English subtitles 414, 416, and 418 resulting from transcription of corresponding audio stream / audio segments 414, 416, and 418 by an allocated / selected English STTE 365 and a corresponding English TE 367) to the virtual conference based on the respective determined source language for the respective audio segment and the respective target language for a respective client device (¶¶85-86, determine which audio stream to transfer to which STTE365A-X to convert the audio streams into a stream of text based on the language identified in the audio stream by TSM 360 per ¶91 and a selected STTE365 per ¶86; ¶¶92-93, STTE 365 forwards the converted text into corresponding TE 367A-X based on the identification on which endpoint the stream of text will be displayed, each TE 367 serves a different language), the sets of transcription and translation processes each comprising a respective plurality of transcription (¶85, each STTE 365 may be used for one or more languages) or translation processes (¶92, each TE 367 may serve a different language); 
generate, for each audio segment by a corresponding transcription process, a transcription in the respective source languages (¶85, the STTE 365A-X may receive the audio streams and convert the audio streams into a stream of text);
translate, by the respective translation process, the respective transcriptions in the respective target languages to create a corresponding translation (¶94, TE 367 outputs the translated text; see Fig. 4A, ¶101, translation of English and Russian into Japanese for a Japanese conferee / endpoint; Fig. 4B, translation of Russian, Japanese, and Chinese into English); and 
provide, during the virtual conference (Abstract, system provides real-time translation of speech by conferees), the translations to the corresponding client devices (¶28, MCU may display on each endpoint screen subtitles of one or more other conferees simultaneously; see Figs. 4A-4B).
Liberman does not teach exclusively allocating for each determined source and target language paid, a translation process from a set of translation processes to the virtual conference. 
Cordell teaches language interpretation / translation platform allocating language interpretation / translation resource (Abstract) exclusively allocating, for each determined source language and target language paid (¶26 and ¶28, user 102 requests language interpretation / translation services for a video conferencing channel of communication and language resource engine 201 uses API 202 to negotiate rates of capable and available resources) and for duration of a virtual conference (¶26 and ¶28, when user requests language interpretation / translation services for a video conferencing channel of communication, language resource engine 201 dynamically allocates and reallocates resources for language interpretation / translation services based on particular requirements of the user and availability of requisite resources; per ¶24, if video conferencing resources 105 are being fully utilized, the user may have to wait or not obtain a voice language interpretation / translation resource (i.e., the allocation is exclusive)), a translation process from a set of translation processes (¶23, devices providing voice based language interpretation / translation services; ¶28, human language interpreters / translators or machine language interpreters / translators) to the virtual conference based on respective determined source language and respective target language (¶21 in view of ¶8, automatically interact with an application stored on user computing device to obtain customer identifying information to obtain a particular language interpretation / translation resource; ¶30, API 202 automatically obtains customer identifying data and other types of data to determine type of language interpreter / translator based on language requirements).
 It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to exclusively allocate for each determined source and target language paid and for the duration of the virtual conference, a translation process from a set of translation processes to the virtual conference based on the respective determined source language for the respective audio segment and the respective target language  for a respective client device (Cordell, ¶26, dynamically allocate and reallocate resources for language interpretation / translation services based upon the particular requirements of the user by employing techniques of Liberman, ¶87, determine the language of the audio stream by identifying the endpoint that is the source of the audio stream and determine the language to which the stream should be translated by identifying the endpoint to which the audio stream should be sent) in order to implement language interpretation / translation resource system providing a marketplace to provide language interpretation / translation services to customers in need of language interpretation and translation by negotiating interpreter fee rates, payment terms, and service details (Cordell, ¶¶19-20).
Further regarding Claim 19, Liberman disclose a non-transitory computer-readable medium comprising processor-executable instructions configured to cause one or more processors to implement the method of claim 1 and system of claim 14 (¶110, memory storing software program to be loaded into an appropriate processor).
Regarding Claim 4, Liberman disclose determining the target language for a first client device of the plurality of client devices comprises receiving, by the conference provider (¶12 and ¶48, a multipoint control unit (MCU) or the one or more servers), based on a selection by a first user of the first client device (¶21, implementing a feedback mechanism to inform the conferee of the automatic identification of the conferee’s language, allowing the conferee to override the automatic decision using “click and view” option; e.g., a first endpoint from endpoints EP 130A-130N in Fig. 3; per ¶90, a receiving conferee at the first endpoint may define the language and the endpoint from which the conferee wants to get the subtitles by entering the information using click and view function); and 
determining the target language for the second client device of the plurality of client devices comprises receiving, by the conference provider, an indication of the target language for the second client device based on a selection by a second user of the second client device (e.g., a second endpoint from endpoints EP 130A-130N in Fig. 3; ¶90, a receiving conferee at the second endpoint may define the language and the endpoint from which the conferee wants to get the subtitles by entering the information using click and view function;).
Regarding Claims 5 and 18, Liberman disclose wherein: 
determining the target language for a first client device of the plurality of client devices comprises determining, by the conference provider, the target language for the first client device based on a location of the first client device (per ¶87 and ¶93, determining the language to which the stream should be translated by identifying the endpoint to which the audio stream should be sent; e.g., ¶101, determine that for a Japanese endpoint, the Russian and the English segments should be translated into Japanese); and 
determining the target language for a second client device of the plurality of client devices comprises determining, by the conference provider, the target language for the second client device based on a location of the second client device (per ¶87 and ¶93, determining the language to which the stream should be translated by identifying the endpoint to which the audio stream should be sent; e.g., ¶102, determine that for a U.S. endpoint, the Russian, Japanese, and Chinese segments in Fig. 4B should be translated into English).
Regarding Claim 7, Liberman discloses punctuating the transcription (¶92, after conversion of the audio streams into a text stream, STTE 365 may arrange the text such that it will have periods and commas in appropriate places) and providing the respective transcriptions to the respective client devices (¶17, MLTV-MCU displays one or more translations of one or more audio streams as subtitles on one or more endpoint screens).
Regarding Claims 8 and 15, Liberman discloses prior to providing the translations to the respective client devices:
 attributing eh respective translation to a respective speaker (¶¶22-23, distinguish each audio stream with a different indicator comprising the name of the conferee / endpoint; ¶33, generate a conference script in the language selected by the conferee with indication which text was said by which conferee).
Regarding Claim 9, Liberman discloses wherein the allocated translation processes receive the respective transcriptions in real-time as they are generated (¶26, display converted text in the original language to the speaking conferee to solicit feedback as to the speech to text conversion accuracy; in view of ¶22 and Abstract, configure the MCU to translate and display a plurality of received audio streams as subtitles simultaneously and in real-time; i.e., respective real-time translation to target language requires real-time transcription in respective source language).
Regarding Claims 12 and 20, Liberman discloses determining which of the plurality of client devices for which to perform translation (¶62, CTM 222 determines which of the received audio streams need to be translated) based on which of the plurality of client devices is most active during the virtual conference (¶23, mark the main speaker in underline and bold letters based on the conferee whose audio energy level was above the audio energy of the other conferees for a certain percentage of a certain period, with the name of the conferee / endpoint that has been translated at the beginning of the subtitle; i.e., subject audio streams from conferee / endpoint corresponding to main speaker with the audio energy level above audio energy of other conferees to transcription and translation in order to generate name of the conferee / endpoint at the beginning of the subtitle).
Regarding Claims 13 and 20, Liberman discloses determining which of the plurality of client devices for which to perform translation based on receiving a selection of particular audio streams from the plurality of client devices to transcribe (¶62, CTM 222 determines which of the received audio streams need to be translated and transfers the identified audio streams that need translation to a speech to text engine).

Claims 2 and 16 are rejected under 35 USC 103 for being unpatentable over Liberman et al. (US 2011/0246172 A1) and Cordell et al. (US 20170046337 A1) as applied to claims 1 and 14, in view of Seligman et al. (US 9552354 B1).	
Regarding Claims 2 and 16, Liberman does not discloses wherein: the translation processes utilizes a first dictionary in a first language to translate transcriptions in the first language; and the translation process utilizes a second dictionary in a second language to translate the transcriptions in the second language.
Seligman discloses a cross-lingual communication system (Col 18, Rows 62-63) comprising a speech recognition module to recognize speech signals for translation into letters (Col 7, Rows 30-35); i.e., transcription. The system utilizes a first dictionary in a first language to translate transcriptions in a first language (Col 19, Rows 6-10, clicking on English word (i.e., English transcription) accesses English-Spanish dictionary) and a second dictionary in a second language to translate transcriptions in a second language (Col 19, Rows 6-10, clicking on a Spanish word (i.e., Spanish transcription) accesses Spanish-English dictionary). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to utilize a first dictionary in a first language to translate transcriptions in a first language and a second dictionary in a second language to translate transcriptions in a second language in order to send conference scripts to different endpoints in the languages selected by respective conferees (Liberman, ¶33) by using lexical sources (i.e., dictionaries, Seligman, Col 8, Rows 64-65) to provide accurate and real-time machine translation (Seligman, Col 2, Rows 15-18).
Claim 10 is rejected under 35 USC 103 for being unpatentable over Liberman et al. (US 2011/0246172 A1) and Cordell et al. (US 20170046337 A1) as applied to claim 9, in view of Ahn et al. (US 10643036 B2).	
Regarding Claim 10, Liberman does not disclose revising a first translation in real-time based on additional words received in a corresponding transcription, the revised translation replacing previously translated words with newly translated words; and providing the revised translation to a respective client device.
Ahn discloses a real-time translation system for a first user with a first smartphone having a video chat with a second user with a second smartphone (Col 6, Rows 10-17) comprising fragment (i.e., text transcripts per Col 6, Rows 29-30) by fragment translation of the first user’s transcription and displaying of the translation while the first user is still speaking a current sentence (Col 6, Rows 28-33; i.e., a first translation occurs as the user’s continued to speak fragment by fragment). 
Further, the system revises the fragment by fragment translation in real time based on additional words received in the first user’s transcription, the revised fragment by fragment translation replaces the previously translated words with newly translated words (Col 6, Rows 33-37 and Col 8, Rows 15-21, when user finishes the current sentence, which includes additional fragments spoken by the user, the system sends the whole sentence to translation server to generate a translation sentence that is different from the combination of the fragment by fragment translation and replaces the previously displayed translation scripts of speeches), and providing the revised fragment by fragment translation to a respective client device (Col 8, Rows 19-21, replace the previously displayed fragment by fragment translations with the translation of the whole sentence on a display of the first or second smartphone; see Col 6, Rows 38-41, user’s speech during video chat is translated twice: (1) fragment by fragment and (2) as a whole sentence).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to revise the first translation in real-time based on additional words received in a corresponding transcription, the revised translation replacing previously translated words with newly translated words in order to provide improved user experience (Ahn, Col 2, Rows 20-21).
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor Hai Phan whose telephone number is 571-272-6338. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2654                                                                                                                                                                                                        02/05/2026


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 https://www.merriam-webster.com/dictionary/earmark

Read full office action

Prosecution Timeline

Apr 29, 2022

Application Filed

Apr 05, 2024

Non-Final Rejection — §103

Jul 10, 2024

Response Filed

Sep 30, 2024

Final Rejection — §103

Mar 03, 2025

Request for Continued Examination

Mar 06, 2025

Response after Non-Final Action

Mar 18, 2025

Non-Final Rejection — §103

Aug 21, 2025

Response Filed

Oct 02, 2025

Final Rejection — §103

Jan 06, 2026

Request for Continued Examination

Jan 08, 2026

Response after Non-Final Action

Feb 05, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/247,441

Patent 12592228

SPEECH INTERACTION METHOD ,AND APPARATUS, COMPUTER READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE

2y 5m to grant Granted Mar 31, 2026

18/365,694

Patent 12592222

APPARATUSES, COMPUTER PROGRAM PRODUCTS, AND COMPUTER-IMPLEMENTED METHODS FOR ADAPTING SPEECH RECOGNITION CONFIDENCE SCORES BASED ON EXPECTED RESPONSE

2y 5m to grant Granted Mar 31, 2026

18/510,086

Patent 12586574

ELECTRONIC DEVICE FOR PROCESSING UTTERANCE, OPERATING METHOD THEREOF, AND STORAGE MEDIUM

2y 5m to grant Granted Mar 24, 2026

18/520,336

Patent 12579978

NETWORKED DEVICES, SYSTEMS, & METHODS FOR INTELLIGENTLY DEACTIVATING WAKE-WORD ENGINES

2y 5m to grant Granted Mar 17, 2026

17/957,934

Patent 12572739

GENERATING MACHINE INTERPRETABLE DECOMPOSABLE MODELS FROM REQUIREMENTS TEXT

2y 5m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

5-6

Expected OA Rounds

69%

Grant Probability

85%

With Interview (+15.4%)

3y 2m

Median Time to Grant

High

PTA Risk

Based on 718 resolved cases by this examiner. Grant probability derived from career allow rate.