Last updated: May 29, 2026

Application No. 18/543,795

GENERATING SPEAKER VIDEO AND AUDIO IN MULTIPLE LANGUAGES FOR VIDEOCONFERENCING

Final Rejection §103

Filed

Dec 18, 2023

Examiner

SIDDO, IBRAHIM

Art Unit

2681

Tech Center

2600 — Communications

Assignee

Zoom Video Communications, Inc.

OA Round

2 (Final)

Interview Optional

— +13.1% interview lift. Interview lift (+13.1%) is below the 15.0% threshold. A written response is recommended.

Based on 477 resolved cases, 2023–2026

Examiner Intelligence

SIDDO, IBRAHIM View full profile →

Grants 84% — above average

Career Allowance Rate

400 granted / 477 resolved

+21.9% vs TC avg

Moderate +13% lift

Without

With

+13.1%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 1m

Avg Prosecution

23 currently pending

Career history

494

Total Applications

across all art units

Statute-Specific Performance

§101

0.9%

-39.1% vs TC avg

§103

86.7%

+46.7% vs TC avg

§102

7.2%

-32.8% vs TC avg

§112

1.4%

-38.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 477 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) on 01/29/26 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-6, 8-13, 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Waibel (WO 2023/219752) provided by Applicant in PE2E in view of Cunnington (US 2010/0198579).
With respect to claim 8 (similarly claims 1 and 15), Waibel teaches a computing device (e.g. a computer system Figs 1-2, 11, 14-16 [0031]-[0032]), comprising:
a non-transitory computer-readable medium (e.g. a memory [0138]); and 
a processor (e.g. processor cores [0138]) communicatively coupled to the non-transitory computer-readable medium (e.g. coupled to the memory [0138]), the processor configured to execute processor-executable instructions stored in the non-transitory computer-readable medium (e.g. the processor executes instructions stored in the memory to [0138]) to: 
access a speaker speech audio signal comprising speech in a first language captured at a client computing device associated with a speaker of a video conference and a video of the speaker associated with the speaker speech (e.g. accessing a speaker audio in English and a video of the speaker associated with the speaker audio, see [0054]); 
access a translated speech audio signal, the translated speech audio signal comprising speech in a second language that is a translation of the speaker speech (e.g. accessing a translated audio signal in German that is a translation of the speaker audio/speech [0054], see also [0067]-[0069])); 
generate a converted translated speech audio signal based on the translated speech audio signal, the converted translated speech audio signal comprising a speech in the second language having voice characteristics in the speaker speech (e.g. generate a converted translated speech audio based on the translated speech audio signal, the converted translated speech audio comprising a speech audio in German having voice characteristics in the speaker speech [0054], Figs 1 and 14-16- voice conversion 20); 
generate a lip-synched speaker video based on the video of the speaker and the converted translated speech audio signal, wherein lip movements in the lip-synched speaker video correspond to the converted translated speech audio signal (e.g. generate new video frames of the speaker’s face with the lips of the speaker adapted to the given German audio [0054]); and 
transmit the converted translated speech audio signal and the lip-synched speaker video to a video conference provider configured to host the video conference (e.g. Finally, a video generation system 25 combines the video frames from the lip generation module 24 and the German speech from the voice conversion module 20 to create the final output video 26 [0054], see also [0086]).
However, Waibel fails to teach the speech in the second language spoken by a second speaker different than the speaker;
Cunnington teaches a system 100 that facilitates enabling translation between two disparate languages within a telepresence session Fig 1 [0022] where the speech in the second language spoken by a second speaker different than the speaker (e.g. wherein the first user is an English speaking person and the second person is a Spanish speaking person. The first user can identify English as his or her language of choice in which the subject innovation can provide automatic translations to the first user in his or her selected language of choice. Thus, any communications from the second user (directed in Spanish or any other language other than English) can be translated to English for the comprehension of the first user [0025];
Waibel and Cunnington are analogous art because they all pertain to translating from one language to another. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Waibel with the teachings of Cunnington to include: the speech in the second language spoken by a second speaker different than the speaker as suggested in [0025] of Cunnington. The benefit of the modification would be to provide the system with a multi-lingual translation capabilities, thus improving user’s satisfaction.
 
With respect to claim 9 (similarly claims 2 and 16), Waibel teaches the computing device of claim 8, wherein generating the converted translated speech audio signal comprises:
resampling the translated speech audio signal (e.g. see the VQMIVC of [0045]-[0046]); 
encoding the translated speech audio signal into voice characteristics (e.g. encoding the translated speech audio to extract embeddings [0045]-[0046]); 
applying a voice changer model onto the voice characteristics to generate the converted translated speech audio signal, wherein the voice changer model is associated with the speaker and is configured to change voice characteristics of an input speech audio signal to voice characteristics of the speaker (e.g. the VQMIVC uses a straightforward autoencoder architecture to solve the voice conversion issue [0045], see the four modules of the framework [0045]-[0046] which suggest applying a voice changer model onto the voice characteristics to generate the converted translated speech audio signal, wherein the voice changer model is associated with the speaker and is configured to change voice characteristics of an input speech audio signal to voice characteristics of the speaker); and 
outputting the converted translated speech audio signal (e.g. see the final audio of [0054]).
With respect to claim 10 (similarly claims 3 and 17), Waibel teaches the computing device of claim 9, wherein generating the converted translated speech audio signal further comprises:
generating a fundamental frequency of the translated speech audio signal; and prior to outputting the converted translated speech audio signal, shifting a fundamental frequency of the converted translated speech audio signal based on a difference between a fundamental frequency of the converted translated speech audio signal and the fundamental frequency of the translated speech audio signal (e.g. quantization, discretizing, encoding, decoding the converted speech as suggested in [0045]-[0046] suggest  generating a fundamental frequency of the translated speech audio signal; and prior to outputting the converted translated speech audio signal, shifting a fundamental frequency of the converted translated speech audio signal based on a difference between a fundamental frequency of the converted translated speech audio signal and the fundamental frequency of the translated speech audio signal).
With respect to claim 11 (similarly claims 4 and 18), Waibel teaches the computing device of claim 9, wherein the voice changer model is trained by a training process (e.g. pre-trained VQMIVC [0046]) comprising:
accessing training speaker speech audio signals; determining a quality score of the training speaker speech audio signals; processing the training speaker speech audio signals to increase the quality score based on determining that the quality score is lower than a predetermined threshold value; determining a fundamental frequency of the training speaker speech audio signals; and training the voice changer model by adjusting parameters of the voice changer model according to voice characteristics of the training speaker speech audio signal (e.g. the mutual information (MI) loss measures the dependencies between all representations and can be effectively integrated into the training process to achieve speech representation disentanglement [0046], fine-tuning the VQMIVC model [0046] to reach convergence i.e. quality/error score lower than a certain threshold, which suggest accessing training speaker speech audio signals; determining a quality score of the training speaker speech audio signals; processing the training speaker speech audio signals to increase the quality score based on determining that the quality score is lower than a predetermined threshold value; determining a fundamental frequency of the training speaker speech audio signals; and training the voice changer model by adjusting parameters of the voice changer model according to voice characteristics of the training speaker speech audio signal).
With respect to claim 12 (similarly claims 5 and 19), Waibel teaches the computing device of claim 8, wherein generating the lip-synched speaker video based on the video of the speaker and the converted translated speech audio signal comprises:
extracting speech features from the converted translated speech audio signal (e.g. extracting prosodic characteristics of the input speaker [0052]); performing face detection on the video of the speaker to identify a mouth region of the video (e.g. Figs 1-2 [0047]-[0052] perform face detection on the video of the speaker to identify a mouth region of the video); and modifying, via a machine learning model, the mouth region of the video according to the speech features to generate the lip-synched speaker video (e.g. training the lip generation module 24 to preserve facial expressions such as frowns, smiles, coughs, sneezes, twitching, blinking, etc. [0052] suggest modifying, via a machine learning model, the mouth region of the video according to the speech features to generate the lip-synched speaker video).
With respect to claim 13 (similarly claims 6 and 20), Waibel teaches the computing device of claim 12, wherein generating the lip-synched speaker video based on the video of the speaker and the converted translated speech audio signal further comprises:
resizing the mouth region before modifying, via the machine learning model, the mouth region by one or more of resampling the mouth region or expanding the mouth region by including pixels surrounding the identified mouth region (e.g. adapted lips [0054] suggest resizing the mouth region before modifying, via the machine learning model, the mouth region by one or more of resampling the mouth region or expanding the mouth region by including pixels surrounding the identified mouth region); and 
wherein modifying, via the machine learning model, the mouth region of the video according to the speech features to generate the lip-synched speaker video comprises: applying the machine learning model to the resized mouth region to generate a lip-synched mouth region (e.g. training the lip generation module 24 to preserve facial expressions such as frowns, smiles, coughs, sneezes, twitching, blinking, etc. [0052] comprises applying the machine learning model to the resized mouth region to generate a lip-synched mouth region, as suggested in [0052]-[0054]); and constructing the lip-synched speaker video by at least resizing the lip-synched mouth region and inserting the lip-synched mouth region into the video of the speaker (e.g. see the generation of the final output in [0054]).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IBRAHIM SIDDO whose telephone number is (571)272-4508. The examiner can normally be reached 9:00-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached at 5712703438. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IBRAHIM SIDDO/Primary Examiner, Art Unit 2681

Read full office action

Prosecution Timeline

Dec 18, 2023

Application Filed

Oct 01, 2025

Non-Final Rejection mailed — §103

Jan 29, 2026

Response Filed

Mar 06, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/368,333

Patent 12640151

VOICE CONTROL WITH CONTEXTUAL KEYWORDS

2y 8m to grant Granted May 26, 2026

18/417,104

Patent 12634401

INSPECTION SYSTEM AND METHOD OF CONTROLLING THE SAME, AND STORAGE MEDIUM

2y 4m to grant Granted May 19, 2026

18/345,339

Patent 12622505

SYSTEMS, DEVICES, AND METHODS FOR SEGMENT-BASED GUIDANCE OF PRODUCT APPLICATION

2y 10m to grant Granted May 12, 2026

18/538,632

Patent 12614550

ELECTRONIC DEVICE, METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM CONTROLLING EXECUTABLE OBJECT BASED ON VOICE SIGNAL

2y 4m to grant Granted Apr 28, 2026

18/423,276

Patent 12608166

Automated Data Handling

2y 2m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

84%

Grant Probability

97%

With Interview (+13.1%)

2y 1m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 477 resolved cases by this examiner. Grant probability derived from career allowance rate.