Prosecution Insights
Last updated: May 29, 2026
Application No. 17/382,027

SYNTHESIZING VIDEO FROM AUDIO USING ONE OR MORE NEURAL NETWORKS

Final Rejection §103
Filed
Jul 21, 2021
Examiner
RICHER, AARON M
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
7 (Final)
51%
Grant Probability
Moderate
8-9
OA Rounds
0m
Est. Remaining
73%
With Interview

Examiner Intelligence

Grants 51% of resolved cases
51%
Career Allowance Rate
241 granted / 470 resolved
-10.7% vs TC avg
Strong +21% interview lift
Without
With
+21.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
20 currently pending
Career history
497
Total Applications
across all art units

Statute-Specific Performance

§101
1.8%
-38.2% vs TC avg
§103
87.9%
+47.9% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
7.3%
-32.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 470 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed 9 April 2026 have been fully considered but they are not persuasive. Applicant argues that Song outputs expression and key point information not a plurality of video frames of a person uttering speech. Examiner notes that video generation is the entire purpose of Song (see title, abstract, background, etc.). Nevertheless, examiner notes that the previously cited portions of the reference do not explicitly recite video generation, only the elements necessary for it, and thus, examiner has cited additional sections to show that video generation with a person speaking takes place. Applicant argues that Park is not combinable with Song because replacing a sample face requirement with a reference that does not require one renders Song’s system non-functional. Examiner notes that removal or substitution of an element does not necessarily render a reference non-functional. One skilled in the art knowing the input requirements of Song’s invention would not necessarily be bound by such requirements in looking for elements to combine- in fact, one would consider an invention with no corresponding video frame requirement, such as Park, a potential improvement over requiring corresponding frames. As to the amended limitations, applicant’s arguments have been considered but are moot because the new ground of rejection relies upon the Liao reference to fill in the gaps in the Song and Park references. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claims 1-30 are rejected under 35 U.S.C. 103 as being unpatentable over Song (WO 2021052224 A1, herein represented by U.S. Publication 2021/0357625) in view of Liao (U.S. Publication 2021/0390748) and Park (U.S. Publication 2021/0357088). As to claim 1, Song discloses a processor (fig. 8, element 82; p. 10, sections 0174-0178; a processor is configured to execute instructions from a medium, such as a memory, which also stores data for performing a programmed method), comprising: one or more circuits to use one or more neural networks to generate a first plurality of video frames of a person uttering speech based on speech information corresponding to the speech input to the one or more neural networks and based on an image corresponding to the user (fig. 5; p. 6, sections 0099-0112; p. 7, section 0120; a first neural network is trained using an audio clip with voice/speech information, to generate face key point information corresponding to input video data of a person’s face and speech for video displaying such). Song does not disclose, but Liao discloses modifying at least a portion of the generated first plurality of video frames including the person uttering the speech to produce a second plurality of video frames of a representation of a user uttering the speech (fig. 8; p. 6-7, section 0070; poses from generated video frames are replaced based on a user being more likely to have spoken particular words). The motivation for this is to produce output video with synchronized, realistic, and expressive body dynamics at low cost (p. 1, sections 0003-0004). It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Song to modify at least a portion of the generated first plurality of video frames including the person uttering the speech to produce a second plurality of video frames of a representation of a user uttering the speech in order to produce output video with synchronized, realistic, and expressive body dynamics at low cost as taught by Liao. Song does not expressly disclose, but Park discloses that the generation is without corresponding video frames and based on a single image corresponding to the user (fig. 3; fig. 6-7; p. 4, sections 0065-0070; p. 5, sections 0076-0077; p. 6, section 0088; p. 7, sections 0102-0108; p. 8, sections 0115-0116; generation of a character animation based on driving information including mouth movements and words for a character to speak is performed; at this stage, the character is synthetic like the characters in the figures and not based on input video; later, a user inputs “a facial image” and movements, which would include mouth motions associated with speaking, are mapped such that the input facial image replaces one of the synthetic character images in the animation). The motivation for this is to allow more personal and realistic content so that a user can experience the content more dynamically (p. 1, section 0005). It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Song and Liao to generate without corresponding video frames and based on a single image corresponding to the user in order to allow more personal and realistic content so that a user can experience the content more dynamically as taught by Park. As to claim 2, Song does not disclose, but Liao does disclose wherein the plurality of frames includes a representation of one or more three-dimensional character models uttering a corresponding portion of the speech information (p. 1, section 0006; p. 6, sections 0065-0067; p. 6-7, section 0070; p. 7, section 0072; p. 7, section 0079-p. 8, section 0083; using a neural network, a 3D skeleton model is used to create poses for video frames corresponding to spoken voice/speech information). Motivation for the combination is given in the rejection to claim 1. As to claim 3, Song does not disclose, but Liao does disclose wherein the plurality of frames includes a representation of the one or more users uttering the corresponding portion of the speech information as represented by the one or more three-dimensional character models in the plurality of video frames (p. 1, sections 0006-0007; p. 5, section 0059; p. 6, sections 0065-0067; p. 6-7, section 0070; p. 7, section 0072; p. 7, section 0079-p. 8, section 0083; the 3D body model first video information, input video of a person, and speech-to-text information are used to generate video frames of a speaking person). Motivation for the combination is given in the rejection to claim 1. As to claim 4, Song does not disclose, but Liao does disclose wherein the one or more neural networks is trained to correlate key points between the one or more three-dimensional character models represented in the plurality of video frames and at least one of shape information or pose information for the user (p. 1, section 0006; p. 5, section 0060; p. 6, section 0062; p. 6, section 0064-0067; correspondence between points in the 3D model and 2D point positions of a person speaking in input video that correspond to projected pose information is found using the neural network). Motivation for the combination is given in the rejection to claim 1. As to claim 5, Song does not disclose, but Liao does disclose wherein the one or more circuits are further to use the one or more neural networks to synthesize the speech information as voice information from text (fig. 3, element 325; p. 4-5, section 0053; p. 5, section 0056; p. 5, section 0060; a neural network is shown that trains speech video synthesis and mapping from text input). Motivation for the combination is given in the rejection to claim 1. As to claim 6, Song discloses wherein the plurality of video frames is representative of an amount of emotion or pattern of speech determined from the speech information (fig. 3; p. 3, sections 0046-0052; p. 4, section 0060; p. 9, section 0146-0147; the voice/speech information is analyzed to determine a facial expression representing a particular emotional state; the inpainted image from the second neural network uses the expression information and video is generated from the inpainted image). As to claim 7, see the rejection to claim 1. As to claim 8, see the rejection to claim 2. As to claim 9, see the rejection to claim 3. As to claim 10, see the rejection to claim 4. As to claim 11, see the rejection to claim 5. As to claim 12, see the rejection to claim 6. As to claim 13, see the rejection to claim 1. As to claim 14, see the rejection to claim 2. As to claim 15, see the rejection to claim 3. As to claim 16, see the rejection to claim 4. As to claim 17, see the rejection to claim 5. As to claim 18, see the rejection to claim 6. As to claim 19, see the rejection to claim 1. As to claim 20, see the rejection to claim 2. As to claim 21, see the rejection to claim 3. As to claim 22, see the rejection to claim 4. As to claim 23, see the rejection to claim 5. As to claim 24, see the rejection to claim 6. As to claim 25, see the rejection to claim 1. As to claim 26, see the rejection to claim 2. As to claim 27, see the rejection to claim 3. As to claim 28, see the rejection to claim 4. As to claim 29, see the rejection to claim 5. As to claim 30, see the rejection to claim 6. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON M RICHER whose telephone number is (571)272-7790. The examiner can normally be reached 9AM-5PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /AARON M RICHER/Primary Examiner, Art Unit 2617
Read full office action

Prosecution Timeline

Show 16 earlier events
May 20, 2025
Request for Continued Examination
May 21, 2025
Response after Non-Final Action
Aug 27, 2025
Non-Final Rejection mailed — §103
Nov 26, 2025
Response Filed
Jan 09, 2026
Final Rejection mailed — §103
Apr 09, 2026
Request for Continued Examination
Apr 16, 2026
Response after Non-Final Action
Apr 23, 2026
Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12620205
USING TWO-DIMENSIONAL IMAGES AND MACHINE LEARNING TO IDENTIFY INFORMATION PERTAINING TO FACIAL FEATURES
2y 0m to grant Granted May 05, 2026
Patent 12608918
MICROSCOPY SYSTEM AND METHOD FOR PROCESSING MICROSCOPY IMAGES
3y 7m to grant Granted Apr 21, 2026
Patent 12608122
Image Synthesis with Multiple Input Modalities
3y 6m to grant Granted Apr 21, 2026
Patent 12586151
Frame Rate Extrapolation
4y 6m to grant Granted Mar 24, 2026
Patent 12579600
SEAMLESS VIDEO IN HETEROGENEOUS CORE INFORMATION HANDLING SYSTEM
3y 7m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

8-9
Expected OA Rounds
51%
Grant Probability
73%
With Interview (+21.4%)
3y 9m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 470 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month