Last updated: May 29, 2026

Application No. 19/034,368

Video Diffusion Model

Non-Final OA §103§112

Filed

Jan 22, 2025

Priority

Jan 22, 2024 — provisional 63/623,735

Examiner

RICHER, AARON M

Art Unit

2617

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

3 (Non-Final)

Interview Optional

— +21.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 51% grant rate with +21.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 470 resolved cases, 2023–2026

Examiner Intelligence

RICHER, AARON M View full profile →

Grants 51% of resolved cases

Career Allowance Rate

241 granted / 470 resolved

-10.7% vs TC avg

Strong +21% interview lift

Without

With

+21.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 9m

Avg Prosecution

20 currently pending

Career history

497

Total Applications

across all art units

Statute-Specific Performance

§101

1.8%

-38.2% vs TC avg

§103

87.9%

+47.9% vs TC avg

§102

2.2%

-37.8% vs TC avg

§112

7.3%

-32.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 470 resolved cases

Office Action

§103 §112

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 18 February 2026 have been fully considered but they are not persuasive. 
As to claim 1, applicant argues that, in Westcott, the next frame is used with the previous frame and that the key idea is to run the previous frame in the forward process as part of the diffusion process. While examiner agrees that one embodiment in Westcott requires these frame dependencies which would appear to diverge from true simultaneous processing of an entire series of frames, there is another embodiment in Westcott that starts from a Gaussian distribution (p. 9, section 0109). Westcott recognizes at p. 10, section 0116 that using the previous frame as a starting point for a subsequent frame is “not necessary to implement the disclosed technique for real-time video diffusion”. Such an embodiment would lend itself to being improved with simultaneous processing in a much less challenging way than a temporal dependency-based embodiment where sequential processing would be favored.
Applicant’s arguments with respect to the Green reference have been considered but are moot because the new ground of rejection does not rely on this reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-3, 8, 9, 14-17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott (U.S. Publication 2025/0133238) in view of Linzer (U.S. Patent 12,327,176)

As to claim 1, Westcott discloses a computer-implemented method to perform video generation, the method comprising:
generating, by a computing system comprising one or more computing devices, a plurality of inputs that contain noise (fig. 2, elements 130 and 134; p. 4, section 0055; noisy inputs are created using a noising structure) wherein the plurality of inputs  respectively correspond to a plurality of timestamps that span a temporal dimension of a video (p. 3-4, section 0053; p. 5, section 0062; p. 8, section 0081; the frames generated correspond to a sequence of frames which would inherently be associated with a time stamp in a temporal dimension, either a particular time or a particular position in a sequence);
processing, by the computing system, the plurality of noisy inputs with a machine-learned denoising diffusion model to generate, as an output of the machine-learned denoising diffusion model, a plurality of synthetic frames for the video that respectively correspond to the plurality of timestamps of the video (fig. 2, elements 131 and 136; p. 4, section 0055; p. 5, sections 0062-0063; p. 9, section 0109; frames are synthesized from the noisy frames to create new views; as noted above, the frames correspond to video timestamps),
wherein the machine-learned denoising diffusion model comprises a plurality of layers (p. 5, sections 0065-0067; p. 7, sections 0076-0077; at least trainable and cross-attention layers are included in the model), 
wherein at least a first layer of the plurality of layers performs a temporal downsampling operation to generate a first layer output having a reduced size in the  temporal dimension, and wherein at least a second layer of the plurality of layers performs a temporal upsampling operation to generate a second layer output having an increased size in the temporal dimension (p. 4, section 0055; p. 9, section 0112; p. 10, section 0121; a UNet, which is a structure that downsamples/contracts an input and then upsamples/expands the result, is used in space and time dimensions with frames previous and subsequent in a temporal dimension);
and providing, by the computing system, the video as an output (fig. 2, elements 115’; the frames are output from the decoder).
Westcott discloses performing some acts simultaneously but does not expressly disclose simultaneous processing of every input of the plurality of inputs that contain noise and generation of frames. Linzer, however, does disclose this (col. 2, lines 15-31; col. 13, lines 44-59; col. 14-15, claim 1; simultaneous processing of an entire time series of noisy input frames to generate clean output frames is disclosed). The motivation for this is to improve the speed of the network. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott to simultaneously process every input in a plurality of noisy inputs and generate frames in order to improve the speed of the network as taught by Linzer. 

As to claim 2, Westcott discloses wherein the plurality of synthetic frames simultaneously generated by the machine-learned denoising diffusion model comprise an entirety of the video (p. 3, section 0052; a whole video is processed).

As to claim 3, Westcott discloses wherein the machine-learned denoising diffusion model comprises a space-time U-Net (p. 10, section 0121).
 
As to claim 8, Westcott discloses wherein the machine-learned denoising diffusion model operates in a pixel-space of the video (p. 4, section 0056; the model operates in pixel-space to convert an image to latent space and then back to an image). 

As to claim 9, Westcott discloses wherein the machine-learned denoising diffusion model operates in a latent-space of the video, and wherein the machine learned denoising diffusion model comprises at least a decoder to transform from the latent space of the video to a pixel-space of the video (p. 4, section 0056; after the model operates in latent space, a decoder transforms the space back to an image/pixel space).

As to claim 14, Westcott discloses receiving, by the computing system, a conditioning input; and conditioning, by the computing system, the machine-learned denoising diffusion model on the conditioning input (p. 4, section 0054; image frames can be provided as a conditioning input to condition the model).

As to claim 15, Westcott discloses wherein the conditioning input comprises a textual input (p. 6, section 0094; the unconditional model is replaced with a conditional model where a text embedding is input as guidance). 

As to claim 16, Westcott discloses wherein the conditioning input comprises an image input (p. 4, section 0054; image frames can be provided as a conditioning input to condition the model).

As to claim 17, Westcott discloses wherein the image input comprises a masked image input (p. 4, section 0054; an image with face landmarks or canny edges highlighted would read on a masked image).

As to claim 19, see the rejection to claim 1. Further, Westcott disclose a computing system comprising one or more processors and one or more non-transitory computer-readable media that store computer-readable instructions for performing the operations (p. 7, section 0080-p. 8, section 0081).

As to claim 20, see the rejections to claims 1 and 19. 

Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Carreira (U.S. Publication 2022/0012898).

As to claim 4, Westcott discloses wherein the space-time U-Net comprises a pre-trained U-Net (p. 4, section 0055; p. 5, section 0065; p. 7, section 0076; the model is pre-trained and implemented with a U-Net). Westcott does not disclose, but Carreira discloses that the U-Net that has been inflated with temporal layers (p. 2, section 0013; p. 2, section 0016; p. 5, sections 0055-0056; the network, which is a U-Net or similar, is inflated with layers of an extra temporal dimension). The motivation for this is to match motion of features in time. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott and Linzer to inflate a U-Net with temporal layers in order to match motion of features in time as taught by Carreira.

As to claim 5, Westcott does not disclose, but Carreira discloses wherein an initial layer of the machine-learned denoising diffusion model and a final layer of the machine-learned denoising diffusion model each have a size in the temporal dimension that matches a number of frames included in the video (fig. 4a; fig. 4b; p. 5, sections 0050-0054; for a video with T frames, where T=64 in the figure, the temporal size of an input layer is T=64 frames and the final layer producing output is also T=64 frames). Motivation for the combination is given in the rejection to claim 4. 

Claims 10-13 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Mann (U.S. Publication 2024/0193835). 

As to claim 10, Linzer discloses simultaneous frame generation, as discussed in the rejection to claim 1. Westcott in view of Linzer does not disclose, but Mann discloses the plurality of synthetic frames generated by the machine-learned denoising diffusion model (fig. 4; fig. 7; p. 6, section 0059; p. 11, section 0085; p. 15, section 0109; the machine-learning model synthesizing frames can perform denoising and be a diffusion model) comprise a plurality of lower resolution synthetic frames (p. 6, section 0059; p. 9, section 0073; p. 10, sections 0080-0082; low-resolution candidate frames are generated), and the method further comprises, prior to providing the video as an output: processing, by the computing system, the plurality of lower resolution synthetic frames with a machine-learned spatial-super resolution model to generate a plurality of high resolution synthetic frames for the video, wherein the plurality of higher resolution synthetic frames have a relatively larger resolution than the plurality of lower resolution synthetic frames (p. 16, section 0113; the neural/machine-learning renderer includes a super-resolution network/model to generate higher-resolution frames from the lower-resolution frames). The motivation for this is to allow for rapid feedback while still generating a more refined final image (p. 2, section 0010). It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott and Linzer to have the plurality of synthetic frames generated by the machine-learned denoising diffusion model comprise a plurality of low resolution synthetic frames, and, prior to providing the video as an output: processing, by the computing system, the plurality of low resolution synthetic frames with a machine-learned spatial-super resolution model to generate a plurality of high resolution synthetic frames for the video in order to allow for rapid feedback while still generating a more refined final image as taught by Mann.

As to claim 11, Mann discloses wherein processing, by the computing system, the plurality of lower resolution synthetic frames with the machine-learned spatial-super resolution model comprises processing, by the computing system with the machine learned spatial-super resolution model, each of a plurality of groups of the low resolution synthetic frames that respectively correspond to a plurality of temporal windows (p. 14, section 0105-p. 15, section 0106; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame). Motivation for the combination is given in the rejection to claim 10. 

As to claim 12, Mann discloses wherein processing, by the computing system with the machine-learned spatial-super resolution model, each of the plurality of groups of the lower resolution synthetic frames comprises performing, by the computing system, multi-diffusion across the temporal dimension of two or more of the plurality of groups (p. 14, section 0105-p. 15, section 0106; p. 15, section 0115; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame; by sliding one frame for each input image frame, the window/group would overlap windows/groups for adjacent frames; each module can be implemented as a diffusion network, meaning that the process would include multiple diffusions on the windows/groups). Motivation for the combination is given in the rejection to claim 10. 

As to claim 13, Mann discloses wherein the plurality of temporal windows are overlapping, and wherein performing, by the computing system, multi-diffusion comprises performing, by the computing system, multi-diffusion on overlapping temporal portions of the two or more of the plurality of groups (p. 14, section 0105-p. 15, section 0106; p. 15, section 0115; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame; by sliding one frame for each input image frame, the window/group would overlap windows/groups for adjacent frames; each module can be implemented as a diffusion network, meaning that the process would include multiple diffusions on the windows/groups). Motivation for the combination is given in the rejection to claim 10. 

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Kuang (CN 116883545 A, herein represented by a translation).

As to claim 18, Westcott does not disclose, but Kuang discloses wherein the machine-learned denoising diffusion model (p. 2, Background technology; the diffusion model creates noisy samples and then denoises) comprises a plurality of weights that have been derived by interpolating between a base set of weights and a style-specific set of weights (p. 3-4, S2; p. 4, S201; p. 7-9, S2-S203; weights are derived from a weighted combination/interpolation of original/base weights and residual for a particular style calculation, which can read on style-specific weights). The motivation for this is to expand a current data set (p. 3). It would have been obvious to one skilled in the art before the effective filing date to modify Westcott and Linzer to use a plurality of weights that have been derived by interpolating between a base set of weights and a style-specific set of weights in order to expand a current data set as taught by Kuang.

Conclusion
Claims 6 and 7 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON M RICHER whose telephone number is (571)272-7790. The examiner can normally be reached 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AARON M RICHER/Primary Examiner, Art Unit 2617

Read full office action

Prosecution Timeline

Show 1 earlier event

Jul 15, 2025

Non-Final Rejection mailed — §103, §112

Oct 13, 2025

Response Filed

Nov 18, 2025

Final Rejection mailed — §103, §112

Feb 04, 2026

Applicant Interview (Telephonic)

Feb 04, 2026

Examiner Interview Summary

Feb 18, 2026

Request for Continued Examination

Feb 23, 2026

Response after Non-Final Action

Mar 12, 2026

Non-Final Rejection mailed — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/632,073

Patent 12620205

USING TWO-DIMENSIONAL IMAGES AND MACHINE LEARNING TO IDENTIFY INFORMATION PERTAINING TO FACIAL FEATURES

2y 0m to grant Granted May 05, 2026

17/909,937

Patent 12608918

MICROSCOPY SYSTEM AND METHOD FOR PROCESSING MICROSCOPY IMAGES

3y 7m to grant Granted Apr 21, 2026

17/954,872

Patent 12608122

Image Synthesis with Multiple Input Modalities

3y 6m to grant Granted Apr 21, 2026

17/437,298

Patent 12586151

Frame Rate Extrapolation

4y 6m to grant Granted Mar 24, 2026

17/877,606

Patent 12579600

SEAMLESS VIDEO IN HETEROGENEOUS CORE INFORMATION HANDLING SYSTEM

3y 7m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

51%

Grant Probability

73%

With Interview (+21.4%)

3y 9m (~2y 5m remaining)

Median Time to Grant

High

PTA Risk

Based on 470 resolved cases by this examiner. Grant probability derived from career allowance rate.