Prosecution Insights
Last updated: April 19, 2026
Application No. 19/034,368

Video Diffusion Model

Non-Final OA §103§112
Filed
Jan 22, 2025
Examiner
RICHER, AARON M
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
3 (Non-Final)
51%
Grant Probability
Moderate
3-4
OA Rounds
4y 0m
To Grant
70%
With Interview

Examiner Intelligence

Grants 51% of resolved cases
51%
Career Allow Rate
236 granted / 465 resolved
-11.2% vs TC avg
Strong +20% interview lift
Without
With
+19.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
28 currently pending
Career history
493
Total Applications
across all art units

Statute-Specific Performance

§101
9.4%
-30.6% vs TC avg
§103
54.7%
+14.7% vs TC avg
§102
13.1%
-26.9% vs TC avg
§112
19.9%
-20.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 465 resolved cases

Office Action

§103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed 18 February 2026 have been fully considered but they are not persuasive. As to claim 1, applicant argues that, in Westcott, the next frame is used with the previous frame and that the key idea is to run the previous frame in the forward process as part of the diffusion process. While examiner agrees that one embodiment in Westcott requires these frame dependencies which would appear to diverge from true simultaneous processing of an entire series of frames, there is another embodiment in Westcott that starts from a Gaussian distribution (p. 9, section 0109). Westcott recognizes at p. 10, section 0116 that using the previous frame as a starting point for a subsequent frame is “not necessary to implement the disclosed technique for real-time video diffusion”. Such an embodiment would lend itself to being improved with simultaneous processing in a much less challenging way than a temporal dependency-based embodiment where sequential processing would be favored. Applicant’s arguments with respect to the Green reference have been considered but are moot because the new ground of rejection does not rely on this reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claims 1-3, 8, 9, 14-17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott (U.S. Publication 2025/0133238) in view of Linzer (U.S. Patent 12,327,176) As to claim 1, Westcott discloses a computer-implemented method to perform video generation, the method comprising: generating, by a computing system comprising one or more computing devices, a plurality of inputs that contain noise (fig. 2, elements 130 and 134; p. 4, section 0055; noisy inputs are created using a noising structure) wherein the plurality of inputs respectively correspond to a plurality of timestamps that span a temporal dimension of a video (p. 3-4, section 0053; p. 5, section 0062; p. 8, section 0081; the frames generated correspond to a sequence of frames which would inherently be associated with a time stamp in a temporal dimension, either a particular time or a particular position in a sequence); processing, by the computing system, the plurality of noisy inputs with a machine-learned denoising diffusion model to generate, as an output of the machine-learned denoising diffusion model, a plurality of synthetic frames for the video that respectively correspond to the plurality of timestamps of the video (fig. 2, elements 131 and 136; p. 4, section 0055; p. 5, sections 0062-0063; p. 9, section 0109; frames are synthesized from the noisy frames to create new views; as noted above, the frames correspond to video timestamps), wherein the machine-learned denoising diffusion model comprises a plurality of layers (p. 5, sections 0065-0067; p. 7, sections 0076-0077; at least trainable and cross-attention layers are included in the model), wherein at least a first layer of the plurality of layers performs a temporal downsampling operation to generate a first layer output having a reduced size in the temporal dimension, and wherein at least a second layer of the plurality of layers performs a temporal upsampling operation to generate a second layer output having an increased size in the temporal dimension (p. 4, section 0055; p. 9, section 0112; p. 10, section 0121; a UNet, which is a structure that downsamples/contracts an input and then upsamples/expands the result, is used in space and time dimensions with frames previous and subsequent in a temporal dimension); and providing, by the computing system, the video as an output (fig. 2, elements 115’; the frames are output from the decoder). Westcott discloses performing some acts simultaneously but does not expressly disclose simultaneous processing of every input of the plurality of inputs that contain noise and generation of frames. Linzer, however, does disclose this (col. 2, lines 15-31; col. 13, lines 44-59; col. 14-15, claim 1; simultaneous processing of an entire time series of noisy input frames to generate clean output frames is disclosed). The motivation for this is to improve the speed of the network. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott to simultaneously process every input in a plurality of noisy inputs and generate frames in order to improve the speed of the network as taught by Linzer. As to claim 2, Westcott discloses wherein the plurality of synthetic frames simultaneously generated by the machine-learned denoising diffusion model comprise an entirety of the video (p. 3, section 0052; a whole video is processed). As to claim 3, Westcott discloses wherein the machine-learned denoising diffusion model comprises a space-time U-Net (p. 10, section 0121). As to claim 8, Westcott discloses wherein the machine-learned denoising diffusion model operates in a pixel-space of the video (p. 4, section 0056; the model operates in pixel-space to convert an image to latent space and then back to an image). As to claim 9, Westcott discloses wherein the machine-learned denoising diffusion model operates in a latent-space of the video, and wherein the machine learned denoising diffusion model comprises at least a decoder to transform from the latent space of the video to a pixel-space of the video (p. 4, section 0056; after the model operates in latent space, a decoder transforms the space back to an image/pixel space). As to claim 14, Westcott discloses receiving, by the computing system, a conditioning input; and conditioning, by the computing system, the machine-learned denoising diffusion model on the conditioning input (p. 4, section 0054; image frames can be provided as a conditioning input to condition the model). As to claim 15, Westcott discloses wherein the conditioning input comprises a textual input (p. 6, section 0094; the unconditional model is replaced with a conditional model where a text embedding is input as guidance). As to claim 16, Westcott discloses wherein the conditioning input comprises an image input (p. 4, section 0054; image frames can be provided as a conditioning input to condition the model). As to claim 17, Westcott discloses wherein the image input comprises a masked image input (p. 4, section 0054; an image with face landmarks or canny edges highlighted would read on a masked image). As to claim 19, see the rejection to claim 1. Further, Westcott disclose a computing system comprising one or more processors and one or more non-transitory computer-readable media that store computer-readable instructions for performing the operations (p. 7, section 0080-p. 8, section 0081). As to claim 20, see the rejections to claims 1 and 19. Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Carreira (U.S. Publication 2022/0012898). As to claim 4, Westcott discloses wherein the space-time U-Net comprises a pre-trained U-Net (p. 4, section 0055; p. 5, section 0065; p. 7, section 0076; the model is pre-trained and implemented with a U-Net). Westcott does not disclose, but Carreira discloses that the U-Net that has been inflated with temporal layers (p. 2, section 0013; p. 2, section 0016; p. 5, sections 0055-0056; the network, which is a U-Net or similar, is inflated with layers of an extra temporal dimension). The motivation for this is to match motion of features in time. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott and Linzer to inflate a U-Net with temporal layers in order to match motion of features in time as taught by Carreira. As to claim 5, Westcott does not disclose, but Carreira discloses wherein an initial layer of the machine-learned denoising diffusion model and a final layer of the machine-learned denoising diffusion model each have a size in the temporal dimension that matches a number of frames included in the video (fig. 4a; fig. 4b; p. 5, sections 0050-0054; for a video with T frames, where T=64 in the figure, the temporal size of an input layer is T=64 frames and the final layer producing output is also T=64 frames). Motivation for the combination is given in the rejection to claim 4. Claims 10-13 are rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Mann (U.S. Publication 2024/0193835). As to claim 10, Linzer discloses simultaneous frame generation, as discussed in the rejection to claim 1. Westcott in view of Linzer does not disclose, but Mann discloses the plurality of synthetic frames generated by the machine-learned denoising diffusion model (fig. 4; fig. 7; p. 6, section 0059; p. 11, section 0085; p. 15, section 0109; the machine-learning model synthesizing frames can perform denoising and be a diffusion model) comprise a plurality of lower resolution synthetic frames (p. 6, section 0059; p. 9, section 0073; p. 10, sections 0080-0082; low-resolution candidate frames are generated), and the method further comprises, prior to providing the video as an output: processing, by the computing system, the plurality of lower resolution synthetic frames with a machine-learned spatial-super resolution model to generate a plurality of high resolution synthetic frames for the video, wherein the plurality of higher resolution synthetic frames have a relatively larger resolution than the plurality of lower resolution synthetic frames (p. 16, section 0113; the neural/machine-learning renderer includes a super-resolution network/model to generate higher-resolution frames from the lower-resolution frames). The motivation for this is to allow for rapid feedback while still generating a more refined final image (p. 2, section 0010). It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify Westcott and Linzer to have the plurality of synthetic frames generated by the machine-learned denoising diffusion model comprise a plurality of low resolution synthetic frames, and, prior to providing the video as an output: processing, by the computing system, the plurality of low resolution synthetic frames with a machine-learned spatial-super resolution model to generate a plurality of high resolution synthetic frames for the video in order to allow for rapid feedback while still generating a more refined final image as taught by Mann. As to claim 11, Mann discloses wherein processing, by the computing system, the plurality of lower resolution synthetic frames with the machine-learned spatial-super resolution model comprises processing, by the computing system with the machine learned spatial-super resolution model, each of a plurality of groups of the low resolution synthetic frames that respectively correspond to a plurality of temporal windows (p. 14, section 0105-p. 15, section 0106; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame). Motivation for the combination is given in the rejection to claim 10. As to claim 12, Mann discloses wherein processing, by the computing system with the machine-learned spatial-super resolution model, each of the plurality of groups of the lower resolution synthetic frames comprises performing, by the computing system, multi-diffusion across the temporal dimension of two or more of the plurality of groups (p. 14, section 0105-p. 15, section 0106; p. 15, section 0115; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame; by sliding one frame for each input image frame, the window/group would overlap windows/groups for adjacent frames; each module can be implemented as a diffusion network, meaning that the process would include multiple diffusions on the windows/groups). Motivation for the combination is given in the rejection to claim 10. As to claim 13, Mann discloses wherein the plurality of temporal windows are overlapping, and wherein performing, by the computing system, multi-diffusion comprises performing, by the computing system, multi-diffusion on overlapping temporal portions of the two or more of the plurality of groups (p. 14, section 0105-p. 15, section 0106; p. 15, section 0115; as part of the high-resolution reconstruction, some number of frames X is taken as a temporal window to determine one of the candidate frames; the window slides temporally to create a new frame group which is used to reconstruct another frame; by sliding one frame for each input image frame, the window/group would overlap windows/groups for adjacent frames; each module can be implemented as a diffusion network, meaning that the process would include multiple diffusions on the windows/groups). Motivation for the combination is given in the rejection to claim 10. Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Westcott in view of Linzer and further in view of Kuang (CN 116883545 A, herein represented by a translation). As to claim 18, Westcott does not disclose, but Kuang discloses wherein the machine-learned denoising diffusion model (p. 2, Background technology; the diffusion model creates noisy samples and then denoises) comprises a plurality of weights that have been derived by interpolating between a base set of weights and a style-specific set of weights (p. 3-4, S2; p. 4, S201; p. 7-9, S2-S203; weights are derived from a weighted combination/interpolation of original/base weights and residual for a particular style calculation, which can read on style-specific weights). The motivation for this is to expand a current data set (p. 3). It would have been obvious to one skilled in the art before the effective filing date to modify Westcott and Linzer to use a plurality of weights that have been derived by interpolating between a base set of weights and a style-specific set of weights in order to expand a current data set as taught by Kuang. Conclusion Claims 6 and 7 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims. Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON M RICHER whose telephone number is (571)272-7790. The examiner can normally be reached 9AM-5PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /AARON M RICHER/Primary Examiner, Art Unit 2617
Read full office action

Prosecution Timeline

Jan 22, 2025
Application Filed
Jul 12, 2025
Non-Final Rejection — §103, §112
Oct 13, 2025
Response Filed
Nov 14, 2025
Final Rejection — §103, §112
Feb 04, 2026
Applicant Interview (Telephonic)
Feb 04, 2026
Examiner Interview Summary
Feb 18, 2026
Request for Continued Examination
Feb 23, 2026
Response after Non-Final Action
Mar 06, 2026
Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586151
Frame Rate Extrapolation
2y 5m to grant Granted Mar 24, 2026
Patent 12579600
SEAMLESS VIDEO IN HETEROGENEOUS CORE INFORMATION HANDLING SYSTEM
2y 5m to grant Granted Mar 17, 2026
Patent 12571669
DETECTING AND GENERATING A RENDERING OF FILL LEVEL AND DISTRIBUTION OF MATERIAL IN RECEIVING VEHICLE(S)
2y 5m to grant Granted Mar 10, 2026
Patent 12555305
Systems And Methods For Generating And/Or Using 3-Dimensional Information With Camera Arrays
2y 5m to grant Granted Feb 17, 2026
Patent 12548233
3D TEXTURING VIA A RENDERING LOSS
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
51%
Grant Probability
70%
With Interview (+19.5%)
4y 0m
Median Time to Grant
High
PTA Risk
Based on 465 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month