Last updated: April 19, 2026
Application No. 18/607,813
MULTI-COMPONENT LATENT PYRAMID SPACE FOR GENERATIVE MODELS

Non-Final OA §103§DP
Filed
Mar 18, 2024
Examiner
SALVUCCI, MATTHEW D
Art Unit
2613
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
1 (Non-Final)
Interview Optional

— +28.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 485 resolved cases, 2023–2026
Examiner Intelligence

SALVUCCI, MATTHEW D View full profile →
Grants 72% — above average
Career Allow Rate
348 granted / 485 resolved
+9.8% vs TC avg
Strong +28% interview lift
Without
With
+28.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 12m
Avg Prosecution
17 currently pending
Career history
502
Total Applications
across all art units
Statute-Specific Performance

§101
4.6%
-35.4% vs TC avg
§103
60.8%
+20.8% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
14.3%
-25.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 485 resolved cases
Office Action

§103 §DP
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
Restriction to one of the following inventions is required under 35 U.S.C. 121:
I. Claims 1-8 and 16-20, drawn to image synthesis, classified in G06T11/00.
II. Claim 9-15, drawn to training of an auto-encoder, classified in G06N3/08.
The inventions are independent or distinct, each from the other because:
Inventions I and II are related as subcombinations disclosed as usable together in a single combination.  The subcombinations are distinct if they do not overlap in scope and are not obvious variants, and if it is shown that at least one subcombination is separately usable.  In the instant case, subcombination II has separate utility such as the training of a an autoencoder.  See MPEP § 806.05(d).
The examiner has required restriction between subcombinations usable together. Where applicant elects a subcombination and claims thereto are subsequently found allowable, any claim(s) depending from or otherwise requiring all the limitations of the allowable subcombination will be examined for patentability in accordance with 37 CFR 1.104.  See MPEP § 821.04(a).  Applicant is advised that if any claim presented in a divisional application is anticipated by, or includes all the limitations of, a claim that is allowable in the present application, such claim may be subject to provisional statutory and/or nonstatutory double patenting rejections over the claims of the instant application. 

Restriction for examination purposes as indicated is proper because all the inventions listed in this action are independent or distinct for the reasons given above and there would be a serious search and/or examination burden if restriction were not required because one or more of the following reasons apply:
  --the species or groupings of patentably indistinct species have acquired a separate status in the art in view of their different classification;
  --the species or groupings of patentably indistinct species require a different field of search (e.g., searching different classes/subclasses or electronic resources, or employing different search strategies or search queries).
Applicant is advised that the reply to this requirement to be complete must include (i) an election of an invention to be examined even though the requirement may be traversed (37 CFR 1.143) and (ii) identification of the claims encompassing the elected invention. 
The election of an invention may be made with or without traverse. To reserve a right to petition, the election must be made with traverse. If the reply does not distinctly and specifically point out supposed errors in the restriction requirement, the election shall be treated as an election without traverse. Traversal must be presented at the time of election in order to be considered timely. Failure to timely traverse the requirement will result in the loss of right to petition under 37 CFR 1.144. If claims are added after the election, applicant must indicate which of these claims are readable upon the elected invention.
Should applicant traverse on the ground that the inventions are not patentably distinct, applicant should submit evidence or identify such evidence now of record showing the inventions to be obvious variants or clearly admit on the record that this is the case. In either instance, if the examiner finds one of the inventions unpatentable over the prior art, the evidence or admission may be used in a rejection under 35 U.S.C. 103 or pre-AIA  35 U.S.C. 103(a) of the other invention.

During a telephone conversation with Michael Carey on 15 December 2025 a provisional election was made with traverse to prosecute the invention of group I, claims 1-8 and 16-20.  Affirmation of this election must be made by applicant in replying to this Office action.  Claims 9-15 are withdrawn from further consideration by the examiner, 37 CFR 1.142(b), as being drawn to a non-elected invention.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ho et al. (US Patent 11908180), hereinafter Ho, in view of Rombach et al. (NPL: High-Resolution Image Synthesis with Latent Diffusion Models).
Regarding claim 1, Ho discloses a method comprising: obtaining a text prompt (Fig. 1B; Column 1, lines 40-58: receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a final video depicting the scene. The sequence of generative neural networks includes: an initial generative neural network configured to: receive the contextual embedding; and process the contextual embedding to generate, as output, an initial output video having: (i) an initial spatial resolution, and (ii) an initial temporal resolution; and one or more subsequent generative neural networks each configured to: receive a respective input including an input video generated as output by a preceding generative neural network in the sequence; and process the respective input to generate, as output, a respective output video having at least one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video); generating, using a generator of an image generation model, a feature embedding based on the text prompt, wherein the feature embedding includes a first set of channels that encodes a first value of an image characteristic and a second set of channels that encodes a residual between the first value of the image characteristic and a second value of the image characteristic (Fig. 2A; Column 5, line 59-Column 6, line 14: modularity of the video generation system allows multiple devices to implement individual components of the system separately from one another. Particularly, different GNNs in the sequence can be executed on different devices and can transmit their outputs and/or inputs to one another (e.g., via telecommunications). As one example, the text encoder and a subset of the GNNs may be implemented on a client device (e.g., a mobile device) and the remainder of the GNNs may be implemented on a remote device (e.g., in a data center). The client device can receive an input (e.g., a text prompt) and process the text prompt using the text encoder and the subset of GNNs to generate an output video with a particular resolution. The client device can then transmit its outputs (e.g., the output video and a contextual embedding of the text prompt) which is received at the remote device as input. The remote device can then process the input using the remainder of the GNNs to generate a final output video having a higher resolution than the output video; Column 9, line 49-Column 10, line 15: text encoder 110 is configured to process the text prompt 102 to generate a contextual embedding (u) of the text prompt 102. In some implementations, the text encoder 110 is a pre-trained natural language text encoder, e.g., a T5 text encoder such as T5-XXL, a CLIP text encoder, among others. The contextual embedding 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100. For example, the contextual embedding 104 can be a set, vector, or array of values (e.g., in UNICODE or Base64 encoding), alphanumeric values, symbols, or any convenient encoding…sequence of GNNs 121 includes multiple GNNs 120 that are each configured to receive a respective input (c). Each GNN 120 is configured to process their respective input to generate a respective output video (x). In general, the sequence 121 includes an initial GNN that generates an initial output video (e.g., at low resolution) and one or more subsequent GNNs that progressively refine the resolution of the initial output video, i.e., the spatial resolution, the temporal resolution, or both. For example, each subsequent GNN can be a spatial super-resolution (SSR) model to increase spatial resolution, a temporal super-resolution (TSR) model to increase temporal resolution, or a joint spatial-temporal super-resolution (STSR) model to increase both spatial and temporal resolution. Accordingly, the respective input for the initial GNN includes the contextual embedding 104, while the respective input for each subsequent GNN includes the output video generated by a preceding GNN in the sequence 121. In some cases, the respective input to each subsequent GNN may include one or more output videos generated at lower depth in the sequence 121, as opposed to only the preceding GNN. Such cases are not described in detail but can be realized using the techniques outlined herein. In some implementations, the respective input for each subsequent GNN also includes the contextual embedding 104 which generally improves performance of the system 100, e.g., such that each subsequent GNN generates a respective output video that is strongly conditioned on the text prompt 102. The system 100 processes the contextual embedding 104 through the sequence 121 to generate an output video 106 at a high resolution, with few (if any) artifacts. The output video 106 is usually the respective output video of a final GNN in the sequence 121 but can, in principle, be provided by any GNN 120 in the sequence 121).
Ho does not explicitly disclose generating, using a decoder of the image generation model, a synthetic image corresponding to the second value of the image characteristic based on the feature embedding.
However, Rombach teaches image synthesis from a text prompt (Sections 3.2-3.3), further comprising generating, using a decoder of the image generation model, a synthetic image corresponding to the second value of the image characteristic based on the feature embedding (Section 3.1: given an image x ∈ R H×W×3 in RGB space, the encoder E encodes x into a latent representation z = E(x), and the decoder D reconstructs the image from the latent; Section 3.2: With our trained perceptual compression models consisting of E and D, we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space…our model is realized as a time-conditional UNet [71]. Since the forward process is fixed, zt can be efficiently obtained from E during training, and samples from p(z) can be decoded to image space with a single pass through D). Rombach teaches that this will reduce the computational cost (Section 1). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ho with the features of above as taught by Rombach so as to reduce the computational cost as presented by Rombach.
Regarding claim 2, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein: the generator and the decoder of the image generation model are trained using a third encoder that outputs a third set of channels (Fig. 3A; Column 14, lines 31-64: EM algorithms and certain objective functions Lθ(x, c) can be computationally intractable in some cases, e.g., when the training engine uses considerably large training sets, the prior and/or conditional distributions are particularly complex, etc. In these cases, the training engine can simultaneously model posterior distributions qϕ(z|x, c) over the latent representations which can speed up the calculus during training, e.g., when the training engine maximizes the evidence lower bound (ELBO). The posterior distribution describes how data (x, c) is encoded into latent representations z. Here, ϕ is another set of network parameters that can be included in a respective GNN 120 or another neural network, e.g., a discriminative neural network (DNN). A GNN 120 can sample from the posterior distribution instead of the prior distribution during training, which can significantly reduce the number of latents z needed to converge to a suitable parameterization θ, e.g., when the training engine simultaneously optimizes an objective function Lθϕ((x, c) with respect to θ and ϕ. After training, the GNN 120 can continue sampling from the prior distribution. In some implementations, the training engine can model the posterior distribution as a normal distribution qϕ(z|x, c)=N(z; μϕ(x, c), σϕ2(x, c)I), where μϕ((x, c) and σϕ2(x, c) are the mean and variance, respectively, as a function of x and c. As mentioned above with respect to the conditional distribution, a parametrization of this form can aid in optimizing stochastic terms (e.g., using gradient descent methods) that would otherwise be non-differentiable. For reference, the conditional distribution pθ(x|z, c) in combination with the posterior distribution qϕ(z|x, c) is usually referred to as a variational auto-encoder (VAE), with θ being the decoder parameters and ϕ being the encoder parameters; Column 21, lines 49-61: training engine 300 obtains multiple training examples 310, for instance, from a publically available training set or any suitably labeled text-video training set. Each training example 310 includes: (i) a respective input text prompt (T) 302 describing a particular scene, and (ii) a corresponding target video (x) 306 depicting the particular scene. The text encoder neural network 110 processes the respective input text prompt 302 of each training example 310 to generate a corresponding contextual embedding (u) 304 of the input text prompt 302. In some implementations, the text encoder 110 is pre-trained and held frozen 111 by the training engine 300 during the joint training of the GNNs 120.0-n).
Regarding claim 3, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein generating the feature embedding comprises: performing a reverse latent diffusion process (Fig. 1A; Column 20, lines 1-67: the initial GNN 120.0 is a DBGNN. As explained in Section III, the initial DBGNN 120.0 can perform a reversed process starting from t=1 and ending at t=0 to generate the initial output video 106.0. For example, the initial DBGNN 120.0 can sample a latent representation z1 (0) from its (reverse) prior distribution z1 (0)˜N(z1 (0); 0, I) at t=1 and continually update the latent zs(0) at each sampling step using the ancestral sampler. That is, the initial DBGNN 120.0 can process a current latent z t (0) and generate a current estimate {circumflex over (x)}θ(zt (0), c (0)). The initial DBGNN 120.0 can then determine a new latent zs (0) from the current estimate using the update rule for s<t. The initial DBGNN 120.0 updates the latent until reaching z 0(0) at t=0 and thereafter outputs the corresponding estimate as the initial output video {circumflex over (x)}(0)={circumflex over (x)}θ(z0 (0), c (0)). In some implementations, the initial DBGNN 120.0 uses one or more of a v-parametrization, progressive distillation, and/or classifier-free guidance when generating the initial output video {circumflex over (x)}(0)…Each subsequent GNN 120.i is configured to receive a respective input c.(i)=({circumflex over (x)}(i-1)) that includes an input video {circumflex over (x)} (i-1) generated as output by a preceding GNN in the sequence 121. Each subsequent GNN 120.i is configured to process the respective input to generate a respective output video ({circumflex over (x)} (1)) 106.i. As explained in Section II, each subsequent GNN 120.i can also apply noise conditioning augmentation to their input video {circumflex over (x)} (i-1), e.g., Gaussian noise conditioning. In some implementations, the respective input c(i)=({circumflex over (x)} (i-1), u) of each subsequent GNN 120.i also includes the contextual embedding 104 of the text prompt 102…each subsequent GNN 120.i is a DBGNN. As explained in Section III, a subsequent DBGNN 120.i can perform a reversed process starting from t=1 and ending at t=0 to generate an output video 106.i. For example, the subsequent DBGNN 120.i can sample a latent representation z1(i) from its (reverse) prior distribution).
Regarding claim 4, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein the first set of channels encodes a different modality from the second set of channels (Column 9, lines 17-36: the text prompt 102 can describe any particular scene and the system 100, when appropriately trained (e.g., by a training engine), is capable of generating high resolution videos that faithfully depict the scene. The text prompt 102 can also include text modifiers such as “Smooth”, “Studio lighting”, “Pixel art”, “in the style of Van Gough”, etc. that impart various styles, modifications and/or characteristics on final videos 108 generated by the system 100. The system 100 can generate various different types of videos such as photorealistic videos, cartoon videos, abstract visualizations, imaging videos of different modalities, among others. For example, the system 100 can generate medical videos, e.g., videos depicting a sequence of MRI, CT or ultrasound video frames; Column 24, lines 48-61: GNNs 120.0-n with any of these neural network configurations can employ the base video generation models and super-resolution models described herein (see FIGS. 5A-5C for particular examples). Note, subsequent GNNs 120.1-n employing super-resolution models can condition on their input videos by spatially and/or temporally up-sampling the input videos and thereafter concatenating channel-wise to the noisy data z(i), or to zt(i) in the case of DBGNNs. For example, subsequent GNNs 120.1-n can perform spatial up-sampling on input videos using bilinear resizing before concatenation. As another example, subsequent GNNs 120.1-n can perform temporal up-sampling on input videos by repeating frames and/or by filling in blank frames before concatenation).
Regarding claim 5, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein: the image characteristic comprises spatial resolution of the image, the first value of the image characteristic comprises a low spatial resolution, and the second value of the image characteristic comprises a high spatial resolution that is higher than the low spatial resolution (Column 24, line 61-Column 25, line 17: SSRs 120.S increase the spatial resolution of input videos, TSRs 120.T increase the temporal resolution of input videos, while STSRs 120.ST increase both the spatial and temporal resolution of input videos. For example, the SSRs 120.S can increase the number of independent pixels for all input frames of an input video, while the TSRs 120.T can generate independent frames between input frames of an input video. On the other hand, the STSRs 120.ST can increase the number of independent pixels of all input frames of an input video while also generating independent frames between the input frames. The super-resolution models are general purpose video super-resolution models that can be applied to real videos and/or samples from any type of GNN. Moreover, the GNNs 120.0-n can generate all output frames of their respective output video simultaneously so, for instance, SSRs 120.S and STSRs 120.ST do not suffer from artifacts that would occur from naively running spatial super-resolution on individual frames. Operating on input frames and generating output frames simultaneously can help capture the temporal coherence across the entire length of a final output video 106.SjTk compared to, for example, frame-autoregressive approaches that have generally struggled maintaining temporal coherence; Column 7, line 51-Column 8, line 11: the term “spatial resolution” generally refers to how close lines in the video can be to each other and still be visibly resolved. That is, how close two lines can be to each other without them appearing as a single line in the video. In some implementations, the spatial resolution can be identified with the per frame pixel resolution which, in this case, corresponds to the number of independent pixels per unit length (or per unit area) for each video frame—not necessarily the total number of pixels per unit length (or per unit area) for each video frame. In particular, a first video can have a higher pixel count per frame than a second video but is still of worse spatial resolution than the second video. For example, naively up-sampling the pixels of each frame of a video increases the pixel count per frame but does not increase the spatial resolution. Generally, a relative length scale is also assumed to have a definite comparison of spatial resolution between videos. For example, a digital video with 2048×1536 independent pixels per frame may appear as low resolution (˜72 pixels per inch (ppi)) if viewed at 28.5 inches wide, but may appear as high resolution (˜300 ppi) if viewed at 7 inches wide).
Regarding claim 6, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein: the synthetic image comprises a frame of a video, the image characteristic comprises temporal resolution of the video, the first value of the image characteristic comprises a low temporal resolution and the second value of the image characteristic comprises a high temporal resolution that is higher than the low temporal resolution (Column 8, lines 13-39: the term “temporal resolution” generally refers to how close different events can occur in the video while still being resolvable. That is, how close two events can occur without them appearing to occur simultaneously in the video. In some implementations, the temporal resolution can be identified with the framerate which, in this case, corresponds to the number of independent video frames per unit time—not necessarily the total number of video frames per unit time. In particular, a first video can have a higher frame count than a second video but is till of worse temporal resolution than the second video. For example, naively up-sampling the frames of a video increases the frame count but does not increase the temporal resolution. Generally, a relative time scale is also assumed to have a definite comparison of temporal resolution between videos. For example, a digital video with 240 independent frames may appear as low resolution (˜4 frames per second (fps)) if viewed over 60 seconds, but may appear as high resolution (˜24 fps) if viewed over 10 seconds; Column 24, line 61-Column 25, line 17: SSRs 120.S increase the spatial resolution of input videos, TSRs 120.T increase the temporal resolution of input videos, while STSRs 120.ST increase both the spatial and temporal resolution of input videos. For example, the SSRs 120.S can increase the number of independent pixels for all input frames of an input video, while the TSRs 120.T can generate independent frames between input frames of an input video. On the other hand, the STSRs 120.ST can increase the number of independent pixels of all input frames of an input video while also generating independent frames between the input frames. The super-resolution models are general purpose video super-resolution models that can be applied to real videos and/or samples from any type of GNN. Moreover, the GNNs 120.0-n can generate all output frames of their respective output video simultaneously so, for instance, SSRs 120.S and STSRs 120.ST do not suffer from artifacts that would occur from naively running spatial super-resolution on individual frames. Operating on input frames and generating output frames simultaneously can help capture the temporal coherence across the entire length of a final output video 106.SjTk compared to, for example, frame-autoregressive approaches that have generally struggled maintaining temporal coherence).
Regarding claim 7, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein: the image generation model is trained in a first stage using a first encoder and a second stage using the first encoder and a second encoder (Fig. 1A; Column 4, lines 31-58: the system can use a pre-trained text encoder neural network to process a text prompt and generate a contextual embedding of the text prompt. The text prompt can describe a scene (e.g., as a sequence of tokens in a natural language) and the contextual embedding can represent the scene in a computationally amendable form (e.g., as a set or vector of numeric values, alphanumeric values, symbols, or other encoded representation). The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and videos generated at inference. A frozen text encoder can be particularly effective because it can enable the system to learn deep language encodings of scenes that may otherwise be infeasible if the text encoder were trained in parallel, e.g., due to the text encoder being biased on the particular scenes described by text-video training pairs. Furthermore, text-based training sets are generally more plentiful and sophisticated than currently available text-video or text-image training sets which allows the text encoder to be pre-trained and thereafter implemented in a highly optimized fashion; Column 9, lines 36-48: text encoder 110 is configured to process the text prompt 102 to generate a contextual embedding (u) of the text prompt 102. In some implementations, the text encoder 110 is a pre-trained natural language text encoder, e.g., a T5 text encoder such as T5-XXL, a CLIP text encoder, among others. The contextual embedding 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100. For example, the contextual embedding 104 can be a set, vector, or array of values (e.g., in UNICODE or Base64 encoding), alphanumeric values, symbols, or any convenient encoding; Column 10, lines 35-61: the post-processor 130 may perform analysis on the output video 106 such as video classification and/or video quality analysis. The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), or a video encoder to perform such analysis. For example, the post-processor 130 can determine if the output video 106 accurately depicts the scene described by the text prompt 102 by encoding the output video 106 into a video embedding and comparing it with the contextual embedding 104. In these cases, the post-processor 130 may include a video encoder that is paired with the text encoder 110, such as pre-trained text-video encoder pair (similar to a CLIP text-image encoder pair)).
Regarding claim 8, Ho, in view of Rombach teaches the method of claim 1, Ho discloses wherein: the synthetic image includes an element from the text prompt based on the feature embedding (Column 4, lines 31-58: provide high fidelity text-to-video generation, the system can use a pre-trained text encoder neural network to process a text prompt and generate a contextual embedding of the text prompt. The text prompt can describe a scene (e.g., as a sequence of tokens in a natural language) and the contextual embedding can represent the scene in a computationally amendable form (e.g., as a set or vector of numeric values, alphanumeric values, symbols, or other encoded representation). The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and videos generated at inference).
Regarding claim 16, Ho discloses an apparatus comprising: at least one processor (Column 31, lines 25-35: Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry); at least one memory storing instruction executable by the at least one processor (Column 31, lines 25-35: Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry); and wherein the generator is trained to generate a feature embedding based on a text prompt, wherein the feature embedding includes a first set of channels that encodes a first value of an image characteristic and a second set of channels that encodes a residual between the first value of the image characteristic and a second value of the image characteristic (Fig. 2A; Column 5, line 59-Column 6, line 14: modularity of the video generation system allows multiple devices to implement individual components of the system separately from one another. Particularly, different GNNs in the sequence can be executed on different devices and can transmit their outputs and/or inputs to one another (e.g., via telecommunications). As one example, the text encoder and a subset of the GNNs may be implemented on a client device (e.g., a mobile device) and the remainder of the GNNs may be implemented on a remote device (e.g., in a data center). The client device can receive an input (e.g., a text prompt) and process the text prompt using the text encoder and the subset of GNNs to generate an output video with a particular resolution. The client device can then transmit its outputs (e.g., the output video and a contextual embedding of the text prompt) which is received at the remote device as input. The remote device can then process the input using the remainder of the GNNs to generate a final output video having a higher resolution than the output video; Column 9, line 49-Column 10, line 15: text encoder 110 is configured to process the text prompt 102 to generate a contextual embedding (u) of the text prompt 102. In some implementations, the text encoder 110 is a pre-trained natural language text encoder, e.g., a T5 text encoder such as T5-XXL, a CLIP text encoder, among others. The contextual embedding 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100. For example, the contextual embedding 104 can be a set, vector, or array of values (e.g., in UNICODE or Base64 encoding), alphanumeric values, symbols, or any convenient encoding…sequence of GNNs 121 includes multiple GNNs 120 that are each configured to receive a respective input (c). Each GNN 120 is configured to process their respective input to generate a respective output video (x). In general, the sequence 121 includes an initial GNN that generates an initial output video (e.g., at low resolution) and one or more subsequent GNNs that progressively refine the resolution of the initial output video, i.e., the spatial resolution, the temporal resolution, or both. For example, each subsequent GNN can be a spatial super-resolution (SSR) model to increase spatial resolution, a temporal super-resolution (TSR) model to increase temporal resolution, or a joint spatial-temporal super-resolution (STSR) model to increase both spatial and temporal resolution. Accordingly, the respective input for the initial GNN includes the contextual embedding 104, while the respective input for each subsequent GNN includes the output video generated by a preceding GNN in the sequence 121. In some cases, the respective input to each subsequent GNN may include one or more output videos generated at lower depth in the sequence 121, as opposed to only the preceding GNN. Such cases are not described in detail but can be realized using the techniques outlined herein. In some implementations, the respective input for each subsequent GNN also includes the contextual embedding 104 which generally improves performance of the system 100, e.g., such that each subsequent GNN generates a respective output video that is strongly conditioned on the text prompt 102. The system 100 processes the contextual embedding 104 through the sequence 121 to generate an output video 106 at a high resolution, with few (if any) artifacts. The output video 106 is usually the respective output video of a final GNN in the sequence 121 but can, in principle, be provided by any GNN 120 in the sequence 121), wherein the decoder is trained to generate the synthetic image corresponding to a second image characteristic based on the feature embedding, and wherein the generator and the decoder are trained using a first encoder that outputs the first set of channels and a second encoder that outputs the second set of channels  (Fig. 1A; Column 4, lines 31-58: the system can use a pre-trained text encoder neural network to process a text prompt and generate a contextual embedding of the text prompt. The text prompt can describe a scene (e.g., as a sequence of tokens in a natural language) and the contextual embedding can represent the scene in a computationally amendable form (e.g., as a set or vector of numeric values, alphanumeric values, symbols, or other encoded representation). The training engine can also hold the text encoder frozen when the sequence of GNNs is trained to improve alignment between text prompts and videos generated at inference. A frozen text encoder can be particularly effective because it can enable the system to learn deep language encodings of scenes that may otherwise be infeasible if the text encoder were trained in parallel, e.g., due to the text encoder being biased on the particular scenes described by text-video training pairs. Furthermore, text-based training sets are generally more plentiful and sophisticated than currently available text-video or text-image training sets which allows the text encoder to be pre-trained and thereafter implemented in a highly optimized fashion; Column 9, lines 17-48: the text prompt 102 can describe any particular scene and the system 100, when appropriately trained (e.g., by a training engine), is capable of generating high resolution videos that faithfully depict the scene. The text prompt 102 can also include text modifiers such as “Smooth”, “Studio lighting”, “Pixel art”, “in the style of Van Gough”, etc. that impart various styles, modifications and/or characteristics on final videos 108 generated by the system 100. The system 100 can generate various different types of videos such as photorealistic videos, cartoon videos, abstract visualizations, imaging videos of different modalities, among others. For example, the system 100 can generate medical videos, e.g., videos depicting a sequence of MRI, CT or ultrasound video frames… text encoder 110 is configured to process the text prompt 102 to generate a contextual embedding (u) of the text prompt 102. In some implementations, the text encoder 110 is a pre-trained natural language text encoder, e.g., a T5 text encoder such as T5-XXL, a CLIP text encoder, among others. The contextual embedding 104 can also be referred to as an encoded representation of the text prompt 102 that provides a computationally amenable representation for processing by the system 100. For example, the contextual embedding 104 can be a set, vector, or array of values (e.g., in UNICODE or Base64 encoding), alphanumeric values, symbols, or any convenient encoding; Column 10, lines 35-61: the post-processor 130 may perform analysis on the output video 106 such as video classification and/or video quality analysis. The post-processor 130 may include one or more neural networks such as a convolutional neural network (CNN), a recurrent neural network (RNN), or a video encoder to perform such analysis. For example, the post-processor 130 can determine if the output video 106 accurately depicts the scene described by the text prompt 102 by encoding the output video 106 into a video embedding and comparing it with the contextual embedding 104. In these cases, the post-processor 130 may include a video encoder that is paired with the text encoder 110, such as pre-trained text-video encoder pair (similar to a CLIP text-image encoder pair); Column 24, lines 48-61: GNNs 120.0-n with any of these neural network configurations can employ the base video generation models and super-resolution models described herein (see FIGS. 5A-5C for particular examples). Note, subsequent GNNs 120.1-n employing super-resolution models can condition on their input videos by spatially and/or temporally up-sampling the input videos and thereafter concatenating channel-wise to the noisy data z(i), or to zt(i) in the case of DBGNNs. For example, subsequent GNNs 120.1-n can perform spatial up-sampling on input videos using bilinear resizing before concatenation. As another example, subsequent GNNs 120.1-n can perform temporal up-sampling on input videos by repeating frames and/or by filling in blank frames before concatenation).
	Ho does not explicitly disclose an image generation model comprising instruction stored in the at least one memory and trained to generate a synthetic image, where the image generation model includes a generator and a decoder.
	However, Rombach teaches image synthesis from a text prompt (Sections 3.2-3.3), further comprising an image generation model comprising instruction stored in the at least one memory and trained to generate a synthetic image, where the image generation model includes a generator and a decoder (Section 3.1: given an image x ∈ R H×W×3 in RGB space, the encoder E encodes x into a latent representation z = E(x), and the decoder D reconstructs the image from the latent; Section 3.2: With our trained perceptual compression models consisting of E and D, we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space…our model is realized as a time-conditional UNet [71]. Since the forward process is fixed, zt can be efficiently obtained from E during training, and samples from p(z) can be decoded to image space with a single pass through D). Rombach teaches that this will reduce the computational cost (Section 1). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ho with the features of above as taught by Rombach so as to reduce the computational cost as presented by Rombach.
Regarding claim 17, Ho, in view of Rombach teaches the apparatus of claim 16, Rombach discloses wherein: the generator comprises a latent diffusion model (Section 3.2: Diffusion Models [82] are probabilistic models designed to learn a data distribution p(x) by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length T. For image synthesis, the most successful models [15,30,72] rely on a reweighted variant of the variational lower bound on p(x), which mirrors denoising score-matching [85]. These models can be interpreted as an equally weighted sequence of denoising autoencoders θ(xt, t); t = 1 . . . T, which are trained to predict a denoised variant of their input xt, where xt is a noisy version of the input x).
Regarding claim 18, Ho, in view of Rombach teaches the apparatus of claim 16, Ho discloses wherein: the decoder is trained using a variational autoencoder (VAE) model based on an output of the first encoder and the second encoder (Column 5, lines 29-44: the video generation system uses diffusion-based models for each of the GNNs, although any combination of generative models can be utilized by the system, e.g., variational auto-encoders (VAEs), generative adversarial networks (GANs), etc. Diffusion models can be particularly effective in the modular setting of the system due to their controllability and scalability. For example, compared to some generative models, diffusion models can be efficiently trained by the training engine on computationally tractable objective functions with respect to a given training set. These objective functions can be straightforwardly optimized by the training engine to increase the speed and performance of diffusion-based GNNs (DBGNNs), as well as enable techniques such as classifier-free guidance and progressive distillation which further improve performance).
Regarding claim 19, Ho, in view of Rombach teaches the apparatus of claim 18, Ho discloses wherein: the decoder is trained using a VAE model based on an output of a third encoder that outputs a third set of channels (Fig. 3A; Column 14, lines 31-64: EM algorithms and certain objective functions Lθ(x, c) can be computationally intractable in some cases, e.g., when the training engine uses considerably large training sets, the prior and/or conditional distributions are particularly complex, etc. In these cases, the training engine can simultaneously model posterior distributions qϕ(z|x, c) over the latent representations which can speed up the calculus during training, e.g., when the training engine maximizes the evidence lower bound (ELBO). The posterior distribution describes how data (x, c) is encoded into latent representations z. Here, ϕ is another set of network parameters that can be included in a respective GNN 120 or another neural network, e.g., a discriminative neural network (DNN). A GNN 120 can sample from the posterior distribution instead of the prior distribution during training, which can significantly reduce the number of latents z needed to converge to a suitable parameterization θ, e.g., when the training engine simultaneously optimizes an objective function Lθϕ((x, c) with respect to θ and ϕ. After training, the GNN 120 can continue sampling from the prior distribution. In some implementations, the training engine can model the posterior distribution as a normal distribution qϕ(z|x, c)=N(z; μϕ(x, c), σϕ2(x, c)I), where μϕ((x, c) and σϕ2(x, c) are the mean and variance, respectively, as a function of x and c. As mentioned above with respect to the conditional distribution, a parametrization of this form can aid in optimizing stochastic terms (e.g., using gradient descent methods) that would otherwise be non-differentiable. For reference, the conditional distribution pθ(x|z, c) in combination with the posterior distribution qϕ(z|x, c) is usually referred to as a variational auto-encoder (VAE), with θ being the decoder parameters and ϕ being the encoder parameters; Column 21, lines 49-61: training engine 300 obtains multiple training examples 310, for instance, from a publically available training set or any suitably labeled text-video training set. Each training example 310 includes: (i) a respective input text prompt (T) 302 describing a particular scene, and (ii) a corresponding target video (x) 306 depicting the particular scene. The text encoder neural network 110 processes the respective input text prompt 302 of each training example 310 to generate a corresponding contextual embedding (u) 304 of the input text prompt 302. In some implementations, the text encoder 110 is pre-trained and held frozen 111 by the training engine 300 during the joint training of the GNNs 120.0-n).
Regarding claim 20, Ho, in view of Rombach teaches the apparatus of claim 16, Ho discloses wherein: the first encoder is trained using a VAE model based on an output of the first encoder that takes the first set of channels as input (Column 11, line 59-Column 12, line 23: the sequence 121 can utilize any of multiple types of generative models for the GNNs 120. Such generative models include, but are not limited to, diffusion-based models, generative adversarial networks (GANs), variational auto-encoders (VAEs), auto-regressive models, energy-based models, Bayesian networks, flow-based models, hierarchal versions of any of these models (e.g., continuous or discrete time), among others. Section II provides a methodology for using generic GNNs to generate output videos from a conditioning input, followed by Section III describing diffusion-based GNNs (DBGNNs)…the goal of the sequence 121 is to generate new instances of high definition videos with a high degree of controllability, i.e., that are strongly conditioned on a conditioning input (e.g., text prompts). As explained in Section I, each GNN 120 in the sequence 121 processes a respective conditioning input c to generate a respective output video {circumflex over (x)}, where the respective input c includes a contextual embedding (u) of a text prompt and/or an output video generated by a preceding GNN in the sequence 121. That being said, the contextual embedding 104 can also be substituted with a different conditioning input such as a noise input, a pre-existing video, an image, an audio waveform, embeddings of any of these, combinations thereof, etc. Although this specification is generally concerned with text-to-video generation, the video generation systems disclosed herein is not limited to such. The video generation systems can be applied to any conditioned video generation problem by changing the conditioning input into the sequence 121. An example of such an implementation is described in Section VII which outlines generating videos from noise).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached at (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613
Read full office action
Prosecution Timeline

Mar 18, 2024
Application Filed
Dec 18, 2025
Non-Final Rejection — §103, §DP
Mar 16, 2026
Examiner Interview Summary
Mar 16, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/267,368
Patent 12597198
RAY TRACING METHOD AND APPARATUS BASED ON ATTENTION FOR DYNAMIC SCENES
2y 5m to grant Granted Apr 07, 2026
18/498,919
Patent 12597207
Camera Reprojection for Faces
2y 5m to grant Granted Apr 07, 2026
18/227,616
Patent 12579753
Phased Capture Assessment and Feedback for Mobile Dimensioning
2y 5m to grant Granted Mar 17, 2026
17/463,439
Patent 12561899
Vector Graphic Parsing and Transformation Engine
2y 5m to grant Granted Feb 24, 2026
18/458,942
Patent 12548256
IMAGE PROCESSING APPARATUS FOR GENERATING SURFACE PROFILE OF THREE-DIMENSIONAL GEOMETRIC MODEL, CONTROL METHOD THEREFOR, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+28.5%)
2y 12m
Median Time to Grant
Low
PTA Risk
Based on 485 resolved cases by this examiner. Grant probability derived from career allow rate.