Last updated: April 19, 2026
Application No. 18/439,157
LEARNING CONTINUOUS CONTROL FOR 3D-AWARE IMAGE GENERATION ON TEXT-TO-IMAGE DIFFUSION MODELS

Final Rejection §102§103
Filed
Feb 12, 2024
Examiner
PROVIDENCE, VINCENT ALEXANDER
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 18 resolved cases, 2023–2026
Examiner Intelligence

PROVIDENCE, VINCENT ALEXANDER View full profile →
Grants 83% — above average
Career Allow Rate
15 granted / 18 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
38 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
82.4%
+42.4% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
0.9%
-39.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 18 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Amendment filed December 30th 2025 has been entered. Claims 1-20 are pending in the application. Applicant’s amendments to the Claims 1, 8, 12, and 17 have overcome the rejections previously set forth in the Non-Final Office Action mailed September 30th 2025. A second search has been performed to address the material amended in the aforementioned claims. Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models) was used for the amended claim limitations.

Response to Arguments
The Examiner appreciates the Applicant’s thorough review of the previous Non-Final Office Action. Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
However, in acknowledgment of Applicant’s arguments being moot, the Examiner would like to address an argument made by the Applicant that could also made against the current rejections.
With respect to Gal, the applicant argues that: “The pseudo-word S* is treated like any other word token and represents a learned concept, not a separate, controllable input.” The Examiner respectfully disagrees that the pseudo word does not represent a separate, controllable input because Gal demonstrates controlling the impact of the pseudo-word via input images in Figure 13 on Pg. 22 of Gal. In other words, the input images are utilized to determine the meaning of the pseudo word and therefore control the content of the output image. However, the Examiner agrees that Gal fails to teach the amended limitation “obtaining a text prompt and an attribute value wherein the text prompt describes an element and the attribute value comprises a numerical value for a continuous attribute of the element."

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 12, 13, 14, and 16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models).

Regarding claim 12:
Burgess teaches:
A method comprising:
initializing a machine learning model (see Note 12A);
obtaining a training set including a plurality of training images depicting an object with a plurality of values of a continuous attribute (Burgess: Left: we have a small multi-view dataset, DMV containing images, xi, and known camera poses, Ri, Pg. 3, Figure 3; see Note 12B), respectively; 
training, using the training set, an image generation model (Burgess: We do diffusion model training on this dataset while optimizing only Mv and Mo, Pg. 3, Figure 3; see Note 12C) to generate synthetic images with the plurality of values of the continuous attribute (Burgess: Examples of novel view synthesis using ViewNeTI where the input camera parameters are In spherical coordinate system. We do single-scene NVS using an input dataset of nine multiview images of a Shape Net car with random forward-facing poses, Pg. 21, Figure 20; see Note 12C); and
training, using the training set, a continuous control model to generate text embeddings corresponding to the continuous attribute (Burgess: We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents, Abstract; Burgess: Right: textual inversion training of our neural mappers, Mv and Mo. The mappers predict the word embeddings for SRi and So respectively. Pg. 3, Figure 3) as an input for the image generation model (Burgess: The ViewNeTI mappers control the text encoding for the diffusion model, Pg. 4, Section 4.1: ViewNETI mapper architecture and inference).

Note 12A: Burgess discusses pre-training both the mappers (Pg. 6, Section 5.2.1: View-mapper pretraining) and the diffusion model (Pg. 3, col. 2, par. 1), and therefore initializes at least one machine learning model. The Examiner notes that the “machine learning model” does not appear to be referenced elsewhere in claim 12 or in any of the dependent claims of claim 12.

Note 12B: The Examiner interprets the known camera poses Ri as the plurality of values of a continuous attribute, because the specification of the present application teaches that: “The continuous attribute, such orientation of an object or an apparent camera view of the scene can be difficult to describe precisely using text” [0068] (emphasis added). As best understood by the examiner, camera poses describe an apparent camera view of the scene.

Note 12C: The Examiner understands Burgess to teach “training an image generation model using the training set to generate synthetic images with the plurality of values of the continuous attribute”, because Burgess teaches that a diffusion model, specifically StableDiffusion, will generate images (Burgess: Viewpoint Neural Textual Inversion (ViewNeTI) controls the viewpoint of objects in images generated by diffusion models, Pg. 4, Section 4: Method) after having been trained on the dataset discussed in Note 12B (see Pg. 3, Figure 3 cited above).
In Note 12B, it was also discussed that the continuous attribute, the known camera poses Ri, was utilized as part of the training dataset. The images generated by the diffusion model in Burgess are generated with the plurality of values of the continuous attribute, because on Pg. 21, Burgess showcases interpolated images generated from multiple training images, where each training image has an apparent camera view. Therefore, when Burgess “do[es] single-scene NVS using an input dataset of nine Multiview images of a Shape Net car”, Burgess is generating synthetic images while utilizing the camera poses from each training image, or in other words, the plurality of values of the continuous attribute from each training image.

Regarding claim 13:
Burgess teaches:
The method of claim 12 (as shown above), wherein obtaining the training set comprises: rendering the plurality of training images based on a 3D model of the object (Burgess: Novel view synthesis trained on a single scene, Pg. 6, Fig. 4; see Note 13A)

Note 13A: Burgess showcases in Fig. 4 that a 3D model of the object may be utilized to capture the training set or “ground truth”.

Regarding claim 14:
Burgess teaches:
The method of claim 12 (as shown above), wherein obtaining the training set comprises: generating, using a training image generation model, a training image based on a 3D model of the object (Burgess: To encourage the image to be close to the diffusion model training distribution, we generated an augmented training set by outfilling the background around the car, similar to in Appendix A, Pg. 14, Section G: Validation of spherical coordinate parameterization, par. 1; see Note 14A).
Note 14A: Burgess teaches that the training image set may be generated, and uses Stable Diffusion to outfill training set images: “In Fig. 12 we ask a diffusion model to do outfilling around a real car.” (Pg. 13, Section A: Evidence for 3D capabilities in diffusion models with image outfilling, par. 1)

Regarding claim 16:
Burgess teaches:
The method of claim 12 (as shown above), wherein training the image generation model comprises: 
computing a reconstruction loss (Burgess: We optimize the weights of Mv and Mo using the loss in Eq. (1), Pg. 5, Section 4.2: Single-Scene Novel View Synthesis) based on the training set (Burgess: SD’s are Latent Diffusion Models (LDMs) for image generation [41] and are typically trained on web-scale datasets of text-image pairs (x,y)∼D, Pg. 3, Section 3.1: Text-to-image latent diffusion models; see Note 16A); and 
updating parameters of the image generation model (see Note 16B) and parameters of the continuous control model based on the reconstruction loss (Burgess: We optimize the weights of Mv and Mo using the loss in Eq. (1), Pg. 5, Section 4.2: Single-Scene Novel View Synthesis).

Note 16A: The dataset of text-image pairs to be trained on, “(x, y) ~ D”, appears in the loss function, Equation 1 on Pg. 4 of Burgess. 

Note 16B: The specification of the current application teaches: “The training component then updates parameters of the diffusion model 900 based on the comparison. For example, parameters of a U-Net may be updated using gradient descent. Time-dependent parameters of the Gaussian transitions can also be learned.”
Burgess teaches: “Both neural mappers, (Mv,Mo) are conditioned on the denoising timestep and the diffusion model UNet layer, (t,ℓ).”
In other words, Burgess teaches a UNet for the diffusion model as well as a time-dependent timestep parameter. As best understood by the examiner, Burgess also teaches that the UNet parameters may be updated: “The text is passed through a pretrained CLIP [36] text encoder, giving a d-dim conditioning vector for each token, c(y) ∈ Rd×77, which is mixed with each Unet layer via cross-attention” (Pg. 4, col.1, par. 2, emphasis added).
Therefore, the Examiner understands Burgess to teach updating parameters of the image generation model.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 3, 4, 5, 6, 7, 8, 10, 17, 18, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Gal (NPL: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion) in view of Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models).

Regarding claim 1:
Gal teaches:
A method comprising:
obtaining a text prompt describing an element and an attribute value for a continuous attribute of the element (Gal: “A painting of S∗ sitting next to a serene pond”, Figure 11, Pg. 20; see Note 1A);
embedding the text prompt to obtain a text embedding in a text embedding space (Gal: Typical text encoder models, […] begin with a text processing step (Figure 2, left). First, each word or sub-word in an input string is converted to a token […] Each token is then linked to a unique embedding vector […] In our work, we choose this embedding space as the target for inversion, Pg. 5, Text Embeddings, par. 1-2);
embedding, using a model, the attribute value to obtain an attribute embedding in the text embedding space (Gal: we designate a placeholder string, S∗, to represent the new concept we wish to learn. We intervene in the embedding process and replace the vector associated with the tokenized string with a new, learned embedding v∗, in essence “injecting” the concept into our vocabulary. In doing so, we can then compose new sentences containing the concept, just as we would with any other word, Pg. 5, Text embeddings, par. 2; see Note 1B); and
generating, using an image generation model, a synthetic image based on a text embedding of a text prompt (Gal: each word or sub-word in an input string is converted to a token, which is an index in some pre-defined dictionary. Each token is then linked to a unique embedding vector that can be retrieved through an index-based lookup, Pg. 5, Text embeddings, par. 1; see Note 1C) and the attribute embedding (Gal: We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by S∗. This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models, Pg. 2, Section 1: Introduction, par. 5), wherein the synthetic image depicts the continuous attribute based on the attribute value (see Note 1D).

Note 1A: The specification of the present application recites: “In one aspect, the continuous attribute input includes a 3-dimensional characteristic of the element such as orientation, illumination direction, nonrigid shape transformation, object pose, zoom effect, etc.” [0026]. 
In Note 1B below, it is shown that the pseudo-word “S*” taught by Gal is analogous to a continuous attribute. Gal teaches in Figure 11, a prompt of the form: “A painting of S* sitting next to a serene pond”. The prompt of Gal describes the orientation of an element (in this case, the element is the statue seen in the input samples, and the object is oriented “next to a serene pond”).

Note 1B: Gal teaches: “It is natural to search for candidates for such a representation in the word-embedding stage of the text encoders typically employed by text-to-image models. There, the discrete input text is first converted into a continuous vector representation that is amenable to direct optimization,” (Pg. 4, Section 3: Method, par. 2). That is, vectors corresponding to the input text prompt may be considered “continuous”. Gal further teaches in Pg. 5, Text embeddings, par. 2 as cited above that the vector may be replaced with a learned embedding v*. Where the learned embedding is analogous to an attribute embedding and the vector associated with the pseudo word is analogous to an attribute value of a continuous attribute, Gal teaches that a model is trained to embed an attribute value of a continuous attribute to obtain an attribute embedding. Gal further teaches that the v* is essentially injected into the “vocabulary” of the model (cited above) and therefore, it is reasonable to conclude that v* resides in the text embedding space.

Note 1C: Gal teaches that each word in a string may be used to generate a text embedding. In Figure 2 on Pg. 4, Gal further shows that “the embedding vectors are transformed into a single conditioning code cθ(y) which guides the generative model.”

Note 1D: Gal showcases a plurality of examples where the text-to-image model generates a synthetic image on Pg. 22-25. For example, on Pg. 22, Gal showcases a synthetic image labelled “The Scream in the style of S*” based on the attribute embedding obtained from the “pseudo word” S*. Given that the attribute embedding was generated based on a continuous vector (as shown in Note 1B), it is reasonable to conclude that Gal teaches generating synthetic images based on a continuous attribute based on the attribute value.

Gal fails to explicitly teach:
obtaining a text prompt and an attribute value, wherein the text prompt describes an element and the attribute value comprises a numerical value for a continuous attribute of the element;
embedding, using a continuous control model, the attribute value to obtain an attribute embedding in the text embedding space, wherein the continuous control model is trained to generate text embeddings for a plurality of values of the continuous attribute;

Burgess teaches:
A method comprising: obtaining a text prompt (Burgess: We create a caption for each image. Pg. 3, Figure 3) and an attribute value (Burgess: The caption contains a camera-specific token SRi that is different for each view, Pg. 3, Figure 3), 
wherein the text prompt describes an element (Burgess: The caption contains […] an object token, So, Pg. 3, Figure 3) and the attribute value comprises a numerical value for a continuous attribute of the element (Burgess: … in Fig. 20, we show NVS results where the camera is parameterized by spherical coordinates. We assume a central object is fixed at the origin, and that the camera is at a fixed radius with variable polar and azimuth angles, (θ,φ), Pg. 14, Section G: Validation of spherical coordinate parameterization, par. 1; see also Note 1E); 
embedding the text prompt to obtain a text embedding in a text embedding space (Burgess: The prompt is passed through the CLIP text encoder, Pg. 3, Figure 3); 
embedding, using a continuous control model (Burgess: We parameterize Mv as a 2-layer MLP with 64 dimensions, Pg. 5, par. 1), the attribute value to obtain an attribute embedding in the text embedding space (Burgess: Right: textual inversion training of our neural mappers, Mv and Mo. The mappers predict the word embeddings for SRi and So respectively. Pg. 3, Figure 3), wherein the continuous control model is trained to generate text embeddings (Burgess: We optimize the weights of Mv and Mo using the loss in Eq. (1), except we replace D with DMV′. Intuitively, we are learning text space latents, Pg. 5, Section 4.2: Single-Scene Novel View Synthesis) for a plurality of values of the continuous attribute (Burgess: We propose an architecture for predicting text embeddings that control camera viewpoint, Pg. 2, Section 2: Related work, par. 2; see Note 1F); and 
generating, using an image generation model (Burgess: It produces views with photorealistic details for real-world objects that are in the massive 2D training distribution of 2D diffusion models like Stable Diffusion, Pg. 2, col. 2, par. 1), a synthetic image based on the text embedding and the attribute embedding, wherein the synthetic image depicts the continuous attribute of the element based on the attribute value (Burgess: Right: our viewpoint control mechanism enables leveraging pretrained 2D diffusion models for novel view synthesis from only a single view; we can learn from more views if available, Pg. 1, Figure 1; Burgess: ViewNeTI controls viewpoint in text-to-image generation by composing the view-mapper text encoder, represented by Ri, with novel text prompts, Pg. 8, Figure 6).

Note 1E: The specification of the present application teaches that: “The continuous attribute, such [as] […] an apparent camera view of the scene can be difficult to describe precisely using text. For example, it can include one or more numerical parameters such as distance and angle (e.g., the distance between an object and the viewpoint, or angles describing the relationship between an object and a light source). […] The attribute can be described in terms of one or more continuous variables such as 3D position coordinates, Euler angles, or orientation angles such as yaw, pitch and roll.”
Burgess teaches: “in Fig. 20, we show NVS results where the camera is parameterized by spherical coordinates. We assume a central object is fixed at the origin, and that the camera is at a fixed radius with variable polar and azimuth angles, (θ,φ);” (Pg. 14, Section G: Validation of spherical coordinate parameterization, par. 1).
In other words, Burgess teaches a camera view of the scene parameterized by angles.

Note 1F: The specification of the current application states: “the continuous control model includes a multilayer perceptron (MLP), where the MLP is able to receive a continuous input (e.g., the attribute value) and generate a continuous output (e.g., the attribute embedding)” [0075].
Burgess teaches: “We pass the text encoder a prompt like, ‘SR. A photo of a So’, where SR has word embedding vR and So has word embedding vo. The embedding for SR controls the viewpoint for the image that is been generated.” Furthermore, Burgess teaches that “the neural mappers [] learn a predictor of word-embedding space that is sufficiently sensitive to small changes in camera parameters; that is, it can represent high frequency changes in word embedding space.”
Therefore, the Examiner understands that the encoder taught by Burgess accepts a continuous input (attribute value) and outputs a continuous output (attribute embedding). Burgess also teaches: “Following the base architecture of [1], the encoding is passed through an MLP with two blocks”. (Pg. 15, Section I: ViewNeTI implementation details) Therefore, the encoder parsing the prompts is an MLP, similarly to what is described in the specification at paragraph [0075]. It follows that Burgess teaches the continuous control model of claim 1.

Before the effective filing date of the present application, it would be obvious to one of ordinary skill in the art to combine the teachings of Burgess with Gal. Obtaining a text prompt and an attribute value, wherein the text prompt describes an element and the attribute value comprises a numerical value for a continuous attribute of the element; and embedding, using a continuous control model, the attribute value to obtain an attribute embedding in the text embedding space, wherein the continuous control model is trained to generate text embeddings for a plurality of values of the continuous attribute, as in Burgess, would benefit the Gal teachings by enabling controlled modification of the subject and camera angle in a generated image: “The pretrained ViewNeTI mapper generalizes to novel scenes, enabling synthesis of novel views far from the input views with little data; it can even do [novel view synthesis (NVS)] from a single image. Compared to existing single-image NVS methods, ViewNeTI has several advantages, especially in single-view NVS. It produces views with photorealistic details for real-world objects that are in the massive 2D training distribution of 2D diffusion models like Stable Diffusion. Once trained, it can generate diverse predictions under uncertainty in close to real time” (Burgess, Pg. 2, col. 2, par. 1).

Regarding claim 2:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein:
the continuous attribute comprises a 3-dimensional characteristic of the element (Burgess: the per-scene object mappers learned the pose of objects within that scene, Pg. 8, Section 7: Conclusions, par. 2).

Note 2A: The specification of the present application recites: “In one aspect, the text prompt and the continuous attribute input are combined and input into an image generation model to generate the synthetic image. In one aspect, the continuous attribute input includes a 3-dimensional characteristic of the element such as orientation, illumination direction, nonrigid shape transformation, object pose, zoom effect, etc.” [0026]. 
As cited above, Burgess teaches that the mapper learns the object pose. As previously discussed in the rejection of claim 1, Burgess teaches “mappers” that “predict the word embeddings for SRi and So respectively” where SRi and So are a camera specific token and object token. In the rejection of claim 1, the camera specific token was analogized to the claimed “continuous attribute”. However, because the object token is also utilized to assist the Stable Diffusion model in generating an output image, the Examiner understands that the continuous attribute may include the object pose, or that at least one of ordinary skill in the art would include the object pose within the continuous attribute.

Regarding claim 3:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein embedding the text prompt comprises:
dividing the text prompt into a plurality of tokens (Gal: Here, an input string is first converted to a set of tokens, Pg. 2, Section 1: Introduction, par. 4); and
embedding each of the plurality of tokens using a text embedding model (Gal: Each token is then replaced with its own embedding vector, and these vectors are fed through the downstream model, Pg. 2, Section 1: Introduction, par. 4).

Regarding claim 4:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein: the text prompt includes a nonce token corresponding to the attribute value.

Note 4A: The specification of the present application teaches: “In some cases, the text prompt includes a nonce token that corresponds to the attribute. For example, the text prompt states “A <V*> photo of a horse,” where <V*> represents the nonce token” [0041]. Similarly, Burgess teaches: “We pass the text encoder a prompt like, ‘SR. A photo of a So’, where SR has word embedding vR and So has word embedding vo.” (Pg. 5, par. 2) Therefore, as best understood by the Examiner, SR is a nonce token. Additionally, SR was previously discussed to correspond to the continuous attribute in the rejection of claim 1.

Regarding claim 5:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein:
the text prompt includes a word corresponding to the continuous attribute (Burgess: The caption contains a camera-specific token SRi that is different for each view, and an object token, So, which is common across views, Pg. 3, Figure 3; see Note 5A).
Note 5A: Burgess teaches that SR and So have word embeddings: “We pass the text encoder a prompt like, ‘SR. A photo of a So’, where SR has word embedding vR and So has word embedding vo.” (Pg. 5, par. 2). Therefore, the Examiner understands SR and So to be analogous to words. Additionally, SR was previously discussed to correspond to the continuous attribute in the rejection of claim 1.

Regarding claim 6:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), further comprising:
encoding the text embedding and the attribute embedding to obtain guidance information for the image generation model (Gal: the embedding vectors are transformed into a single conditioning code cθ(y) which guides the generative model, Pg. 4, Figure 2; see Note 18A), wherein the synthetic image is generated based on the guidance information (Gal: Figure 2; see Note 6B).

Note 6B: Figure 2 of Gal depicts the conditioning code (cθ(y)) being sent from the Text Encoder to the Generator, which produces the synthetic image (Specifically, the output synthetic image is depicted in the figure as left of the other clock image labelled “Input sample”).

Regarding claim 7:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein generating the synthetic image comprises:
performing a diffusion process on a noise input to obtain the synthetic image (Gal: At inference time, a random noise tensor is sampled and iteratively denoised to produce a new image latent, z0. Finally, this latent code is transformed into an image through the pre-trained decoder x0 = D(z0), Pg. 5, Latent Diffusion Models, par. 3).

Regarding claim 8:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), wherein:
the image generation model is trained using a training set (Burgess: We do diffusion model training on this dataset while optimizing only Mv and Mo (this is textual inversion training [1,13], Pg. 3, Figure 3) including the plurality of training images depicting an object depicting an object (Burgess: we have a small multi-view dataset, DMV containing images, xi, Pg. 3, Figure 3) with a plurality of values of the continuous attribute, respectively (Burgess: Examples of novel view synthesis using ViewNeTI where the input camera parameters are In spherical coordinate system. We do single-scene NVS using an input dataset of nine multiview images of a Shape Net car with random forward-facing poses, Pg. 21, Figure 20).

Regarding claim 10:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), further comprising:
obtaining an additional attribute value corresponding to an additional continuous attribute, wherein the synthetic image is generated to depict the additional attribute value (Gal: Let {xi}ni=1 be the set of input images. Rather than optimizing a single word vector shared across all images, we introduce both a universal placeholder, S∗, and an additional placeholder unique to each image, {Si}ni=1, associated with a unique embedding vi. We then compose sentences of the form “A photo of S∗ with Si”, where every image is matched to sentences containing its own, unique string, Pg. 12, Section 5.2: Evaluation steps, par. 4 (emphasis added); see Note 10A).

Note 10A: Gal teaches that instead of using just one attribute value S*, an indefinite amount of attribute values may be added to a prompt for image generation.

Regarding claim 17:
Gal teaches:
An apparatus comprising:
at least one processor (Gal: Our experiments were conducted using 2×V100 GPUs with a batch size of 4, Pg. 5, Implementation details, par. 1);
at least one memory storing instructions executable by the at least one processor (see Note 17A);
a model trained to embed an attribute value of a continuous attribute to obtain an attribute embedding in a text embedding space (Gal: we designate a placeholder string, S∗, to represent the new concept we wish to learn. We intervene in the embedding process and replace the vector associated with the tokenized string with a new, learned embedding v∗, in essence “injecting” the concept into our vocabulary. In doing so, we can then compose new sentences containing the concept, just as we would with any other word, Pg. 5, Text embeddings, par. 2; see Note 1B); and
an image generation model (Gal: text-to-image model, Pg. 2, Section 1: Introduction, par. 3) comprising parameters (Gal: We employ the publicly available 1.4 billion parameter text-to-image model of Rombach et al, Pg. 5, Latent Diffusion Models, par. 5) stored in the at least one memory (see Note 17A) and trained to generate a synthetic image based on a text embedding of a text prompt (Gal: each word or sub-word in an input string is converted to a token, which is an index in some pre-defined dictionary. Each token is then linked to a unique embedding vector that can be retrieved through an index-based lookup, Pg. 5, Text embeddings, par. 1; see Note 1C) and the attribute embedding (Gal: We represent a new embedding vector with a new pseudo-word (Rathvon, 2004) which we denote by S∗. This pseudo-word is then treated like any other word, and can be used to compose novel textual queries for the generative models, Pg. 2, Section 1: Introduction, par. 5), wherein the synthetic image depicts the continuous attribute based on the attribute value (see Note 1D).

Note 17A: Gal teaches executing the method on a graphical processing unit (GPU) (Gal: Our experiments were conducted using 2×V100 GPUs with a batch size of 4, Pg. 5, Implementation details, par. 1) which inherently comprises a memory and processor.

Gal fails to teach:
a continuous control model comprising parameters stored in the at least one memory and trained to embed an attribute value comprising a numerical value for a continuous attribute to obtain an attribute embedding in a text embedding space, and wherein the continuous control model is trained to generate text embeddings for a plurality of values of the continuous attribute; 

Burgess teaches:
An apparatus comprising: 
at least one processor (Burgess: the training time was modest at only 2 days on one GPU, Pg. 9, par. 1); 
at least one memory storing instructions executable by the at least one processor (see Note 17B); 
a continuous control model comprising parameters stored in the at least one memory and trained to embed an attribute value (Burgess: Right: textual inversion training of our neural mappers, Mv and Mo. The mappers predict the word embeddings for SRi and So respectively. Pg. 3, Figure 3) comprising a numerical value (Burgess: … in Fig. 20, we show NVS results where the camera is parameterized by spherical coordinates. We assume a central object is fixed at the origin, and that the camera is at a fixed radius with variable polar and azimuth angles, (θ,φ), Pg. 14, Section G: Validation of spherical coordinate parameterization, par. 1; see also Note 1E) for a continuous attribute to obtain an attribute embedding in a text embedding space, and wherein the continuous control model is trained to generate text embeddings for a plurality of values of the continuous attribute (Burgess: We propose an architecture for predicting text embeddings that control camera viewpoint, Pg. 2, Section 2: Related work, par. 2; see Note 1F); 
and an image generation model comprising parameters stored in the at least one memory and trained to generate a synthetic image (Burgess: We then sample from a frozen Stable Diffusion model conditioned on this text latent to produce an image of the So object from camera view Ri, Pg. 1, Figure 1) based on a text embedding of a text prompt and the attribute embedding (Burgess: Neural Textual Inversion (ViewNeTI), takes camera viewpoint parameters, Ri, and a scene-specific token, So, to predict a latent in the CLIP output text space, Pg. 1, Figure 1), wherein the synthetic image depicts the continuous attribute based on the attribute value (Burgess: ViewNeTI controls viewpoint in text-to-image generation by composing the view-mapper text encoder, represented by Ri, with novel text prompts, Pg. 8, Figure 6).

Note 17B: Training via a GPU necessarily requires a form of memory to store the training dataset and instructions on how to train the model.

Before the effective filing date of the present application, it would be obvious to one of ordinary skill in the art to combine the teachings of Burgess with Gal. Utilizing a continuous control model comprising parameters stored in the at least one memory and trained to embed an attribute value comprising a numerical value for a continuous attribute to obtain an attribute embedding in a text embedding space, and wherein the continuous control model is trained to generate text embeddings for a plurality of values of the continuous attribute, as in Burgess, would benefit the Gal teachings by enabling controlled modification of the subject and camera angle in a generated image: “The pretrained ViewNeTI mapper generalizes to novel scenes, enabling synthesis of novel views far from the input views with little data; it can even do [novel view synthesis (NVS)] from a single image. Compared to existing single-image NVS methods, ViewNeTI has several advantages, especially in single-view NVS. It produces views with photorealistic details for real-world objects that are in the massive 2D training distribution of 2D diffusion models like Stable Diffusion. Once trained, it can generate diverse predictions under uncertainty in close to real time” (Burgess, Pg. 2, col. 2, par. 1).

Regarding claim 18:
Gal in view of Burgess teaches:
The apparatus of claim 17 (as shown above), further comprising:
a text encoder comprising parameters (Gal: Here, cθ is realized through a BERT (Devlin et al., 2018) text encoder, with y being a text prompt, Pg. 5, Latent Diffusion Models, par. 5) stored in the at least one memory (see Note 1B) and configured to encode the text embedding and the attribute embedding to obtain guidance information for the image generation model (Gal: the embedding vectors are transformed into a single conditioning code cθ(y) which guides the generative model, Pg. 4, Figure 2; see Note 18A).

Note 18A: Gal teaches that the embedding vectors generated from the tokens may be used to generate a “conditioning code” cθ that guides the generative model, and therefore, it is reasonable to call the conditioning code “guidance information.”

Regarding claim 19:
Gal in view of Burgess teaches:
The apparatus of claim 17 (as shown above), wherein: the continuous control model comprises a multilayer perceptron (MLP) (Burgess: We parameterize Mv as a 2-layer MLP with 64 dimensions, Pg. 5, par. 1).

Regarding claim 20:
Gal in view of Burgess teaches:
The apparatus of claim 17 (as shown above), wherein: the image generation model comprises a diffusion model (Gal: we outline the core details of applying our approach to a specific class of generative models — Latent Diffusion Models, Pg. 4, Section 3: Method, par. 4; Burgess: We then sample from a frozen Stable Diffusion model conditioned on this text latent to produce an image of the So object from camera view Ri, Pg. 1, Figure 1).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Gal (NPL: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion) in view of Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models) and Andrew (NPL: How to use negative prompts?)

Regarding claim 9:
Gal in view of Burgess teaches:
The method of claim 8 (as shown above), further comprising:

Gal in view of Burgess fails to teach:
identifying a negative prompt based on the object from the plurality of training images, wherein the synthetic image is generated based on the negative prompt.

Andrew teaches:
identifying a negative prompt based on the object from the plurality of training images, wherein the synthetic image is generated based on the negative prompt (Andrew: “You want to generate another one but with an empty street. You can use the same seed value, which specifies the image, and add the negative prompt “people”. You get an image with most people removed, Pg. 4-5, par. 1; see Note 9A).

Note 9A: Andrew teaches that a negative prompt may be used to remove objects from an already generated synthetic image. Specifically, Andrew identifies the negative prompt “people” by inspecting the input image on Pg. 4, and then generates the synthetic image on Pg. 5, and because of the negative prompt, the people in the image are removed.
Therefore, when the method of Andrew is applied to the teachings of Gal in view of Burgess, it would be obvious to one of ordinary skill in the art to remove the main element from, for example, some of the input sample images in pages 22-25 of Gal using a negative prompt.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of AI Prompt Directory with Gal in view of Burgess. Identifying a negative prompt based on the object from the plurality of training images, wherein the synthetic image is generated based on the negative prompt, a plurality of synthetic images based on a same random input, as in AI Prompt Directory, would benefit the Gal in view of Burgess teachings by enabling removal of elements in the image.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Gal (NPL: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion) in view of Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models) and AI Prompt Directory (NPL: How to Use Seeds for Better Control in Stable Diffusion).

Regarding claim 11:
Gal in view of Burgess teaches:
The method of claim 1 (as shown above), further comprising:
obtaining a plurality of attribute values for the continuous attribute (Burgess: Novel view synthesis trained on a single scene, […] inference views that are ‘interpolated’ from the input  (green), and inference views that are ‘extrapolated’ (red). Pg. 6, Figure 4); and
generating, using the image generation model, a plurality of synthetic images based on the plurality of attribute values (Burgess: Examples of novel view synthesis using ViewNeTI where the input camera parameters are In spherical coordinate system. We do single-scene NVS using an input dataset of nine multiview images of a Shape Net car with random forward-facing poses, Pg. 21, Figure 20; see Note 11A), respectively.

Note 11A: Burgess showcases that multiple synthetic images may be generated with various camera angles. In the rejection of claim 1, the camera angle was analogized to the continuous attribute.

Gal in view of Burgess fails to teach:
generating, using the image generation model, a plurality of synthetic images based on a same random input.

AI Prompt Directory teaches:
generating, using the image generation model (AI Prompt Directory: Stable Diffusion, Pg. 1, par. 1), a plurality of synthetic images based on a same random input (AI Prompt Directory: Seeds are numbers that control the randomness. By fixing the seed value, you can reliably reproduce an image you’ve previously generated, Pg. 2, par. 1).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of AI Prompt Directory with Gal in view of Burgess. Generating, using the image generation model, a plurality of synthetic images based on a same random input, as in AI Prompt Directory, would benefit the Gal in view of Burgess teachings by enabling regeneration of images that were generated randomly.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Burgess (NPL: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models) and Kowalski (US 20210335029 A1).

Regarding claim 15:
Burgess teaches:
The method of claim 12 (as shown above), wherein: 

Burgess fails to explicitly teach:
the image generation model is trained individually in a first stage, and 
the image generation model is trained together with the continuous control model in a second stage.

Kowalski teaches:
the image generation model is trained individually in a first stage (Kowalski: a first stage 900 involves omitting the real data encoder 904 and randomly generating 906 embeddings of real images. During the first stage the synthetic data encoder and the decoder are trained using backpropagation 908 and using synthetic images, [0079]; see Note 15A), and
the image generation model is trained together with the continuous control model in a second stage (Kowalski: In the second stage 902 the real data encoder is included 910, [0079]).

Note 15A: Omitting the real data encoder may be considered training the “generative model” (Kowalski, [0035]) individually. When the real data encoder is included with the autoencoder, this may be considered training together.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Kowalski with Burgess. Training the image generation model individually in a first stage, and training the image generation model together with the continuous control model in a second stage, as in Kowalski, would benefit the Burgess teachings because “experiments show that this two-stage training improves controllability and image quality,” (Kowalski, [0086]).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT ALEXANDER PROVIDENCE whose telephone number is (571)270-5765. The examiner can normally be reached Monday-Thursday 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VINCENT ALEXANDER PROVIDENCE/Examiner, Art Unit 2617                                                                                                                                                                                                        /KING Y POON/Supervisory Patent Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

Feb 12, 2024
Application Filed
Sep 23, 2025
Non-Final Rejection — §102, §103
Dec 22, 2025
Applicant Interview (Telephonic)
Dec 22, 2025
Examiner Interview Summary
Dec 30, 2025
Response Filed
Mar 16, 2026
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/165,619
Patent 12586303
GEOMETRY-AWARE THREE-DIMENSIONAL SYNTHESIS IN ALL ANGLES
2y 5m to grant Granted Mar 24, 2026
18/100,546
Patent 12530847
IMAGE GENERATION FROM TEXT AND 3D OBJECT
2y 5m to grant Granted Jan 20, 2026
18/270,591
Patent 12530808
Predictive Encoding/Decoding Method and Apparatus for Azimuth Information of Point Cloud
2y 5m to grant Granted Jan 20, 2026
18/268,027
Patent 12524946
METHOD FOR GENERATING FIREWORK VISUAL EFFECT, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Jan 13, 2026
18/481,552
Patent 12380621
COMPUTER-IMPLEMENTED SYSTEMS AND METHODS FOR GENERATING ENHANCED MOTION DATA AND RENDERING OBJECTS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 18 resolved cases by this examiner. Grant probability derived from career allow rate.