Last updated: April 19, 2026
Application No. 18/053,556
EMBEDDING AN INPUT IMAGE TO A DIFFUSION MODEL

Non-Final OA §103
Filed
Nov 08, 2022
Examiner
RODGERS, ALEXANDER JOHN
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
3 (Non-Final)
Interview Optional

— +7.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 33 resolved cases, 2023–2026
Examiner Intelligence

RODGERS, ALEXANDER JOHN View full profile →
Grants 70% — above average
Career Allow Rate
23 granted / 33 resolved
+7.7% vs TC avg
Moderate +7% lift
Without
With
+7.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
12 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
10.1%
-29.9% vs TC avg
§103
43.4%
+3.4% vs TC avg
§102
26.0%
-14.0% vs TC avg
§112
19.8%
-20.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 33 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12 December 2025 has been entered.

Response to Arguments
Applicant’s arguments, filed 12 December 2025, with respect to the rejection under 35 U.S.C. 103 of claims 1, 12, and 17 have been fully considered but are not persuasive. 
Addressing the first point, Kim teaches “fine-tuning a pre-trained diffusion model with the image as a single target image”. This is described in paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning” where it is stated the purpose of using the original image to fine tune the DiffusionClip diffusion model as to preserve the identity of the object in the original image. One can also see that in all of the Figure 5-8 which were referenced to show examples of the input image along with prompts used to create output images, that each entry for the DiffusionClip model Kim teaches shows a single target in each image such as a woman or a building being modified by the respective prompts.
Further, Kim “retains an identity of an object in the image based on the fine-tuning”. In fact, as stated above this is part of the motivation for fine-tuning on the same input image. However, we can also see Reference “x0-” in Section 3.1 “Diffusion CLIP Fine-Tuning” first paragraph, where the identity loss is defined as using the input image x0. Also See Figure 2 showing x0 as the original image and as an input to the CLIP loss. See paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning”  as well which describes the purpose of using the original image as to preserve the identity of the object of image.
The rejections below have been updated to address the amended material and arguments as noted above.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6-9, 11-13, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al (“DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation”) in view of Xiao et al (US Publication No. 20230095092 A1).
Regarding Claim 1, Kim discloses A method comprising: obtaining an image (Reference “input image”, see Figure 2 where an input image is shown as input to the diffusion model) and a prompt for editing the image (Reference “text prompts”, see Introduction paragraph 4. Also note Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image);  fine-tuning a pre-trained diffusion model with the image as a single target image to obtain (Reference “original image”, see paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning” the purpose of using the original image as to preserve the identity of the object in the original image.) a fine-tuned diffusion model by computing a plurality of loss values (Reference  “CLIP loss”, “identity loss”, and “weight for each loss”, see Section 3.1 DiffusionClip Fine-Tuning where first the CLIP loss is noted as taking place during training. Next, note the identity loss which is added in this fine-tuning step being described. Finally, note in paragraph 3 the weights for each loss specifically called out which would read as a first and second loss with same respective uses for training and fine-tuning. Further, it is noted in applicant specification paragraph 0068: “In some cases, the system converts the text prompt (or other guidance) into a conditional guidance vector or other multi-dimensional representation. For example, text may be converted into a vector or a series of vectors using a transformer model, or a multi-modal encoder.” Returning to reference Kim, see Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image which is then passed into an encoder. Further, note the “encoded vectors” specifically stated in Section 2.2 CLIP Guidance for Image Manipulation, where the text description of target is an encoded vector which is then used to compare against the generated image); generating, using the fine-tuned diffusion model, after the fine-tuning (Reference “fine-tuned model”, see Section 4.3  Image Translation between Unseen Domains describing use of the fine-tuned diffusion models and are fine-tuned which would be after the fine tuning and see 4.4 Noise Combination where the fine-tuned model is used to control the degree of change the attributes in sampling), a modified image by denoising a noise map based on the prompt, wherein the modified image depicts the image with the change indicated by the prompt (See Figure 5 as an example of the different modified images capable of being generated by this invention using this diffusion model. Note textual prompts such as “grey hair” and “Zuckerberg” are overlaid onto the original image to create different versions of the original image, a woman, with the result of the text encoding such as grey hair or even mixed with the attributes of another known entity. Note the first row shows the samples labelled DiffusionClip are produced by the diffusion model disclosed by Kim and are in a pretrained domain. Further also note the denoising of a noisy image or map as shown in Figure 4 by this model based on the text input such as Pixar or Gogh showing text inputs describing an artistic style which has modified the original image) and retains an identity of an object in the image based on the fine-tuning with the image as the single target image (Reference “x0-” see Section 3.1 “Diffusion CLIP Fine-Tuning” first paragraph, where the identity loss is defined as using the input image x0. Also See Figure 2 showing x0 as the original image and as an input to the CLIP loss. See paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning”  as well which describes the purpose of using the original image as to preserve the identity of the object of image) using a diffusion model that has been trained on the image to generate different versions of the image (Reference “Images in the pretrained domain”, see Section 3.3 Image Translation between Unseen Domains, where the diffusion model is trained on these images. Further also note the generation of images as a result of this training as shown in Figure 4 and the previously mentioned Figure 5 where images generated are in a pretrained domain.). However, Kim fails to disclose computing a plurality of loss values corresponding to a plurality of diffusion timesteps, respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep;
Instead, Xiao discloses computing a plurality of loss values corresponding to a plurality of diffusion timesteps (Reference “loss” and “denoising step”, see Specification paragraph 0071 where a further training of a generator is described using a loss that minimizes divergence at each denoising step. Further, it is noted in paragraph 066 which describes the diffusion process as a number of denoising steps T and Equation 2 which shows the steps corresponding to different progressive times t and t-1 describing the current and previous time step in these diffusion steps), respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep (Examiner’s Note: The plurality of loss values described here to which each of them are based on a difference between the image and a generated image at that step refers specifically to the finetuning weight loss function. As noted in paragraph 0037 there are multiple weights for loss functions—one for pre-training and one for fine tuning. Returning to Xiao, a similar loss function is described in Equation 2 which describes the diffusion steps as described above in more detail and as paragraph 0065 describes the diffusion process occurs on one or more inputs such as an input image. Further note in paragraph 0066 where the diffusion steps T numbers in the thousands of steps in generating these intermediate images which are further described in paragraph 0083.). Motivation is also taught by Xiao by reducing the number of steps (See Specification paragraph 0081) required to diffuse these images which a reduction of steps in Kim or Xiao would reduce time required and improve speed. Therefore, it would have been obvious to one of ordinary skill in the art to modify Kim in view of Xiao.
Regarding Claim 3 Kim discloses The method of claim 1, further comprising: initializing a plurality of noise maps (Reference “noise is gradually added to the data”, see 2.1 Diffusion Models showing the equation to create a series of noisy images from the original image which is the data being referred to); generating a plurality of intermediate images corresponding to the plurality of noise maps (See Figure 4 as an example showing different intermediate images which are generated and which each show a corresponding noise) at different noise levels based on the plurality of noise maps using the diffusion model (Reference “noise is gradually added to the data”, see 2.1 Diffusion Models where as noted above noise is added to create a series of images, but further note the series of steps in this forward process which receives a different gaussian noise and therefore reads as different noise levels as these are applied in iterative nature with the different gaussian noise); and computing a loss function by comparing each of the plurality of intermediate images to the image (Reference “ ID loss”, see Section 3.1 Diffusion Clip Fine-Tuning, where prior to fine tuning, an identity loss which is shown between each generated image and the original image and this ID loss shows the original and modified image as inputs to its function), wherein the diffusion model is based on the loss function (Reference “reverse diffusion model”, see Section 3.1 Diffusion Clip Fine tuning where the above loss function is used to fine tune the diffusion model and therefore the diffusion model is based on the loss function).
Regarding Claim 6, Kim discloses The method of claim 1, wherein: the prompt comprises text that describes a modification to the image, wherein the modified image includes the modification (Reference “prompt”, see Section 3.1 Diffusion Clip Fine-Tuning where the prompt includes a text and this text is included in an edited generation of the target image such as “face” versus “angry face”).
Regarding Claim 7, Kim discloses The method of claim 1, wherein: the modified image retains an identity of an object in the image (Reference “Identity loss”, see Section 3.1 Section 3.1 Diffusion Clip Fine-Tuning, where prior to fine tuning, an identity loss which is shown between each generated image and the original image. This identity loss shows how much of the generated or modified image retains an identity of an object in the image, which is tuned to improve this loss or improve the retention of this identity).
Regarding Claim 8, Kim discloses The method of claim 1, further comprising: combining a guidance vector (First, it is noted in applicant specification paragraph 0068: “In some cases, the system converts the text prompt (or other guidance) into a conditional guidance vector or other multi-dimensional representation. For example, text may be converted into a vector or a series of vectors using a transformer model, or a multi-modal encoder.” Returning to reference Kim, see Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image which is then passed into a text encoder) with image features within the diffusion model (Reference “image encoder”, see Section 2.2 Clip Guidance for Image Manipulation paragraph 2 where the image and text encoders within this Clip diffusion model encode the text and image features which are later described in detail with respect to losses and tuning in Section 3.1), wherein the modified image is based on the guidance vector. (Finally note Section 3.1 Diffusion Clip Fine-Tuning  paragraph 3 where the prompt includes a text and this text is included in an edited generation of the target image such as “face” versus “angry face”. This generated image comes from the diffusion model as noted previously in Figures 4 and 5 showing example outputs).
Regarding Claim 9, Kim discloses The method of claim 1, further comprising: initializing the diffusion model (See Section 3.1 Diffusion Models, where reception of an input image and conversions are shown within the diffusion model prior to creating noisy versions, training, etc.); training the diffusion model based on a diverse training set to obtain a pre-trained diffusion model (Reference “CLIP”, see Section 2.2 CLIP Guidance for Image Manipulation where the pretrained model is CLIP, or contrastive language image pretraining. Section 3.3 Image Translation between Unseen Domains where DiffusionCLIP which as previously noted generated the images noted in Figures 4 and 5 specifically utilizes pre-trained diffusion model); and fine-tuning the pre-trained diffusion model based on the image (Reference “DiffusionCLIP”, see Section 3. Diffusion Clip and Figure 2 where the input image is used to fine-tune the diffusion model DiffusionClip).
Regarding Claim 11, Kim discloses The method of claim 9, wherein: a first weight for a loss function is used for training the diffusion model and a second weight for the loss function that is different from the first weight is used for fine-tuning the pre-trained diffusion model (Reference  “CLIP loss”, “identity loss”, and “weight for each loss”, see Section 3.1 DiffusionClip Fine-Tuning where first the CLIP loss is noted as taking place during training. Next, note the identity loss which is added in this fine-tuning step being described. Finally, note in paragraph 3 the weights for each loss specifically called out which would read as a first and second loss with same respective uses for training and fine-tuning).
Regarding Claim 12, Kim discloses A non-transitory computer-readable medium (Reference “code” and “server”, see Abstract where a server stores the computer programming code) comprising instructions, that, when executed by a processor, are configured to perform operations  (Reference “Python code”, see Abstract where the programming code is Python which is executable or portable to generic personal computers which execute instructions on processors; inherently for the system to work as discussed, the code would have to implemented by a processor at some point) of fine-tuning a pre-trained diffusion model based on an image as single target image to obtain a tuned diffusion model  (Reference “original image”, see paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning” the purpose of using the original image as to preserve the identity of the object in the original image), wherein the fine-tuning is performed by computing a plurality of loss values (Reference  “CLIP loss”, “identity loss”, and “weight for each loss”, see Section 3.1 DiffusionClip Fine-Tuning where first the CLIP loss is noted as taking place during training. Next, note the identity loss which is added in this fine-tuning step being described. Finally, note in paragraph 3 the weights for each loss specifically called out which would read as a first and second loss with same respective uses for training and fine-tuning. Finally also note the steps shows a plurality of respective losses found for their respective timesteps t and t-1.); receiving a prompt including additional content for the single image (Reference “text prompts”, see Introduction paragraph 4 describing the text prompts. Also note Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image); and generating a modified image based on the single image and the prompt using the tuned diffusion model (Reference “prompt”, see Section 3.1 Diffusion Clip Fine-Tuning where the prompt includes a text and this text is included in an edited generation of the target image such as “face” versus “angry face”) after the fine-tuning (Reference “fine-tuned model”, see Section 4.3  Image Translation between Unseen Domains describing use of the fine-tuned diffusion models and are fine-tuned which would be after the fine tuning and see 4.4 Noise Combination where the fine-tuned model is used to control the degree of change the attributes in sampling which can even change multiple attributes as shown in Figure 8 “Multi Attribute Transfer” where a woman is layered with multiple effects of Makeup, Curly Hair and Super Saiyan effects while retaining many of the physical features or identity of the original image), wherein the modified image depicts the image with the additional content indicated by the prompt (See Figure 5 as an example of the different modified images capable of being generated by this invention using this diffusion model. Note textual prompts such as “grey hair” and “Zuckerberg” are overlaid onto the original image to create different versions of the original image, a woman, with the result of the text encoding such as grey hair or even mixed with the attributes of another known entity but retain many of the woman’s original features or identity. Note the first row shows the samples labelled DiffusionClip are produced by the diffusion model disclosed by Kim and are in a pretrained domain and retains an identity of an object in the image based on the tuning with the image as the single target image  (Reference “x0-” see Section 3.1 “Diffusion CLIP Fine-Tuning” first paragraph, where the identity loss is defined as using the input image x0. Also See Figure 2 showing x0 as the original image and as an input to the CLIP loss. See paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning”  as well which describes the purpose of using the original image as to preserve the identity of the object of image). However, Kim fails to disclose wherein the fine-tuning is performed by computing a plurality of loss values corresponding to a plurality of diffusion timesteps, respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep;
Instead, Xiao discloses wherein the fine-tuning is performed by computing a plurality of loss values corresponding to a plurality of diffusion timesteps, respectively (Reference “loss” and “denoising step”, see Specification paragraph 0071 where a further training of a generator is described using a loss that minimizes divergence at each denoising step. Further, it is noted in paragraph 066 which describes the diffusion process as a number of denoising steps T and Equation 2 which shows the steps corresponding to different progressive times t and t-1 describing the current and previous time step in these diffusion steps), wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep, respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep (Examiner’s Note: The plurality of loss values described here to which each of them are based on a difference between the image and a generated image at that step refers specifically to the finetuning weight loss function. As noted in paragraph 0037 there are multiple weights for loss functions—one for pre-training and one for fine tuning. Returning to Xiao, a similar loss function is described in Equation 2 which describes the diffusion steps as described above in more detail and as paragraph 0065 describes the diffusion process occurs on one or more inputs such as an input image. Further note in paragraph 0066 where the diffusion steps T numbers in the thousands of steps in generating these intermediate images which are further described in paragraph 0083). Motivation is also taught by Xiao by reducing the number of steps (See Specification paragraph 0081) required to diffuse these images which a reduction of steps in Kim or Xiao would reduce time required and improve speed. Therefore, it would have been obvious to one of ordinary skill in the art to modify Kim in view of Xiao.
Claim 13 is rejected for containing similar limitations to Claim 3. See rejection of Claim 3 above which traverses these limitations.
Regarding Claim 16, Kim discloses The non-transitory computer-readable medium of claim 12, wherein the instructions are further configured to perform: encoding the prompt to obtain a guidance vector (First, it is noted in applicant specification paragraph 0068: “In some cases, the system converts the text prompt (or other guidance) into a conditional guidance vector or other multi-dimensional representation. For example, text may be converted into a vector or a series of vectors using a transformer model, or a multi-modal encoder.” Returning to reference Kim, see Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image which is then passed into an encoder); and combining the guidance vector with image features within the tuned diffusion model, wherein the modified image is based on the guidance vector  (See Figure 5, where textual prompts such as “grey hair” and “Zuckerberg” are overlaid onto the original image to create different versions of the original image, a woman, with the result of the text encoding such as grey hair or even mixed with the attributes of another known entity. Note the first row shows the samples labelled DiffusionClip are produced by the diffusion model disclosed by Kim).
Regarding Claim 21, Kim discloses The method of claim 1, wherein fine-tuning the pre-trained diffusion model comprises: embedding the image in a latent text embedding space (Reference “CLIP” see Section 2.2 CLIP Guidance for Image Manipulation paragraph 3 where “EI and ET are CLIPs image and text encoders”. Which follows a very similar use case to applicant’s latent text encoding/embedding see Application specification paragraph 0042, CLIP: “In one example, text encoder 230 comprises a Contrastive Language-Image Pre-training (CLIP) model. CLIP is a contrastive learning model trained for image representation learning using natural language supervision. CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples”), wherein the difference is computed in the latent text embedding space (Reference “CLIP loss”, see Section 3.1 DiffusionCLIP Fine-Tuning where the directional CLIP loss as described above is a combination of deltas or differences as shown in the definition supplied in Equation 9 and these deltas are of the pair of encoders described above and therefore is computed in the latent text embedding space).
Regarding Claim 22, Kim discloses The method of claim 1, wherein: the difference is computed between the generated image at the respective diffusion timestep and the image as it was input independent of the respective diffusion timestep (Reference x0- see Section 3.1 “Diffusion CLIP Fine-Tuning” first paragraph, where the identity loss is defined as using the input image x0. Also See Figure 2 showing x0 as the original image and as an input to the CLIP loss. See paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning”  as well which describes the purpose of using the original image as to preserve the identity of the object of image).

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over  Kim et al (“DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation”) in view of Xiao et al (US Publication No. 20230095092 A1) further in view of Liu et al (US Publication No. 20240153152 A1).
Regarding Claim 2, Kim discloses The method of claim 1, but fails to disclose further comprising: receiving the prompt from a user via a text field of a user interface; and displaying the modified image to the user via the user interface.
Instead, Liu discloses The method of claim 1, further comprising: receiving the prompt from a user via a text field of a user interface; and displaying the modified image to the user via the user interface (Reference “display”, “interface”, “text”, see Specification paragraph 0058 where text is received in a user interface. Also note the interface displays the rendered image). Citing KSR Rationale A, such a modification is the result of combining prior art elements according to known methods to yield predictable results. Specifically, interfaces for text prompts dating back to MS-DOS systems for example have always been a convenient method of accepting a text prompt.  Thus, a person of ordinary skill would have appreciated including in Kim’s text prompt some sort of interface or GUI to accept said text prompt since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable. Therefore, it would have been obvious to one of ordinary skill in the art prior to the time of filing to modify Kim to include a text interface as taught by Liu.

Claims 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al (US Publication No. 20240153152 A1) in view of Kim et al (“DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation”) further in view of Xiao et al (US Publication No. 20230095092 A1).
Regarding Claim 17, Liu discloses An apparatus for image processing, comprising: one or more processors (Reference “processor”, see Specification paragraph 0020); and one or more memories including instructions executable by the one or more processors to (Reference “non-volatile memory”, see Specification paragraph 0020 where this memory stores an image rendering program, for modifying an image based on a prompt using a DDIM model). However, Liu fails to disclose instructions executable by the one or more processors to: obtain an image and a prompt for editing the image; fine-tune a pre-trained diffusion model based on the image to obtain a tuned diffusion model; and generate a modified image based on the image and the prompt using the tuned diffusion model. 
Instead, Kim discloses instructions executable by one or more processors (Reference “Python code”, see Abstract where the programming code is Python which is executable or portable to generic personal computers which execute instructions on processors)  to: obtain an image  (Reference “input image”, see Figure 2 where an input image is shown as input to the diffusion model) and a prompt indicating a change to the image (Reference “text prompts”, see Introduction paragraph 4. Also note Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image); fine-tune a pre-trained diffusion model with the image as a single target image based on the image to obtain a tuned diffusion model (Reference “DiffusionCLIP”, see Section 3. Diffusion Clip and Figure 2 where the input image is used to fine-tune the diffusion model DiffusionClip) wherein the fine-tuning is performed by computing a plurality of loss values (Reference  “CLIP loss”, “identity loss”, and “weight for each loss”, see Section 3.1 DiffusionClip Fine-Tuning where first the CLIP loss is noted as taking place during training. Next, note the identity loss which is added in this fine-tuning step being described. Finally, note in paragraph 3 the weights for each loss specifically called out which would read as a first and second loss with same respective uses for training and fine-tuning. Finally also note the steps shows a plurality of respective losses found for their respective timesteps t and t-1.); and generate, using the tuned diffusion model, a modified image by denoising a noise map based on the prompt, wherein the modified image depicts the image with the change indicated by the prompt (Finally note Section 3.1 Diffusion Clip Fine-Tuning  paragraph 3 where the prompt includes a text and this text is included in an edited generation of the target image such as “face” versus “angry face”. This generated image comes from the diffusion model as noted previously in Figures 4 and 5 showing example outputs. Further also note the generation of images as a result of this training as shown in Figure 4 and the previously mentioned Figure 5 where images generated are in a pretrained domain) and retains an identity of an object in the image based on the fine-tuning with the image as the single target image (Reference “x0-” see Section 3.1 “Diffusion CLIP Fine-Tuning” first paragraph, where the identity loss is defined as using the input image x0. Also See Figure 2 showing x0 as the original image and as an input to the CLIP loss. See paragraph 3 of Section 3.1 “Diffusion CLIP Fine-Tuning”  as well which describes the purpose of using the original image as to preserve the identity of the object of image). Kim teaches that their model for modifying has benefit such as image translation between unseen domains (see Section 3) and reduced noise over other types of models (see the abstract). Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date to modify Liu for the reasons stated above. However, Kim fails to disclose computing a plurality of loss values corresponding to a plurality of diffusion timesteps, respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep;
Instead, Xiao discloses computing a plurality of loss values corresponding to a plurality of diffusion timesteps (Reference “loss” and “denoising step”, see Specification paragraph 0071 where a further training of a generator is described using a loss that minimizes divergence at each denoising step. Further, it is noted in paragraph 066 which describes the diffusion process as a number of denoising steps T and Equation 2 which shows the steps corresponding to different progressive times t and t-1 describing the current and previous time step in these diffusion steps), respectively, wherein each of the plurality of loss values is based on a difference between the image and a generated image at the respective diffusion timestep (Examiner’s Note: The plurality of loss values described here to which each of them are based on a difference between the image and a generated image at that step refers specifically to the finetuning weight loss function. As noted in paragraph 0037 there are multiple weights for loss functions—one for pre-training and one for fine tuning. Returning to Xiao, a similar loss function is described in Equation 2 which describes the diffusion steps as described above in more detail and as paragraph 0065 describes the diffusion process occurs on one or more inputs such as an input image. Further note in paragraph 0066 where the diffusion steps T numbers in the thousands of steps in generating these intermediate images which are further described in paragraph 0083.). Motivation is also taught by Xiao by reducing the number of steps (See Specification paragraph 0081) required to diffuse these images which a reduction of steps in Kim or Xiao would reduce time required and improve speed. Therefore, it would have been obvious to one of ordinary skill in the art to modify Kim in view of Xiao.
Regarding Claim 18, Liu discloses The apparatus of claim 17, but fails to disclose wherein the instructions are further executable by the one or more processors to: encode the prompt to obtain a guidance vector using a text encoder. 
Instead, Kim discloses wherein the instructions are further executable by the one or more processors to: encode the prompt to obtain a guidance vector using a text encoder (Reference “text encoder”, see Section 2.2 CLIP Guidance for Image Manipulation describing the text description for the target image which is then passed into an encoder), wherein the modified image is based on the guidance vector (Finally note Section 3.1 Diffusion Clip Fine-Tuning  paragraph 3 where the prompt includes a text and this text is included in an edited generation of the target image such as “face” versus “angry face”. This generated image comes from the diffusion model as noted previously in Figures 4 and 5 showing example outputs). Kim teaches that their model for modifying has benefit such as image translation between unseen domains (see Section 3) and reduced noise over other types of models (see the abstract). Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date to modify Liu for the reasons stated above.
Regarding claim 19, Liu discloses The apparatus of claim 17, further comprising: receiving the prompt from a user via a text field of a user interface; and displaying the modified image to the user via the user interface (Reference “display”, “interface”, “text”, see Specification paragraph 0058 where text is received in a user interface. Also note the interface displays the rendered image). 
Regarding Claim 20, Liu discloses The apparatus of claim 17, but fails to disclose The apparatus of claim 17, wherein: the diffusion model comprises a Denoising Diffusion Probabilistic Model (DDPM). 
Instead, Kim discloses wherein: the diffusion model comprises a Denoising Diffusion Probabilistic Model (DDPM) (Reference “DDPM”, see Figure 4 where a forward DDPM is used which as noted in Introduction paragraph 3 DDPM are denoising diffusion probabilistic models). Further, motivation for using the DDPM in such a model is given where recent research shows higher quality image synthesis (See Introduction paragraph 3, where this diffusion model is compared to other machine learning models). Therefore, it would have been obvious to one of ordinary skill in the art before the time of filing to use a denoising diffusion probabilistic model as taught by Kim specifically over other machine learning models to modify Liu for image synthesis or generation needs.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER JOHN RODGERS whose telephone number is (703)756-1993. The examiner can normally be reached 5:30AM to 2:30PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Villecco can be reached on (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALEXANDER JOHN RODGERS/Examiner, Art Unit 2661                                                                                                                                                                                                        
/JOHN VILLECCO/Supervisory Patent Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Nov 08, 2022
Application Filed
Apr 05, 2025
Non-Final Rejection — §103
Jun 09, 2025
Interview Requested
Jun 20, 2025
Examiner Interview Summary
Jun 26, 2025
Response Filed
Nov 01, 2025
Final Rejection — §103
Nov 24, 2025
Interview Requested
Dec 04, 2025
Examiner Interview Summary
Dec 04, 2025
Applicant Interview (Telephonic)
Dec 12, 2025
Request for Continued Examination
Jan 13, 2026
Response after Non-Final Action
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/758,003
Patent 12548181
INFORMATION PROCESSING APPARATUS, SENSING APPARATUS, MOBILE OBJECT, METHOD FOR PROCESSING INFORMATION, AND INFORMATION PROCESSING SYSTEM
2y 5m to grant Granted Feb 10, 2026
17/979,164
Patent 12541961
INFORMATION EXTRACTION METHOD OF OFFSHORE RAFT CULTURE BASED ON MULTI-TEMPORAL OPTICAL REMOTE SENSING IMAGES
2y 5m to grant Granted Feb 03, 2026
17/952,002
Patent 12494058
RELATIONSHIP MODELING AND KEY FEATURE DETECTION BASED ON VIDEO DATA
2y 5m to grant Granted Dec 09, 2025
18/117,102
Patent 12453511
SYSTEMS AND METHODS FOR CONFIRMATION OF INTOXICATION DETERMINATION
2y 5m to grant Granted Oct 28, 2025
17/932,544
Patent 12430771
LIGHT FIELD RECONSTRUCTION METHOD AND APPARATUS OF A DYNAMIC SCENE
2y 5m to grant Granted Sep 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
70%
Grant Probability
77%
With Interview (+7.0%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 33 resolved cases by this examiner. Grant probability derived from career allow rate.