Last updated: April 19, 2026
Application No. 18/601,325
MODIFYING IMAGES FOR IMPROVED SEARCH

Final Rejection §103
Filed
Mar 11, 2024
Examiner
WANG, JIN CHENG
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Etsy Inc.
OA Round
2 (Final)
Interview Optional

— +10.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 832 resolved cases, 2023–2026
Examiner Intelligence

WANG, JIN CHENG View full profile →
Grants 59% of resolved cases
Career Allow Rate
492 granted / 832 resolved
-2.9% vs TC avg
Moderate +10% lift
Without
With
+10.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
40 currently pending
Career history
872
Total Applications
across all art units
Statute-Specific Performance

§101
11.8%
-28.2% vs TC avg
§103
62.7%
+22.7% vs TC avg
§102
7.6%
-32.4% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 832 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Specification
The amended title of the invention is still not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Response to Amendment
Applicant’s submission filed 1/15/2026 has been entered. The claims 1, 14 and 20 have been amended. The claims 1-20 are pending in the current application. 

Response to Arguments
Applicant's arguments filed 1/15/2026 have been fully considered but they are not persuasive. 
In Remarks, applicant argued in essence with respect to the new claim limitation of identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
However, Tanjim teaches the claim limitation of identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
For example, Tanjim teaches at Paragraph 0097 that the second image generator 680 receives the optimized latent code 665 (third embedding) and generates synthetic image 685. Tanjim teaches at Paragraph 0095 that the second text encoder 670 encodes text prompt 605 to generate a text embedding and encodes preliminary image 640 to generate an image embedding (first embedding). Tanjim eaches at FIG. 6 and Paragraph 0088-0095 that the second text encoder 670 together with the optimization component 660 generates the latent code 665 (third embedding) to optimize the preliminary latent code 635 (second embedding) comprising the text embedding and the image embedding. 
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 
However, Ravi teaches the claim limitation of identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
Ravi teaches at Paragraph 0045, Paragraph 0074 and Paragraph 0087 that the diffusion prior image editing system 102 performs structural editing to generate a third image embedding. 
Ravi teaches at Paragraph 0087 that the diffusion prior image editing system 102 runs the sampling process of the diffusion decoder conditioned on the conceptually edited embedding (second embedding) to get the final edited latent (third embedding). The generated VAE latent can then be passed through the pre-trained and fixed VAE decoder to get the final edited image. 
Ravi teaches at Paragraph 0071 that the prior gets to modify the injected embedding according to edit text and the closer the generated embedding (the final embedding---third embedding) will be to the edit text. In one or more implementations, the diffusion prior image editing system 102 controls the injection timestep the using a conceptual edit strength parameter. 
Ravi teaches at [0074] As mentioned previously, the diffusion prior image editing system 102 can also perform structural editing within a diffusion neural network to generate a modified digital image. For example, FIG. 5 illustrates utilizing a diffusion neural network 524 to generate a modified digital image 506 through structural editing of a base image embedding 502 and a text-edited image embedding 504 (second embedding) in accordance with one or more embodiments. 
Ravi teaches at Paragraph [0045] In addition, as shown in FIG. 2 the diffusion prior image editing system 102 can utilize the diffusion neural network 210 to implement structural editing 214. For example, the diffusion prior image editing system 102 can perform structural editing 214 by dynamically controlling the degree/extent to which the modified digital image 212 reflects structure of the base digital image 202 (or the degree/extent to which the diffusion neural network 210 can deviate from the base digital image 202). In particular, the diffusion prior image editing system 102 can dynamically select a structural transition step of the diffusion neural network 210 that varies the amount of structure to retain from the base digital image 202. Additional detail regarding structure editing is provided below (e.g., in relation to FIGS. 3 and 5).
However, Aggarwal ‘144 teaches the claim limitation of identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
Aggarwal ‘144 teaches at Paragraph [0091] In an embodiment, diffusion prior model 615 generates a set of image embeddings based on text embedding 610. Diffusion prior model 615 scores and ranks the set of image embeddings by comparing each image embedding of image embeddings 620 to text embedding 610. In an embodiment, diffusion prior model 615 calculates a similarity score between the text embedding 610 and each image embedding of image embeddings 620 and selects one or more image embeddings 620 with the highest similarity score (e.g., select top k image embeddings that correspond to the top k highest similarity scores). A high similarity score shows that image embedding 620 is similar to text embedding 610 in a common embedding space. Text embedding 610 and image embedding 620 are in a multi-modal embedding space. For example, diffusion prior model 615 ranks the set of image CLIP embeddings and selects an image CLIP embedding (the third embedding) that is closest to the text CLIP embedding.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 5-15 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Tanjim et al. US PGPUB No. 2025/0225683 (hereinafter Tanjim) in view of 
Aggarwal et al. US-PGPUB No. 2024/0404144 (hereinafter Aggarwal ‘144); 
Ravi et al. US-PGPUB No. 2024/0362842 (hereinafter Ravi); 
Song et al. US-PGPUB No. 2025/0022099 (hereinafter Song); 
and Aggarwal et al. US-PGPUB No. 2025/0278816 (hereinafter Aggarwal ‘816). 

Re Claim 1: 
Tanjim in view of Aggarwal ‘144 teaches a method comprising:
generating a first embedding that represents an input image using a first encoder, wherein a dimension of the first embedding matches a first dimension (
Tanjim teaches at Paragraph 0089 that the image encoder 620 generates an image embedding based on input image 610. 
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0078-0081 that the color encoder generates color embedding (image embedding) having predetermined size that matches with that of text embedding or the image embedding of the input image. 
With respect to the dimension of the image embedding and/or text embedding, Aggarwal ‘144 teaches that the text embedding and the image embedding has a dimension size of 768. 

Aggarwal ‘144 teaches at Paragraph 0081 that the image embeddings and text CLIP embeddings have a dimension size of 768. 
Aggarwal ‘144 teaches at Paragraph [0121] In an embodiment, machine learning model 500 uses a histogram size of [10, 8, 9] which generates a color embedding having a dimension size 720 (or a batch size of 1, 720). In some cases, the image embeddings and text CLIP embeddings have a dimension size 768. For the color embedding and text CLIP embedding to have the same embedding dimension, machine learning model 500 fills 48 0's to the color embedding to modify the color embedding to have a dimension size 768 (e.g., batch size of 1, 768). Then, the square root of each number in the feature vectors is taken to obtain the color embedding. In some cases, taking the square root can penalize the dominant color and give more weight to the other colors in the image.
Aggarwal teaches at [0137] According to an embodiment, the color conditioning is optional and the color embeddings are modified to have 0's vector of the same dimension as the dimension of text prompt. For example, the modified color embedding is applied to the sampling process during training to train the machine learning model to turn on/off color conditioning.
Aggarwal ‘144 teaches at Paragraph 0141 that the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding);
generating, using the first embedding, a second embedding that represents (i) the input image and (ii) a modification to the input image, wherein a dimension of the second embedding matches the first dimension (
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 
Aggarwal ‘144 teaches at Paragraph 0037 that a color embedding is concatenated with a text embedding and the concatenated embedding is used to predict an image embedding (e.g., image CLIP embedding) and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. 
Aggarwal ‘144 teaches at Paragraph 0022 that the diffusion prior model maps a text embedding of the text prompt (e.g., text CLIP embedding using a multi-modal encoder such as CLIP model) to an image embedding (e.g., image CLIP embedding) and at Paragraph 0092 that diffusion prior model 615 generates an image embedding(s) 620 based on the text embedding and the color embedding.. 
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0082 that diffusion prior model 525 receives color embedding, token embeddings, and text embedding from multi-modal encoder 515 as input and generates an image embedding based on the text embedding, token embedding and the color embedding. These embeddings are concatenated. Aggarwal ‘144 teaches at Paragraph 0121 that the image embeddings and the text embeddings have a dimension size 768);
generating, using the second embedding, a third embedding that represents (i) the input image and (ii) the modification to the input image using a second encoder (
Tanjim teaches at Paragraph 0095 that the second text encoder 670 encodes text prompt 605 to generate a text embedding and encodes preliminary image 640 to generate an image embedding (first embedding). Tanjim eaches at FIG. 6 and Paragraph 0088-0095 that the second text encoder 670 together with the optimization component 660 generates the latent code 665 (third embedding) to optimize the preliminary latent code 635 (second embedding) comprising the text embedding and the image embedding. 
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 
Aggarwal ‘144 teaches at Paragraph 0137 that the color embeddings are modified to have 0’s vector of the same dimension as the dimension of text prompt and the modified color embedding is applied to the sampling process during the training and at Paragraph 0141 that he machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. 
Aggarwal ‘144 teaches at Paragraph 0022 that the diffusion prior model maps a text embedding of the text prompt (e.g., text CLIP embedding using a multi-modal encoder such as CLIP model) to an image embedding (e.g., image CLIP embedding) and at Paragraph 0092 that diffusion prior model 615 generates an image embedding(s) 620 based on the text embedding and the color embedding.. 
Aggarwal ‘144 teaches at Paragraph [0085] Diffusion prior model 525 maps the text CLIP embedding to a corresponding image CLIP embedding. In some cases, a text CLIP embedding may correspond to a set of image CLIP embeddings (including a second image embedding). Diffusion prior model 525 ranks the set of image CLIP embeddings and selects an image CLIP embedding (the closest image embedding representing a third embedding. that is closest to the text CLIP embedding based on a metric (e.g., a similarity score). Diffusion prior model 525 is pre-trained and may be retrained. Diffusion prior model 525 is an example of, or includes aspects of, the corresponding element described with reference to FIGS. 2, 6, and 7.
Aggarwal ‘144 teaches at Paragraph [0091] In an embodiment, diffusion prior model 615 generates a set of image embeddings based on text embedding 610. Diffusion prior model 615 scores and ranks the set of image embeddings by comparing each image embedding of image embeddings 620 to text embedding 610. In an embodiment, diffusion prior model 615 calculates a similarity score between the text embedding 610 and each image embedding of image embeddings 620 and selects one or more image embeddings 620 with the highest similarity score (e.g., select top k image embeddings that correspond to the top k highest similarity scores). A high similarity score shows that image embedding 620 is similar to text embedding 610 in a common embedding space. Text embedding 610 and image embedding 620 are in a multi-modal embedding space. For example, diffusion prior model 615 ranks the set of image CLIP embeddings and selects an image CLIP embedding (the third embedding) that is closest to the text CLIP embedding.
Aggarwal ‘144 teaches at Paragraph 0037 that a color embedding is concatenated with a text embedding and the concatenated embedding is used to predict an image embedding (e.g., image CLIP embedding) and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0082 that diffusion prior model 525 receives color embedding, token embeddings, and text embedding from multi-modal encoder 515 as input and generates an image embedding based on the text embedding, token embedding and the color embedding. These embeddings are concatenated. Aggarwal ‘144 teaches at Paragraph 0121 that the image embeddings and the text embeddings have a dimension size 768); and
identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding (
Tanjim teaches at Paragraph 0097 that the second image generator 680 receives the optimized latent code 665 (third embedding) and generates synthetic image 685. Tanjim teaches at Paragraph 0095 that the second text encoder 670 encodes text prompt 605 to generate a text embedding and encodes preliminary image 640 to generate an image embedding (first embedding). Tanjim teaches at FIG. 6 and Paragraph 0088-0095 that the second text encoder 670 together with the optimization component 660 generates the latent code 665 (third embedding) to optimize the preliminary latent code 635 (second embedding) comprising the text embedding and the image embedding. 
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 

Aggarwal ‘144 teaches at Paragraph 0022 that the diffusion prior model maps a text embedding of the text prompt (e.g., text CLIP embedding using a multi-modal encoder such as CLIP model) to an image embedding (e.g., image CLIP embedding). Aggarwal ‘144 teaches at Paragraph 0086 that latent diffusion model 530 may receive the image embedding and output generated image 535 and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Aggarwal ‘144’s teaching that the image CLIP embedding is generated by the diffusion prior model using the multi-modal encoder having the same dimension as the text embedding and the color embedding (Aggarwal ‘144 Paragraph 0121 and Paragraph 0141) into Tanjim’s diffusion-based image generator to have optimized the modified image embedding based on the text embedding and the image embedding of the original input image. One of the ordinary skill in the art would have been motivated to have provided a diffusion prior model to have utilized the CLIP encoder to have selected an image CLIP embedding closest to the text CLIP embedding (Aggarwal ‘144 Paragraph 0091). 
Tanjim in view of Ravi teaches a method comprising:
generating a first embedding that represents an input image using a first encoder, wherein a dimension of the first embedding matches a first dimension (
Tanjim teaches at Paragraph 0089 that the image encoder 620 generates an image embedding based on input image 610. 
Ravi teaches at Paragraph 0047 generating a base image embedding 306 using a trained text-image encoder 304 from the base digital image 302);
generating, using the first embedding, a second embedding that represents (i) the input image and (ii) a modification to the input image, wherein a dimension of the second embedding matches the first dimension (
Tanjim teaches at Paragraph 0097 that the second image generator 680 receives the optimized latent code 665 (third embedding) and generates synthetic image 685. Tanjim teaches at Paragraph 0095 that the second text encoder 670 encodes text prompt 605 to generate a text embedding and encodes preliminary image 640 to generate an image embedding (first embedding). Tanjim teaches at FIG. 6 and Paragraph 0088-0095 that the second text encoder 670 together with the optimization component 660 generates the latent code 665 (third embedding) to optimize the preliminary latent code 635 (second embedding) comprising the text embedding and the image embedding. 
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 
Ravi teaches that the conceptual editing 208 can include generating a text-edited image embedding (second embedding) using the diffusion prior image editing system 102. 
Ravi teaches at Paragraph [0042] As shown in FIG. 2, the diffusion prior image editing system 102 can perform conceptual editing 208 utilizing the diffusion prior neural network 206. The conceptual editing 208 can include combining features of the base digital image 202 and the edit text 204. For example, as described in greater detail below in relation to FIG. 3 and FIG. 4 the diffusion prior image editing system 102 can generate an image embedding utilizing a trained text-image encoder from the base digital image 202. Similarly, the diffusion prior image editing system 102 can generate an edit text embedding utilizing the trained text-image encoder from the edit text 204. The diffusion prior neural network 206 can inject the base image embedding to a dynamically selected conceptual editing denoising step of the diffusion prior neural network 206 and condition subsequent steps of the diffusion prior neural network 206 based on the edit text embedding. Utilizing this approach, the diffusion prior image editing system 102 can generate a text-edited image embedding utilizing the diffusion prior neural network 206. 
Ravi teaches at Paragraph [0070] Moreover, although the foregoing description of FIG. 4 focuses on the second set of steps 412, in one or more embodiments, the diffusion prior image editing system 102 still utilizes the first set of steps 410. For instance, in some implementations the first set of steps 410 are utilized to generate intermediate embeddings (second embedding). Ravi teaches at Paragraph 0078 that the diffusion prior image editing system 102 can condition the denoising step utilizing the text-edited image embedding 504. Thus, as shown the diffusion prior image editing system 102 conditions the denoising step 520n based on the text-edited image embedding 504. Moreover, the diffusion prior image editing system 102 conditions the remaining denoising steps based on the text-edited image embedding 504 (second embedding).
Ravi teaches at Paragraph 0045, Paragraph 0074 and Paragraph 0087 that the diffusion prior image editing system 102 performs structural editing to generate a third image embedding. 
Ravi teaches at Paragraph 0087 that the diffusion prior image editing system 102 runs the sampling process of the diffusion decoder conditioned on the conceptually edited embedding (second embedding) to get the final edited latent (third embedding). The generated VAE latent can then be passed through the pre-trained and fixed VAE decoder to get the final edited image. 
Ravi teaches at Paragraph 0071 that the prior gets to modify the injected embedding according to edit text and the closer the generated embedding (the final embedding---third embedding) will be to the edit text. In one or more implementations, the diffusion prior image editing system 102 controls the injection timestep the using a conceptual edit strength parameter. 
Ravi teaches at [0074] As mentioned previously, the diffusion prior image editing system 102 can also perform structural editing within a diffusion neural network to generate a modified digital image. For example, FIG. 5 illustrates utilizing a diffusion neural network 524 to generate a modified digital image 506 through structural editing of a base image embedding 502 and a text-edited image embedding 504 (second embedding) in accordance with one or more embodiments. 
Ravi teaches at Paragraph [0045] In addition, as shown in FIG. 2 the diffusion prior image editing system 102 can utilize the diffusion neural network 210 to implement structural editing 214. For example, the diffusion prior image editing system 102 can perform structural editing 214 by dynamically controlling the degree/extent to which the modified digital image 212 reflects structure of the base digital image 202 (or the degree/extent to which the diffusion neural network 210 can deviate from the base digital image 202). In particular, the diffusion prior image editing system 102 can dynamically select a structural transition step of the diffusion neural network 210 that varies the amount of structure to retain from the base digital image 202. Additional detail regarding structure editing is provided below (e.g., in relation to FIGS. 3 and 5).
Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding. 
Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated);
generating, using the second embedding, a third embedding that represents (i) the input image and (ii) the modification to the input image using a second encoder (
Tanjim teaches at Paragraph 0097 that the second image generator 680 receives the optimized latent code 665 (third embedding) and generates synthetic image 685. Tanjim teaches at Paragraph 0095 that the second text encoder 670 encodes text prompt 605 to generate a text embedding and encodes preliminary image 640 to generate an image embedding (first embedding). Tanjim teaches at FIG. 6 and Paragraph 0088-0095 that the second text encoder 670 together with the optimization component 660 generates the latent code 665 (third embedding) to optimize the preliminary latent code 635 (second embedding) comprising the text embedding and the image embedding. 
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 (second embedding) based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding (first embedding). 
Ravi teaches that the conceptual editing 208 can include generating a text-edited image embedding (second embedding) using the diffusion prior image editing system 102. 
Ravi teaches at Paragraph [0042] As shown in FIG. 2, the diffusion prior image editing system 102 can perform conceptual editing 208 utilizing the diffusion prior neural network 206. The conceptual editing 208 can include combining features of the base digital image 202 and the edit text 204. For example, as described in greater detail below in relation to FIG. 3 and FIG. 4 the diffusion prior image editing system 102 can generate an image embedding utilizing a trained text-image encoder from the base digital image 202. Similarly, the diffusion prior image editing system 102 can generate an edit text embedding utilizing the trained text-image encoder from the edit text 204. The diffusion prior neural network 206 can inject the base image embedding to a dynamically selected conceptual editing denoising step of the diffusion prior neural network 206 and condition subsequent steps of the diffusion prior neural network 206 based on the edit text embedding. Utilizing this approach, the diffusion prior image editing system 102 can generate a text-edited image embedding utilizing the diffusion prior neural network 206. 
Ravi teaches at Paragraph [0070] Moreover, although the foregoing description of FIG. 4 focuses on the second set of steps 412, in one or more embodiments, the diffusion prior image editing system 102 still utilizes the first set of steps 410. For instance, in some implementations the first set of steps 410 are utilized to generate intermediate embeddings (second embedding). Ravi teaches at Paragraph 0078 that the diffusion prior image editing system 102 can condition the denoising step utilizing the text-edited image embedding 504. Thus, as shown the diffusion prior image editing system 102 conditions the denoising step 520n based on the text-edited image embedding 504. Moreover, the diffusion prior image editing system 102 conditions the remaining denoising steps based on the text-edited image embedding 504 (second embedding).
Ravi teaches at Paragraph 0045, Paragraph 0074 and Paragraph 0087 that the diffusion prior image editing system 102 performs structural editing to generate a third image embedding. 
Ravi teaches at Paragraph 0087 that the diffusion prior image editing system 102 runs the sampling process of the diffusion decoder conditioned on the conceptually edited embedding (second embedding) to get the final edited latent (third embedding). The generated VAE latent can then be passed through the pre-trained and fixed VAE decoder to get the final edited image. 
Ravi teaches at Paragraph 0071 that the prior gets to modify the injected embedding according to edit text and the closer the generated embedding (the final embedding---third embedding) will be to the edit text. In one or more implementations, the diffusion prior image editing system 102 controls the injection timestep the using a conceptual edit strength parameter. 
Ravi teaches at [0074] As mentioned previously, the diffusion prior image editing system 102 can also perform structural editing within a diffusion neural network to generate a modified digital image. For example, FIG. 5 illustrates utilizing a diffusion neural network 524 to generate a modified digital image 506 through structural editing of a base image embedding 502 and a text-edited image embedding 504 (second embedding) in accordance with one or more embodiments. 
Ravi teaches at Paragraph [0045] In addition, as shown in FIG. 2 the diffusion prior image editing system 102 can utilize the diffusion neural network 210 to implement structural editing 214. For example, the diffusion prior image editing system 102 can perform structural editing 214 by dynamically controlling the degree/extent to which the modified digital image 212 reflects structure of the base digital image 202 (or the degree/extent to which the diffusion neural network 210 can deviate from the base digital image 202). In particular, the diffusion prior image editing system 102 can dynamically select a structural transition step of the diffusion neural network 210 that varies the amount of structure to retain from the base digital image 202. Additional detail regarding structure editing is provided below (e.g., in relation to FIGS. 3 and 5).
 Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding.Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated); and
identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding (Tanjim teaches at Paragraph 0097 that the second image generator 680 receives the optimized latent code 665 and generates synthetic image 685. 
Ravi teaches that the conceptual editing 208 can include generating a text-edited image embedding (second embedding) using the diffusion prior image editing system 102. 
Ravi teaches at Paragraph [0042] As shown in FIG. 2, the diffusion prior image editing system 102 can perform conceptual editing 208 utilizing the diffusion prior neural network 206. The conceptual editing 208 can include combining features of the base digital image 202 and the edit text 204. For example, as described in greater detail below in relation to FIG. 3 and FIG. 4 the diffusion prior image editing system 102 can generate an image embedding utilizing a trained text-image encoder from the base digital image 202. Similarly, the diffusion prior image editing system 102 can generate an edit text embedding utilizing the trained text-image encoder from the edit text 204. The diffusion prior neural network 206 can inject the base image embedding to a dynamically selected conceptual editing denoising step of the diffusion prior neural network 206 and condition subsequent steps of the diffusion prior neural network 206 based on the edit text embedding. Utilizing this approach, the diffusion prior image editing system 102 can generate a text-edited image embedding utilizing the diffusion prior neural network 206. 
Ravi teaches at Paragraph [0070] Moreover, although the foregoing description of FIG. 4 focuses on the second set of steps 412, in one or more embodiments, the diffusion prior image editing system 102 still utilizes the first set of steps 410. For instance, in some implementations the first set of steps 410 are utilized to generate intermediate embeddings (second embedding). Ravi teaches at Paragraph 0078 that the diffusion prior image editing system 102 can condition the denoising step utilizing the text-edited image embedding 504. Thus, as shown the diffusion prior image editing system 102 conditions the denoising step 520n based on the text-edited image embedding 504. Moreover, the diffusion prior image editing system 102 conditions the remaining denoising steps based on the text-edited image embedding 504 (second embedding).
Ravi teaches at Paragraph 0045, Paragraph 0074 and Paragraph 0087 that the diffusion prior image editing system 102 performs structural editing to generate a third image embedding. 
Ravi teaches at Paragraph 0087 that the diffusion prior image editing system 102 runs the sampling process of the diffusion decoder conditioned on the conceptually edited embedding (second embedding) to get the final edited latent (third embedding). The generated VAE latent can then be passed through the pre-trained and fixed VAE decoder to get the final edited image. 
Ravi teaches at Paragraph 0071 that the prior gets to modify the injected embedding according to edit text and the closer the generated embedding (the final embedding---third embedding) will be to the edit text. In one or more implementations, the diffusion prior image editing system 102 controls the injection timestep the using a conceptual edit strength parameter. 
Ravi teaches at [0074] As mentioned previously, the diffusion prior image editing system 102 can also perform structural editing within a diffusion neural network to generate a modified digital image. For example, FIG. 5 illustrates utilizing a diffusion neural network 524 to generate a modified digital image 506 through structural editing of a base image embedding 502 and a text-edited image embedding 504 (second embedding) in accordance with one or more embodiments. 
Ravi teaches at Paragraph [0045] In addition, as shown in FIG. 2 the diffusion prior image editing system 102 can utilize the diffusion neural network 210 to implement structural editing 214. For example, the diffusion prior image editing system 102 can perform structural editing 214 by dynamically controlling the degree/extent to which the modified digital image 212 reflects structure of the base digital image 202 (or the degree/extent to which the diffusion neural network 210 can deviate from the base digital image 202). In particular, the diffusion prior image editing system 102 can dynamically select a structural transition step of the diffusion neural network 210 that varies the amount of structure to retain from the base digital image 202. Additional detail regarding structure editing is provided below (e.g., in relation to FIGS. 3 and 5).
Ravi teaches at Paragraph 0106 that he modified digital images 716a-716h progressively emphasize the input text 712 and de-emphasize the base digital image as the conceptual strength parameter increases. Indeed, the concept of ‘oranges’ is gradually encoded into the embedding as c increases. 
Ravi teaches at Paragraph 0074 that FIG. 5 illustrates utilizing a diffusion neural network 524 to generate a modified digital image 506 through structural editing of a base image embedding 502 and a text-edited image embedding 504). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Ravi’s teaching of the diffusion prior image editing system  for generating text-edited image embedding or the CLIP image embedding having the same dimension as the base image embedding into Tanjim’s diffusion-based image generator to have generated text-edited image embeddings based on the text embedding and the image embedding of the original input image. One of the ordinary skill in the art would have been motivated to have provided a diffusion prior model comprising an encoder to have generated text-edited image embedding having the same dimension as the base image embedding,  
Song teaches at Paragraph 0175 that noised image 850 is a visual representation of noisy features obtained by adding noise to image features of a first image obtained from an image encoder. Song teaches at Paragraph 0039 that the adapter network modifies the dimensions of the image embedding to match dimensions of a text embedding and at Paragraph 0095 that image generation apparatus 500 provides a descriptive embedding of the second image as guidance to image generation model 535 for generating the composite image. In some aspects, the descriptive embedding includes a same number of dimensions as a text embedding used to train the image generation model. 
Song teaches at Paragraph 0180 that encoding the second image using an image encoder to obtain an image embedding; generating a descriptive embedding based on the image embedding using an adapter network; and generating a composite image based on the descriptive embedding and the first image using an image generation model. 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Song’s teaching of generating a descriptive embedding based on the image embedding and the text embedding into Tanjim’s diffusion-based image generator to have optimized the modified image embedding based on the text embedding and the image embedding of the original input image. One of the ordinary skill in the art would have been motivated to have provided a diffusion prior model to have utilized the descriptive embedding. 
 Aggarwal ‘816 teaches at Paragraph 0045 that the image embedding is a 512-dimensional vector that represents the content and visual features of the input image.
Aggarwal ‘816 teaches at Paragraph 0048 that the text embedding is a 512-dimensional vector 
Aggarwal ‘816 teaches at Paragraph [0053] The image encoder 204 (e.g., an autoencoder and/or any other encoder) is communicatively coupled to an image embedding generator 206, where the image encoder 204 and the generator 206 generate one or more image embeddings based on a ground truth image 202. The image encoder 204 receives the ground truth image 202. The image 202 shows a screaming cat wearing a chef's hat and standing in a professional kitchen holding a utensil. The image encoder 204 analyzes the structure of the image to determine its specific structural features, e.g., “a screaming cat”, “a cat wearing a chef's hat”, a “professional kitchen”, “a cat standing in a professional kitchen”, “a cat holding a utensil”, etc. The encoder 204 then uses structural features to generate one or more image embeddings having a predetermined dimension. The image embeddings are resized (e.g., by the image embedding generator 206) to ensure that all inputs that are provided to the ML model 122 for training have uniform dimension. For example, the image encoder 204 generates latents of size 8×32×32, which are resized to 8×1024. The latents have structural information of the ground truth image 202 (e.g., as shown in FIG. 2, cat in a chef's hat holding a utensil and standing in professional kitchen). As can be understood, any desired dimensions can be used by the system 200 in connection with generating and/or resizing dimensions of embeddings provided to the ML model (either during training and/or during inferencing). The generator 206 then provides the image embeddings to the ML model 122 for training.
Aggarwal ‘816 teaches at Paragraph [0054] that the text encoder 210 (e.g., a T5 encoder, a CLIP encoder, etc.) is communicatively coupled to a text embedding generator 220, where the text encoder 210 and the generator 220 generate one or more text embeddings based on a text input and/or prompt (terms may be used interchangeably herewith) 208. The text input 208 can describe one or more specific features related to the ground truth image 202. For example, the text input 208 states “a cat chef screaming at a dish in a professional kitchen.” Similar to the processing performed by the image encoder 204, the text encoder 210 generates one or more text embeddings (as discussed herein) having a predetermined dimension. The text embeddings are resized (e.g., by the text embedding generator 220) to ensure that all inputs that are provided to the ML model 122 for training have uniform dimension. The generator 220 then provides the text embeddings to the ML model 122 for training.
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Aggarwal ‘816’s teaching that the image CLIP embedding is generated by the diffusion prior model using the multi-modal encoder having the same dimension as the text embedding and the color embedding into Tanjim’s diffusion-based image generator to have optimized the modified image embedding based on the text embedding and the image embedding of the original input image. One of the ordinary skill in the art would have been motivated to have provided a diffusion prior model to have utilized the CLIP encoder to have selected an image CLIP embedding closest to the text CLIP embedding. 


Re Claim 2: 
The claim 2 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the modification input is text provided by a user of an input device. 
Tanjim, Ravi and Aggarwal ‘144 further teach the claim limitation that the modification input is text provided by a user of an input device (Aggarwal ‘144 teaches at Paragraph 0035 that user 100 provides a text prompt and a color prompt via user device 105 and the text prompt may include a natural language statement. 
Tanjim teaches at Paragraph 0032 that user 100 provides a text prompt and an input image to image processing apparatus 100. 
Ravi teaches at Paragraph 0038 that the diffusion prior image editing system 102 receives the edit text 204 based on user interaction with a user interface of a client device.). 
Re Claim 5: 
The claim 5 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the first encoder and the second encoder are autoencoders. 
Ravi and Tanjim further teach the claim limitation that the first encoder and the second encoder are autoencoders (Ravi teaches at Paragraph [0125] As further illustrated in FIG. 11, the diffusion prior image editing system 102 includes the diffusion prior conceptual editing engine 1108. In particular, the diffusion prior conceptual editing engine 1108 can perform conceptual editing processes utilizing a diffusion prior neural network (e.g., by creating, generating, or encoding a text-edited image embedding). For example, as described above, the diffusion prior conceptual editing engine 1108 can select a conceptual editing step for injecting a base image embedding and condition denoising steps of a diffusion prior neural network on edit text embeddings.
Ravi teaches at Paragraph [0126] Additionally, the diffusion prior image editing system 102 includes the diffusion structural editing engine 1110. In particular, the diffusion structural editing engine 1110 can perform structural editing processes utilizing a diffusion neural network. For example, as described above, the diffusion structural editing engine 1110 can select a structural transition step and/or a structural number and utilize a diffusion noising model and/or diffusion neural network to generate a modified digital image. In one or more implementations, the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder).
Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding). 
Re Claim 6: 
The claim 6 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that identifying the set of one or more images that are different from the input image comprises: performing one or more operations of an approximate nearest neighbor (ANN) algorithm. 
Ravi and Aggarwal ‘144 further teach the claim limitation that identifying the set of one or more images that are different from the input image comprises: performing one or more operations of an approximate nearest neighbor (ANN) algorithm (
Aggarwal ‘144 teaches at Paragraph [0091] In an embodiment, diffusion prior model 615 generates a set of image embeddings based on text embedding 610. Diffusion prior model 615 scores and ranks the set of image embeddings by comparing each image embedding of image embeddings 620 to text embedding 610. In an embodiment, diffusion prior model 615 calculates a similarity score between the text embedding 610 and each image embedding of image embeddings 620 and selects one or more image embeddings 620 with the highest similarity score (e.g., select top k image embeddings that correspond to the top k highest similarity scores). A high similarity score shows that image embedding 620 is similar to text embedding 610 in a common embedding space. Text embedding 610 and image embedding 620 are in a multi-modal embedding space. For example, diffusion prior model 615 ranks the set of image CLIP embeddings and selects an image CLIP embedding that is closest to the text CLIP embedding. 
Ravi teaches at Paragraph [0125] As further illustrated in FIG. 11, the diffusion prior image editing system 102 includes the diffusion prior conceptual editing engine 1108. In particular, the diffusion prior conceptual editing engine 1108 can perform conceptual editing processes utilizing a diffusion prior neural network (e.g., by creating, generating, or encoding a text-edited image embedding). For example, as described above, the diffusion prior conceptual editing engine 1108 can select a conceptual editing step for injecting a base image embedding and condition denoising steps of a diffusion prior neural network on edit text embeddings.
Ravi teaches at Paragraph [0126] Additionally, the diffusion prior image editing system 102 includes the diffusion structural editing engine 1110. In particular, the diffusion structural editing engine 1110 can perform structural editing processes utilizing a diffusion neural network. For example, as described above, the diffusion structural editing engine 1110 can select a structural transition step and/or a structural number and utilize a diffusion noising model and/or diffusion neural network to generate a modified digital image. In one or more implementations, the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder).
Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding.Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated). 
Re Claim 7: 
The claim 7 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that, prior to performing the one or more operations of the ANN algorithm, the method comprises:
generating, using the second encoder, one or more embeddings of a same dimension as the third embedding; and
performing, using the one or more embeddings of the same dimension as the third embedding and the third embedding, the one or more operations of the ANN algorithm. 
Ravi and Aggarwal ‘144 further teach the claim limitation that prior to performing the one or more operations of the ANN algorithm, the method comprises:
generating, using the second encoder, one or more embeddings of a same dimension as the third embedding; and
performing, using the one or more embeddings of the same dimension as the third embedding and the third embedding, the one or more operations of the ANN algorithm
(Ravi teaches at Paragraph [0125] As further illustrated in FIG. 11, the diffusion prior image editing system 102 includes the diffusion prior conceptual editing engine 1108. In particular, the diffusion prior conceptual editing engine 1108 can perform conceptual editing processes utilizing a diffusion prior neural network (e.g., by creating, generating, or encoding a text-edited image embedding). For example, as described above, the diffusion prior conceptual editing engine 1108 can select a conceptual editing step for injecting a base image embedding and condition denoising steps of a diffusion prior neural network on edit text embeddings.
Ravi teaches at Paragraph [0126] Additionally, the diffusion prior image editing system 102 includes the diffusion structural editing engine 1110. In particular, the diffusion structural editing engine 1110 can perform structural editing processes utilizing a diffusion neural network. For example, as described above, the diffusion structural editing engine 1110 can select a structural transition step and/or a structural number and utilize a diffusion noising model and/or diffusion neural network to generate a modified digital image. In one or more implementations, the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder).
Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding.Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated. 
Aggarwal ‘144 teaches at Paragraph 0037 that a color embedding is concatenated with a text embedding and the concatenated embedding is used to predict an image embedding (e.g., image CLIP embedding) and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. 
Aggarwal ‘144 teaches at Paragraph 0022 that the diffusion prior model maps a text embedding of the text prompt (e.g., text CLIP embedding using a multi-modal encoder such as CLIP model) to an image embedding (e.g., image CLIP embedding) and at Paragraph 0092 that diffusion prior model 615 generates an image embedding(s) 620 based on the text embedding and the color embedding.. 
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0082 that diffusion prior model 525 receives color embedding, token embeddings, and text embedding from multi-modal encoder 515 as input and generates an image embedding based on the text embedding, token embedding and the color embedding. These embeddings are concatenated. Aggarwal ‘144 teaches at Paragraph 0121 that the image embeddings and the text embeddings have a dimension size 768. 
Song teaches at Paragraph 0175 that noised image 850 is a visual representation of noisy features obtained by adding noise to image features of a first image obtained from an image encoder. Song teaches at Paragraph 0039 that the adapter network modifies the dimensions of the image embedding to match dimensions of a text embedding and at Paragraph 0095 that image generation apparatus 500 provides a descriptive embedding of the second image as guidance to image generation model 535 for generating the composite image. In some aspects, the descriptive embedding includes a same number of dimensions as a text embedding used to train the image generation model). 
Re Claim 8: 
The claim 8 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that generating the third embedding that represents (i) the input image and (ii) the modification to the input image comprises: compressing the second embedding from the first dimension to a second dimension.
Tanjim further teaches the claim limitation that generating the third embedding that represents (i) the input image and (ii) the modification to the input image comprises: compressing the second embedding from the first dimension to a second dimension (
Tanjim teaches at Paragraph [0091] An autoencoder is a type of ANN used to learn efficient data encoding in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, e.g., for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side may also be learned. The reconstructing network tries to generate, from the reduced encoding, a representation as close as possible to the original input. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties). 
Re Claim 9: 
The claim 9 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the first dimension includes 16,000 values and the second dimension includes 512 values. 
Tanjim further teaches the claim limitation that the first dimension includes 16,000 values and the second dimension includes 512 values (
Tanjim teaches at Paragraph [0091] An autoencoder is a type of ANN used to learn efficient data encoding in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, e.g., for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side may also be learned. The reconstructing network tries to generate, from the reduced encoding, a representation as close as possible to the original input. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties). 
Re Claim 10: 
The claim 10 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that generating the first embedding that represents the input image comprises: generating an initial embedding using the first encoder; and generating a diffused embedding as the first embedding using a diffusion model. 
Ravi, Tanjim and Aggarwal ‘144 further teach the claim limitation that generating the first embedding that represents the input image comprises: generating an initial embedding using the first encoder; and generating a diffused embedding as the first embedding using a diffusion model (
Ravi teaches at Paragraph [0125] As further illustrated in FIG. 11, the diffusion prior image editing system 102 includes the diffusion prior conceptual editing engine 1108. In particular, the diffusion prior conceptual editing engine 1108 can perform conceptual editing processes utilizing a diffusion prior neural network (e.g., by creating, generating, or encoding a text-edited image embedding). For example, as described above, the diffusion prior conceptual editing engine 1108 can select a conceptual editing step for injecting a base image embedding and condition denoising steps of a diffusion prior neural network on edit text embeddings.
Ravi teaches at Paragraph [0126] Additionally, the diffusion prior image editing system 102 includes the diffusion structural editing engine 1110. In particular, the diffusion structural editing engine 1110 can perform structural editing processes utilizing a diffusion neural network. For example, as described above, the diffusion structural editing engine 1110 can select a structural transition step and/or a structural number and utilize a diffusion noising model and/or diffusion neural network to generate a modified digital image. In one or more implementations, the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder).
Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding.Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated. 
Tanjim teaches at Paragraph 0089 that the image encoder 620 generates an image embedding based on input image 610. Tanjim teaches at Paragraph 0090-0091 that image generation model 615 is a diffusion-based image generation model and includes a variational autoencoder. Tanjim teaches at Paragraph 0089 that the image generation model 615 generates a preliminary latent code 635 based on text prompt 605 and input image 610. The preliminary latent code 635 includes the text embedding and the image embedding. 
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0078-0081 that the color encoder generates color embedding (image embedding) having predetermined size that matches with that of text embedding or the image embedding of the input image. 
Aggarwal ‘144 teaches at Paragraph 0037 that a color embedding is concatenated with a text embedding and the concatenated embedding is used to predict an image embedding (e.g., image CLIP embedding) and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. 
Aggarwal ‘144 teaches at Paragraph 0022 that the diffusion prior model maps a text embedding of the text prompt (e.g., text CLIP embedding using a multi-modal encoder such as CLIP model) to an image embedding (e.g., image CLIP embedding) and at Paragraph 0092 that diffusion prior model 615 generates an image embedding(s) 620 based on the text embedding and the color embedding.. 
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0082 that diffusion prior model 525 receives color embedding, token embeddings, and text embedding from multi-modal encoder 515 as input and generates an image embedding based on the text embedding, token embedding and the color embedding. These embeddings are concatenated. Aggarwal ‘144 teaches at Paragraph 0121 that the image embeddings and the text embeddings have a dimension size 768). 
Re Claim 11: 
The claim 11 encompasses the same scope of invention as that of the claim 10 except additional claim limitation that generating the first embedding occurs in batch prior to generating the second embedding. 
Aggarwal ‘144 and Tanjim further teach the claim limitation that generating the first embedding occurs in batch prior to generating the second embedding (
Aggarwal ‘144 teaches at Paragraph [0141] that the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding and at Paragraph 0091 that diffusion prior model 615 scores and ranks the set of image embeddings by comparing each image embedding of image embeddings 620 to text embedding 610.
Aggarwal ‘144 teaches at FIG. 5 and Paragraph 0078-0081 that the color encoder generates color embedding (image embedding) having predetermined size that matches with that of text embedding or the image embedding of the input image. Aggarwal ‘144 teaches at Paragraph 0078 that 
Aggarwal ‘144 teaches at Paragraph 0037 that a color embedding is concatenated with a text embedding and the concatenated embedding is used to predict an image embedding (e.g., image CLIP embedding) and at Paragraph 0105 that at operation 810, the system generates image embeddings based on the text prompt and the color prompt and the machine learning model extracts color embeddings/histograms of the training samples with the same dimensions as the text embedding and image CLIP embedding. 
Tanjim teaches at Paragraph [0091] An autoencoder is a type of ANN used to learn efficient data encoding in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, e.g., for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side may also be learned. The reconstructing network tries to generate, from the reduced encoding, a representation as close as possible to the original input. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties). 
Re Claim 12: 
The claim 12 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that generating the second embedding comprises:
providing (i) the first embedding and (ii) the modification to the input image to a reverse diffusion model, wherein the reverse diffusion model generates the second embedding. 
Ravi and Tanjim further teach the claim limitation that generating the second embedding comprises: providing (i) the first embedding and (ii) the modification to the input image to a reverse diffusion model, wherein the reverse diffusion model generates the second embedding (
Ravi teaches at Paragraph [0054] The diffusion noising model 322 can include a variety of computer implemented models or architectures. For example, in some embodiments the diffusion noising model 322 includes a reverse diffusion neural network. As described above, a diffusion neural network can iteratively denoise a noise map to generate a digital image. A reverse diffusion neural network utilizes a neural network to predict noise that, when analyzed by a diffusion neural network, will result in a particular (e.g., deterministic) digital image. Thus, a reverse diffusion neural network includes a neural network that iteratively adds noise to an input signal that will reflect a deterministic outcome or result when processed through denoising layers of a diffusion neural network. The diffusion prior image editing system 102 can utilize a variety of reverse diffusion neural networks. For example, in one or more implementations, the diffusion prior image editing system 102 utilizes the architecture described by Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020 (hereinafter Reverse DDIM), which is incorporated herein by reference in its entirety.
Ravi teaches at Paragraph [0055] In addition to a reverse diffusion neural network, the diffusion prior image editing system 102 can also utilize other architectures for the diffusion noising model 322. For example, in some implementations the diffusion prior image editing system 102 can utilize a diffusion model that iteratively adds noise to an input signal utilizing a stochastic or other statistical process. To illustrate, in some embodiments the diffusion prior image editing system 102 utilizes a diffusion noising model as described by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon, Sdedit: Guided image synthesis and editing with stochastic differential equations, 2021.

Ravi teaches at Paragraph [0125] As further illustrated in FIG. 11, the diffusion prior image editing system 102 includes the diffusion prior conceptual editing engine 1108. In particular, the diffusion prior conceptual editing engine 1108 can perform conceptual editing processes utilizing a diffusion prior neural network (e.g., by creating, generating, or encoding a text-edited image embedding). For example, as described above, the diffusion prior conceptual editing engine 1108 can select a conceptual editing step for injecting a base image embedding and condition denoising steps of a diffusion prior neural network on edit text embeddings.
Ravi teaches at Paragraph [0126] Additionally, the diffusion prior image editing system 102 includes the diffusion structural editing engine 1110. In particular, the diffusion structural editing engine 1110 can perform structural editing processes utilizing a diffusion neural network. For example, as described above, the diffusion structural editing engine 1110 can select a structural transition step and/or a structural number and utilize a diffusion noising model and/or diffusion neural network to generate a modified digital image. In one or more implementations, the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder).
Ravi teaches at Paragraph 0043 that a diffusion prior neural network generates an image embedding (e.g., a CLIP image embedding) from random noise, conditioned on a text embedding (e.g., a CLIP text embedding).
Ravi teaches at Paragraph 0050 that the diffusion prior image editing system 102 utilizes the diffusion prior neural network 316 to analyze the base image embedding 306 and the edit text embedding 312 to generate the text-edited image embedding 318. The text-edited image embedding 318 can include a combination of the trained text-image encoder 304 and the base image embedding 306 according to the learned parameters of the diffusion prior neural network 316.
Ravi teaches at Paragraph 0062 that FIG. 4 illustrates the diffusion prior image editing system 102 utilizing a set of steps 412 of a diffusion prior neural network 416 to generate a text-edited image embedding 406 from a base image embedding 402 and edit text embedding 404. 
It is known from FIG. 3 that the text-edited image embedding 318 has the same dimension as the base image embedding 306. 
Ravi teaches at Paragraph [0124] that, the diffusion prior image editing system 102 also includes the embedding manager 1106. In particular, the embedding manager 1106 can generate, encode, and/or create embeddings from inputs. For example, as described above, the embedding manager 1106 can generate base image embeddings from base digital images. Similarly, the embedding manager 1106 can also generate edit text embeddings from edit text. 
Ravi teaches at Paragraph 0126 that the diffusion structural editing engine 1110 utilizes a diffusion neural network to generate a latent representation that is then converted to the modified digital image (e.g., utilizing a neural network such as a variational auto encoder. 
Ravi teaches at Paragraph 0142 receiving a conceptual edit strength parameter based on user interaction with the conceptual edit controller; determining a conceptual editing step based on the conceptual edit strength parameter; generating, utilize a diffusion prior neural network, a text-edited image embedding by utilizing a base image embedding of the base digital image and an edit text embedding from the edit text according to the conceptual editing step; and generating a modified digital image from the text-edited image embedding. 
Ravi teaches at Paragraph [0143] In addition, in one or more embodiments, generating the modified digital image from the text-edited image embedding comprises generating, utilizing a diffusion neural network, the modified digital image from the text-edited image embedding and the base image embedding. Further, in one or more implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises injecting the base image embedding at the conceptual editing step of the diffusion prior neural network. Moreover, in some implementations, generating, utilize a diffusion prior neural network, the text-edited image embedding comprises conditioning a set of steps of the diffusion prior neural network after the conceptual editing step utilizing the edit text embedding.Ravi thus teaches using the user-controlled conceptual editing parameters, second and third text-edited image embedding can be generated. 
Tanjim teaches at Paragraph [0067] Next, a reverse diffusion process 340 (e.g., a U-Net ANN) gradually removes the noise from the noisy features 335 at the various noise levels to obtain denoised image features 345 in latent space 325. In some examples, the denoised image features 345 are compared to the original image features 320 at each of the various noise levels, and parameters of the reverse diffusion process 340 of the diffusion model are updated based on the comparison. Finally, an image decoder 350 decodes the denoised image features 345 to obtain an output image 355 in pixel space 310. In some cases, an output image 355 is created at each of the various noise levels. The output image 355 can be compared to the original image 305 to train the reverse diffusion process 340. 
Tanjim teaches at Paragraph [0152] that, the training component trains a decoder-only Transformer with a causal attention mask on a sequence including the CLIP text embedding, an embedding for the diffusion timestep and a final embedding whose output from the Transformer is used to predict the unnoised (or denoised) CLIP image embedding. This is implemented using a Unet architecture. The diffusion prior model is trained to predict the unnoised (or denoised) image embedding and is based on using a mean-squared error loss on this prediction.
Tanjim teaches at Paragraph [0153] During inference time, the diffusion prior model samples k different image embeddings and picks the image embedding that has high similarity score with respect to the text embedding of the text prompt.
). 
Re Claim 13: 
The claim 13 encompasses the same scope of invention as that of the claim 12 except additional claim limitation that the reverse diffusion model includes U-Net architecture.
Ravi and Tanjim further teach the claim limitation that the reverse diffusion model includes U-Net architecture (Tanjim teaches at Paragraph [0067] Next, a reverse diffusion process 340 (e.g., a U-Net ANN) gradually removes the noise from the noisy features 335 at the various noise levels to obtain denoised image features 345 in latent space 325. In some examples, the denoised image features 345 are compared to the original image features 320 at each of the various noise levels, and parameters of the reverse diffusion process 340 of the diffusion model are updated based on the comparison. Finally, an image decoder 350 decodes the denoised image features 345 to obtain an output image 355 in pixel space 310. In some cases, an output image 355 is created at each of the various noise levels. The output image 355 can be compared to the original image 305 to train the reverse diffusion process 340. 
Tanjim teaches at Paragraph [0152] that, the training component trains a decoder-only Transformer with a causal attention mask on a sequence including the CLIP text embedding, an embedding for the diffusion timestep and a final embedding whose output from the Transformer is used to predict the unnoised (or denoised) CLIP image embedding. This is implemented using a Unet architecture. The diffusion prior model is trained to predict the unnoised (or denoised) image embedding and is based on using a mean-squared error loss on this prediction.
Tanjim teaches at Paragraph [0153] During inference time, the diffusion prior model samples k different image embeddings and picks the image embedding that has high similarity score with respect to the text embedding of the text prompt.
Ravi teaches at Paragraph [0054] The diffusion noising model 322 can include a variety of computer implemented models or architectures. For example, in some embodiments the diffusion noising model 322 includes a reverse diffusion neural network. As described above, a diffusion neural network can iteratively denoise a noise map to generate a digital image. A reverse diffusion neural network utilizes a neural network to predict noise that, when analyzed by a diffusion neural network, will result in a particular (e.g., deterministic) digital image. Thus, a reverse diffusion neural network includes a neural network that iteratively adds noise to an input signal that will reflect a deterministic outcome or result when processed through denoising layers of a diffusion neural network. The diffusion prior image editing system 102 can utilize a variety of reverse diffusion neural networks. For example, in one or more implementations, the diffusion prior image editing system 102 utilizes the architecture described by Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020 (hereinafter Reverse DDIM), which is incorporated herein by reference in its entirety.
Ravi teaches at Paragraph [0055] In addition to a reverse diffusion neural network, the diffusion prior image editing system 102 can also utilize other architectures for the diffusion noising model 322. For example, in some implementations the diffusion prior image editing system 102 can utilize a diffusion model that iteratively adds noise to an input signal utilizing a stochastic or other statistical process. To illustrate, in some embodiments the diffusion prior image editing system 102 utilizes a diffusion noising model as described by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon, Sdedit: Guided image synthesis and editing with stochastic differential equations, 2021). 
Re Claim 14: 
The claim 14 recites a system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
generating a first embedding that represents an input image using a first encoder, wherein a dimension of the first embedding matches a first dimension;
generating, using the first embedding, a second embedding that represents (i) the input image and (ii) a modification to the input image, wherein a dimension of the second embedding matches the first dimension;
generating, using the second embedding, a third embedding that represents (i) the input image and (ii) the modification to the input image using a second encoder; and
identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
The claim 14 is in parallel with the claim 1 in the form of an apparatus claim. The claim 14 is subject to the same rationale of rejection as the claim 1. 
Moreover, Tanjim in view of Aggarwal ‘144 further teach the claim limitation of a system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations [of the claim 1] (Aggarwal ‘144 teaches at Paragraph 0046] that processor unit 205 is an intelligent hardware device, (e.g., a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, processor unit 205 is configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into the processor. In some cases, processor unit 205 is configured to execute computer-readable instructions stored in a memory to perform various functions. In some embodiments, processor unit 205 includes special-purpose components for modem processing, baseband processing, digital signal processing, or transmission processing. processor unit 205 is an example of, or includes aspects of, the processor described with reference to FIG. 15. 
Tanjim teaches at Paragraph [0153] that the described methods may be implemented or performed by devices that include a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, a conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored in the form of instructions or code on a computer-readable medium). 
Re Claim 15: 
The claim 15 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the modification input is text provided by a user of an input device. 
The claim 15 is in parallel with the claim 2 in the form of an apparatus claim. The claim 15 is subject to the same rationale of rejection as the claim 2. 
Re Claim 18: 
The claim 18 encompasses the same scope of invention as that of the claim 14 except additional claim limitation that the first encoder and the second encoder are autoencoders.
The claim 18 is in parallel with the claim 5 in the form of an apparatus claim. The claim 18 is subject to the same rationale of rejection as the claim 5.  
Re Claim 19: 
The claim 19 encompasses the same scope of invention as that of the claim 14 except additional claim limitation that identifying the set of one or more images that are different from the input image comprises: performing one or more operations of an approximate nearest neighbor (ANN) algorithm. 
The claim 19 is in parallel with the claim 6 in the form of an apparatus claim. The claim 19 is subject to the same rationale of rejection as the claim 6. 
Re Claim 20: 
The claim 20 recites one or more computer storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
generating a first embedding that represents an input image using a first encoder, wherein a dimension of the first embedding matches a first dimension;
generating, using the first embedding, a second embedding that represents (1) the input image and (11) a modification to the input image, wherein a dimension of the second embedding matches the first dimension;
generating, using the second embedding, a third embedding that represents (1) the input image and (11) the modification to the input image using a second encoder; and
identifying, using the third embedding, a set of one or more other images that correspond to a modified image represented by the third embedding. 
The claim 20 is in parallel with the claim 1 in the form of a computer program product. The claim 20 is subject to the same rationale of rejection as the claim 1. 
Moreover, Tanjim in view of Aggarwal ‘144 further teach the claim limitation of one or more computer storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations [of the claim 1] (Aggarwal ‘144 teaches at Paragraph 0046] that processor unit 205 is an intelligent hardware device, (e.g., a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, processor unit 205 is configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into the processor. In some cases, processor unit 205 is configured to execute computer-readable instructions stored in a memory to perform various functions. In some embodiments, processor unit 205 includes special-purpose components for modem processing, baseband processing, digital signal processing, or transmission processing. processor unit 205 is an example of, or includes aspects of, the processor described with reference to FIG. 15. 
Tanjim teaches at Paragraph [0153] that the described methods may be implemented or performed by devices that include a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, a conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Thus, the functions described herein may be implemented in hardware or software and may be executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored in the form of instructions or code on a computer-readable medium). 

Claims 3, 4, 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Tanjim et al. US PGPUB No. 2025/0225683 (hereinafter Tanjim) in view of 
Aggarwal et al. US-PGPUB No. 2024/0404144 (hereinafter Aggarwal ‘144); 
Ravi et al. US-PGPUB No. 2024/0362842 (hereinafter Ravi); 
Song et al. US-PGPUB No. 2025/0022099 (hereinafter Song); 
Aggarwal et al. US-PGPUB No. 2025/0278816 (hereinafter Aggarwal ‘816); 
Zhang et al. US-PGPUB No. 2023/0245418 (hereinafter Zhang). 

Re Claim 3: 
The claim 3 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the input image represents a product or service listing on an e-commerce platform. 
Zhang et al. US-PGPUB No. 2023/0245418 (hereinafter Zhang) teaches the claim limitation that the input image represents a product or service listing on an e-commerce platform (
Zhang teaches at Paragraph 0084 that in the field of e-commerce, images from different categories of products carry different information. Garment image in fashion field generally carries more useful information than electronics images.
Zhang teaches at Paragraph 0065 that the customer can use text, image or the combination of text and image as a query to search one or multiple products and the product database 144 includes information of the product, such as tile, description, main image and at Paragraph 0114 that the product search application 118 only needs to run the query text feature module 120, the query text embedding module 122, the query transformer 128, the relevance module 140, the user interface 142. In certain embodiments, the product search application 118 may further need to run the query image feature module 124 and the query image embedding module 126 when the query also includes an image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Zhang’s teaching of query multiple products from the product database in the product search application using text prompt and the image prompt to have provided a list of product in an e-commerce platform into the digital media management system of Ravi, Tanjim and Aggarwal ‘144 to have modified the input image to be the specific product input image to have presented the modified images in terms of the modified product images in response to the search query for the product images. One of the ordinary skill in the art would have been motivated to have modified the input image of Ravi, Tanjim and Aggarwal ‘144 to be the product input image such that the modified input product image can be produced to search for a set of similar products in the product database in an e-commerce platform. 

Re Claim 4: 
The claim 4 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that identifying the set of one or more images that are different from the input image comprises:
identifying product listings, wherein each of the set of one or more images represents one or more of the product listings. 
Zhang et al. US-PGPUB No. 2023/0245418 (hereinafter Zhang) teaches the claim limitation that identifying the set of one or more images that are different from the input image comprises:
identifying product listings, wherein each of the set of one or more images represents one or more of the product listings (
Zhang teaches at Paragraph 0084 that in the field of e-commerce, images from different categories of products carry different information. Garment image in fashion field generally carries more useful information than electronics images.
Zhang teaches at Paragraph 0065 that the customer can use text, image or the combination of text and image as a query to search one or multiple products and the product database 144 includes information of the product, such as tile, description, main image and at Paragraph 0114 that the product search application 118 only needs to run the query text feature module 120, the query text embedding module 122, the query transformer 128, the relevance module 140, the user interface 142. In certain embodiments, the product search application 118 may further need to run the query image feature module 124 and the query image embedding module 126 when the query also includes an image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Zhang’s teaching of query multiple products from the product database in the product search application using text prompt and the image prompt to have provided a list of product in an e-commerce platform into the digital media management system of Ravi, Tanjim and Aggarwal ‘144 to have modified the input image to be the specific product input image to have presented the modified images in terms of the modified product images in response to the search query for the product images. One of the ordinary skill in the art would have been motivated to have modified the input image of Ravi, Tanjim and Aggarwal ‘144 to be the product input image such that the modified input product image can be produced to search for a set of similar products in the product database in an e-commerce platform. 
Re Claim 16: 
The claim 16 encompasses the same scope of invention as that of the claim 14 except additional claim limitation that the input image represents a product or service listing on an e-commerce platform. 
The claim 16 is in parallel with the claim 3 in the form of an apparatus claim. The claim 16 is subject to the same rationale of rejection as the claim 3. 
Re Claim 17: 
The claim 17 encompasses the same scope of invention as that of the claim 14 except additional claim limitation that identifying the set of one or more images that are different from the input image comprises: identifying product listings, wherein each of the set of one or more images represents one or more of the product listings.
The claim 17 is in parallel with the claim 4 in the form of an apparatus claim. The claim 17 is subject to the same rationale of rejection as the claim 4. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JIN CHENG WANG whose telephone number is (571)272-7665. The examiner can normally be reached Mon-Fri 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at 571-270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JIN CHENG WANG/Primary Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

Mar 11, 2024
Application Filed
Sep 10, 2025
Non-Final Rejection — §103
Jan 15, 2026
Response Filed
Mar 24, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/270,926
Patent 12594883
DISPLAY DEVICE FOR DISPLAYING PATHS OF A VEHICLE
2y 5m to grant Granted Apr 07, 2026
16/703,494
Patent 12597086
Tile Region Protection in a Graphics Processing System
2y 5m to grant Granted Apr 07, 2026
18/291,702
Patent 12592012
METHOD, APPARATUS, ELECTRONIC DEVICE AND READABLE MEDIUM FOR COLLAGE MAKING
2y 5m to grant Granted Mar 31, 2026
17/655,739
Patent 12586270
GENERATING AND MODIFYING DIGITAL IMAGES USING A JOINT FEATURE STYLE LATENT SPACE OF A GENERATIVE NEURAL NETWORK
2y 5m to grant Granted Mar 24, 2026
17/888,216
Patent 12579709
IMAGE SPECIAL EFFECT PROCESSING METHOD AND APPARATUS
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
59%
Grant Probability
69%
With Interview (+10.3%)
3y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 832 resolved cases by this examiner. Grant probability derived from career allow rate.