Last updated: April 19, 2026
Application No. 18/439,036
TEXT-BASED IMAGE GENERATION USING AN IMAGE-TRAINED TEXT

Final Rejection §103
Filed
Feb 12, 2024
Examiner
YICK, JORDAN WAN
Art Unit
2612
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +7.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

YICK, JORDAN WAN View full profile →
Grants 95% — above average
Career Allow Rate
18 granted / 19 resolved
+32.7% vs TC avg
Moderate +8% lift
Without
With
+7.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
64.2%
+24.2% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
15.3%
-24.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
2.	Claims 1, 9, and 14 are amended.
3.	Claims 2-5, 7-8, 10-13, 15-17, and 19-20 are as previously presented.
4.	Claims 6 and 18 have been cancelled.

Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	Claims 1-3 are rejected under 35 U.S.C. 103 as being unpatentable over Liu (US-20240282016-A1), hereinafter Liu, in view of Gafni (US-20240221235-A1), hereinafter Gafni, and in view of Yu (US-20240185035-A1), hereinafter Yu.
 
Regarding claim 1, Liu teaches a method for training a machine learning model, comprising: generating, using an image generation model, a provisional image based on a provisional text embedding, wherein the provisional text embedding is generated based on the text prompt (Fig. 3, paragraph 42-43, generating a set of predicted synthesized images based on a set of training feature vectors, wherein the set of training feature vectors are generated by a text encoder and is interpreted as provisional text embeddings, and the text encoder is fed image captions of training images); and training a text encoder to generate text embeddings as input for generating images with the image generation model based on the provisional image and the ground-truth image (Fig. 3, paragraph 43-44, wherein text encoder generates training feature vectors which are interpreted as text embeddings used to generate synthesized images, and is trained based on the differences between a predicted synthesized image, interpreted as a provisional image, and the training image, interpreted as the ground-truth image), wherein the text encoder is trained jointly with the image generation model (Fig. 3, paragraph 43-44, wherein diffusion model that generates images and text decoder can be trained together, which is interpreted as jointly training the image generation model and text decoder).
Liu does not teach obtaining training data including a ground-truth image and a text prompt of the ground-truth image, wherein the text encoder is pre-trained.
Gafni teaches obtaining training data including a ground-truth image and a text prompt of the ground-truth image (Fig. 2, paragraph 31, paragraph 31, machine-learning model has training data comprising text and a ground-truth image, wherein the input text is interpreted as a text prompt of a respective ground-truth image).
Neither Liu nor Gafni teach wherein the text encoder is pre-trained.
Yu teaches pre-training the text encoder (Fig. 1, paragraph 19, wherein a pre-trained language model replaces and acts as the text encoder in the text-to-image framework, which is interpreted as the text encoder being pre-trained).
It would be obvious to one of ordinary skill before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Gafni and Yu for this method of training a model for generating images. Both Liu, Gafni, and Yu discuss machine learning models trained for generating images based on text prompts, both including a text encoder to encode text input as part of the image generation process. Liu discusses training the machine learning model as a whole based on a set of training images. Gafni also discusses training the machine learning model based on both a set of training images and a set of associated text inputs. Similarly, Yu also discusses training the machine learning model by feeding the model text captions and backpropagating based on the generated image to minimize loss.  As all three references discuss analogous art and both discuss non-conflicting ways of training machine learning models for the purposes of image generation, it would be obvious to combine them.

Regarding claim 2, Liu in view of Gafni and Yu discloses the method of claim 1. Additionally, Liu teaches training the image generation model to generate images based on the provisional image (Fig. 3, paragraph 43-44, wherein diffusion model that generates images is trained based on predicted synthesized images which is interpreted as a provisional image).

Regarding claim 3, Liu in view of Gafni and Yu discloses the method of claim 1. Additionally, Liu teaches computing an image generation loss based on the provisional image and the ground-truth image, wherein the text encoder and the image generation model are trained based on the image generation loss (Fig. 3, paragraph 43-44, wherein diffusion model that generates images and text decoder is trained based on minimizing a loss function between predicted synthesized images and the training images, which is interpreted as training based on the image generation loss).

8.	Claims 4 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Gafni and Yu as applied to claim 1 above, and further in view of Benedetto (US-20240264718-A1), hereinafter Benedetto.

Regarding claim 4, Liu in view of Gafni and Yu discloses the method of claim 1. Additionally, Benedetto teaches obtaining a complex text prompt describing a plurality of objects and a relationship between the objects (Fig. 2A, paragraph 42, wherein taking in a multiple sentence long text description as input that can describe a scene or characteristics is interpreted as a complex text prompt that can include describing a plurality of objects and their relationships), wherein the provisional text embedding represents the complex text prompt and the provisional image depicts the plurality of objects and the relationship between the objects (Fig 2A, paragraph 42-44, wherein input text is encoded into latent space data which is interpreted as a provisional text embedding, and the preprocessed output image is interpreted as a provisional image).
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu in view of Gafni to incorporate the teachings of Benedetto for this method of teaching a model for generating images. All four references discuss a diffusion-based machine-learning model for the purposes of generating images, and each include an encoder that can encode input text into an embedding representing that input. Each reference additionally discusses methods of training their machine-learning model using training data including text prompts and captions. As all four references discuss analogous art and discuss encoding text inputs as part of a machine-learning model for generating images, it would be obvious to combine them.

9.	Claims 5, 7 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Gafni and Yu as applied to claim 1 above, and further in view of Pham (US-20230154161-A1), hereinafter Pham.

Regarding claim 5, Liu in view of Gafni and Yu discloses the method of claim 1. Additionally, Pham teaches fixing parameters of the text encoder during a first training phase (Fig. 2, paragraph 47, 56, wherein during training a first forward pass is performed through the text encoder, which is interpreted as a first training phase, and is performed according to the current neural network parameters, which suggests that the parameters are fixed in the first training phase), wherein the text encoder is trained during a second training phase (Fig. 4, paragraph 79-83, wherein performing a second forward pass and backwards pass for training a text encoder is interpreted as a second training phase).
	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu in view of Gafni to incorporate the teachings of Pham for this method of teaching a model for generating images. Both Liu and Gafni discuss machine learning models trained for generating images based on text prompts, and both include a text and image encoder to encode image and text input as part of the image generation process. While Yu does not discuss an image encoder, it still discusses a text encoder as part of its image generation process. Liu discusses training the machine learning model as a whole based on a set of training images. Gafni and Yu also discuss training the machine learning model based on a set of associated text inputs. Pham, on the other hand, teaches a neural network system for training image and text encoders through contrastive learning, for the purposes of improving the performance of learning while not utilizing more memory. As Pham discusses a method for using a neural network for training a text and image encoder, and Liu, Gafni, and Yu both discuss image generation models that include a trained text and image encoder, it would be obvious to combine them.
	
	Regarding claim 7, Liu in view of Gafni and Yu discloses the method of claim 1. Additionally, Pham teaches identifying a first subset of parameters of the text encoder and a second subset of parameters of the text encoder, wherein the first subset of parameters is updated based on the training and the second subset of parameters are fixed during the training (paragraph 108, wherein the neural network parameters for training a text encoder include fixed constants, which suggests that the fixed constants are a second set of fixed parameters, and the rest of the network parameters for training is updated based on training).


10.	Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Gafni and Yu as applied to claim 1 above, and further in view of Zhou (US-20220130499-A1), hereinafter Zhou.

Regarding claim 8, Liu in view of Gafni disclose the method of claim 1. Additionally, Zhou teaches training an additional encoder for a modality other than text based on the provisional image (Fig. 1, paragraph 26-27, training a multi-modal encoder; paragraph 21-22, wherein the neural network is trained on features of an input training image, which is interpreted as including provisional images).

It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu in view of Gafni to incorporate the teachings of Zhou for this method of training a model for generating an image. Both Liu and Gafni discuss machine learning models trained for generating images based on text prompts, and both include a text and image encoder to encode image and text input as part of the image generation process. While Yu does not discuss an image encoder, it still discusses a text encoder as part of its image generation process. Zhou discusses a training a multi-modal encoder in order to encode a joint representation of an image and text data, for the purposes of better mapping relationships between the features of an image and the text, and generate a more accurate embedding representing a relationship between the image and text. It would be obvious to modify the image generation models of Liu, Gafni, and Yu to utilize the multi-modal encoder described in Zhou for the purposes of generating embeddings that represents their inputs and the relationships between the inputs and outputs more accurately.

11.	Claims 9-13 are rejected under 35 U.S.C. 103 as being unpatentable over Surya (US-10713821-B1), hereinafter Surya, in view of Liu and Yu.

Regarding claim 9, Surya teaches a method for image generation, comprising: obtaining a text prompt (Fig. 5, Col 12 lines 1-20, receiving text data describing an object interpreted as obtaining a text prompt); encoding, using a text encoder, the text prompt to obtain a text embedding (Fig. 5, Col. 12 lines 21-28, generating a text embedding of the text data); and generating, using the image generation model, a synthetic image based on the text embedding (Fig. 5, Col. 12 line 65 – Col. 13 line 10, generating a synthetic image data representing the object described in the text data through a GAN based on the conditioning data, wherein conditioning data is defined as including the text embedding).
 	Surya does not teach using a text encoder jointly trained with an image generation model and wherein the text encoder is pre-trained prior to training the text encoder jointly with the image generation model.
Liu teaches using a text encoder jointly trained with an image generation model (Fig. 3, paragraph 43-44, wherein diffusion model that generates images and text decoder can be trained together, which is interpreted as jointly trained).
Neither Surya nor Liu teach wherein the text encoder is pre-trained.
Yu teaches pre-training the text encoder (Fig. 1, paragraph 19, wherein a pre-trained language model replaces and acts as the text encoder in the text-to-image framework, which is interpreted as the text encoder being pre-trained).

	It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Surya to incorporate the teachings of Liu and Yu for generating an image using a machine learning model. Both Surya, Liu, and Yu discuss a machine learning model for generating images, both utilizing encoders to create embeddings as part of the image generation process. Surya in particular describes a context aware model that iteratively generates images based on modifications to an initial text query. Liu similarly discusses a machine learning model for generating images, as well as discusses methods for training both the image generation model and the text encoders. Furthermore, Yu discusses using a pre-trained text encoder to better improve the quality of generated embeddings. As all three references discuss analogous art on generating images using a machine learning model, as well as non-limiting methods for training the model, it would be obvious to combine them.

	Regarding claim 10, Surya in view of Liu and Yu discloses the method of claim 9. Additionally, Surya teaches generating, using a generative adversarial network (GAN), a high-resolution image based on the synthetic image (Fig. 6, Col. 13 lines 11-54, using a GAN to generate high resolution image data based on the synthetic image data).

	Regarding claim 11, Surya in view of Liu and Yu discloses the method of claim 10. Additionally, Surya teaches the method of claim 10, wherein: the image generation model and the GAN each take the text embedding as input (Fig. 5, Col. 12 line 65 – Col. 13 line 10, generating a synthetic image data representing the object described in the text data in a first GAN based on the conditioning data, wherein conditioning data is defined as including the text embedding; Fig. 6 Col. 13 lines 24-54, the second GAN generates high-resolution image data based on color embedding and hidden state data, which are defined as both being based off of the text embedding input).

	Regarding claim 12, Surya in view of Liu and Yu discloses the method of claim 10. Additionally, Surya teaches generating, using an image encoder, an image embedding, wherein the high-resolution image is generated based on the image embedding (Fig. 6, Col. 13 lines 24-54, wherein an encoder generates a feature representation of synthetic image data, which is interpreted as an image embedding, and is used to generate the high-resolution image data).

Regarding claim 13, Surya in view of Liu and Yu discloses the method of claim 12. Additionally, Surya teaches the method of claim 12, wherein: the GAN takes the image embedding as input (Fig. 6, Col. 13 lines 24-54, wherein the second GAN takes the feature representation of synthetic image data, interpreted as an image embedding, as input for generating an image).
	Surya does not teach where the image generation model takes the image embedding as input.
	Liu teaches where the image generation model takes the image embedding as input (paragraph 15, diffusion model configured to generate a synthesized image based on an input feature vector defined by the embeddings generated by the image encoder).
The motivation to combine would be the same as that set forth for claim 9.

12.	Claims 14-17, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Pham (US-20230154161-A1), hereinafter Pham, and Yu.

Regarding claim 14, Liu teaches a system for image generation, comprising: one or more processors; one or more memory components coupled with the one or more processors (Fig. 1, paragraph 18); a text encoder, the text encoder trained to encode a text prompt to obtain a text embedding (Fig. 3, paragraph 43-44, wherein text encoder generates training feature vectors based on text captions, which are interpreted as encoding text prompts to obtain text embeddings); and an image generation model comprising image generation parameters stored in the one or more memory components (Fig. 6, paragraph 52-53, wherein the machine learning diffusion model can be based off of user identifiers, which are interpreted as parameters that influence image generation), the image generation model trained to generate a synthetic image based on the text embedding (Fig. 3, paragraph 42-43, generating synthesized images based on a training feature vector generated by a text encoder), wherein the text encoder is trained jointly with the image generation model based on an output of the image generation model (Fig. 3, paragraph 43-44, wherein text encoder and image generation model may be trained together, and are trained based on the loss function of the synthesized images, which is interpreted as the output of the image generation model).
Liu does not teach a text encoder comprising text encoding parameters stored in the one or more memory components, and wherein the text encoder is pre-trained.
Pham teaches a text encoder comprising text encoding parameters stored in the one or more memory components (Fig. 1, paragraph 23, text encoder neural network has parameters; paragraph 31, neural network is trained on memory).
Neither Liu nor Pham teaches wherein the text encoder is pre-trained.
Yu teaches pre-training the text encoder (Fig. 1, paragraph 19, wherein a pre-trained language model replaces and acts as the text encoder in the text-to-image framework, which is interpreted as the text encoder being pre-trained).
It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Pham and Yu for this system of teaching a model for generating images. Liu discusses machine learning models trained for generating images based on text prompts, including a text and image encoder to encode image and text input as part of the image generation process. Similarly, Yu discusses a machine learning model for generating images based on text prompts using pre-trained text encoders to improve image generation quality and plausibility. Pham, on the other hand, teaches a neural network system for training image and text encoders through contrastive learning, for the purposes of improving the performance of learning while not utilizing more memory. As Pham discusses a method for using a neural network for training a text and image encoder, and Liu and Yu discuss an image generation model that include trained encoders, it would be obvious to combine them in order to improve the performance of both text and image encoders.

Regarding claim 15, Liu in view of Pham and Yu disclose the system of claim 14. Additionally, Liu teaches the system of claim 14, the system further comprising: a training component configured to train the text encoder to generate text embeddings as input for generating images with the image generation model based on a provisional image and a ground-truth image (Fig. 3, paragraph 43-44, wherein the machine learning diffusion model trains the text encoder to generate training feature vectors which are inputted for generating images, and is trained based on the predicted synthesized image, interpreted as a provisional image, and the training images, interpreted as a ground-truth image).

Regarding claim 16, Liu in view of Pham and Yu discloses the system of claim 15. Additionally, Liu teaches the system of claim 15, wherein: the training component is further configured to train the image generation model to generate images based on the provisional image (Fig. 3, paragraph 43-44, wherein the diffusion model that generates images is trained based on predicted synthesized images, which is interpreted as a provisional image).

Regarding claim 17, Liu in view of Pham and Yu discloses the system of claim 15. Additionally, Pham teaches the system of claim 15, wherein: the training component is further configured to fix parameters of the text encoder during a first training phase (Fig. 2, paragraph 47, 56, wherein during training a first forward pass is performed through the text encoder, which is interpreted as a first training phase, and is performed according to the current neural network parameters, which suggests that the parameters are fixed in the first training phase), wherein the text encoder is trained during a second training phase (Fig. 4, paragraph 79-83, wherein performing a second forward pass and backwards pass for training a text encoder is interpreted as a second training phase).
The motivation to combine would be the same as that set forth for claim 14.

Regarding claim 20, Liu in view of Pham and Yu discloses the system of claim 15. Additionally, Pham teaches the system of claim 14, the system further comprising: an image encoder comprising image encoding parameters stored in the one or more memory components, the image encoder trained to generate an image embedding (Fig. 1, paragraph 21, image encoder neural network has parameters for generating an image embedding; paragraph 31, neural network is trained on memory).
The motivation to combine would be the same as that set forth in claim 14.
	
13.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Pham and Yu as applied to claim 14 above, and further in view of Surya (US-10713821-B1), hereinafter Surya.

Regarding claim 19, Liu in view of Pham disclose the system of claim 14. Additionally, Surya teaches the system of claim 14, the system further comprising: a generative adversarial network (GAN) comprising GAN parameters stored in the one or more memory components (Col. 6 line 61 – Col. 7 line 34, GAN contains parameters that are updated during training; Fig. 3 Col. 8 line 65 – Col. 9 line 26, memory architecture), the GAN trained to generate a high-resolution image based on a low-resolution image generated by the image generation model (Fig. 6, Col. 13 lines 11-54, using a GAN to generate high resolution image data based on the generated synthetic image data, which is interpreted as a low-resolution image).

It would be obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu in view of Pham and Yu to incorporate the teachings of Surya for this system of teaching a machine learning model to generate images Both Surya, Liu, and Yu discuss a machine learning model for generating images, both utilizing encoders to create embeddings as part of the image generation process. Surya in particular describes a context aware model that iteratively generates images based on modifications to an initial text query. Liu similarly discusses a machine learning model for generating images, as well as discusses methods for training both the image generation model and the text encoders. Pham, on the other hand, teaches a neural network system for training image and text encoders through contrastive learning, for the purposes of improving the performance of learning while not utilizing more memory. As Pham discusses a method for using a neural network for training a text and image encoder, and both Liu, Yu, and Surya discuss image generation models that include trained encoders, it would be obvious to combine these references in order to improve the performance of the encoders.

Response to Arguments
14.	Applicant’s arguments with respect to claims 1, 9, and 14 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
15.	THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
16.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to JORDAN W YICK whose telephone number is (571)272-4063. The examiner can normally be reached M-F 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Said Broome can be reached at (571) 272-2931. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JORDAN WAN YICK/Examiner, Art Unit 2612                                                                                                                                                                                                        
/Said Broome/Supervisory Patent Examiner, Art Unit 2612
Read full office action
Prosecution Timeline

Feb 12, 2024
Application Filed
Sep 05, 2025
Non-Final Rejection — §103
Nov 25, 2025
Interview Requested
Dec 04, 2025
Examiner Interview Summary
Dec 04, 2025
Applicant Interview (Telephonic)
Dec 16, 2025
Response Filed
Jan 10, 2026
Final Rejection — §103
Mar 04, 2026
Interview Requested
Mar 11, 2026
Examiner Interview Summary
Mar 11, 2026
Applicant Interview (Telephonic)
Apr 09, 2026
Request for Continued Examination
Apr 13, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/528,835
Patent 12592026
NEURAL RADIANCE FIELD FOR VEHICLE
2y 5m to grant Granted Mar 31, 2026
18/560,159
Patent 12586312
METHOD AND DEVICE FOR DETERMINING CONCEALED OBJECTS IN A 3D POINT CLOUD REPRESENTING AN ENVIRONMENT
2y 5m to grant Granted Mar 24, 2026
18/321,305
Patent 12579744
3D GLOBAL GEOSPATIAL RENDERING SERVER SYSTEMS AND METHODS
2y 5m to grant Granted Mar 17, 2026
18/327,742
Patent 12573143
Systems and Methods for Identifying Suitability of Stormwater Management Measures Using Spatial Analysis
2y 5m to grant Granted Mar 10, 2026
18/364,417
Patent 12573142
SPATIAL LOCALITY FOR FIRST-HIT RAY BVH TRAVERSALS
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
95%
Grant Probability
99%
With Interview (+7.7%)
2y 6m
Median Time to Grant
Moderate
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allow rate.