Last updated: April 19, 2026
Application No. 18/443,590
UPSIDE-DOWN REINFORCEMENT LEARNING FOR IMAGE GENERATION MODELS

Final Rejection §103
Filed
Feb 16, 2024
Examiner
FOSTER, THOMAS JOHN
Art Unit
2616
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +7.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 20 resolved cases, 2023–2026
Examiner Intelligence

FOSTER, THOMAS JOHN View full profile →
Grants 95% — above average
Career Allow Rate
19 granted / 20 resolved
+33.0% vs TC avg
Moderate +7% lift
Without
With
+7.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
0.8%
-39.2% vs TC avg
§103
72.7%
+32.7% vs TC avg
§102
22.7%
-17.3% vs TC avg
§112
2.3%
-37.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 20 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION

Response to Arguments
Applicant’s arguments, filed 01/21/2026, with respect to how the newly amended claim features differ from the prior art cited in the last office have been fully considered. These arguments are found to be persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in this office action.
Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fan (Fan, Ying, et al. "Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models." Advances in Neural Information Processing Systems 36 (2023): 79858-79885.) in view of Black (Black, Kevin, et al. "Training diffusion models with reinforcement learning." arXiv preprint arXiv:2305.13301 (2023).).

As per claim 1, Fan teaches the claimed:
1. A method for image generation, comprising:
obtaining an input text prompt and an indication of a level of a target characteristic, (Fan teaches an input text prompt. Fan introduction: “Recent advances in diffusion models [10, 37, 38], together with pre- trained text encoders (e.g.,
CLIP [27], T5 [28]) have led to impressive results in text-to-image generation. Large-scale text-to image models, such as Imagen [32], Dalle-2 [29], and Stable Diffusion [30], generate high-quality, creative images given novel text prompts. However, despite these advances, current models have systematic weaknesses. For example, current models have a limited ability to compose multiple objects [6, 7, 25]. They also frequently encounter difficulties when generating objects with specified
colors and counts [12, 17]” Fan teaches generated image from text using a pre-trained model on text prompts.  Fan figure 12 depicts a text prompt with a level of a target characteristic. “Figure 12: Text prompt: "oil portrait of archie andrews holding a picture of among us, intricate,
elegant, highly detailed, lighting, painting, artstation, smooth, illustration, art by greg rutowski and alphonse mucha". Fan teaches an aesthetic quality and a degree of that quality, such as “highly detailed’. This is the target characteristic.).

Fan alone does not explicitly teach the remaining claim limitations.
However, Fan in combination with Black teaches the claimed:
 wherein the target characteristic comprises a characteristic used to train an image generation model and the level comprises a value included in a range of values, wherein the range of values is used to train the image generation model; (Black 5.2 “Aesthetic Quality”: “To capture a reward function that would be useful to a human user, we define a task based on perceived aesthetic quality. We use the LAION aesthetics predictor [43], which is trained on 176,000 human image ratings. The predictor is implemented as a linear model on top of CLIP embeddings [37]. Annotations range between 1 and 10, with the highest-rated images mostly containing artwork. Since the aesthetic quality predictor is trained on human judgments, this task constitutes reinforcement learning from human feedback [34, 7, 61].” The range between 1 and 10 on the training data is the range.).
generating an augmented text prompt comprising the input text prompt and an objective text corresponding to the indication of the level of the target characteristic, wherein the objective text comprises the target characteristic and the value; (Fan appendix figure 12. “Figure 12: Text prompt: "oil portrait of archie andrews holding a picture of among us, intricate, elegant, highly detailed, lighting, painting, artstation, smooth, illustration, art by greg rutowski and alphonse mucha". It would be obvious to combine these input text prompts with the trait of being “aesthetic” and the set of numerical ratings for them to indicate the amount of an abstract trait desired for an image, as taught by Black. For example, the prompts discuss characteristics like “detailed” or “intricate”.).
and generating, using the image generation model, an image based on the augmented text prompt, wherein the image depicts content of the input text prompt and has the level of the target characteristic.  
(Fan teaches generating images by the model based on text prompts, including the augmented one described above. Fan Figure 2 description: “Figure 2: Comparison of images generated by the original Stable Diffusion model, supervised fine_tuned (SFT) model, and RL fine-tuned model. Images in the same column are generated with the
same random seed. Images from seen text prompts: “A green colored rabbit” (color), “A cat and a dog” (composition), “Four wolves in the park” (count), and “A dog on the moon” (location).” These prompts for image generation can be combined with the descriptions of an image with a characteristic and a value from 1 to 10 as taught by Black. An input prompt for generation can be used with this format since the training of a model in this format would aim to allow for such generation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the training data labeled with aesthetic quality   and an associated level in a range of levels as taught by Black with the system of Fan in order to indicate in training data a quantifiable, ordered value for an aesthetic quality in training data and allow an image to be generated with a value in this scale.  
As per claims 9 and 15, these claims are similar in scope to limitations recited in claim 1, and thus are rejected under the same rationale. Claim 9 discloses a plurality of training prompts for a plurality of training images where the training images are labelled with the prompts. Black 5.2 teaches a number of training images that have the characteristic of “aesthetic” and an annotation of a number between 1 and 10. The combination of the characteristic and the number comprises the prompt. The annotations are the labels described in claim 9. 

As per claim 2, Fan teaches the claimed:
2. The method of claim 1, further comprising:
determining the level of the target characteristic based on the objective text using a classifier model.  (Fan top of pg. 4: “Text-to-image diffusion models. Diffusion models are especially well-suited to conditional data
generation, as required by text-to-image models: one can plug in a classifier as guidance function.” The classifier is part of the model.  The model is trained using prompts including the one described above.  Fan pg. 8: “Figure 3(a) compares ImageReward scores of images generated by the different models (with the same random seed). We see that both SFT and RL fine-tuning improve the ImageReward scores on the training text prompt. This implies the fine-tuned models can generate images that are better aligned with the input text prompts than the original model because ImageReward is trained on human feedback datasets to evaluate image-text alignment. Figure 2 indeed shows that fine-tuned models add objects to match the number (e.g., adding more wolves in “Four wolves in the park”), and replace incorrect objects with target objects (e.g., replacing an astronaut with a dog in “A dog on the moon”) compared to images from the original model.”).

As per claim 3, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
3. The method of claim 1, wherein:
the image generation model is trained using annotated training data including a training image that is labeled based on the target characteristic and a training prompt corresponding to the training image.  (Black 5.2 “To capture a reward function that would be useful to a human user, we define a task based on
perceived aesthetic quality. We use the LAION aesthetics predictor (Schuhmann, 2022), which is trained on 176,000 human image ratings. The predictor is implemented as a linear model on top of CLIP embeddings (Radford et al., 2021). Annotations range between 1 and 10, with the highest-rated images mostly containing artwork. Since the aesthetic quality predictor is trained on human judgments, this task constitutes reinforcement learning from human feedback (Ouyang et al.,
2022; Christiano et al., 2017; Ziegler et al., 2019”).”  Black 5.1 teaches the training images for the reward function have prompts that are measures by the model. “Figure 2 (VLM reward function) Illustration of the VLM-based reward function for prompt-image
alignment. LLaVA (Liu et al., 2023) provides a short description of a generated image; the reward is the similarity between this description and the original prompt as measured by BERTScore (Zhang et al., 2020).”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the annotated training data with prompts and corresponding scores as taught by Black with the system of Fan in order to train the image generation system beforehand on previous images with known prompts and targeted aesthetic scores.  
As per claim 20, this claim is similar in scope to limitations recited in claim 3, and thus is rejected under the same rationale.

As per claim 4, Fan teaches the claimed:
4. The method of claim 3, wherein: the training prompt includes training objective text at a same location as the objective text within the augmented text prompt. (Fan Figure 12: Text prompt: "oil portrait of archie andrews holding a picture of among us, intricate, elegant, highly detailed, lighting, painting, artstation, smooth, illustration, art by greg rutowski and alphonse mucha". The objective text is in the same location of text as the subject of the prompt. Fan pg. 20 describes training with multiple prompts and describes using prompts of different complexity.  This would include Figure 12 shown in the next page.).

As per claim 5, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
5. The method of claim 1, wherein:	
the objective text indicates a level of image quality.  (Fan figure 3 description: “(a) ImageReward scores and (b) Aesthetic scores of three models: the original model,
supervised fine-tuned (SFT) model, and RL fine-tuned model. ImageReward and Aesthetic scores are
averaged over 50 samples from each model. (c) Human preference rates between RL model and SFT model in terms for image-text alignment and image quality. The results show the mean and standard deviation averaged over eight independent human raters.” One of the characteristics being measured is image quality.  Additionally, Fan fig. 12 shows a prompt with the phase “highly detailed” as described above.  This is an indicator of image quality. Likewise, Black using training data indicating an “aesthetic” characteristic, and a score of 1 to 10. This could be the format of the prompts relating to aesthetic quality.).
It would have been obvious to one of ordinary skill in the art before format of training images with a score of an aesthetic quality as taught by Black with the system of Fan in order to format a prompt relating to aesthetics on a clear scale with different levels.  

As per claim 6, Fan teaches the claimed:
6. The method of claim 1, wherein:
the objective text indicates a number of objects described by the input text prompt.  (Fan teaches prompts describing numbers of objects. Figure 2: “Comparison of images generated by the original Stable Diffusion model, supervised fine_tuned (SFT) model, and RL fine-tuned model. Images in the same column are generated with the same random seed. Images from seen text prompts: “A green colored rabbit” (color), “A cat and a dog” (composition), “Four wolves in the park” (count), and “A dog on the moon” (location).” The graphics included in figure 2 measure data based on count, among other traits.).

As per claim 10, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
10. The method of claim 9, wherein obtaining the training data comprises:
applying a classifier model to the plurality of training images to determine the levels of the target characteristic; (Fan top of pg. 4: “Text-to-image diffusion models. Diffusion models are especially well-suited to conditional data
generation, as required by text-to-image models: one can plug in a classifier as guidance function [4], or can directly train the diffusion model’s conditional distribution with classifier-free guidance [9]. Given text prompt z „ ppzq, let qpx0|zq be the data distribution conditioned on z. This induces a joint distribution ppx0, zq. During training, the same noising process q is used regardless of input z, and both the unconditional ϵθpxt, tq and conditional ϵθpxt, t, zq denoising models are learned. For data sampling, let ϵ¯θ “ wϵθpxt, t, zq ` p1 ´ wqϵθpxt, tq, where w ě 1 is the guidance scale. At test.” Black teaches using a plurality of images and teaches defined levels for the characteristic used in the training of the images. Black 5.2: “To capture a reward function that would be useful to a human user, we define a task based on perceived aesthetic quality. We use the LAION aesthetics predictor [43], which is trained on 176,000 human image ratings. The predictor is implemented as a linear model on top of CLIP embeddings [37]. Annotations range between 1 and 10, with the highest-rated images mostly containing artwork. Since the aesthetic quality predictor is trained on human judgments, this task constitutes reinforcement learning from human feedback.”).
and generating the objective text based on an output of the classifier model.  (Fan teaches using the diffusion model to generate images based on text prompts. Fan Figure 2 description: “Figure 2: Comparison of images generated by the original Stable Diffusion model, supervised fine_tuned (SFT) model, and RL fine-tuned model. Images in the same column are generated with the
same random seed. Images from seen text prompts: “A green colored rabbit” (color), “A cat and a
dog” (composition), “Four wolves in the park” (count), and “A dog on the moon” (location).” Thus, the images are based on text prompts generated by the model.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the plurality of training image as taught by Black with the system of Fan in order to train the image generation system beforehand on a larger number of images with discrete and ordered scores to train the model to measure a level of an abstract characteristic more accurately.  

As per claim 11, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
11. The method of claim 10, wherein:
the classifier model comprises an aesthetic classifier.  (Black teaches a classifier that predicts image aesthetics. Black pg. 15, appendix B: “Classifier guidance (Dhariwal & Nichol, 2021) was originally introduced as a way to improve sample
quality for conditional generation using the gradients from an image classifier. For a differentiable
reward function such as the LAION aesthetics predictor (Schuhmann, 2022), one could naturally imagine an extension to classifier guidance that uses gradients from such a predictor to improve aesthetic score. The issue is that classifier guidance uses gradients with respect to the noisy images in the intermediate stages of the denoising process, which requires retraining the guidance network on noisy images.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to aesthetic classifier as taught by Black with the system of Fan in order to use the classifier of Fan to predict aesthetics specifically to obtain an aesthetic score for different images.  

As per claim 12, Fan teaches the claimed:
12. The method of claim 9, wherein:
the training is based on a diffusion process.
 (Fan teaching a diffusion model to train the text-to-image generation. Fan abstract: “Learning from human feedback has been shown to improve text-to-image models.
These techniques first learn a reward function that captures what humans care about
in the task and then improve the models based on the learned reward function.
Even though relatively simple approaches (e.g., rejection sampling based on reward
scores) have been investigated, fine-tuning text-to-image models with the reward
function remains challenging. In this work, we propose using online reinforcement
learning (RL) to fine-tune text-to-image models. We focus on diffusion models,
defining the fine-tuning task as an RL problem, and updating the pre-trained
text-to-image diffusion models using policy gradient to maximize the feedback_trained reward. Our approach, coined DPOK, integrates policy optimization with
KL regularization.”).

As per claim 13, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
13. The method of claim 9, wherein:
the training data comprises a plurality of different images corresponding to a plurality of levels of the target characteristic, respectively.  (Black pg. 6, 5.2: “To capture a reward function that would be useful to a human user, we define a task based on perceived aesthetic quality. We use the LAION aesthetics predictor (Schuhmann, 2022), which is trained on 176,000 human image ratings. The predictor is implemented as a linear model on top of CLIP embeddings (Radford et al., 2021). Annotations range between 1 and 10, with the highest-rated images mostly containing artwork. Since the aesthetic quality predictor is trained on human judgments, this task constitutes reinforcement learning from human feedback (Ouyang et al.,
2022; Christiano et al., 2017; Ziegler et al., 2019).” The annotations of the aesthetic quality are the plurality of levels of a target characteristic.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the annotated training data with a plurality of levels of the target characteristic as taught by Black with the system of Fan in order train the image generation system with different levels of an aesthetic characteristic so it can respond to a level given in a prompt.  

As per claim 14, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Black teaches the claimed:
14. The method of claim 9, further comprising:
pretraining the image generation model based on unlabeled images.  (Black teaches using a vision-language model (VLM) to train the diffusion model. Black allows training without human-made labels. Black pg. 1 “In the case of text-to-image diffusion, we propose a method for replacing such labeling with feedback from a vision-language
model (VLM). Similar to RLAIF finetuning for language models (Bai et al., 2022b), the resulting procedure allows for diffusion models to be adapted to reward functions that would otherwise require additional human annotations. We use this procedure to improve prompt-image alignment for unusual
subject-setting compositions.” Black pg. 8 “We next evaluate the ability of VLMs, in conjunction with DDPO, to automatically improve the
image-prompt alignment of the pretrained model without additional human labels. We focus on DDPOIS for this experiment, as we found it to be the most effective algorithm in Section 6.1. The prompts for this task all have the form “a(n) [animal] [activity]”, where the animal comes from the same list of 45 common animals used in Section 6.1 and the activity is chosen from a list of 3 activities: “riding a bike”, “playing chess”, and “washing dishes”.”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the vision-language model on unlabeled data as taught by Black with the system of Fan in order to test have unlabeled data as an input option and to test the image-prompt alignment of the image generation system of Fan.  
Claim 8 is rejected under 35 U.S.C. 103 as obvious over Fan. 
As per claim 8, Fan teaches the claimed:
8. The method of claim 1, wherein generating the augmented text prompt comprises:
prepending the objective text to the input text prompt.  (Fan figure 12 shows the text prompt with the objective.  It would be obvious to enter the prompt with the objective “highly detailed” at the beginning of the prompt.  For example, the exact ordering of the words in the input text prompt may be based upon the user’s own preference.  For instance, a user may enter the prompt with the objective “highly detailed” at the beginning as one of several different ways in which to describe their desired output.).

Claims 7, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Fan in view of Black and further in view of Cao (Cao, Yuanjiang, et al. "Reinforcement learning for generative AI: A survey." arXiv preprint arXiv:2308.14328).

As per claim 7, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Cao teaches the claimed:
7. The method of claim 1, further comprising:
encoding the augmented text prompt to obtain a text embedding, wherein the image generation model takes the text embedding as an input.  (Cao pg. 8, bottom of left column: “1) The generated variable is non-differentiable: Discrete
values are prevalent in various generative applications such as
computer vision, neural language processing, and molecule
generation. In language applications and molecular design,
elements of text and molecules are tokenized and embedded
into high-dimensional space in order to capture a better
representation. The tokens are discrete values or one-hot
vectors. In computer vision, a common format of images is the
RGB format which comprises discrete values of three color
channels. Although it is feasible and easy to normalize the
discrete values that are fed into a continuous generative model,
transforming discrete values into continuous values leads to
adverse effects such as weaker robustness [59].
Reinforcement learning is a suitable tool for such problems.
The policy gradient method is a widely adopted approach”
Cao is being used for image generation and its representation of image would be used for an image generation system.  The image with embedded text would be input into that system: Cao pg. 19: “4) Text-to-Image Generation: Recent advancement in im_age generation is diffusion models, thereby it might be ben_eficial explore how to combine reinforcement learning with diffusion models by improvement on images characteristics
that are hard to be described by prompts [229] and online
reinforcement learning methods [230]. …The reward takes file size and
human aesthetic preference acquired from another predictor
into consideration. An extra alignment using a vision language
model is incorporated for RLAIF [96]. [230] proposes to
compare the RL-directed fine-tuning and supervised fine_tuning in the context of KL divergence as a regularizer. The RL fine-tuning uses a policy gradient with a KL term to constrain models on a pre-trained model. The reward in RL fine-tuning
is typically from human preference matching.”)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the embedding of text into a different dimensional vector space of a training image as taught by CAO with the system of Fan in order to convert the prompt text of Fan for an image of Fan into a vector format that is easier for a machine learning system to read.  

As per claim 16, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Cao teaches the claimed:
16. The system of claim 15, the system further comprising:
a classifier model comprising classification parameters stored in the one or more memory components, the classifier model trained to determine the level of the target characteristic.  (Cao teaches a model that compares them to human judgements about their quality and the similarity to certain characteristics. Cao pg. 15 left column: “4) Human Value Alignment and Constraints: The output of
generators is not well matched with human values. Models
sometimes have hallucinations, generating fake information
that they do not understand at all. Sometimes models are
impacted by datasets and spit out sentences that do not match
human values in some cultures. Reinforcement learning can
be used to adjust the model to work better on value matching
[158], impose constraints [159], or even help people to combat
problems like fake news [160]. SENSEI [158] proposes to use
actor-critic to align text generation with human judgments.
The reward is predicted by a binary classifier trained a human_labeled text data. RCR [159] uses a discriminator to model the
violations and computes a penalty accordingly. This penalty
is added to the reward to regulate the actions of the text
generator. FakeGAN [160] trains a deceptive reviews classifier
with a two-discriminator GAN model. Although it addresses
a classification problem, the method contains generating de_ceptive reviews, which is a generation subtask.” The rewards are the indication that text generated aligns with human perception of the characteristics being targeted. It would be obvious to combine these with the aesthetic qualities in the images and described in the text prompts as taught in claim 1.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the classification to measure the aesthetic characteristics as taught by Cao with the system of Fan in order to classify images in the training model according to their alignment with the related text and their closeness to a human ideal of aesthetics.  
	
As per claim 18, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Cao teaches the claimed:
18. The system of claim 15, the system further comprising:
an encoder comprising encoding parameters stored in the one or more memory components, the encoder configured to encode the augmented text prompt to obtain a text embedding.  (Cao pg. 19: “3) Visual Dialog System: RL enhances the visual dialog system by incorporating discriminators [61], [228]. Fan et al.
[61] borrows SeqGAN [63] model into a visual dialog system.
They devise a model that contains two modules, an encoder
to embed images, captions, and questions into the embedding
vector, which is fed into an RL-based decoder as the state.
The decoder is an RL-based GAN. The generator is the agent
that outputs answers as actions. The discriminator learns to
classify the generated answers from real ones in the embedding
space. SCH-GAN [228] learn a cross-modal hashing GAN
with reinforcement learning. Text and image modalities are
considered. The generator tries to retrieve an image from
texts or vice versa. The discriminator aims to distinguish true
examples of the query.” The embedding parameters would include the image modalities, the goal (image from text vs. text from image) and the dimension of the vector being encoded. These procedures use a computer which necessarily comprises memory components.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the encoding of a text prompt or question or description into an image vector being input into a system as taught by Cao with the system of Fan in order to allow the prompts to be combined with an input image and efficiently read by the training system.  

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Fan in view of Black in further view of Li (Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unifed Vision-Language Understanding and Generation. arXiv:2201.12086 [cs.CV]).

As per claim 17, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Li teaches the claimed:
17. a language generation model comprising language generation parameters stored in the one or more memory components, the language generation model configured to generate the objective text based on an output of the classifier model.  (Li teaches text generation to generate text about images. It conversions relating images to language for training. Li Introduction. “Model perspective: most methods either adopt an
encoder-based model (Radford et al., 2021; Li et al., 2021a),
or an encoder-decoder (Cho et al., 2021; Wang et al., 2021)
model. However, encoder-based models are less straightforward to directly transfer to text generation tasks (e.g. image
captioning), whereas encoder-decoder models have not been
successfully adopted for image-text retrieval tasks.”  Li teaches using vision-language understanding and generation. Li introduction left column: “To this end, we propose BLIP: Bootstrapping LanguageImage Pre-training for unified vision-language understanding and generation. BLIP is a new VLP framework which
enables a wider range of downstream tasks than existing
methods. It introduces two contributions from the model
and data perspective, respectively:
(a) Multimodal mixture of Encoder-Decoder (MED): a new
model architecture for effective multi-task pre-training and
flexible transfer learning. An MED can operate either as
a unimodal encoder, or an image-grounded text encoder,
or an image-grounded text decoder. The model is jointly
pre-trained with three vision-language objectives: imagetext contrastive learning, image-text matching, and imageconditioned language modeling.”  It would be obvious to use this to generate language such as a prompt and base the vision-language relation on the output of the classifier model described above.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the prompt modification language generation as taught by Li with the system of Fan in order to use the vision-language relationship data and training to generate image-based texts refine the prompts to be used in the image generation process of Fan.  

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Fan in view of Black in further view of Liu (Liu, Vivian, Han Qiao, and Lydia Chilton. "Opal: Multimodal image generation for news illustration." Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 2022.).

As per claim 19, Fan alone does not explicitly teach the claimed limitations.
However, Fan in combination with Brade teaches the claimed:
19. The system of claim 15, the system further comprising:
a user interface configured to obtain the input text prompt from a user.  (Liu in the description for figure 1 shows a user interface with entering text for an image generation system. Liu Figure 1 description: “A screenshot of the Opal system, which helps users create news illustrations using a text-to-image generative AI
model. The system here has generated a gallery of images for an article on "climate change". The participant is guided through
the generation process with a structured pipeline of suggestions based off of GPT-3 generated keywords, tones, and styles” Liu teaches input of article text which can be a short one-line description. Liu 3.3.6: “User Interface. Opal, seen in Figure 3 is an interface com posed of two components: Oeuvre and Palette. An oeuvre is by definition "the works of a painter, composer, or author regarded collectively". The oeuvre provides users with a birds-eye view of all the generations they have created, and the generations stream in in real time as they are generated. The Palette provides users with a pipeline of tools to help create a news illustration. The user begins with the Article Area, which asks the user to input article text, which could be short one-line descriptions or long-form text.” The short one-line description corresponds to a text prompt.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use the user interface as taught by Liu with the system of Fan in order to give the user an easy tool to input a prompt into the reinforcement learning model of Fan and generate an image.  
	
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS JOHN FOSTER whose telephone number is (571)272-5053. The examiner can normally be reached Mon, Fri 8:30-6. Tues-Thurs 7:30-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Hajnik can be reached at 571-272-7642. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS JOHN FOSTER/Examiner, Art Unit 2616                


/DANIEL F HAJNIK/Supervisory Patent Examiner, Art Unit 2616
Read full office action
Prosecution Timeline

Feb 16, 2024
Application Filed
Oct 16, 2025
Non-Final Rejection — §103
Jan 08, 2026
Applicant Interview (Telephonic)
Jan 08, 2026
Examiner Interview Summary
Jan 21, 2026
Response Filed
Mar 02, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/225,271
Patent 12597097
INFORMATION PROCESSING DEVICE, MEASUREMENT SYSTEM, IMAGE PROCESSING METHOD, AND NON-TRANSITORY STORAGE MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/484,520
Patent 12592031
IMAGE PROCESSING METHOD, APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/332,869
Patent 12586272
Methods and Systems for Transferring Hair Characteristics from a Reference Image to a Digital Image
2y 5m to grant Granted Mar 24, 2026
18/388,786
Patent 12586158
IMAGE SIGNAL PROCESSOR FOR A COMPOSITE CHROMINANCE IMAGE AND A COMPOSITE WHITE IMAGE
2y 5m to grant Granted Mar 24, 2026
18/416,035
Patent 12586143
METHOD, DEVICE, AND PRODUCT FOR GPU CLUSTER
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
95%
Grant Probability
99%
With Interview (+7.1%)
2y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 20 resolved cases by this examiner. Grant probability derived from career allow rate.
UPSIDE-DOWN REINFORCEMENT LEARNING FOR IMAGE GENERATION MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email