Office Action Analysis: 18407232 — NEURAL NETWORK TUNING USING TEXT ENCODER

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The two information disclosure statements (“IDS”) filed on 07/11/2025 were reviewed and the listed references were noted.

Drawings
The 2-page drawings have been considered and placed on record in the file.

Status of Claims
Claims 1-20 are pending.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-8, 11-17, and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Karpman et al. (US 11995803 B1), in view of Chen et al. ("Enhancing Diffusion Models with Text-Encoder Reinforcement Learning") .

Regarding claim 1, Karpman teaches a “generative text-to-image neural network comprising a pre-trained text encoder and a pre-trained diffusion model” (Karpman, Abstract and Column 5, lines 25-28 discloses; “The base image diffusion model 120 can therefore: receive one or more text embeddings from the set of pre-trained text encoders”), “with the method comprising: processing text prompts using the pre-trained text encoder to obtain embedded text prompts” (Karpman, Column 5 line 67 and column 6 line 1 discloses; “execute the set of pre-trained text encoders 118 on the text prompt to generate one or more embedding representations”); “generating images responsive to the embedded text prompts using the pretrained diffusion model” (Karpman, Column 2, lines 60-63 discloses; “Text encoders 118 interpret a text query and generate an embedding of the text query. Base image diffusion models 120 generate a base image (e.g., an initial, low-resolution image) from the embedding” ). Karpman does not teach reward scores and updating the text encoder responsive to the reward scores. Since Karpman does not explicitly disclose these limitations, Examiner relies on the teachings of Chen, in an analogous field of endeavor. Specifically, Chen discloses “iteratively determining reward scores for the images to convergence with the pre-trained diffusion model fixed” (Chen, Introduction Para. 3 discloses; “In this paper, we introduce TexForce, an innovative method that applies reinforcement learning combined with low-rank adaptation to enhance the text encoder using task-specific rewards. We utilize the DDPO (denoising diffusion policy optimization) [2] algorithm to update the text encoder, which is based on PPO (proximal policy optimization [39]) in the iterative denoising process”); “and updating the weights of the pre-trained text encoder responsive to the iteratively determined reward scores” (Chen, Abstract and Introduction Para. 3 discloses; “In this paper, we introduce TexForce, an innovative method that applies reinforcement learning combined with low-rank adaptation to enhance the text encoder using task-specific rewards”).
Karpman and Chen are both considered to be analogous to the claimed invention because they are in the same field of generative text-to-image neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Karpman to incorporate the teachings of Chen and provide a reward score for the images and updating the weights of the text encoder responsive to the reward scores. Doing so would improve the performance of diffusion models and enhancing the text-image alignment of the results (Chen, Abstract discloses; “In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality”).

Regarding claim 2, the combination of Karman in view of Chen teaches the method for tuning a text-to-image neural network as claimed in claim 1, wherein the method “further comprising: iteratively determining reward scores for the images to convergence with the pre-trained text encoder with the updated weights fixed” (Chen, Figure 3 shows that the text encoder can be frozen with updated weights); “and updating weights of the pre-trained diffusion model responsive to the iteratively determined reward scores for the images to convergence with the pre-trained text encoder with the updated weights fixed” (Karpman, Column 9, Lines 28-33; “As described in more detail below, the system can therefore leverage the reward model 114 (e.g., preference scores generated by the reward model 114) to update, optimize, and/or fine-tune parameters of the text-to-image diffusion model”).  The proposed combination as well as the motivation for combining the Karpman and Chen references presented in the rejection of claim 1, apply to claim 2 and are incorporated herein by reference.  Thus, the method recited in claim 2 is met by Karpman and Chen.

Regarding claim 3, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network as claimed in claim 1, “wherein the iteratively determining reward scores comprises: assessing quality of the image using an aesthetics model” (Karpman, Column 18, Lines 24-30 discloses; “Thus, in addition to recruiting the reward model 114 during training and/or fine-tuning the reward model 114, the system can also leverage the reward model 114 rank and select images generated by the text-to-image diffusion model 112 in order to increase the aesthetic quality and/or text alignment of images that are ultimately output/served in response to image generation tasks”).  

Regarding claim 4, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network as claimed in claim 1, “further comprising: processing the images with a variational autoencoder decoder before iteratively determining the reward scores for the images” (Karpman, Column 15, Lines 5-6 discloses; “the multimodal encoder-decoder 126 operating as an image aware text encoder, and so on”, which is interpreted to encapsulate a variational autoencoder decoder).  

Regarding claim 5, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network as claimed in claim 1, “wherein the iteratively determining reward scores for the images to convergence with the pre-trained diffusion model fixed comprises: maximizing quality scores predicted by one or more reward models” (Chen, Section 3.2 discloses; “Let R be the reward function that evaluates the quality of the generated images, which could encapsulate various aspects, such as image-text alignment and image quality, and adherence to specific attributes desired in the output. Then the objective of RL is to maximize the expected reward”).  The proposed combination as well as the motivation for combining the Karpman and Chen references presented in the rejection of claim 1, apply to claim 5 and are incorporated herein by reference.  Thus, the method recited in claim 5 is met by Karpman and Chen.

Regarding claim 6, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network and reward model as claimed in claims 1 and 5, “wherein the one or more reward models comprises an image-based reward model” (Chen, Section 3.2 discloses; “Let R be the reward function that evaluates the quality of the generated images, which could encapsulate various aspects, such as image-text alignment and image quality, and adherence to specific attributes desired in the output”).  The proposed combination as well as the motivation for combining the Karpman and Chen references presented in the rejection of claim 1, apply to claim 6 and are incorporated herein by reference.  Thus, the method recited in claim 6 is met by Karpman and Chen.

Regarding claim 7, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network and reward model as claimed in claims 1, 5, and 6, “wherein the one or more reward models further comprises a text-image alignment-based reward model” (Chen, Section 3.2 discloses; “Let R be the reward function that evaluates the quality of the generated images, which could encapsulate various aspects, such as image-text alignment and image quality, and adherence to specific attributes desired in the output”).  The proposed combination as well as the motivation for combining the Karpman and Chen references presented in the rejection of claim 1, apply to claim 7 and are incorporated herein by reference.  Thus, the method recited in claim 7 is met by Karpman and Chen.

Regarding claim 8, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network and reward model as claimed in claims 1 and 5, “wherein the iteratively determining reward scores for the images to convergence with the pre-trained diffusion model fixed further comprises: fixing the weights of the one or more reward models” (Chen, Section 4.6 discloses; “TexForce demonstrates remarkable adaptability to diverse tasks, as it does not require differentiable rewards”).  The proposed combination as well as the motivation for combining the Karpman and Chen references presented in the rejection of claims 1, apply to claim 8 and are incorporated herein by reference.  Thus, the method recited in claim 8 is met by Karpman and Chen.

Claims 11-17 recite a system with elements corresponding to the steps recited in method claims 1-7, respectively. Therefore, the recited elements of these claims are mapped to the proposed combination in the same manner as the corresponding steps in their corresponding method claims.  Additionally, the rationale and motivation to combine the Karpman and Chen references, presented in rejection of Claim 1, apply to these claims.  Finally, the combination of Karpman and Chen references discloses all of the elements claimed in claims 11-17. Finally, the combination of Karpman and Chen discloses a “pre-trained text encoder”, “a pre-trained diffusion model”, and “a reward model” (Karpman, Column 13, Lines 35-38 discloses; “ More specifically, for each text caption in the modified training corpus, the system can: execute a first pre-trained text encoder in the set of pre-trained text encoders 118”. Karpman, Column 15, Lines 36-39 discloses; “ the system 100 can execute a reinforcement learning algorithm to: input the text prompt to the (pre-trained) text-to-image diffusion model 112 to generate a first image”. Karpman Column 8, Lines 25-27 discloses; “A reward model 114 that is pre-trained on aggregated human assessments of quality and preferability for images created by the text-to-image diffusion model 112”.). 

Claims 19 and 20 recite a computer-readable storage medium storing a program with instructions corresponding to the steps recited in Claims 1 and 2, respectively. Therefore, the recited programming instructions of this claim are mapped to the proposed combination in the same manner as the corresponding steps in its corresponding method claim.  Additionally, the rationale and motivation to combine the Karpman and Chen references, presented in rejection of Claim 1, apply to this claim.  Finally, the combination of Karpman and Chen references disclose a computer readable storage medium (Karpman, Column 24, Lines 20-23 discloses; ” Instructions for performing such operations may be embodied in the memory 503, on one or more non-transitory computer readable media, or on some other storage device.”)

Claim 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Karpman in view of Chen et al and in further view of Baldrati et al. (“Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features”).

Regarding claim 9, the combination of Karpman in view of Chen teaches the method for tuning a text-to-image neural network as claimed in claim 1, wherein the pre-trained text encoder comprises a contrastive language-image pre-training (CLIP) model. The combination of Karpman and Chen does not teach setting similarity of the CLIP model as always on and maintaining the weights of the CLIP model greater than zero. Since the combination of Karpman and Chen does not explicitly disclose these limitations, Examiner relies on the teachings of Baldrati, in an analogous field of endeavor. Specifically, Baldrati discloses, “setting similarity of the CLIP model as an always on constraint” (Baldrati, Section 3 discloses; “The goal is to retrieve target images that satisfy similarity constraints imposed by both the input components”) “while maintaining weights of the CLIP model greater than zero” (Baldrati Section 4.1 discloses; “we employed AdamW optimizer [38] with a learning rate of 2𝑒 − 6 and a weight decay coefficient of 1𝑒 − 2.” Examiner interprets the weight decay to be equivalent to the weights of the CLIP model).

Karpman, Chen, and Baldrati are all considered to be analogous to the claimed invention because they are all in the same field of generative text-to-image neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Karpman and Chen used to reject claim 1 to incorporate the teachings of Baldrati and include setting similarity of the CLIP model as an always on constraint and maintaining weights of the CLIP model greater than zero (Baldrati Section 4.1). Doing so would allow the device to calculate the similarity between the text and images allowing the device to adjust the weights accordingly.

Claim 18 recites a system with elements corresponding to the steps recited in Claim 9. Therefore, the recited elements of this claim are mapped to the proposed combination in the same manner as the corresponding steps in its corresponding method claim.  Additionally, the rationale and motivation to combine the Karpman, Chen, and Baldrati references, presented in rejection of Claim 9, apply to this claim.  Finally, the combination of Karpman, Chen, and Baldrati references disclose all of the elements claimed in claim 9.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Karpman in view of Chen et al , in further view of Baldrati et al., and in still further view of Zbinden (“Implementing and Experimenting with Diffusion Models for Text-to-Image Generation”).

Regarding claim 10, the combination of Karpman in view of Chen in further view of Baldrati teaches the method for tuning a text-to-image neural network and CLIP settings as claimed in claims 1 and 9, where the setting the similarity of the CLIP model as an always on constraint while maintaining the weights of the CLIP model greater than zero. The combination of Karpman, Chen, and Baldrati does not teach maximizing cosine similarity between the textual and image embeddings. Since the combination of Karpman, Chen, and Baldrati does not explicitly disclose this limitation, Examiner relies on the teachings of Zbinden, in an analogous field of endeavor. Specifically, Zbinden discloses, “maximizing cosine similarity between textual embedding by the pre-trained text encoder and image embeddings by the pre-trained diffusion model” (Zbinden, page 21 Para. 2 discloses; “The two encoders are jointly trained in a contrastive way, by maximizing the cosine similarity between two embeddings”).

Karpman, Chen, Baldrati, and Zbinden are all considered to be analogous to the claimed invention because they are all in the same field of generative text-to-image neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Karpman and Chen used to reject claim 1 and the teachings of Baldrati used to reject claim 9 to incorporate the teachings of Zbinden and set the CLIP encoder to maximize cosine similarity (Zbinden, page 21 Para. 2 discloses; “The two encoders are jointly trained in a contrastive way, by maximizing the cosine similarity between two embeddings”). Doing so would allow the device to create images that are more similar to the text embeddings. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUSTIN M. OAKES whose telephone number is (571)272-9379. The examiner can normally be reached 7:30am-5pm.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amandeep Saini can be reached at (571) 272-3382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JUSTIN M OAKES/Examiner, Art Unit 2662   

/Siamak Harandi/Primary Examiner, Art Unit 2662
Read full office action
NEURAL NETWORK TUNING USING TEXT ENCODER

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

NEURAL NETWORK TUNING USING TEXT ENCODER

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email