Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on Jan 7, 2026 has been entered.
Allowable Subject Matter
Claims 8 and 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim Objections
Claim 13 is objected to because of the following informalities: In claim 13, line 1 recites “A of claim 9, wherein …”. It appears that this a typo and it was intended to recite: “A system of claim 9, wherein …”. Appropriate correction is required.
Response to Arguments
Applicant’s arguments, filed Jan 7, 2026, with respect to how the newly amended claim features differ from the prior art cited in the last office have been fully considered. These arguments are found to be persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in this office action.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 5-6, 9-10, 12-13, 15-17, and 19-24 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhou (NPL Doc, “LAFITE : Towards Language-Free Training for Text-to-Image Generation”).
As per claim 1, Zhou teaches the claimed:
1. A method comprising:
generating a plurality of generated text-image pairs by inputting a plurality
of bare images (Zhou in section 3, 1st paragraph “A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images … In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.”
In this passage, the “collect training images” correspond to the claimed “a plurality of bare images”. These bare images are used to generate text-image pairs by automatically generating text captions or text features for these images) to a pre-trained multimodal model having a text encoder and an image encoder implementing a multimodal embedding space (Zhou in section 3, 2nd paragraph “Throughout the paper, (x, t) denotes an image-text pair, x′ is the corresponding generated image of t … We use fimg and ftxt to denote the pre-trained text encoder and image encoder, which map text descriptions and image samples into a joint multi-modal feature space”) in which cosine similarity of the plurality of bare images in the multimodal embedding space to a plurality of text, respectively, is used to generate the text-image pairs (Zhou in section 3.1, 1st paragraph “The cosine similarity between matched image-text features is maximized, while cosine similarity of the mis-matched pair is minimized. This naturally provides a high-dimensional hyper-sphere2 for the multimodal features, where paired image-text should be close to each other, with a small angle between their feature vectors”); and
training a text-to-image generation model using the plurality of generated text-image pairs (Zhou in the abstract “One of the major challenges in training text-to-image generation models is the need of a large number of high quality image-text pairs … In this paper, we propose the first work to train text-to-image generation models” and the 2nd page upper middle of the 1st column “To the best of our knowledge, LAFITE is the first work that enables the language-free training for the text-to image generation task. We propose two novel schemes to construct pseudo image-text feature pairs, and conduct comprehensive study for the new setting. The effectiveness is validated with quantitative results on several datasets with different training schemes (training from scratch and fine-tuning from pre-trained generative models)”), the text-to-image generation model trained as a generative adversarial network (GAN) (Zhou in section 3.2, 1st paragraph “We propose to adapt the unconditional StyleGAN2 to a conditional generative model for our goal. Note that although we discuss our model in a language-free setting, it can be directly generalized to standard text-to-image generation by using h (real text feature) instead of h′ (pseudo text feature)”) having a generator and discriminator to train the generator to produce an image based upon text (Zhou in section 3, 2nd paragraph “Throughout the paper, (x, t) denotes an image-text pair, x′ is the corresponding generated image of t. G and D denote the generator and discriminator respectively … Our idea to achieve language-free training is to generate pseudo text features h′, which aims to approximating h, by leveraging the image text feature alignment of a pre trained model. The generated features are then fed into the text-to-image generator to synthesize the corresponding images”. In this instance, the synthesized corresponding images corresponds to the claimed “to produce an image based upon text”).
As per claim 2, Zhou teaches the claimed:
2. A method as in claim 1, wherein the pre-trained multimodal model includes the image encoder and the text encoder and has been trained with a set of text-image pairs, the set of text-image pairs including at least 10,000,000 text-image pairs (Zhou in section 4, 1st paragraph “Arguably, our pretraining dataset CC3M is much smaller4, compared to the pre-training dataset used in DALL-E” and the footnote 4 located on the bottom of this page which recites: “4Though we acknowledge that LAFITE is based on an off-the-shelf discriminate model CLIP, which is trained on 400 million image-text pairs”).
As per claim 3, Zhou teaches the claimed:
3. A method as in claim 1, wherein the plurality of bare images includes at least 1,000,000 images and the plurality of generated text- image pairs includes at least 1,000,000 generated text-image pairs (Zhou in section 4, 1st paragraph: “Arguably, our pretraining dataset CC3M is much smaller4, compared to the pre-training dataset used in DALL-E” and the footnote 4 located on the bottom of this page which recites: “4Though we acknowledge that LAFITE is based on an off-the-shelf discriminate model CLIP, which is trained on 400 million image-text pairs”. Zhou in section 3, 1st paragraph “A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images … In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.”
In this passage, the “collect training images” correspond to the claimed “a plurality of bare images”. These bare images are used to generate text-image pairs by automatically generating text captions or text features for these images
Also, please see Zhou in section 3.4, 2nd paragraph: “Pre-training. To demonstrate the zero-shot task transfer ability of our model, we also consider a variant that is pretrained on the Google Conceptual Captions 3M (CC3M) dataset [41], which consists of 3.3 millions of image-text pairs … The pre-trained models can be fine-tuned with LAFITE under language-free setting on different datasets”).
As per claim 5, Zhou teaches the claimed:
5. A method as in claim 1, wherein the training of the text-to-image generation model is accomplished using the generated text-image pairs generated by the pre-trained multimodal model (Zhou in the abstract “One of the major challenges in training text-to-image generation models is the need of a large number of high quality image-text pairs … In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multimodal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features” and on the 2nd page upper middle of the 1st column “To the best of our knowledge, LAFITE is the first work that enables the language-free training for the text-to image generation task. We propose two novel schemes to construct pseudo image-text feature pairs, and conduct comprehensive study for the new setting. The effectiveness is validated with quantitative results on several datasets with different training schemes (training from scratch and fine-tuning from pre-trained generative models)”).
As per claim 6, Zhou teaches the claimed:
6. A method as in claim 1, wherein the text-image pairs are not manually created (Zhou in section 3, 1st paragraph “A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images … In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.” Also, please see Zhou in section 3.2, 1st paragraph “… Note that although we discuss our model in a language-free setting, it can be directly generalized to standard text-to-image generation by using h (real text feature) instead of h′ (pseudo text feature)”.
In this passage, since the training is being performed using images with automatically generated text captions (text-image pairs), this training of the text-to-image generation model is not being performed manually).
Regarding claim 9, this claim is similar in scope to limitations recited in claim 1, and thus is rejected under the same rationale. Zhou teaches the claimed: the text-to-image generation model being trained using the plurality of bare images and the plurality of generated text-image pairs: Zhou in the abstract “One of the major challenges in training text-to-image generation models is the need of a large number of high quality image-text pairs … In this paper, we propose the first work to train text-to-image generation models” and the 2nd page upper middle of the 1st column “To the best of our knowledge, LAFITE is the first work that enables the language-free training for the text-to image generation task. We propose two novel schemes to construct pseudo image-text feature pairs, and conduct comprehensive study for the new setting. The effectiveness is validated with quantitative results on several datasets with different training schemes (training from scratch and fine-tuning from pre-trained generative models)”. The bare images are used to help the training process. For example, please see Zhou in section 3, 1st paragraph “A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images … In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.”
In this passage, the “collect training images” correspond to the claimed “a plurality of bare images”. These bare images are used to generate text-image pairs (which are used for training) by automatically generating text captions or text features for these images.
Regarding claim 10, this claim is similar in scope to limitations recited in claims 2 and 3, respectively, and thus is rejected under the same rationale.
Regarding claims 12 and 13, these claims are similar in scope to limitations recited in claims 5 and 6, respectively, and thus are rejected under the same rationale.
As per claim 15, the reasons and rationale for the rejection of claim 1 is incorporated herein.
Zhou teaches the claimed: A non-transitory computer readable storage media (The system of Zhou would have to have some type of non-transitory computer readable storage media present in order for their system to function and run on a computer-based system as described by the reference).
Regarding claims 16-17, and 19, these claims are similar in scope to limitations recited in claims 2-3, and 5, respectively, and thus are rejected under the same rationale.
As per claim 20, Zhou teaches the claimed:
20. A non-transitory computer-readable storage media as in claim 15, wherein the training of the text-to-image generation model is performed independent of manually created text-image pairs (Zhou in section 3, 1st paragraph “A natural idea to avoid human captioning in constructing image-text pair training data is using an off-the-shelf image captioning model that can automatically generate captions for the collected training images … In this paper, we resort to solving an easier problem: one may directly generate text features rather than text descriptions, to avoid the use of image captioning models.” Also, please see Zhou in section 3.2, 1st paragraph “… Note that although we discuss our model in a language-free setting, it can be directly generalized to standard text-to-image generation by using h (real text feature) instead of h′ (pseudo text feature)”
In this passage, since the training is being performed using images with automatically generated text captions (text-image pairs), this training of the text-to-image generation model is performed independent of manually created text-image pairs).
As per claim 21, Zhou teaches the claimed:
21. A method as in claim 1, wherein the pre-trained multimodal model has few-shot or zero-shot capability with few-shot meaning less than 100 (In section 4.2, 1st paragraph “Zero shot is a setting to evaluate a pre-trained text-to-image generation model, without training the model on any of downstream data. MS-COCO dataset is used for evaluating our model pre-trained on CC3M. The main results are shown in Table 2. Compared to DALL-E [38] and CogView [7], LAFITE achieves better quantitative results in most cases”).
Regarding claims 22 and 23, these claims are similar in scope to limitations recited in claim 21, and thus are rejected under the same rationale.
Regarding claim 24, this claim is similar in scope to limitations recited in claim 20, and thus are rejected under the same rationale.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL F HAJNIK whose telephone number is (571) 272-7642. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL F HAJNIK/Supervisory Patent Examiner, Art Unit 2616