Last updated: May 29, 2026

Application No. 18/402,415

IMAGE GENERATION USING TEXT

Final Rejection §102§103

Filed

Jan 02, 2024

Priority

Dec 18, 2023 — CIP of 18/543,898

Examiner

CHEN, YU

Art Unit

2613

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

2 (Final)

Interview Optional

— +29.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 68% grant rate with +29.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 1063 resolved cases, 2023–2026

Examiner Intelligence

CHEN, YU View full profile →

Grants 68% — above average

Career Allowance Rate

720 granted / 1063 resolved

+5.7% vs TC avg

Strong +30% interview lift

Without

With

+29.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

67 currently pending

Career history

1169

Total Applications

across all art units

Statute-Specific Performance

§101

0.9%

-39.1% vs TC avg

§103

77.2%

+37.2% vs TC avg

§102

11.9%

-28.1% vs TC avg

§112

5.5%

-34.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1063 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Response to Amendment
This is in response to applicant’s amendment/response filed on 01/08/2026, which has been entered and made of record.  Claims 1-7 have been amended.  No claim has been cancelled.  No claim has been added.  Claims 1-20 are pending in the application. 

Response to Arguments
Applicant's arguments filed on 01/08/2026 have been fully considered but they are not persuasive. 
Applicant submits “Yuan fails to teach or suggest at least "circuitry to use one or more neural networks to generate one or more images from text based, at least in part, on one or more first images without text indicating content of the one or more first images and one or more second images with text indicating content of the one or more second images," as recited in claim 1. Yuan discusses an image generation apparatus configured to generate images based on a text prompt. See Yuan at [0004], p. 1. Yuan also mentions that an image search may be based on a text prompt, and then using one or more retrieved images based on a text prompt. Id However, Yuan is silent at least regarding "one or more first images without text indicating content of the one or more first images," as recited by claim 1, and further silent regarding the one or more processors to use a neural network to generate images from a text prompt from images without any text embeddings indicating their content. Rather, Yuan merely describes retrieving search images from a database based on a query of images with text embeddings and, thus, the system appears incapable of obtaining an image "without text indicating content" of the image, as claimed. Yuan does not anticipate each and every element of amended claim 1. Withdrawal of the pending rejection of claim 1, therefore, is respectfully requested.” (Remarks, Page 6).
The examiner disagrees with Applicant’s premises and conclusion. Examiner mapped “Search images” to “one or more first images without text indicating content of the one or more first images” because the search image is the input without captions. See Fig. 5 and ¶0071.  The image with captions is mapped to “one or more second images with text indicating content of the one or more second images” because when image and text features are embedded into a common space, it is image with captions. See ¶0075, “Image and text features are embedded into a common space and image-semantic contrastive losses are designed to force the features of semantically similar input examples to be closer.”  
To further explain examiner point of view and without changing the ground, applicant can see Cho et al. (US Pub 2023/0153522 A1) in ¶0037, a “image search apparatus 110 can encode images and generate captions for the images according to a machine learning model.”

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-6, 8-13, 15-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Yuan et al. (US Pub 2023/0260164 A1).

As to claim 1, Yuan discloses a processor, comprising: one or more circuits to use one or more neural networks to generate one or more images from text based, at least in part, on one or more first images without text indicating content of the one or more first images and one or more second images with text indicating content of the one or more second images (Yuan, Fig. 2, Fig. 5, ¶0004, “The same cross-modal encoder can then encode a text phrase to obtain a text phrase representation. An image search component is used to select a search image from the candidate images by comparing each of the search image representations to the text phrase representation. A search image is used as guidance along with the text phrase to generate a target image.” The search image is a first image without text indicating content. ¶0071, “identifying a training set comprising a set of images and a set of captions corresponding to the images, encoding the images using an image encoder to produce encoded images, encoding the captions using a text encoder to produce encoded text, computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term, and training the image encoder and the text encoder based on the multi-modal loss function.” Image with captions is a second image with text indicating content. ¶0075, “Image features are adjusted according to captions via back propagation, and vice versa.”).

As to claim 2, claim 1 is incorporated and Yuan discloses the text indicating content is a caption embedded using one or more encoders (Yuan, ¶0071, “identifying a training set comprising a set of images and a set of captions corresponding to the images, encoding the images using an image encoder to produce encoded images, encoding the captions using a text encoder to produce encoded text, computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term, and training the image encoder and the text encoder based on the multi-modal loss function.”).

As to claim 3, claim 1 is incorporated and Yuan discloses the one or more first images without text indicating content of the one or more first images are compared to the one or more second images based, at least in part, on the one or more neural networks performing one or more loss operations (Yuan, ¶0071, “computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term, and training the image encoder and the text encoder based on the multi-modal loss function.” ¶0072, “the multi-modal loss function includes an image-text contrastive loss. The image-text contrastive loss is based on a distance between a query image and a positive or negative sample of encoded text. For example, a low image-text contrastive loss indicates that a query image may be similar to a phrase of text, and a high image-text contrastive loss indicates that an image may not be similar to a phrase of text, based on the associated encoded query image and the encoded text.” ¶0075, “Image and text features are embedded into a common space and image-semantic contrastive losses are designed to force the features of semantically similar input examples to be closer.” ¶0082, “The visual-semantic contrastive learning involves multi-modal training with multiple types of contrastive losses.” ¶0110, “computing a discriminator contrastive learning loss based on the target image and an original image. Some examples further include updating parameters of the discriminator network based on the discriminator contrastive learning loss.”).

As to claim 4, claim 1 is incorporated and Yuan discloses the one or more neural networks generates the one or more images based, at least in part, on one or more captions of one or more portions of the one or more second images (Yuan, ¶0071, “identifying a training set comprising a set of images and a set of captions corresponding to the images, encoding the images using an image encoder to produce encoded images, encoding the captions using a text encoder to produce encoded text, computing a multi-modal loss function based on the encoded images and the encoded text, the multi-modal loss function comprising at least one image loss term, at least one text loss term, and at least one cross-modal term, and training the image encoder and the text encoder based on the multi-modal loss function.” ¶0075, “Image features are adjusted according to captions via back propagation, and vice versa. After the completion of multi-modal contrastive training, cross-modal encoder 505 can be directly applied to or fine-tuned for cross-modal search 500.”).

As to claim 5, claim 1 is incorporated and Yuan discloses the one or more second images comprise one or more ground truth images and the one or more ground truth images are compared to the one or more first images based, at least in part, on one or more loss operations (Yuan, Fig.3, ¶0044, “The example shown includes text phrase 300, search images 305, ground truth image 310, and target image 315.” ¶0045, “ground truth image 310 is a validation ground truth image.” ¶0060, “training component 420 computes a generator contrastive learning loss based on the target image and an original image, where the parameters of image generation network 440 are updated based on the generator contrastive learning loss. In some examples, training component 420 computes a discriminator contrastive learning loss based on the target image and an original image. In some examples, training component 420 updates parameters of the discriminator network based on the discriminator contrastive learning loss.” ¶0071, “cross-modal encoder 505 (or a retrieval module including cross-modal encoder 505) is pre-trained on cross-modal search tasks using contrastive learning.”).

As to claim 6, claim 5 is incorporated and Yuan discloses the one or more loss operations include a visual reconstruction loss, a contrastive caption loss, or a generative caption loss (Yuan, ¶0060, “training component 420 computes a generator contrastive learning loss based on the target image and an original image, where the parameters of image generation network 440 are updated based on the generator contrastive learning loss. In some examples, training component 420 computes a discriminator contrastive learning loss based on the target image and an original image. In some examples, training component 420 updates parameters of the discriminator network based on the discriminator contrastive learning loss.” ¶0071, “cross-modal encoder 505 (or a retrieval module including cross-modal encoder 505) is pre-trained on cross-modal search tasks using contrastive learning.”).

As to claim 8, Yuan discloses a system, comprising: one or more processors to use one or more neural networks to generate one or more images from text based, at least in part, on one or more first images without text indicating content of the one or more first images and one or more second images with text indicating content of the one or more second images (See claim 1 for detailed analysis.).

As to claim 9, claim 8 is incorporated and Yuan discloses the text indicating content is a caption embedded using one or more encoders (See claim 2 for detailed analysis.).

As to claim 10, claim 8 is incorporated and Yuan discloses the one or more first images without text indicating content of the one or more first images are compared to the one or more second images based, at least in part, on the one or more neural networks performing one or more loss operations (See claim 3 for detailed analysis.).

As to claim 11, claim 8 is incorporated and Yuan discloses the one or more neural networks generates the one or more images based, at least in part, on one or more captions of one or more portions of the one or more second images (See claim 4 for detailed analysis.).

As to claim 12, claim 8 is incorporated and Yuan discloses the one or more second images comprise one or more ground truth images and the one or more ground truth images are compared to the one or more first images based, at least in part, on one or more loss operations (See claim 5 for detailed analysis.).

As to claim 13, claim 12 is incorporated and Yuan discloses the one or more loss operations include a visual reconstruction loss, a contrastive caption loss, or a generative caption loss (See claim 6 for detailed analysis.).

As to claim 15, Yuan discloses a method, comprising: using one or more neural networks to generate one or more images from text based, at least in part, on one or more first images without text indicating content of the one or more first images and one or more second images with text indicating content of the one or more second images (See claim 1 for detailed analysis.).

As to claim 16, claim 15 is incorporated and Yuan discloses the text indicating content is a caption embedded using one or more encoders (See claim 2 for detailed analysis.).

As to claim 17, claim 15 is incorporated and Yuan discloses the one or more first images without text indicating content of the one or more first images are compared to the one or more second images based, at least in part, on the one or more neural networks performing one or more loss operations (See claim 3 for detailed analysis.).

As to claim 18, claim 15 is incorporated and Yuan discloses the one or more neural networks generates the one or more images based, at least in part, on one or more captions of one or more portions of the one or more second images (See claim 4 for detailed analysis.).

As to claim 19, claim 15 is incorporated and Yuan discloses the one or more second images comprise one or more ground truth images and the one or more ground truth images are compared to the one or more first images based, at least in part, on one or more loss operations (See claim 5 for detailed analysis.).


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 7, 14, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yuan et al. (US Pub 2023/0260164 A1) in view of Cho et al. (US Pub 2023/0153522 A1).

As to claim 7, claim 1 is incorporated and Yuan discloses the one or more circuits are to cause the one or more neural networks to be trained based, at least in part, on using one or more visual encoders to embed one or more captions (Yuan, ¶0071, “training cross-modal encoder 505 includes identifying a training set comprising a set of images and a set of captions corresponding to the images, encoding the images using an image encoder to produce encoded images, encoding the captions using a text encoder to produce encoded text, computing a multi-modal loss function based on the encoded images and the encoded text”). 
Yuan does not discloses comparing the one or more captions with one or more ground truth captions.
Cho teaches the one or more neural networks to be trained based, at least in part, on using one or more visual encoders to embed one or more captions and comparing the one or more captions with one or more ground truth captions (Cho, ¶0004, “Image captioning is an NLP task of generating a textual description (i.e., a caption) of an image. Words in a caption can be used to index an image so that it can be retrieved from an image search database. Existing deep learning based approaches for image captioning train an image-conditioned language model on an image-caption dataset. For example, an image captioning model can be trained by maximizing likelihood over ground truth captions, then maximizing n-gram based metrics between predicted captions and ground truth captions.”).
Yuan and Cho are considered to be analogous art because all pertain to naural language processing. It would have been obvious before the effective filing date of the claimed invention to have modified Yuan with the features of “comparing the one or more captions with one or more ground truth captions” as taught by Cho. The claim would have been obvious because the technique for improving a particular class of devices was part of the ordinary capabilities of a person of ordinary skill in the art, in view of the teaching of the technique for improvement in other situations.

As to claim 14, claim 8 is incorporated and the combination of Yuan and Cho discloses the one or more processors are to cause the one or more neural networks to be trained based, at least in part, on using one or more visual encoders to embed one or more captions and comparing the one or more captions with one or more ground truth captions (See claim 7 for detailed analysis.).

As to claim 20, claim 15 is incorporated and the combination of Yuan and Cho discloses causing the one or more neural networks to be trained based, at least in part, on using one or more visual encoders to embed one or more captions and comparing the one or more captions with one or more ground truth captions (See claim 7 for detailed analysis.).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YU CHEN whose telephone number is (571)270-7951.  The examiner can normally be reached on M-F 8-5 PST Mid-day flex.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on 571-272-7761.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YU CHEN/
Primary Examiner, Art Unit 2613

Read full office action

Prosecution Timeline

Show 2 earlier events

Aug 12, 2025

Applicant Interview (Telephonic)

Aug 12, 2025

Examiner Interview Summary

Jan 08, 2026

Response Filed

Jan 28, 2026

Final Rejection mailed — §102, §103

Mar 12, 2026

Applicant Interview (Telephonic)

Mar 12, 2026

Examiner Interview Summary

Apr 27, 2026

Request for Continued Examination

May 01, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/355,399

Patent 12639861

GENERATING IMAGES BASED ON GENERATED CLUSTERS

2y 10m to grant Granted May 26, 2026

17/484,633

Patent 12628465

METHOD FOR MANUFACTURING INORGANIC LIGHT EMITTER

4y 7m to grant Granted May 12, 2026

18/473,462

Patent 12616076

SEMICONDUCTOR DEVICE

2y 7m to grant Granted Apr 28, 2026

18/545,206

Patent 12615353

VIEW SYNTHESIS UTILIZING SCENE-LEVEL FEATURES AND PIXEL-LEVEL FEATURES

2y 4m to grant Granted Apr 28, 2026

18/539,595

Patent 12610535

SEMICONDUCTOR STRUCTURE INCLUDING A BIT LINE STRUCTURE AND METHOD OF MANUFACTURING THE SAME

2y 4m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

68%

Grant Probability

98%

With Interview (+29.9%)

2y 10m (~5m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 1063 resolved cases by this examiner. Grant probability derived from career allowance rate.