Last updated: May 29, 2026

Application No. 18/653,469

GENERATING SYNTHETIC CAPTIONS FOR TRAINING TEXT-TO-IMAGE GENERATIVE MODELS

Non-Final OA §102§103

Filed

May 02, 2024

Examiner

VAZ, JANICE EZVI

Art Unit

2667

Tech Center

2600 — Communications

Assignee

Databricks Inc.

OA Round

1 (Non-Final)

Interview Optional

— +19.4% interview lift. Examiner has a relatively high allowance rate (76%); +19.4% interview lift. A written response may suffice.

Based on 66 resolved cases, 2023–2026

Examiner Intelligence

VAZ, JANICE EZVI View full profile →

Grants 76% — above average

Career Allowance Rate

50 granted / 66 resolved

+13.8% vs TC avg

Strong +19% interview lift

Without

With

+19.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

12 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

6.5%

-33.5% vs TC avg

§103

83.5%

+43.5% vs TC avg

§102

6.5%

-33.5% vs TC avg

§112

3.6%

-36.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 66 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 3, 8, 10, 15, and 17 are rejected under 35 U.S.C. 102(a)(2) as being unpatentable by Im (US 20240177507 A1).
Regarding Claim 1, representative of Claim 8 and 15, Im teaches a method, comprising: 
obtaining a first machine learning model, the first machine learning model being a pre-trained model trained to map content of a high-dimensional data modality to a low-dimensional data modality, wherein the low-dimensional data modality is a text caption of the content ([0056] S200 of controlling an image-to-text translation model so that the image-to-text translation model learns a function of extracting text related to the content of an image);
obtaining a set of training data, wherein the set of training data comprises at least one or more uncaptioned training examples that are of the high-dimensional data modality ([0060]: generating text from an image according to an example embodiment of the present disclosure may further include an operation S300 of generating image-text synthetic data for the image by combining image and text information, [0062]:  generating text/image synthetic data so that a text-based image may be generated from an image-only dataset which has only images without text information); 
generating a set of synthetic captions for the training data by applying the first machine learning model to the set of training data, wherein each synthetic caption describes a respective training example of the set of training data in the low-dimensional data modality ([0060]: generating text from an image according to an example embodiment of the present disclosure may further include an operation S300 of generating image-text synthetic data for the image by combining image and text information, [0062]:  generating text/image synthetic data so that a text-based image may be generated from an image-only dataset which has only images without text information, see Fig. 5 caption describing the image); and 
for one or more iterations, training a second machine learning model, the second machine learning model trained to map content of the low-dimensional data modality to the high-dimensional data modality ([0066] Virtual synthetic data which describes the image may be generated for the image-only dataset without text information using the back-translation model (S300). The text-based image generation model may be trained using virtual text/image pair data generated as described above (S400)), by: 
generating a set of estimations by applying a set of parameters of the second machine learning model to the synthetic captions for a batch of the training examples, wherein the set of estimations include reconstructions of the content of the high-dimensional data modality for the batch of training examples ([0089] E is an encoder, G is a decoder, Z is a codebook, and D is a discriminator. L.sub.VQ is a loss function related to codebook learning which is set so that loss is reduced when an image is reconstructed in an encoding and decoding process), 
computing a loss for the batch of training examples based on the set of estimations, the loss representing the difference between the estimations and the batch of training examples ([0089] E is an encoder, G is a decoder, Z is a codebook, and D is a discriminator. L.sub.VQ is a loss function related to codebook learning which is set so that loss is reduced when an image is reconstructed in an encoding and decoding process), and 
updating the set of parameters of the second model to reduce the loss ([0089]:  Training may be performed to reduce the sum of the two loss functions).

Regarding Claim 3, representative of Claim 10 and 17, Im teaches the method of claim 1. In addition, Im teaches wherein the first machine learning model is an image-to-text model ([0056]: S200 of controlling an image-to-text translation model, [0060]: S300 of generating image-text synthetic data) and the second machine learning model is a text-to-image model ([0066]: text-based image generation model may be trained using virtual text/image pair data generated as described above (S400)).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2, 5-6, 9, 12-13, 16, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Karpman (US 11995803 B1).
Regarding Claim 2, representative of Claim 9 and 16, Im teaches the method of claim 1. In addition, Im teaches a first and second machine learning model. However, Im does not explicitly teach the remaining limitations of Claim 2. Karpman teaches wherein the first machine learning model is trained on a different set of training data than the second machine learning model, wherein the different set of training data for the first machine learning model has a higher number of training examples than the set of training data for the second machine learning model ([0047]: the system can first execute the captioner module on the training corpus and then execute the filter module on the resulting training set in order to remove image-text pairs that include either misaligned alternate text or misaligned (e.g., inaccurate) synthetic captions generated by the captioner module, thereby increasing overall caption fidelity of the training corpus but yielding a (slightly) smaller set of training examples. Examiner notes the modification of Im with Karpman’s synthetic caption filter would reduce the resulting image/caption pairs generated for a training dataset).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by evaluating the generated synthetic captions, and filtering those determined to be misaligned/inaccurate. Doing so would improve the fidelity of the training data used in Im’s secondary processing of text to image generation.

Regarding Claim 5, representative of Claim 12 and 19, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of Claim 5. Karpman teaches wherein the set of training data comprises both captioned and uncaptioned training examples that are of the high-dimensional data ([0046]: system can automatically infer, generate, and/or bootstrap high-fidelity text captions for images (e.g., images without associated alternate text) retrieved by the web intelligence engine 108 and identify and correct noisy, incorrect and/or mismatched image-alternate-text pairs retrieved by the web intelligence engine 108 in order to create a large set of instructive text-image training examples. Examiner notes, captions are generated for images without alternate text (uncaptioned) and existing image-alternate text pairs may be corrected (captioned)).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by substituting the dataset of uncaptioned images for a dataset that may include uncaptioned and captioned images. Doing so would provide the predictable result of a providing a training dataset.

Regarding Claim 6, representative of Claim 13 and 20, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of Claim 6. Karpman teaches further comprising: receiving a confidence level associated with each synthetic caption; and filtering the training data to exclude training examples where the confidence level of the corresponding synthetic caption does not exceed a threshold confidence level ([0047]:  the system can first execute the captioner module on the training corpus and then execute the filter module on the resulting training set in order to remove image-text pairs that include either misaligned alternate text or misaligned (e.g. inaccurate) synthetic captions generated by the captioner module, thereby increasing overall caption fidelity).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by evaluating the generated synthetic captions, and filtering those determined to be misaligned/inaccurate. Doing so would improve the fidelity of the training data.

Claim(s) 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Jayaswal (V. Jayaswal, S. Ji, Satyankar, V. Singh, Y. Singh and V. Tiwari, "Image Captioning Using VGG-16 Deep Learning Model," 2024 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 2024, pp. 1428-1433, doi: 10.1109/ICDT61202.2024.10489470).

Regarding Claim 4, representative of Claim 11 and 18, Im teaches the method of claim 3. However, Im does not explicitly teach the remaining limitations of Claim 4. Jayaswal teaches wherein the uncaptioned training examples are images and further comprising pre-processing the images to reduce resolutions of the images before applying the first machine learning model ([Section V. A.]:  image captioning process involves preprocessing and data loading to convert unprocessed image and text data into a machine learning model. Preprocessing [17] involves cropping).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Jayaswal by including a preprocessing step of cropping images before captioning them. Doing so would reduce the amount of data needed to be processed, thereby improving processing speed.

Claim(s) 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Bai (US 20250029289 A1).
Regarding Claim 7, representative of Claim 14, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of claim 7. 
Bai teaches wherein the first machine learning model is a pre-trained text-to-image model trained to map content of a low-dimensional data modality to a high-dimensional data modality, wherein the high-dimensional data modality is an image caption of the content ([0034] In embodiments of the present disclosure, the model training system 200 uses a generative model 210 for image generation, to generate a plurality of sample images 214-1, 214-2, . . . , 214-N…sample images 214 may also be referred to as synthetic images, and can be used as training data of the target model 232), and wherein the second machine learning model is an image-to-text model ([0033]: target model 232 may be configured for…image-text retrieval).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified Im by substituting the first and second models for the first and second models of Bai. Doing so would improve the accuracy of an image-text retrieval task by enabling both image-to-text model training and text-to-image model training depending on available training data.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANICE VAZ whose telephone number is (703)756-4685. The examiner can normally be reached Monday-Friday 9:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571) 272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JANICE E. VAZ/Examiner, Art Unit 2667 

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667

Read full office action

Prosecution Timeline

May 02, 2024

Application Filed

Feb 24, 2026

Non-Final Rejection mailed — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/000,364

Patent 12620133

SYSTEM AND METHOD OF IMAGING

3y 5m to grant Granted May 05, 2026

18/544,325

Patent 12620199

ADVANCED IMAGE HASHING FOR ILLEGAL IMAGE DETECTION

2y 4m to grant Granted May 05, 2026

17/379,654

Patent 12602831

METHOD AND SYSTEM FOR ENHANCING IMAGES USING MACHINE LEARNING

4y 9m to grant Granted Apr 14, 2026

18/005,440

Patent 12602811

IMAGE PROCESSING SYSTEM

3y 3m to grant Granted Apr 14, 2026

18/562,754

Patent 12602935

DRIVING ASSISTANCE DEVICE AND DRIVING ASSISTANCE METHOD

2y 4m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

95%

With Interview (+19.4%)

3y 0m (~12m remaining)

Median Time to Grant

Low

PTA Risk

Based on 66 resolved cases by this examiner. Grant probability derived from career allowance rate.