Prosecution Insights
Last updated: April 19, 2026
Application No. 18/653,469

GENERATING SYNTHETIC CAPTIONS FOR TRAINING TEXT-TO-IMAGE GENERATIVE MODELS

Non-Final OA §102§103
Filed
May 02, 2024
Examiner
VAZ, JANICE EZVI
Art Unit
2667
Tech Center
2600 — Communications
Assignee
Databricks Inc.
OA Round
1 (Non-Final)
77%
Grant Probability
Favorable
1-2
OA Rounds
3y 1m
To Grant
99%
With Interview

Examiner Intelligence

Grants 77% — above average
77%
Career Allow Rate
48 granted / 62 resolved
+15.4% vs TC avg
Strong +28% interview lift
Without
With
+27.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
21 currently pending
Career history
83
Total Applications
across all art units

Statute-Specific Performance

§101
9.2%
-30.8% vs TC avg
§103
45.8%
+5.8% vs TC avg
§102
36.5%
-3.5% vs TC avg
§112
8.5%
-31.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 62 resolved cases

Office Action

§102 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention. Claim(s) 1, 3, 8, 10, 15, and 17 are rejected under 35 U.S.C. 102(a)(2) as being unpatentable by Im (US 20240177507 A1). Regarding Claim 1, representative of Claim 8 and 15, Im teaches a method, comprising: obtaining a first machine learning model, the first machine learning model being a pre-trained model trained to map content of a high-dimensional data modality to a low-dimensional data modality, wherein the low-dimensional data modality is a text caption of the content ([0056] S200 of controlling an image-to-text translation model so that the image-to-text translation model learns a function of extracting text related to the content of an image); obtaining a set of training data, wherein the set of training data comprises at least one or more uncaptioned training examples that are of the high-dimensional data modality ([0060]: generating text from an image according to an example embodiment of the present disclosure may further include an operation S300 of generating image-text synthetic data for the image by combining image and text information, [0062]: generating text/image synthetic data so that a text-based image may be generated from an image-only dataset which has only images without text information); generating a set of synthetic captions for the training data by applying the first machine learning model to the set of training data, wherein each synthetic caption describes a respective training example of the set of training data in the low-dimensional data modality ([0060]: generating text from an image according to an example embodiment of the present disclosure may further include an operation S300 of generating image-text synthetic data for the image by combining image and text information, [0062]: generating text/image synthetic data so that a text-based image may be generated from an image-only dataset which has only images without text information, see Fig. 5 caption describing the image); and for one or more iterations, training a second machine learning model, the second machine learning model trained to map content of the low-dimensional data modality to the high-dimensional data modality ([0066] Virtual synthetic data which describes the image may be generated for the image-only dataset without text information using the back-translation model (S300). The text-based image generation model may be trained using virtual text/image pair data generated as described above (S400)), by: generating a set of estimations by applying a set of parameters of the second machine learning model to the synthetic captions for a batch of the training examples, wherein the set of estimations include reconstructions of the content of the high-dimensional data modality for the batch of training examples ([0089] E is an encoder, G is a decoder, Z is a codebook, and D is a discriminator. L.sub.VQ is a loss function related to codebook learning which is set so that loss is reduced when an image is reconstructed in an encoding and decoding process), computing a loss for the batch of training examples based on the set of estimations, the loss representing the difference between the estimations and the batch of training examples ([0089] E is an encoder, G is a decoder, Z is a codebook, and D is a discriminator. L.sub.VQ is a loss function related to codebook learning which is set so that loss is reduced when an image is reconstructed in an encoding and decoding process), and updating the set of parameters of the second model to reduce the loss ([0089]: Training may be performed to reduce the sum of the two loss functions). Regarding Claim 3, representative of Claim 10 and 17, Im teaches the method of claim 1. In addition, Im teaches wherein the first machine learning model is an image-to-text model ([0056]: S200 of controlling an image-to-text translation model, [0060]: S300 of generating image-text synthetic data) and the second machine learning model is a text-to-image model ([0066]: text-based image generation model may be trained using virtual text/image pair data generated as described above (S400)). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 2, 5-6, 9, 12-13, 16, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Karpman (US 11995803 B1). Regarding Claim 2, representative of Claim 9 and 16, Im teaches the method of claim 1. In addition, Im teaches a first and second machine learning model. However, Im does not explicitly teach the remaining limitations of Claim 2. Karpman teaches wherein the first machine learning model is trained on a different set of training data than the second machine learning model, wherein the different set of training data for the first machine learning model has a higher number of training examples than the set of training data for the second machine learning model ([0047]: the system can first execute the captioner module on the training corpus and then execute the filter module on the resulting training set in order to remove image-text pairs that include either misaligned alternate text or misaligned (e.g., inaccurate) synthetic captions generated by the captioner module, thereby increasing overall caption fidelity of the training corpus but yielding a (slightly) smaller set of training examples. Examiner notes the modification of Im with Karpman’s synthetic caption filter would reduce the resulting image/caption pairs generated for a training dataset). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by evaluating the generated synthetic captions, and filtering those determined to be misaligned/inaccurate. Doing so would improve the fidelity of the training data used in Im’s secondary processing of text to image generation. Regarding Claim 5, representative of Claim 12 and 19, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of Claim 5. Karpman teaches wherein the set of training data comprises both captioned and uncaptioned training examples that are of the high-dimensional data ([0046]: system can automatically infer, generate, and/or bootstrap high-fidelity text captions for images (e.g., images without associated alternate text) retrieved by the web intelligence engine 108 and identify and correct noisy, incorrect and/or mismatched image-alternate-text pairs retrieved by the web intelligence engine 108 in order to create a large set of instructive text-image training examples. Examiner notes, captions are generated for images without alternate text (uncaptioned) and existing image-alternate text pairs may be corrected (captioned)). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by substituting the dataset of uncaptioned images for a dataset that may include uncaptioned and captioned images. Doing so would provide the predictable result of a providing a training dataset. Regarding Claim 6, representative of Claim 13 and 20, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of Claim 6. Karpman teaches further comprising: receiving a confidence level associated with each synthetic caption; and filtering the training data to exclude training examples where the confidence level of the corresponding synthetic caption does not exceed a threshold confidence level ([0047]: the system can first execute the captioner module on the training corpus and then execute the filter module on the resulting training set in order to remove image-text pairs that include either misaligned alternate text or misaligned (e.g. inaccurate) synthetic captions generated by the captioner module, thereby increasing overall caption fidelity). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Karpman by evaluating the generated synthetic captions, and filtering those determined to be misaligned/inaccurate. Doing so would improve the fidelity of the training data. Claim(s) 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Jayaswal (V. Jayaswal, S. Ji, Satyankar, V. Singh, Y. Singh and V. Tiwari, "Image Captioning Using VGG-16 Deep Learning Model," 2024 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 2024, pp. 1428-1433, doi: 10.1109/ICDT61202.2024.10489470). Regarding Claim 4, representative of Claim 11 and 18, Im teaches the method of claim 3. However, Im does not explicitly teach the remaining limitations of Claim 4. Jayaswal teaches wherein the uncaptioned training examples are images and further comprising pre-processing the images to reduce resolutions of the images before applying the first machine learning model ([Section V. A.]: image captioning process involves preprocessing and data loading to convert unprocessed image and text data into a machine learning model. Preprocessing [17] involves cropping). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified the teachings of Im to include the teachings of Jayaswal by including a preprocessing step of cropping images before captioning them. Doing so would reduce the amount of data needed to be processed, thereby improving processing speed. Claim(s) 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Im (US 20240177507 A1) in view of Bai (US 20250029289 A1). Regarding Claim 7, representative of Claim 14, Im teaches the method of claim 1. However, Im does not explicitly teach the remaining limitations of claim 7. Bai teaches wherein the first machine learning model is a pre-trained text-to-image model trained to map content of a low-dimensional data modality to a high-dimensional data modality, wherein the high-dimensional data modality is an image caption of the content ([0034] In embodiments of the present disclosure, the model training system 200 uses a generative model 210 for image generation, to generate a plurality of sample images 214-1, 214-2, . . . , 214-N…sample images 214 may also be referred to as synthetic images, and can be used as training data of the target model 232), and wherein the second machine learning model is an image-to-text model ([0033]: target model 232 may be configured for…image-text retrieval). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present invention to have modified Im by substituting the first and second models for the first and second models of Bai. Doing so would improve the accuracy of an image-text retrieval task by enabling both image-to-text model training and text-to-image model training depending on available training data. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANICE VAZ whose telephone number is (703)756-4685. The examiner can normally be reached Monday-Friday 9:00-5:00pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571) 272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /JANICE E. VAZ/Examiner, Art Unit 2667 /MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667
Read full office action

Prosecution Timeline

May 02, 2024
Application Filed
Feb 19, 2026
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602831
METHOD AND SYSTEM FOR ENHANCING IMAGES USING MACHINE LEARNING
2y 5m to grant Granted Apr 14, 2026
Patent 12602811
IMAGE PROCESSING SYSTEM
2y 5m to grant Granted Apr 14, 2026
Patent 12602935
DRIVING ASSISTANCE DEVICE AND DRIVING ASSISTANCE METHOD
2y 5m to grant Granted Apr 14, 2026
Patent 12591847
SYSTEMS AND METHODS OF TRANSFORMING IMAGE DATA TO PRODUCT STORAGE FACILITY LOCATION INFORMATION
2y 5m to grant Granted Mar 31, 2026
Patent 12591977
AUTOMATICALLY AUTHENTICATING AND INPUTTING OBJECT INFORMATION
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+27.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 62 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month