DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Affidavit under CFR 1.130(a)
The affidavit lists Gabriel Goh alongside 11 other authors. It states, “none of those Co-authors invented the subject matter” and that they merely “contributed to Betker et al., but they did not conceive of or invent the elements.” This affidavit fails because of the evidence in the paper itself, the missing explanation of Linjie Li’s involvement, and the misspelling of Li Jing’s name as well as an explanation of his/her involvement.
The paper notes on page 1, that Gabriel Goh is marked with a symbol (“*”) which indicates “Equal contribution.” MPEP 717.01 (a)(1) states, “ A mere statement from the inventor or a joint inventor, without any accompanying reasonable explanation, may not be sufficient where there is evidence to the contrary.” An author credited with “equal contribution” on a technical paper describing the same subject matter as the claims creates a reasonable basis to question whether that author contributed to conception of the invention.
It is highly implausible that an author credited with equal contribution on a technical report titled “Improving Image Generation with Better Captions” did not invent the core captioning improvements described in the text. Authorship contribution to a technical paper describing the claimed invention raises a reasonable inference of contribution to conception, which requires explanation to rebut. See MPEP § 717.01(a)(1); Ex parte Kroger, 219 USPQ 370 (Bd. App. 1982). The affidavit must explain how Gabriel Goh and Li Jing could be an equal contributor to the paper yet not be an inventor of the core subject matter (recaptioning) used in the rejection. The affidavit also does not explain Li Jing’s involvement, instead swearing that Li Jeng had no involvement. In the paper Li Jing’s name is marked with a symbol (“*”) which indicates “Equal contribution.” Finally, the affidavit is silent as to Linjie Li’s involvement. Thus, the affidavit does not successfully invoke the 35 USC 102(b)(1)(A) exception and the 35 USC 102(a)(1) rejection remains valid.
Response to Arguments
Applicant's arguments filed 11/5/2025 have been fully considered but they are not persuasive.
Claims 1-20 are pending in this application and have been considered below.
Arguments:
The applicant argues the claims do not recite mathematical concepts or mental processes because they only "involve" or are "based on" mathematical concepts rather than explicitly reciting them. Applicant distinguishes from Example 47, Claim 2, which explicitly recites "backpropagation algorithm and a gradient descent algorithm."
Examiner’s Response:
This argument is unpersuasive for the following reasons. MPEP § 2106.04(a)(2) does state that a claim only "based on or involving" a mathematical concept does not automatically recite one. However, “training’ and “applying” machine learning models directly recite executing mathematical operations, not merely the using the results of mathematics. SAP America, Inc. v. InvestPic, LLC, 898 F.3d 1161, 1163 (Fed. Cir. 2018).
While applicant correctly notes that the claims do not explicitly recite "backpropagation" or "gradient descent," this does not resolve the issue. The absence of specific algorithm names does not mean the claims avoid reciting mathematical concepts. The claims still describe training/tuning processes that inherently involve mathematical optimization. The specification references mathematical likelihood objectives (see ,¶¶ [0052]-[0053]) with explicit mathematical formulas.
Arguments:
The applicant argues:
Claims improve computational efficiency by using image captioner models instead of LLMs for large datasets
The two-stage tuning process improves caption quality, which improves image generation models, and
Upsampling improves image generation model outputs.
Examiner’s Response:
None of these claims explicitly recite the improved computational efficiency, the improved image quality, or the improved model accuracy. The claims describe how to create training data but do not tie this to a specific technical improvement in the claim language itself.
The alleged improvements are improvements to the results of the mathematical processes, not improvements to computer functionality. See MPEP § 2106; Intellectual Ventures I LLC v. Capital One Bank (USA), N.A., 792 F.3d 1363, 1366 (Fed. Cir. 2015) ("An abstract idea does not become nonabstract by limiting the invention to a particular field of use or technological environment, such as the Internet [or] a computer").
Arguments:
The applicant argues the multi-faceted tuning approach and use of synthetic captions
provide non-routine activity that qualifies as significantly more.
Examiner’s Response:
The applicant conflates technical novelty with the Step 2B inquiry. The tuning stages are part of the judicial exception (mathematical optimization), not additional elements. Novelty of the abstract idea doses not render claims eligible. See SAP America, Inc. v. InvestPic, LLC, 898 F.3d at 1163 (“no matter how much of an advance in the field the claims recite, the advance lies entirely in the realm of abstract ideas”).
To show that elements are not well-understood, routine, and conventional, the applicant should provide evidence per MPEP § 2106.07(a)(III) and Berkheimer v. HP, Inc., 881 F.3d 1360, 1368 (Fed. Cir. 2018). Asserting that the approach is "non-conventional" is insufficient.
Therefore, the argued limitations were written broad such that they read upon the cited references or are shown explicitly by the references. As a result, the claims stand as follows.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to abstract ideas (mathematical concepts) without integration into a practical application and without providing significantly more than the abstract ideas.
Independent Claims 1, 12, and 16 Analysis
Step 1: Statutory Category
Claim 1 recites a "method for enhancing a training dataset for a machine learning model,"
which is a process. Claims 12 and 16 each recite a system comprising "at least one
memory" and "at least one processor," which is a machine. Accordingly, claims 1, 12, and
16 fall within statutory categories of 35 U.S.C. § 101. (Step 1: YES)
Step 2A, Prong One: Does the Claim Recite a Judicial Exception?
Claims 1, 12, and 16 recite limitations directed to training, tuning, and applying machine learning models to process data:
Claim 1 recites "generating a recaptioned dataset by applying an image captioner model to images in the text-to -image dataset, the image captioner model trained with an image dataset, a first tuning stage, and a second tuning stage"
Claim 12 recites "generating an image captioner model configured to generate captions from input images, the image captioner model trained using a text-to image dataset"; "performing a first tuning stage ... training the image captioner model using a first set of captions"; "performing a second tuning stage ... training the image captioner using the set of synthetic captions;" and "generating a captioned dataset by applying the tuned image captioner model to images in a dataset."
Claim 16 recites "upsampling the text description with a language model" and "providing the upsampled text description to an image generation model, the image generation model trained with a dataset comprising image-caption pairs, wherein at least a portion of captions are generated with an image captioner model"
These limitations recite mathematical concepts. Under the broadest reasonable interpretation:
Applying a trained model (claims 1, 12, 16) requires executing the mathematical operations learned during training to transform input data into output data. This is the execution of learned mathematical relationships.
Training and tuning machine learning models (claims 1, 12, 16) involves optimization algorithms that iteratively adjust model parameters through mathematical calculations to minimize a loss function. The specification confirms this at ¶¶ [0052]-[0053], which describe maximizing a likelihood function objective and updating the "θ parameter" during tuning.
Upsampling with a language model (claim 16) involves applying a trained language model to transform input text into expanded output text through learned mathematical transformations.
See MPEP § 2106.04(a)(2)(1); July 2024 SME Example 47 (training an ANN using optimization algorithms encompasses mathematical concepts as these are "optimization algorithms, which compute neural network parameters using a series of mathematical calculations"). The claims recite mathematical concepts falling within the enumerated groupings of abstract ideas. (Step 2A, Prong One: YES)
Step 2A, Prong Two: Does the Claim Integrate the Judicial Exception into a Practical Application?
This requires identifying additional elements beyond the judicial exception and evaluating whether they integrate the exception into a practical application. See M PEP§ 2106.04(d).
The additional elements in claims 1, 12, and 16 are:
Generic computer components (Claims 12, 16): "at least one memory storing
instructions" and "at least one processor configured to execute the instructions"
Data gathering (Claim 1): "obtaining a text-to-image dataset comprising one or
more digital image-caption pairs"
Data gathering (Claim 12): "obtaining a set of synthetic captions" and "wherein the
text-to-image dataset comprises one or more digital image-caption pairs"
Data gathering (Claim 16): "receiving a text description corresponding to an image"
Data output (Claim 16): "providing the upsampled text description to an image
generation model"
Generic Computer Components (Claims 12, 16): The memory and processor are recited
at a high level of generality and amount to no more than mere instructions to apply the
judicial exception using generic computer components. See MPEP § 2106.05(f).
Data Gathering (Claims 1, 12, 16): Obtaining datasets, receiving text descriptions, and obtaining captions are insignificant extra-solution activity in the form of mere data gathering. These limitations are recited at a high level of generality and merely describe obtaining input data necessary for the recited mathematical processes. All uses of the recited judicial exception would require obtaining such input data. See MPEP § 2106.05(g); July 2024 SME Example 47, Claim 2 analysis (receiving training data is "mere data gathering and output recited at a high level of generality, and thus [is] insignificant extra-solution activity").
Data Output (Claim 16): "Providing the upsampled text description to an image generation
model" is insignificant post-solution activity, outputting data without reciting any technical outcome (e.g., generating an image). Unlike Example 49, Claim 2 (particular treatment to identified patient population), Claim 16 recites no results from providing the data. See MPEP § 2106.05(g).
Field of Use Limitations: The preamble of claim 1 ("for enhancing a training dataset for a machine learning model") and the "wherein" clause of claim 16 (describing how the image generation model was trained) indicate a field of use or technological environment without imposing meaningful limits on the claims. See MPEP § 2106.05(h).
Consideration of Improvement to Technology: The examiner has considered whether the claims reflect an improvement to the functioning of a computer or an improvement to another technology or technical field. See MPEP § 2106.04(d)(1), § 2106.05(a).
The specification describes improvements including: (1) more efficient captioning compared to large language models (¶ [0066]), (2) improved caption quality leading to better image generation models (¶¶ [0010]-[0012]), (3) computational resource savings (¶ [0007]), and (4) improved image generation from upsampled prompts (¶ [0088]).
However, the claims do not reflect these improvements:
Claim 1 recites generating a recaptioned dataset but does not recite any use of that
dataset to train an image generation model, nor does it recite any technical
outcome.
Claim 12 terminates at "generating a captioned dataset by applying the tuned image
captioner model to images in a dataset" without reciting any use of that dataset or
technical outcome.
Claim 16 terminates at providing input to an image generation model without reciting generating any image or technical result.
Compare July 2024 SME Example 47, Claim 3 (eligible because the claim reflected the improvement by reciting specific remediation steps of detecting source addresses, dropping malicious packets, and blocking traffic in real time) with Example 47, Claim 2 (ineligible because the claim only recited detecting/analyzing anomalies and outputting data without reflecting any improvement). See also Example 48, Claim 2 (eligible because it recited synthesizing speech waveforms and combining them to generate a mixed speech signal excluding unwanted sources, thereby reflecting the speech separation improvement).
The examiner has considered the additional elements in combination. Claim 1 recites: data gathering (obtain dataset) followed by the abstract idea (applying the trained model). Claim 12 recites: generic computer components, the abstract idea (training/tuning stages), data gathering (obtaining synthetic captions), and the abstract idea (applying tuned model). Claim 16 recites: generic computer components, data gathering (receiving text), the abstract idea (upsampling), and data output (providing to model). In each case, the additional elements serve only to gather data, apply the exception on generic computers, and output data without imposing any meaningful limits. See Alice Corp. v. CLS Bank, 573 U.S. at 224. MPEP § 2106.05(f), (g).
(Step 2A, Prong Two: NO)
The claims are directed to an abstract idea. (Step 2A: YES)
Step 2B: Does the Claim Provide Significantly More?
The additional elements identified at Step 2A, Prong Two are re-evaluated to determine whether they provide an inventive concept.
Generic Computer Components (Claims 12, 16): Generic computer components do not provide significantly more. See MPEP § 2106.05(d)(II) (computer functions such as storing and executing instructions are well-understood, routine, and conventional).
Data Gathering (Claims 1, 12, 16): The data gathering activities were identified as insignificant extra-solution activity at Step 2A, Prong Two. Re-evaluating under Step 2B, these elements are also well-understood, routine, and conventional. Obtaining datasets for machine learning training is routine activity in the field. The specification itself states that datasets "may be stored in a database, stored as part of benchmark training datasets, or available as public datasets" (¶ [0046]), indicating this is conventional data gathering.
Data Output (Claim 16): Providing data to another system is well-understood , routine, and conventional computer activity. See MPEP § 2106.05(d)(II); OIP Techs. v. Amazon.com, 788 F.3d at 1363.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. (Step 2B: NO)
Claims 1, 12, and 16 are not eligible.
Claims 2 and 13 (depending from claims 1 and 12, respectively)
Claim 2 adds "updating one or more captions in the text-to-image dataset using the image captioner model." Claim 13 adds "wherein the image dataset is a subset of the text-to-image dataset."
These limitations further describe data manipulation using the model (claim 2) or data relationships (claim 13). Neither adds a practical application nor significantly more. Claims
2 and 13 are not eligible.
Claims 3, 6, and 17 (depending from claims 1, 5, and 16, respectively)
Claim 3 adds the first tuning stage comprising "obtaining a first set of captions corresponding to at least a first subset of the image dataset; and updating, based on the first set of captions, the image captioner model."
Claim 6 adds "the image dataset is a subset of the text-to-image dataset; and the first subset and the second subset are inclusive of each other."
Claim 17 adds "wherein the image captioner model is trained with a first tuning stage and a second tuning stage."
These limitations further describe the training/tuning process (mathematical concepts) and data organization. Obtaining captions is data gathering. Updating the model is part of the mathematical training process. Specifying dataset relationships is data characterization. None adds a practical application or significantly more. Claims 3, 6, and 17 are not eligible.
Claims 4, 14, and 18 (depending from claims 3, 12, and 16, respectively)
Claim 4 adds "the image captioner model is configured to generate short synthetic captions, and the first set of captions describe a main subject of an image in the image dataset."
Claim 14 adds "the first set of captions comprises short captions, the short captions describing a main subject of an image in the image dataset."
Claim 18 adds "the image captioner model is configured to generate short synthetic captions."
These limitations describe characteristics of the model output or training data (short captions describing main subjects). Describing what type of captions the model generates or is trained with does not add a practical application-it further characterizes the mathematical process. None adds significantly more. Claims 4, 14, and 18 are not eligible.
Claims 5 and 8 (depending from claims 3 and 5, respectively)
Claim 5 adds "obtaining a second set of captions corresponding to at least a second subset of the image dataset, wherein captions of the second set of captions have a length that is longer than captions of the first set of captions; and updating, based on the second set of captions, the image captioner model."
Claim 8 adds "at least one of the first set of captions or the second set of captions are generated with a machine learning model."
Claim 5 elaborates on the second tuning stage with data gathering (obtaining captions) and mathematical processes (updating the model). Claim 8 specifies that training data is generated by another machine learning model, adding another layer of mathematical processes. Neither adds a practical application or significantly more. Claims 5 and 8 are not eligible.
Claims 7 and 19 (depending from claims 5 and 16, respectively)
Claim 7 adds "the image captioner model is configured to generate descriptive synthetic captions, and the second set of captions describe the main subject plus at least one of surroundings, background, image text, style, or coloration of an image in the image dataset."
Claim 19 adds "the image captioner model is configured to generate descriptive synthetic captions."
These limitations describe characteristics of the model output (descriptive synthetic captions with specific content types). Describing what type of captions the model generates does not add a practical application-it further characterizes the mathematical process. Neither adds significantly more. Claims 7 and 19 are not eligible.
Claim 9 (depending from claim 1)
Claim 9 adds "augmenting the image captioner model with an image embedding, the image embedding corresponding to a compressed representation space."
Image embeddings are mathematical representations of images. The specification at ,¶ [0051] describes image embeddings as "numerical representation of an image, such as a vector" generated through neural networks. Augmenting the model with embeddings adds mathematical concepts to the claim without adding a practical application or significantly more. Claim 9 is not eligible.
Claims 10 and 15 (depending from claims 1 and 12, respectively)
Claim 10 adds "training an image generation model with the recaptioned dataset."
Claim 15 adds "training a text-to-image machine learning model with the captioned dataset."
Training another model is itself a mathematical process. The claims do not recite any technical outcome. Example 48, Claim 2 was eligible because it recited the technical output. Claims 10 and 15 terminate at training without any technical output.
Claims 10 and 15 are not eligible.
Claims 11 and 20 (depending from claims 1 and 16, respectively)
Claim 11 adds "upsampling a caption in the recaptioned dataset using a large language model."
Claim 20 adds "the upsampling increases the length of the text description."
Claim 11 adds another mathematical process {applying an LLM to transform text). Claim 20 describes a characteristic of the upsampling output (increased length). Neither adds a practical application. Applying a language model to expand text is executing learned
mathematical transformations. Describing that text length increases characterizes the
mathematical output. Neither adds significantly more. Claims 11 and 20 are not eligible.
Conclusion
Claims 1-20 are rejected under 35 U.S.C. § 101 as being directed to abstract ideas (mathematical concepts) without integration into a practical application and without providing significantly more than the abstract ideas.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-11 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Betker et al. (Improving Image Generation with Better Captions – hereinafter “Betker”).
Claim 1.
Betker discloses a method for enhancing a training dataset for a machine learning model (Abstract discloses “training a bespoke image captioner and use it to recaption the training dataset. We then train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability.”), the method comprising:
obtaining a text-to-image dataset comprising one or more digital image-caption pairs (p. 5, ¶1 discloses “a large quantity of pairings (t, i) where i is an image and t is text that describes that
image”); and
generating a recaptioned dataset by applying an image captioner model to images in the text-to-image dataset (p. 5, § 2 Data Recaptioning), the image captioner model trained with an image dataset (p. 6, § 2.1.1 Fine-tuning the captioner, ¶3 discloses “ground truth … captions”; Fig. 3 discloses “alt-text accompanying selected images scraped from the internet”), a first tuning stage (p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset of captions that describe only the main subject of the image”; Fig. 3, short synthetic captions (SSC)), and a second tuning stage (p. 6, § 2.1.1 Fine-tuning the captioner, ¶2 discloses “We repeat this process a second time, creating a dataset of long, highly-descriptive captions describing the contents of each image in our fine-tuning dataset”; Fig. 3, descriptive synthetic captions (DSC)).
Claim 2.
Betker discloses the method of claim 1, wherein generating the recaptioned dataset comprises updating one or more captions in the text-to-image dataset (Betker p. 6, Fig. 3, short synthetic captions (SSC)) using the image captioner model (Betker p. 5, §2.1, ¶1 discloses “An image captioner is very similar to a traditional language model that predicts text.”).
Claim 3.
The method of claim 1, wherein the first tuning stage comprises: obtaining a first set of captions corresponding to at least a first subset of the image dataset (Betker p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset of captions that describe only the main subject of the image.”); and updating, based on the first set of captions, the image captioner model (Betker p. 6, § 2.1.1, “We then continue to train our captioner on this dataset”).
Claim 4.
Betker discloses the method of claim 3, wherein: the image captioner model is configured to generate short synthetic captions (p. 6, § 2.1.1: “We refer to captions generated by this
fine-tune as ‘short synthetic captions’.”; Fig. 3 shows SSC examples), and the first set of captions describe a main subject of an image in the image dataset (p. 6, § 2.1.1: “captions that describe only the main subject of the image”).
Claim 5.
Betker discloses the method of claim 3, wherein the second tuning stage comprises:
obtaining a second set of captions corresponding to at least a second subset of the image dataset (p. 6, § 2.1.1: “creating a dataset of long, highly-descriptive captions describing the
contents of each image in our fine-tuning dataset” i.e. second subset), wherein captions of the second set of captions have a length that is longer than captions of the first set of captions (Fig. 3: Shows Descriptive Synthetic Captions (DSC) are longer than Short Synthetic Captions (SSC)); and updating, based on the second set of captions, the image captioner model (p. 6, § 2.1.1: “We again fine-tune our base captioner on this dataset.”).
Claim 6.
Betker discloses the method of claim 5, wherein: the image dataset is a subset of the text-to-image dataset (p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset” – small dataset i.e. subset); and the first subset and the second subset are inclusive of each other (inclusive of each other – could potentially be the same set of images (spec ¶79), or one set could contain the other; p. 6, § 2.1.1 and Fig. 3 discloses using the same images for the SSC and DSC).
Claim 7.
Betker discloses the method of claim 5, wherein: the image captioner model is configured to generate descriptive synthetic captions (p. 6, § 2.1.1: “We refer to captions generated by this
captioner as ‘descriptive synthetic captions’.”; Fig. 3 shows DSC examples), and the second set of captions describe the main subject plus at least one of surroundings, background, image text, style, or coloration of an image in the image dataset (p. 6, § 2.1.1: “These captions describe not only the main subject of the image, but also its surroundings, background, text found in the image, styles, coloration, etc.”).
Claim 8.
The method of claim 5, wherein at least one of the first set of captions or the second set of captions are generated with a machine learning model (p. 9, § 3.5, ¶3: “we found that GPT-4 will readily ‘upsample’ any caption into a highly descriptive one.”).
Claim 9.
The method of claim 1, further comprising augmenting the image captioner model with an image embedding, the image embedding corresponding to a compressed representation space (p. 5, §2.1: discloses the need for a compressed representation space and using a pre-trained CLIP image embedding function F(i) to augment the language model).
Claim 10.
The method of claim 1, further comprising training an image generation model with the recaptioned dataset (Abstract: “train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability.”; p. 7, §3: discloses evaluating models trained on synthetic text).
Claim 11.
The method of claim 1, further comprising upsampling a caption in the recaptioned dataset using a large language model (p. 9, §3.5: “utilizing a LLM to 'upsample' captions”; p. 17, Appendix C shows a prompt used with GPT-4 for upsampling).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 12-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Betker et al. (Improving Image Generation with Better Captions – hereinafter “Betker”) in view of Zhao et al. (US 20220012544 A1 – hereinafter “Zhao”).
Claim 12.
Betker discloses a system comprising:
generating an image captioner model configured to generate captions from input images, the image captioner model trained using a text-to-image dataset (p. 5, §2.1), wherein the text- to-image dataset comprises one or more digital image-caption pairs (p. 5, ¶1 discloses “a large quantity of pairings (t, i)”);
performing a first tuning stage for the image captioner model, the first tuning stage comprising: training the image captioner model using a first set of captions corresponding to at least a first subset of an image dataset (p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset of captions that describe only the main subject of the image” – small dataset i.e. subset; Fig. 3, short synthetic captions (SSC));
obtaining a set of synthetic captions (p. 6, §2.1.1: “We refer to captions generated by this
fine-tune as ‘short synthetic captions’.”; Fig. 3 shows SSC examples);
after the first tuning stage, performing a second tuning stage for the trained image captioner model (p. 6, § 2.1.1 Fine-tuning the captioner, ¶2 discloses “We repeat this process a second time, creating a dataset of long, highly-descriptive captions describing the contents of each image in our fine-tuning dataset”), the second tuning stage comprising:
training the image captioner using the set of synthetic captions (p. 6, § 2.1.1: “We then continue to train our captioner on this dataset.”; where the output of the first stage are referred to as synthetic captions (SSC)); and
generating a captioned dataset by applying the tuned image captioner model to images in a dataset (p. 6, § 2.1.1: “Once built, we apply our image captioner fine-tunes to every image in our text-to-image dataset, resulting in a set of synthetic captions which we use for subsequent experiments.”).
Betker discloses all of the subject matter as described above except for specifically teaching “at least one memory storing instructions; at least one processor configured to execute the instructions to perform operations, the operations comprising.” However, Zhao in the same field of endeavor teaches at least one memory storing instructions (¶¶32-33 and Fig. 4 discloses a memory 402 storing instructions); at least one processor configured to execute the instructions to perform operations, the operations comprising (¶¶32-33, 40 and Fig. 4 discloses a processor 401 that executes instructions).
At the time the invention was made, it would have been obvious to one of ordinary skill in the art to modify Betker to include Zhao because such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Betker as modified by Zhao can yield a predictable result of using standard computer hardware to implement Betker’s machine learning method. Thus, a person of ordinary skill would have appreciated including in Betker’s method for improving text-to-image datasets using a tuned captioner with the ability to do computer-implemented methods for augmenting caption datasets and training captioning models since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Claim 13.
The combination of Betker and Zhao discloses the system of claim 12, wherein the image dataset is a subset of the text-to-image dataset (Betker p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset of captions that describe only the main subject of the image” – small dataset i.e. subset; Fig. 3, short synthetic captions (SSC); p. 7, §3.2: “All models were trained to 500,000 training steps at a batch size of 2048, corresponding to 1B training images total (emphasis added).”).
Claim 14.
The combination of Betker and Zhao discloses the system of claim 12, wherein the first set of captions comprises short captions, the short captions describing a main subject of an image in the image dataset (Betker p. 6, § 2.1.1 Fine-tuning the captioner, ¶1 discloses “In our first attempt, we build a small dataset of captions that describe only the main subject of the image.”).
Claim 15.
The combination of Betker and Zhao discloses the system of claim 12, further comprising training a text-to-image machine learning model with the captioned dataset (Betker p. 6, § 2.1.1, “We then continue to train our captioner on this dataset”).
Claim 16.
Betker discloses a system comprising:
receiving a text description corresponding to an image (p. 6, § 2.1.1 Fine-tuning the captioner, ¶3 discloses “ground truth … captions”; Fig. 3 discloses “alt-text accompanying selected images scraped from the internet”);
upsampling the text description with a language model (p. 9, §3.5: “utilizing a LLM to 'upsample' captions”; p. 17, Appendix C shows a prompt used with GPT-4 for upsampling); and
providing the upsampled text description to an image generation model (p. 9, § 3.5), the image generation model trained with a dataset comprising image-caption pairs, wherein at least a portion of captions are generated with an image captioner model (p. 6, § 2.1.1 ).
Betker discloses all of the subject matter as described above except for specifically teaching “at least one memory storing instructions; at least one processor configured to execute the instructions to perform operations, the operations comprising.” However, Zhao in the same field of endeavor teaches at least one memory storing instructions (¶¶32-33 and Fig. 4 discloses a memory 402 storing instructions); at least one processor configured to execute the instructions to perform operations, the operations comprising (¶¶32-33, 40 and Fig. 4 discloses a processor 401 that executes instructions).
At the time the invention was made, it would have been obvious to one of ordinary skill in the art to modify Betker to include Zhao because such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Betker as modified by Zhao can yield a predictable result of using standard computer hardware to implement Betker’s machine learning method. Thus, a person of ordinary skill would have appreciated including in Betker’s method for improving text-to-image datasets using a tuned captioner with the ability to do computer-implemented methods for augmenting caption datasets and training captioning models since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Claim 17.
The combination of Betker and Zhao discloses the system of claim 16, wherein the image captioner model is trained with a first tuning stage and a second tuning stage (Betker p. 6, § 2.1.1).
Claim 18.
The combination of Betker and Zhao discloses the system of claim 16, wherein the image captioner model is configured to generate short synthetic captions (Betker p. 6, § 2.1.1; Fig. 3, SSC).
Claim 19.
The combination of Betker and Zhao discloses the system of claim 16, wherein the image captioner model is configured to generate descriptive synthetic captions (Betker p. 6, § 2.1.1: “We refer to captions generated by this fine-tune as ‘short synthetic captions’.”; Fig. 3 shows SSC examples).
Claim 20.
The combination of Betker and Zhao discloses the system of claim 16, wherein the upsampling increases the length of the text description (Betker Fig. 3: Shows Descriptive Synthetic Captions (DSC) are longer than Short Synthetic Captions (SSC)).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ross Varndell whose telephone number is (571)270-1922. The examiner can normally be reached M-F, 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’Neal Mistry can be reached at (313)446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Ross Varndell/Primary Examiner, Art Unit 2674