DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 2, 9, 10, 13, 14, 21, 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627).
Regarding claim 1, XIE discloses A method of generating high-performance images, the method comprising: training or finetuning, by the one or more processors, a second generative AI model using the first plurality of images and the first plurality of captions (XIE, “[0008] According to some embodiments, a text-to-image model training method, performed by a computing device, comprises training a text-to-image model by performing cyclic iterative training using a training set comprising sample image and text pairs. [0086] The image-text sample pair may be a pair of samples consisting of a sample image and description text describing the sample image. The sample image may be an image used as a sample to train the text-to-image model, and the description text corresponding to the sample image is text used to describe content of the sample image. [0094] A core idea thereof is to continuously improve performance of a model by repeatedly performing a process or algorithm, until a predetermined condition is satisfied or a predetermined number of iteration times is reached. In a model training process, cyclic iteration is performed on a full image-text sample pair (for example, an image-text sample pair training set) for a total of a plurality of rounds (for example, 100 rounds). [0099] The image-text sample pair training set may be a set formed by a plurality of image-text sample pairs, and is configured to train a text-to-image model”. Therefore, the sample image and text pairs correspond to the first plurality of images and the first plurality of captions); and
generating, by the one or more processors, a second plurality of images, wherein generating the second plurality of images includes inputting a plurality of text prompts into the trained or finetuned second generative AI model (XIE, fig.10, “[0140] Referring to FIG. 10, FIG. 10 is a schematic diagram of image generation based on a text-to-image model according to some embodiments. Based on FIG. 10, a server obtains, by using a client, specified text of at least two target class names inputted by an object (for example, a user), the specified text being: “a man with headphone, holding a cup”…the server inputs the specified text into a text-to-image model, to obtain an image corresponding to the specified ext. Then, the server returns the generated image to the client for presentation”).
On the other hand, XIE fails to explicitly disclose but Rao discloses generating, by one or more processors, a first plurality of captions each corresponding to a different one of a first plurality of images, wherein generating the first plurality of captions includes inputting the first plurality of images into a first generative artificial intelligence (AI) model (Rao, fig.7, “[0061] The input image 201 is provided to an image-to-text model 202 for generation of a text prompt 203. The image-to-text model 202 generates a textual description (or “caption”) and keywords for the input image 201. [0085] FIG. 7 illustrates an example image-to-text model 202, an example prompt engineering 204, and an example personalization block 207 of FIG. 2 in accordance with this disclosure. In this example, the image-to-text model 202 creates prompts using one or more LLMs. For example, in the image-to-text model 202, the input image 201 is received at a bootstrapping language image pretraining (BLIP) model 701, which is a decoder-based model that produces an image caption 702”);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Rao and XIE, to include all limitations of claim 1. That is, adding the image-to-text model of Rao to generate the input captions of XIE. The motivation/ suggestion would have been to require accurate prompts for high-quality image outpainting generation (Rao, [0084]).
Regarding claim 2, XIE in view of Rao discloses The method of claim 1.
XIE further discloses the second generative AI model comprises a second LLM (XIE, “[0032] A low-rank adaptation (LoRA) weight is low-rank adaptation of a large language model. The LoRA freezes a weight of a pre-trained model, and injects a trainable rank decomposition matrix to each layer of a transformer architecture, greatly reducing a quantity of trainable parameters of downstream tasks. In some embodiments, the LoRA mainly injects a trainable network parameter into a denoising network in a text-to-image model”).
On the other hand, XIE fails to explicitly disclose but Rao discloses the first generative AI model comprises a first large language model (LLM) (Rao, “[0067] An initial text prompt may be produced by the image-to-text model 202, and an LLM may be used (step 312) to contextualize the initial text prompt. [0085] FIG. 7 illustrates an example image-to-text model 202, an example prompt engineering 204, and an example personalization block 207 of FIG. 2 in accordance with this disclosure. In this example, the image-to-text model 202 creates prompts using one or more LLMs”). The same motivation of claim 1 applies here.
Regarding claim 9, XIE in view of RAO discloses The method of claim 1.
XIE further discloses wherein the second generative AI model is a pre-trained model, and wherein training or finetuning the second generative AI model includes finetuning the pre-trained model (XIE, “[0033] Fine tuning is to train some tasks in a customized manner by using a pre-trained model, and modify a network for a task. A to-be-trained model in some embodiments is a pre-trained model”).
Regarding claim 10, XIE in view of RAO discloses The method of claim 9.
XIE further discloses wherein finetuning the pre-trained model includes using low-rank adaptation (LoRA) finetuning to finetune the pre-trained model (XIE, “[0032] A low-rank adaptation (LoRA) weight is low-rank adaptation of a large language model. The LoRA freezes a weight of a pre-trained model, and injects a trainable rank decomposition matrix to each layer of a transformer architecture, greatly reducing a quantity of trainable parameters of downstream tasks”).
Regarding claim 13, it recites similar limitations as claim 1, except that it further recites “One or more non-transitory, computer-readable media storing instructions that, when executed by one or more processors of a computing system, cause the computing system to…”.
XIE further discloses in claim 17 “A non-transitory computer-readable storage medium, storing computer code which, when executed by at least one processor, causes the at least one processor to at least”.
Regarding claim(s) 14, 21, 22, they are interpreted and rejected for the same reasons set forth in claim(s) 2, 9, 10, respectively.
Claim(s) 3, 7, 15, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of Hamedi et a. (US 20250014314).
Regarding claim 3, XIE in view of RAO discloses The method of claim 1.
On the other hand, XIE in view of RAO fails to explicitly disclose but Hamedi discloses wherein training or finetuning the second generative AI model further includes using a plurality of performance labels each corresponding to a different one of the first plurality of images (Hamedi, “[0874] the web page scoring machine learning model can be trained to generate a web page score for a web page based on visual metrics such as a score (e.g., a performance score, which may be determined or generated using another machine learning model, as described herein) of each image (e.g., the quality or effectiveness of those images)…”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Hamedi into the combination of RAO and XIE, to include all limitations of claim 3. That is, adding the training based on the performance score of Hamedi to train the text-to-image model of XIE and RAO. The motivation/ suggestion would have been to generate web page scores using a web page scoring machine learning model (Hamedi, [0874]).
Regarding claim 7, XIE in view of RAO and Hamedi discloses The method of claim 3.
On the other hand, XIE in view of RAO fails to explicitly disclose but Hamedi discloses wherein training or finetuning the second generative AI model further includes using a plurality of visual quality labels each corresponding to a different one of the first plurality of images (Hamedi, “[0874] the web page scoring machine learning model can be trained to generate a web page score for a web page based on visual metrics such as a score (e.g., a performance score, which may be determined or generated using another machine learning model, as described herein) of each image (e.g., the quality or effectiveness of those images)…”). The same motivation of claim 3 applies here.
Regarding claim(s) 15, 19, they are interpreted and rejected for the same reasons set forth in claim(s) 3, 7, respectively.
Claim(s) 4, 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of Hamedi et a. (US 20250014314) and Brewer et al. (US 20250258878).
Regarding claim 4, XIE in view of RAO and Hamedi discloses The method of claim 3.
On the other hand, XIE in view of RAO and Hamedi fails to explicitly disclose but Brewer discloses wherein each label of the plurality of performance labels is indicative of past performance of a respective image of the first plurality of images, and wherein the past performance is a measure of user interest in a content item that included the respective image (Brewer, “[0025] By leveraging unsupervised techniques, the interest graph can continuously adapt to capture emerging trends and interests reflected in the most recent user-generated content. [0077] To train a pairwise content tagging model 400 for interest-based tagging, the text encoder 410 is provided with keywords and key phrases (e.g., interest tags 408) from the interest graph to output text embeddings. [0078] The model is trained on matching pairs of images/videos and corresponding interest descriptors extracted from accompanying text. [0079] Specifically, relevant interest tags are extracted from the text accompanying each historical item”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Brewer into the combination of RAO and XIE, Hamedi, to include all limitations of claim 4. That is, adding the interest tag of Brewer to the text-to-image training of Hamedi, XIE and RAO. The motivation/ suggestion would have been to provide an improved solution for content understanding compared to conventional systems (Brewer, [0033]).
Regarding claim(s) 16, it is interpreted and rejected for the same reasons set forth in claim(s) 4.
Claim(s) 5, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of Hamedi et a. (US 20250014314) and SINGH et al. (US 20250265751).
Regarding claim 5, XIE in view of RAO and Hamedi discloses The method of claim 3.
On the other hand, XIE in view of RAO and Hamedi fails to explicitly disclose but SINGH discloses wherein each label of the plurality of performance labels is indicative of predicted performance of a respective image of the first plurality of images (SINGH, “[0043] In one embodiment, the pipeline further provides prompt refinement through at least another generative model call, such as calling the LLM 126a and/or the text-to-image model 126b based on user feedback data 146a sent via a feedback loop (e.g., a quality prediction model, and/or a reflection loop based on a confidence threshold)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined SINGH into the combination of RAO and XIE, Hamedi, to include all limitations of claim 5. That is, adding the feedback data from a quality prediction model of SINGH to the text-to-image training of Hamedi, XIE and RAO. The motivation/ suggestion would have been to improve visual content collage generation using generative model(s) by providing users with AI-based visual content collage generations based on a novel visual content collage generation pipeline to streamline the user experience by eliminating the need for selecting a collage template, or manually dragging and dropping the photos into the collage template, as well as by minimizing user editing of the collage (SINGH, [0015]).
Regarding claim(s) 17, it is interpreted and rejected for the same reasons set forth in claim(s) 5.
Claim(s) 6, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of Hamedi et a. (US 20250014314) and ZHU (US 20250349042).
Regarding claim 6, XIE in view of RAO and Hamedi discloses The method of claim 3.
On the other hand, XIE in view of RAO and Hamedi fails to explicitly disclose but ZHU discloses wherein generating the second plurality of images further includes inputting a plurality of desired performance labels into the trained or finetuned second generative AI model, each of the plurality of desired performance labels corresponding to a different one of the plurality of text prompts (ZHU, “[0062] Therefore, when the determined image generation model is used, a large number of image samples having expected labels that accurately reflects content of the guidance information may be rapidly generated by inputting the guidance information. [0066] A basic description of a to-be-generated image is required, to ensure that the generated guidance information includes descriptive information of at least one expected label (an expected to-be-annotated object) in the to-be-generated image”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined ZHU into the combination of RAO and XIE, Hamedi, to include all limitations of claim 6. That is, adding the expected label input of ZHU to the text-to-image trained model of Hamedi, XIE and RAO. The motivation/ suggestion would have been to help reduce costs of obtaining the image samples and improve quality of the image samples (ZHU, [0062]).
Regarding claim(s) 18, it is interpreted and rejected for the same reasons set forth in claim(s) 6.
Claim(s) 11, 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of Desnoyer et a. (US 20160188997).
Regarding claim 11, XIE in view of RAO discloses The method of claim 1.
On the other hand, XIE in view of RAO fails to explicitly disclose but Desnoyer discloses identifying, by the one or more processors, the first plurality of images, wherein identifying the first plurality of images includes filtering out, from a larger set of images, images containing more than a threshold amount of text (Desnoyer, “[0066] In particular embodiments, the filtering engine 312 may perform a quick analysis of the image frames to determine an amount of text in the image frames. As an example and not by way of limitation, the filtering engine 312 may filter out image frames determined to include an amount of text that exceeds an acceptable threshold amount of text. As a result, image frames that contain too much text, such as movie credits, may be filtered out as being unsuitable for selection as a representative image frame”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Desnoyer into the combination of RAO and XIE, to include all limitations of claim 11. That is, adding the filtering engine of Desnoyer to filter out the first plurality of images of XIE and RAO. The motivation/ suggestion would have been image frames that contain too much text, such as movie credits, may be filtered out as being unsuitable for selection as a representative image frame (Desnoyer, [0066]).
Regarding claim(s) 23, it is interpreted and rejected for the same reasons set forth in claim(s) 11.
Claim(s) 12, 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al. (US 20250371348) in view of Rao et al. (US 20250225627), and further in view of ZHU2 et al. (US 20260004475).
Regarding claim 12, XIE in view of RAO discloses The method of claim 1.
On the other hand, XIE in view of RAO fails to explicitly disclose but ZHU2 discloses training or finetuning, by the one or more processors, the first generative AI model using a third plurality of images and a second plurality of captions, each of the second plurality of captions corresponding to a different one of the third plurality of images (ZHU2, “[0062] The model loss of the image-text model includes the image loss. Further, in a training stage, the image-text model may be obtained through training according to the image loss. The image loss is constructed according to the first sample image and the second sample image that is obtained by converting the first sample text configured for describing the first sample image”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined ZHU2 into the combination of RAO and XIE, to include all limitations of claim 12. That is, adding the training process of ZHU2 to train the image-to-text model of XIE and RAO. The motivation/ suggestion would have been to enable a generated target text to describe a target image as accurately as possible, and ensure accuracy of the target text (ZHU2, [0004]).
Regarding claim(s) 24, it is interpreted and rejected for the same reasons set forth in claim(s) 12.
Allowable Subject Matter
Claim(s) 8, 20 is/are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
Regarding claim 8, it recites, wherein training or finetuning the second generative AI model includes: generating a plurality of training or finetuning text prompts each including (i) a different one of the first plurality of captions, and (ii) a text indication of an image type of the image, of the first plurality of images, that corresponds to the different one of the first plurality of captions; and training or finetuning the second generative AI model using the first plurality of images and the plurality of training or finetuning text prompts. None of the prior arts on the record or any of the prior arts searched, alone or in combination, renders obvious the combination of elements recited in the claim(s) as a whole.
Regarding claim 20, it is interpreted and allowed under similar rationale as claim 8.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GRACE Q LI whose telephone number is (571)270-0497. The examiner can normally be reached Monday - Friday, 8:00 am-5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DEVONA FAULK can be reached at 571-272-7515. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/GRACE Q LI/Primary Examiner, Art Unit 2618 2/3/2026