Office Action Analysis: 18785665 — GENERATING ARTIFICIAL INTELLIGENCE (AI)-BASED IMAGES USING LARGE LANGUAGE MACHINE-LEARNED MODELS

Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant claims the benefit of US Provisional Application No. 63/529,100, filed 07/26/2023.  Claims 1-24 have been afforded the benefit of this filing date.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 6-10, 15-19, and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Unnikrishnan et al. (US 20240420205 A1, hereinafter "Unnikrishnan") in view of Rezaeian et al. (US 20250022595 A1, hereinafter "Rezaeian”).

Regarding claim 1, Unnikrishnan discloses:
A method comprising: 
generating a second prompt for input to the machine-learned language model or another machine-learned language model ([0028] “Further, in some embodiments, any input (e.g., text or images, customer data, etc.) can be utilized to guide the output textual prompt of the language model.”), the second prompt specifying at least the theme ([0029] “In some embodiments, with respect to the example shown in FIG. 8F, the language model can generate the textual prompt based on a selected style, recent purchases, moodboards, recent searches, the customer's locations, the customer style affinity score, etc. In this regard, customer data can be used to generate personalized textual prompts by the language model.”; prompts are also generated based on a “knowledge graph” which includes information about types/categories of products, see [0025] to [0026] for details) and a second request to generate a third prompt for input to an image generation model, the third prompt comprising a third request to generate one or more images of one or more items associated with the theme ([0030] “In this regard, an image (or images) of a set of products (or sets of products) are generated through a text-to-image diffusion model based on the textual prompt generated by the language model.”); 
providing the second prompt to the one or more model serving systems for execution by the machine-learned language model ([0028] “For example, with respect to the example of FIGS. 8B and 8C, the customer may begin by inputting text and/or image(s)… into the search query and the language model generates textual prompts… based on the input from the customer.”); 
receiving, from the one or more model serving systems, a second response generated by executing the machine-learned language model on the second prompt, the second response comprising the requested third prompt (fig. 7, output of the prompt generation LLM 704); 
providing the third prompt to the one or more model serving systems for execution by the image generation model (fig. 7, output of the prompt generation LLM 704 is eventually received by text-to-image diffusion model 710); 
receiving, from the one or more model serving systems, one or more images generated by executing the image generation model on the third prompt ([0030] “In this regard, an image (or images) of a set of products (or sets of products) are generated through a text-to-image diffusion model based on the textual prompt generated by the language model. For example, with respect to the example of FIG. 7, the text-to-image diffusion model generates three (3) images of different sets of products, where each set of products includes a number of products from the catalog of products, based on the textual prompt generated by a language model.”); and presenting a recommendation to the user, the recommendation including at least one of the generated images (fig. 7, final output 712).
Unnikrishnan is not relied upon to teach:
generating a first prompt for input to a machine-learned language model, the first prompt specifying at least contextual information of a user and a first request to generate a theme for item recommendations for the user; 
providing the first prompt to one or more model serving systems for execution by the machine-learned language model; 
receiving, from the one or more model serving systems, a first response generated by executing the machine-learned language model on the first prompt, the first response comprising at least the requested theme;
or that the theme specified in the first prompt is included in the first response.
Rezaeian teaches:
generating a first prompt for input to a machine-learned language model ([0024] “Prompt engine 134 receives a summary note, automatically generates a prompt...”), the first prompt specifying at least contextual information of a user and a first request to generate a theme for item recommendations for the user ([0023] “Each summary note may comprise one or more of the following types of information: potential diagnoses, one or more risk factors, and potential treatments, such as medicine, tests, and/or physical therapy. Additionally, a summary note may include information about the patient's medical history and/or a general impression of the physician.”, where the summary note is included with the first prompt;
[0026] “In an embodiment, prompt engine 134 generates a prompt that is tailored to a particular medical field or medical department… Examples of medical fields or departments include one medical department handling ocular matters and another medical department handling cardiac matters. Identification of a particular medical field/department may be based on field data that is stored in a summary note.”); 
providing the first prompt to one or more model serving systems for execution by the machine-learned language model ([0024] “Prompt engine 134 receives a summary note, automatically generates a prompt, and sends the prompt and summary note to LLM 136.”); 
receiving, from the one or more model serving systems, a first response generated by executing the machine-learned language model on the first prompt ([0029] “LLM 136 receives a prompt and a summary note from prompt engine 134 and converts the unstructured text (e.g., a string of characters) of the summary note into a structured data format, such as a JSON file.”), the first response comprising at least the requested theme ([0031] “After training, given a prompt and a summary note, LLM 136 compares the summary note with all existing medical data upon which the model was trained and then converts, according to the prompt, the summary note into structured features, such as potential diagnoses features, risk factor features, and potential treatment features.”, where types of medical diagnoses/treatment may be considered a “theme”);
generating a second prompt for input to the machine-learned language model or another machine-learned language model ([0031] “Thus, the output of LLM 136 is in a structured data format and is sent to recommendation model 140.”, where the recommendation model outputs a list of relevant medical items for prescription to a patient as described in [0039]), the second prompt specifying at least the theme included in the first response (output sent to recommendation model includes the previously discussed medical data described in [0031]).
To summarize:
Rezaeian teaches generating a prompt using contextual user information (prompt 1) which is used as input for a machine learning algorithm (model 1) to generate another prompt indicating a category (prompt 2), then passing this prompt to another machine learning model (model 2) to generate a recommendation.
Unnikrishnan teaches using user input (prompt 2) as a prompt for a language model (model 2), which generates a prompt (prompt 3) for a diffusion model (model 3), which generates the final output image(s).
Unnikrishnan and Rezaeian are both analogous to the claimed invention because they are in the same field of generating item recommendations based on user data.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Unnikrishnan with the teachings of Rezaeian to use a machine learning model to generate prompts specifying recommended items and item categories for which the invention of Unnikrishnan can generate images.  The motivation would have been to further automate the product image generation system of Unnikrishnan to automatically generate personalized visual advertisements; Rezaeian claims its invention will “reduce medical costs, reduce time to medical resolution, increase quality of life measures for patients”, which when translated to a general advertising context would result in cheaper, faster to produce, and more personally relevant ads. 

Regarding claim 6, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 1, wherein inputting the third prompt into the image generation model comprises inputting the third prompt into a diffusion-based model (Unnikrishnan [0021] “Generally, and at a high level, embodiments described herein facilitate… using a generative language model trained on the relationships stored in the knowledge graph to generate textual prompts for a text-to-image diffusion model.”).

Regarding claim 7, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 1, further comprising: 
receiving, via a user interface, a user input for modifying the third prompt (Unnikrishnan [0028] “In other embodiments, the language model can begin by generating a textual prompt, the customer can refine the textual prompt through input text or images, and the language model can refine the textual prompt based on the customer's input.”; fig. 8D shows user interface); and 
providing the modified third prompt to the one or more model serving systems for execution by the image generation model (Unnikrishnan fig. 8D shows image results of prompt, which can be modified).
	
Regarding claim 8, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 7, further comprising: updating the machine-learned language model using the user input as feedback (Unnikrishnan [0035] “In some embodiments, the customer style affinity score is updated each time a customer enters the website to perform the search query or at automated intervals in order to capture the evolving style affinities of the customer.”; the customer style affinity score is used by the language model when generating a prompt as mentioned in [0029]).
	
Regarding claim 9, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 1, wherein the second prompt and the third prompt are generated by a computing platform which includes the one or more model serving systems (Unnikrishnan [0066] “User device 102 can be a client device on a client-side of operating environment 100, while generative AI product search query manager 108 can be on a server-side of operating environment 100. Generative AI product search query manager 108 may comprise server-side software designed to work in conjunction with client-side software on user device 102 so as to implement any combination of the features and functionalities discussed in the present disclosure.”).

Claim(s) 2-5, 11-14, and 20-23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Unnikrishnan (US 20240420205 A1) in view of Rezaeian (US 20250022595 A1) as applied to claims 1, 10, and 19 above, and further in view of Ruiz et al. ("DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation". arXiv preprint (15 Mar 2023). https://arxiv.org/abs/2208.12242v2; hereinafter "Ruiz").

	Regarding claim 2, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 1, but is not relied upon to teach the limitations:
further comprising: obtaining a fine-tuning data set including images of one or more items in different settings for promoting the one or more items, a class name for each item, and a unique identifier representing each item; and providing the fine-tuning data set to the one or more model serving systems for fine-tuning the image generation model.
Ruiz teaches the limitations:
further comprising: obtaining a fine-tuning data set including images of one or more items in different settings for promoting the one or more items, a class name for each item, and a unique identifier representing each item; and providing the fine-tuning data set to the one or more model serving systems for fine-tuning the image generation model (Ruiz fig. 3 shows example of fine-tuning data including variety of input images, class name, and identifier; pg. 2 col. 1 “We fine-tune the text-to-image model with the input images and text prompts containing a unique identifier followed by the class name of the subject (e.g., “A [V] dog”). The latter enables the model to use its prior knowledge on the subject class while the class-specific instance is bound with the unique identifier.”).
Ruiz is analogous to the claimed invention because it is in the same field of generating images of an item using a diffusion model, and it pertains to the same issue in which both the item category and the specific item type need to be taken into account.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Unnikrishnan in view of Rezaeian with the teachings of Ruiz to incorporate a system where the diffusion model is trained on data containing both an item class and a unique identifier.  The motivation would have been to increase the potential variety of generated images for a particular item by using prior data from images of other items from the same class, while still preserving that particular item’s appearance, as taught by Ruiz (pg. 4 col. 1 section “Designing Prompts for Few-Shot Personalization”): “In essence, we seek to leverage the model’s prior of the specific class and entangle it with the embedding of our subject’s unique identifier so we can leverage the visual prior to generate new poses and articulations of the subject in different contexts”.


    PNG
    media_image1.png
    899
    1132
    media_image1.png
    Greyscale

Ruiz fig. 3

Regarding claim 3, the combination of Unnikrishnan in view of Rezaeian and further in view of Ruiz teaches: The method of claim 2, wherein fine-tuning the image generation model comprises: 
fixing one or more parameters of the image generation model (Ruiz pg. 2 col. 1 “To that end, we propose a technique to represent a given subject with rare token identifiers and fine-tune a pre-trained, diffusion-based text-to-image framework.”; pre-training involves fixing one or more parameters); 
applying the image generation model to a fine-tuning prompt and a noisy version of an image of a particular item to output denoised images of the particular item, the fine-tuning prompt including a unique identifier for the item and a class of the item (fig. 3 (see above) top right, the model being trained generates item-specific images based on a prompt containing both the identifier and class (“A [V] dog”)); 
applying the fixed parameters of the image generation model to a class-specific prompt to output an image of the class, the class-specific prompt including a class of the item (fig. 3 bottom left, the pre-trained “locked” model generates class images based on a prompt containing only the class (“A dog”)); 
applying the image generation model to the class-specific prompt and a noisy version of the image of the class to output denoised images of the class (fig. 3 bottom right, the model being trained generates class images based on a prompt containing only the class (“A dog”)); 
generating a reconstruction loss that indicates a difference between known images of the particular item and the denoised output images of the particular item (fig. 3 the loss between the ground-truth input images and the newly generated item-specific images is labeled “Reconstruction Loss”; 
pg. 3 col. 2 section 3.1 “Text-to-Image Diffusion Models” provides the loss function which compares a generated (denoised) image with a ground-truth image); and 
generating a class-specific prior preservation loss that indicates a difference between the image of the class and the denoised output images of the class (fig. 3 the loss between the pre-trained class images and the newly generated class images is labeled “Class-Specific Prior Preservation Loss”; 
pg. 4 col. 2 section 3.3 “Class-specific Prior Preservation Loss”: “To mitigate the two aforementioned issues, we propose an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. In essence, our method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins.”; the conditioning vector for this term based on the item class - see the rest of section 3.3 “Class-specific Prior Preservation Loss” for more details); and 
backpropagating terms obtained from the reconstruction loss and the class-specific prior preservation loss to update parameters of the image generation model (this limitation is suggested or implied by the use of a training process involving a loss function).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Unnikrishnan in view of Rezaeian and further in view of Ruiz with the additional teachings of Ruiz to use a diffusion model which includes a class-specific prior preservation loss in addition to the usual reconstruction loss.  The motivation would have been to prevent “language drift” in which a class becomes too closely associated with a particular item, and to maintain a high output diversity (explained in Ruiz section 3.3 “Class-specific Prior Preservation Loss”).

Regarding claim 4, the combination of Unnikrishnan in view of Rezaeian and further in view of Ruiz teaches: The method of claim 2, wherein obtaining the images of a particular item in the one or more items comprises obtaining images of the particular item at different angles, perspectives, or arrangements (Ruiz figs. 6 and 7 show input images of a subject at different angles).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Unnikrishnan in view of Rezaeian and further in view of Ruiz with the additional teachings of Ruiz to include a different angles of input images of a given item in order to improve the variety of the training data.
	
Regarding claim 5, the combination of Unnikrishnan in view of Rezaeian teaches: The method of claim 1, but is not relied upon to teach: wherein the third prompt further comprises a unique identifier for a particular item and a class for the particular item, and wherein the one or more generated images include the particular item.
Ruiz teaches: wherein the third prompt further comprises a unique identifier for a particular item and a class for the particular item, and wherein the one or more generated images include the particular item (pg. 4 col. 1 section “Designing Prompts for Few-Shot Personalization”: “Our goal is to “implant” a new (unique identifier, subject) pair into the diffusion model’s “dictionary”. In order to bypass the overhead of writing detailed image descriptions for a given image set we opt for a simpler approach and label all input images of the subject “a [identifier] [class noun]”, where [identifier] is a unique identifier linked to the subject and [class noun] is a coarse class descriptor of the subject (e.g. cat, dog, watch, etc.). The class descriptor can be provided by the user or obtained using a classifier.”;
Fig. 4 gives several examples of generated images and their corresponding prompts, each of which contains a unique identifier and a category).
Ruiz is analogous to the claimed invention because it is in the same field of generating images of an item using a diffusion model, and it pertains to the same issue in which both the item category and the specific item type need to be taken into account.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Unnikrishnan in view of Rezaeian with the teachings of Ruiz to incorporate a system where the diffusion model is trained and used by inputting prompts containing both an item class and a unique identifier.  The motivation would have been to increase the potential variety of generated images for a particular item by using prior data from images of other items from the same class, while still preserving that particular item’s appearance, as taught by Ruiz (pg. 4 col. 1 section “Designing Prompts for Few-Shot Personalization”): “In essence, we seek to leverage the model’s prior of the specific class and entangle it with the embedding of our subject’s unique identifier so we can leverage the visual prior to generate new poses and articulations of the subject in different contexts”.
		
	Regarding claims 10-18, they are rejected using the same references, rationale, and motivations to combine as claims 1-9 respectively because their limitations substantially correspond with the limitations of claims 1-9 respectively, as well as the additional limitation of: A computer program product comprising a non-transitory computer readable storage medium (Unnikrishnan fig. 12 element 1212; [0160] to [0161]) having instructions encoded thereon that, when executed by a processor, cause the processor to perform steps (Unnikrishnan [0158] “The technology described herein may be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.”).
	
	Regarding claims 19-24, they are rejected using the same references, rationale, and motivations to combine as claims 1-5 and 7 respectively because their limitations substantially correspond with the limitations of claims 1-5 and 7 respectively, as well as the additional limitation of: A computer system (Unnikrishnan fig. 12 computing device 1200) comprising: a processor (Unnikrishnan fig. 12 element 1214); and a non-transitory computer-readable storage medium (Unnikrishnan fig. 12 element 1212; [0160] to [0161]) having instructions that, when executed by the processor, cause the computer system to perform steps (Unnikrishnan [0158] “The technology described herein may be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device.”).

References Cited
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
	Daha et al. (US 20240386661 A1) teaches a method of generating a prompt for a text-to-image diffusion model which operates under the same principles as Ruiz et al. ("DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation"), including a fine-tuning data set featuring images, class names, and unique identifiers.  The teachings of Daha et al. overlap portions of both Ruiz and Unnikrishnan.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/BENJAMIN TOM STATZ/Examiner, Art Unit 2611                                                                                                                                                                                                        
/TAMMY PAIGE GODDARD/Supervisory Patent Examiner, Art Unit 2611
Read full office action
GENERATING ARTIFICIAL INTELLIGENCE (AI)-BASED IMAGES USING LARGE LANGUAGE MACHINE-LEARNED MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

GENERATING ARTIFICIAL INTELLIGENCE (AI)-BASED IMAGES USING LARGE LANGUAGE MACHINE-LEARNED MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email