Last updated: April 19, 2026
Application No. 18/740,084
GENERATIVE CONTAINERS

Non-Final OA §103
Filed
Jun 11, 2024
Examiner
SHENG, XIN
Art Unit
2619
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +17.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 401 resolved cases, 2023–2026
Examiner Intelligence

SHENG, XIN View full profile →
Grants 72% — above average
Career Allow Rate
290 granted / 401 resolved
+10.3% vs TC avg
Strong +17% interview lift
Without
With
+17.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
17 currently pending
Career history
418
Total Applications
across all art units
Statute-Specific Performance

§101
5.3%
-34.7% vs TC avg
§103
75.0%
+35.0% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
7.7%
-32.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 401 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-7, 11-13, 15-16, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765).

Regarding Claim 1. Xu teaches A computer-implemented method (Xu, abstract, the invention describes systems and methods to enhance the process of creating an artificial intelligence (AI) generated content or content items, such as images, text, video, sounds, etc., using a text or other suitable prompt, such as via voice input. The systems and methods disclosed provide streamlined content generation with, e.g., reduced processing power and computing time. In an embodiment the systems and methods receive a prompt for generating a first content item using a generative artificial intelligence (AI) model and retrieve, based on the prompt, a collection of matching content items. The systems and methods may then receive input selecting one of the content items from the collection and identify a prompt used to generate the selected content item. The systems and methods may then merge using a trained natural language processing model, the received prompt with the prompt of the selected content item to create a third prompt. In an embodiment the systems and methods may modify the third prompt based on additional input and, based on the modified third prompt, generate a second content item.) comprising:
receiving a user-identified content item (Xu, [0026] FIG.1A shows an example embodiment of the systems and methods described herein. In step 101, the system 110 presents a user interface 130 through which the system receives a prompt 102 to generate an AI-generated content item such as image 109. The disclosure in some embodiment also or alternatively generates other content items such as video, text, audio, 3-D and 2-D models, animation, and multimedia among others. Prompt 102 may be, for example, text describing a requested image 109.);
based on the user-identified content item, generating first content
items using one or more generative models (Xu, [0003] Text-to-image models, for instance, are a type of neural network that generates images based on a textual
input, e.g., a prompt, such as a sentence or a paragraph describing the requested image. 
[0026] FIG.1A shows an example embodiment of the systems and methods described herein. In step 101, the system 110 presents a user interface 130 through which the system receives a prompt 102 to generate an AI-generated content item such as image 109. The disclosure in some embodiment also or alternatively generates other content items such as video, text, audio, 3-D and 2-D models, animation, and multimedia among others. Prompt 102 may be, for example, text describing a requested image 109.
[0027] FIG. 1B shows an illustrative process of creating an AI-generated images using an existing system 150 rather than system 110. In such a process, a user interface provides a number of variables to direct and begin the process. First, at step 112, the system 150 receives an initial prompt that includes a subject and qualifiers. … At step 120 it generates images, image 1 through image n. The system, 150 repeats the process with parameter adjustment until the generated images begin resembling an intended image. 
Therefore, image 1 through image n are the first generative items.);
presenting the first content items in one or more first generative
containers (Xu, as shown in Fig 1A, generated image contents are displayed at the bottom of the GUI. 
Although Xu didn’t explicitly teach “generative containers”, Xu teaches displaying generated image content at a pre-defined section of the user interface. It is obvious to a person with ordinary skill in the art that the bottom section of the user interface holding generative image, is similar to the generative container.);
receiving a user selection of a selected first content item from a
selected first generative container (Xu, [0041] FIG. 5 shows a continuation of the embodiment of the inquiry shown in FIG. 4. In FIG. 5, the system has received a suggested prompt 501, "an old priest in red." Cursor 402 indicates a selection of the search option 502, instructing the system to execute a search for an image in search engine 205 using the inputs including the prompt 501.
[0043], … The system may then display the search results 504, a collection of previously generated images matching the search parameters. The system 201 can then receive a selection of one chosen image of the previously generated images to download. In another embodiment this chosen image is designated as closest match 104a either as well as or instead of downloading. At this time, system 201 adds this chosen image to the generated-image database 203 indicating via metadata that the image is a match for the prompt, model, and parameters. If the results are not satisfactory, the process can repeat the steps of modifying the input prompt and changing parameters. At any stage whenever one of the returned images from the image-search engine meets the expectation, the system can directly download the returned image.);
receiving a requested refinement to the selected first content item (Xu, [0043], … The system may then display the search results 504, a collection of previously generated images matching the search parameters. The system 201 can then receive a selection of one chosen image of the previously generated images to download. In another embodiment this chosen image is designated as closest match 104a either as well as or instead of downloading. At this time, system 201 adds this chosen image to the generated-image database 203 indicating via metadata that the image is a match for the prompt, model, and parameters. If the results are not satisfactory, the process can repeat the steps of modifying the input prompt and changing parameters. At any stage whenever one of the returned images from the image-search engine meets the expectation, the system can directly download the returned image.);
based on the selected first content item and the requested refinement, generating second content items using the one or more generative models (Xu, [0044] FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608.
[0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.
Therefore, merge the prompt 606 with prompt 501 is to refine image created based on prompt 606 with prompt 501.); and
presenting the second content items in one or more second generative containers (Xu, [0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image. The system may receive this instruction after, for example, the prompt and other search elements are satisfactory. It may also store the new image with its metadata to previously generated image database 203.).

Regarding Claim 2. Xu further teaches The computer-implemented method of claim 1, wherein the generating the second content items comprises:
refining a prompt used to generate the selected first content item based on the requested refinement to obtain a refined prompt; and
inputting the refined prompt to the one or more generative models (Xu, [0044] FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608.
[0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.
Therefore, merge the prompt 606 with prompt 501 is to refine image created based on prompt 606 with prompt 501.
[0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608. In this example, the system 201 has used prompt analysis and merge engine 206 to merge "68 year old priest with red robe, Vincent Van Gogh" and "an old priest in red." … The system 201 then receives an instruction to modify the input including updating the prompt 702 to prompt 802 and updating the height 803 and width 804 of the generated image.).

Regarding Claim 3. Xu further teaches The computer-implemented method of claim 1, wherein the one or more generative models comprise a generative image model, the user-identified content item comprises a user-identified image, the first content items comprise first images, the second content items comprise second images, and the selected first content item comprises a selected first image (Xu, [0028] FIG. 2 shows an example environment of an embodiment of the disclosure including a text-to-image system 201 such as may exist within system 110. … The backend of system 201 also contains, in some embodiments, a text-to-image model inference engine 204, an image search engine 205, a prompt analysis and merge engine 206, a prompt-based model classifier 207, and a prompt-based sampler classifier 208, all of which interact to drive the image generation described in FIG. 1A. The text-to-image model inference engine 204 may in some embodiments, using the prompt and parameters as input, generate an output image.
[0044] FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image. The system may receive this instruction after, for example, the prompt and other search elements are satisfactory. It may also store the new image with its metadata to previously generated image database 203. Once a set of output images are generated, they may be shown on the display 209. A user can choose one of the generated images to download.
Therefore, the generated “set of output images” are equivalent to second content items.).

Regarding Claim 4. Xu further teaches The computer-implemented method of claim 3, further comprising:
obtaining image metadata relating to the user-identified image (Xu, [0044], FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608. In some embodiments, when a cursor hovers over an image in the search results, for example search result 601, a box displaying the associated metadata, for example box 602, may be displayed. In one embodiment, box 602 may be displayed for any selected search result.);
generating first image generation prompts based on the image metadata (Xu, [0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.); and
inputting the first image generation prompts to the generative image model, the generative image model generating the first images based on the first image generation prompts (Xu, [0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image.).

Regarding Claim 5. Xu further teaches The computer-implemented method of claim 4, wherein the one or more generative models comprise a generative language model, the computer-implemented method further comprising:
generating a first language generation prompt based on the image metadata (Xu, abstract, the invention describes systems and methods to enhance the process of creating an artificial intelligence (AI) generated content or content items, such as images, text, video, sounds, etc., using a text or other suitable prompt, such as via voice input. The systems and methods disclosed provide streamlined content generation with, e.g., reduced processing power and computing time. In an embodiment the systems and methods receive a prompt for generating a first content item using a generative artificial intelligence (AI) model and retrieve, based on the prompt, a collection of matching content items. The systems and methods may then receive input selecting one of the content items from the collection and identify a prompt used to generate the selected content item. The systems and methods may then merge using a trained natural language processing model, the received prompt with the prompt of the selected content item to create a third prompt. In an embodiment the systems and methods may modify the third prompt based on additional input and, based on the modified third prompt, generate a second content item. 
[0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.); and
inputting the first language generation prompt to the generative language model, the generative language model outputting the first image generation prompts in response to the first language generation prompt (Xu, [0030] The prompt analysis and merge engine 206, in some embodiments, includes a sentence merging model that can be trained to merge the main descriptions of two prompts together by fine tuning a large pretrained language model, like OpenAI's GPT BERT, XLNet, or ROBERTa with collected training data.  
[0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608. In this example, the system 201 has used prompt analysis and merge engine 206 to merge "68 year old priest with red robe, Vincent Van Gogh" and "an old priest in red." The engine 206, in one embodiment, recognizes the individual words as main descriptions and modifiers. In both prompts "priest" is a main description. "Old," "68 year old," "with red robe," "Vincent Van Gogh," and "in red" are modifiers. It also recognizes that "red robe" and "in red" are overlaps along with "old" and 68 year old." These terms are therefore reduced in the example. The system 201 then receives an instruction to modify the input including updating the prompt 702 to prompt 802 and updating the height 803 and width 804 of the generated image.).

Regarding Claim 6. Xu further teaches The computer-implemented method of claim 5, further comprising:
identifying a selected first image generation prompt that was used to generate the selected first image (Xu, [0044] FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608. In some embodiments, when a cursor hovers over an image in the search results, for example search result 601, a box displaying the associated metadata, for example box 602, may be displayed. In one embodiment, box 602 may be displayed for any selected search result.);
generating a second language generation prompt based on the selected first image generation prompt and the requested refinement (Xu, [0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.); and
inputting the second language generation prompt to the generative language model (Xu, abstract, the invention describes systems and methods to enhance the process of creating an artificial intelligence (AI) generated content or content items, such as images, text, video, sounds, etc., using a text or other suitable prompt, such as via voice input. The systems and methods disclosed provide streamlined content generation with, e.g., reduced processing power and computing time. In an embodiment the systems and methods receive a prompt for generating a first content item using a generative artificial intelligence (AI) model and retrieve, based on the prompt, a collection of matching content items. The systems and methods may then receive input selecting one of the content items from the collection and identify a prompt used to generate the selected content item. The systems and methods may then merge using a trained natural language processing model, the received prompt with the prompt of the selected content item to create a third prompt. In an embodiment the systems and methods may modify the third prompt based on additional input and, based on the modified third prompt, generate a second content item. 
[0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image. The system may receive this instruction after, for example, the prompt and other search elements are satisfactory. It may also store the new image with its metadata to previously generated image database 203.).

Regarding Claim 7. Xu further teaches The computer-implemented method of claim 6, further comprising:
receiving second image generation prompts from the generative language model in response to the second language generation prompt (Xu, [0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.); and
inputting the second image generation prompts to the generative image model, the generative image model generating the second images based on the second image generation prompts (Xu, abstract, the invention describes systems and methods to enhance the process of creating an artificial intelligence (AI) generated content or content items, such as images, text, video, sounds, etc., using a text or other suitable prompt, such as via voice input. The systems and methods disclosed provide streamlined content generation with, e.g., reduced processing power and computing time. In an embodiment the systems and methods receive a prompt for generating a first content item using a generative artificial intelligence (AI) model and retrieve, based on the prompt, a collection of matching content items. The systems and methods may then receive input selecting one of the content items from the collection and identify a prompt used to generate the selected content item. The systems and methods may then merge using a trained natural language processing model, the received prompt with the prompt of the selected content item to create a third prompt. In an embodiment the systems and methods may modify the third prompt based on additional input and, based on the modified third prompt, generate a second content item. 
[0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image. The system may receive this instruction after, for example, the prompt and other search elements are satisfactory. It may also store the new image with its metadata to previously generated image database 203.).

Regarding Claim 11. Xu further teaches The computer-implemented method of claim 7, wherein the generative language model and the generative image model are separate models (Xu, [0028] FIG. 2 shows an example environment of an embodiment of the disclosure including a text-to-image system 201 such as may exist within system 110. … The backend of system 201 also contains, in some embodiments, a text-to-image model inference engine 204, an image search engine 205, a prompt analysis and merge engine 206, a prompt-based model classifier 207, and a prompt-based sampler classifier 208, all of which interact to drive the image generation described in FIG. 1A. The text-to-image model inference engine 204 may in some embodiments, using the prompt and parameters as input, generate an output image. 
[0030] The prompt analysis and merge engine 206, in some embodiments, includes a sentence merging model that can be trained to merge the main descriptions of two prompts together by fine tuning a large pretrained language model, like OpenAI's GPT BERT, XLNet, or ROBERTa with collected training data.).

Claim 12 is similar in scope as Claim 1, and thus is rejected under same rationale. Claim 12 further requires
a processor; and
a storage medium storing instructions (Xu, [0055] The system 201 may be implemented using any suitable architecture….Control circuitry may retrieve instructions of the application from storage and process the instructions to provide image generation and selection discussed herein.).

Regarding Claim 13. Xu further teaches The system of claim 12, wherein the instructions, when executed by the processor, cause the system to: display the one or more first generative containers and the one or more second generative containers on a user interface comprising a new container area (Xu, as shown in Fig 1A, generated image contents are displayed at the bottom of the GUI. 
Although Xu didn’t explicitly teach “generative containers”, Xu teaches displaying generated image content at a pre-defined section of the user interface. It is obvious to a person with ordinary skill in the art that the bottom section of the user interface holding generative image, is similar to the generative container.).

Regarding Claim 15. Xu further teaches The system of claim 12, wherein the instructions, when executed by the processor, cause the system to: receiving the requested refinement via the one or more second generative containers (Xu, [0044] FIG. 6 shows the system receiving a selection of a particular search result 601 in the example embodiment shown in FIGS. 4 and 5. Scrolling may display more results. Box 602 displays metadata associated with result 601 which includes generation parameters in the form of a model 604, prompt 606, and parameters 608.
[0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.).

Regarding Claim 16. Xu further teaches The system of claim 12, wherein at least some of the first content items and at least some of the second content items comprise natural language content items (Xu, [0045] FIG. 7 shows the embodiment of FIGS. 4-6 receiving a request to merge the prompt 606 with prompt 501, as indicated by the position of cursor 402 over the merge option 701. Upon receiving the selection of merge option 701, the system uses the prompt analysis and merge engine 206 to merge the prompts 501 and 606 to create prompt 702 "a 68 year old priest with red robe, Vincent Van Gogh." The model 604 and parameters 608 are input upon receiving the instructions to merge as well.).

Claim 20 is similar in scope as Claim 1, and thus is rejected under same rationale.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765) in view of Aberman et al (US20250349040).

Regarding Claim 8. Xu fails to explicitly teach, however, Aberman teaches The computer-implemented method of claim 7, further comprising constraining the generative image model based on a depth map obtained from the user-identified image (Aberman, abstract, the invention describes method of personalized image generation using combined image features. A plurality of input images is provided by a user of an interaction application. Each of the plurality of input images depicts at least part of a subject. Each input image is encoded to obtain an identity representation. The identity representations obtained from the plurality of input images are combined to obtain a combined identity representation associated with the subject. A personalized output image is generated via a generative machine learning model. The generative machine learning model processes the combined identity representation and at least one additional image generation control to generate the personalized output image. At a user device, the personalized output image is presented in a user interface of the interaction application.
[0031] As mentioned above, a text prompt is an example of an image generation control. More specifically, a text prompt representation, obtained by processing the text prompt via a text encoder, can be used as the additional image generation control. Alternatively, or additionally, one or more structural conditions can be used as image generation controls. Examples of structural conditions include structural maps, edge maps, depth maps, or pose maps that guide image generation from a structural or spatial perspective. A structural condition might, for example, be provided as an additional input to specify where to position one or more objects relative to each other in the personalized output image.).
Xu and Aberman are analogous art because they both teach text-to-image generation using image generative model. Aberman further teaches generating image based on additional conditions including depth maps. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the text-to-image generation method (taught in Xu), to further incorporate input image depth map (taught in Aberman), so as to automatically generate corresponding visual outputs (Aberman, [0002]).

Claims 9, 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765) in view of Doken et al (US20250328934).

Regarding Claim 9. Xu fails to explicitly teach, however, Doken teaches The computer-implemented method of claim 7, further comprising:
prompting the generative image model by at least one of the first image generation prompts or the second image generation prompts to inpaint part of the user-identified image, outpaint the user-identified image, or restyle the user-identified image (Doken, abstract, the invention describes methods are described for identifying content on a social media platform and reactions thereto. The system may input, to a first machine learning model, data indicating the reactions, and receive, as output, sentiment data for the reactions. The system may determine, based on the sentiment data, a reaction having a negative sentiment. The system may identify, as a portion of the content to be modified, a portion of the content corresponding to a portion of the reaction having the negative sentiment, and input, to a second machine learning model, data indicating at least a portion of the content and data indicating the identified portion of the content. The system may receive, as output, a regenerated version of the content, and cause the content on the social media platform to be modified based on, or supplemented with the regenerated version of the content.
[0072] In some embodiments, the system may determine that a specific portion of text of one or more of the reactions to content 108 corresponds to a specific portion of audio, image, text or video frame of content 108. For example, if a reaction to content 108 was "Why can't we see the daughter's face?," the system may determine that that a suitable modification to content 108 when regenerating content 108 would be to generate a full view of face of the model wearing the sweatshirt (e.g., by retrieving an uncropped version of the image, or outpainting or generating the remaining portion of the face of the girl using an AI model, for inclusion in the regenerated version 136 of content 108). As another example, the system may prompt the generative model to replace the specific portion of audio, image, text or video frame of content 108 with a positive or neutral counterpart, resulting in a modified content (e.g., modified image, text, audio and/or video) with respect to the original content 108. In some embodiments, such generative model may be a general-purpose model fine-tuned on the product being advertised in content 108. For example, a new image may be generated by a text-to-image model that includes a low-rank adaptation (low-rank adaptation of large language models or LoRA, or any other suitable machine learning model, or custom implementations of a machine learning model, or any combination thereof) of a generative model, which may have been built and/or trained using images of the product or service being advertised).
Xu and Doken are analogous art because they both teach text-to-image generation using image generative model. Doken further teaches regenerating image including replacing and outpainting. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the text-to-image generation method (taught in Xu), to further provide image modification including replacing and outpainting (taught in Doken), so as to provide machine learning model to automatically regenerate content based on sentiment and/or semantic analysis of the reactions of users to the content (Doken, [0001]).

Regarding Claim 17. The combination of Xu and Doken further teaches The system of claim 12, wherein at least some of the first content items and at least some of the second content items comprise video content items or audio content items (Doken, [0072] In some embodiments, the system may determine that a specific portion of text of one or more of the reactions to content 108 corresponds to a specific portion of audio, image, text or video frame of content 108. For example, if a reaction to content 108 was "Why can't we see the daughter's face?," the system may determine that that a suitable modification to content 108 when regenerating content 108 would be to generate a full view of face of the model wearing the sweatshirt (e.g., by retrieving an uncropped version of the image, or outpainting or generating the remaining portion of the face of the girl using an AI model, for inclusion in the regenerated version 136 of content 108). As another example, the system may prompt the generative model to replace the specific portion of audio, image, text or video frame of content 108 with a positive or neutral counterpart, resulting in a modified content (e.g., modified image, text, audio and/or video) with respect to the original content 108. In some embodiments, such generative model may be a general-purpose model fine-tuned on the product being advertised in content 108. For example, a new image may be generated by a text-to-image model that includes a low-rank adaptation (low-rank adaptation of large language models or LoRA, or any other suitable machine learning model, or custom implementations of a machine learning model, or any combination thereof) of a generative model, which may have been built and/or trained using images of the product or service being advertised).
The reasoning for combination of Xu and Doken is the same as described in Claim 9.

Regarding Claim 18. The combination of Xu and Doken further teaches The system of claim 12, wherein the instructions, when executed by the processor, cause the system to:
receive a user selection of a portion of the selected first content item (Doken, [0069] In some embodiments, the system may receive input specifying, or automatically specify, a desired output length, which may or may not be equal to a number of words or characters (and/or may or may not comprise a same amount of imagery) as the portions of content 108 being replaced. In some embodiments, if the system is not able to locate an equivalent word or other portion, the output length may be extended to accommodate multiple portions to replace the identified problematic portion. In some embodiments, if no equivalent can be found, the offending word can be removed from content 108 and content 108 may be regenerated (e.g., using a generative AI model). In some embodiments, if the LLM or other machine learning model, which may be used in relation to step 134, is not general purpose, but more vertical specific, further embeddings may be provided such as "As an advertisement specialist of product X" as preambles.); and
prompt the one or more generative models to generate the second content items by modifying the selected portion of the selected first content item (Doken, [0072] In some embodiments, the system may determine that a specific portion of text of one or more of the reactions to content 108 corresponds to a specific portion of audio, image, text or video frame of content 108. For example, if a reaction to content 108 was "Why can't we see the daughter's face?," the system may determine that that a suitable modification to content 108 when regenerating content 108 would be to generate a full view of face of the model wearing the sweatshirt (e.g., by retrieving an uncropped version of the image, or outpainting or generating the remaining portion of the face of the girl using an AI model, for inclusion in the regenerated version 136 of content 108). As another example, the system may prompt the generative model to replace the specific portion of audio, image, text or video frame of content 108 with a positive or neutral counterpart, resulting in a modified content (e.g., modified image, text, audio and/or video) with respect to the original content 108. In some embodiments, such generative model may be a general-purpose model fine-tuned on the product being advertised in content 108. For example, a new image may be generated by a text-to-image model that includes a low-rank adaptation (low-rank adaptation of large language models or LoRA, or any other suitable machine learning model, or custom implementations of a machine learning model, or any combination thereof) of a generative model, which may have been built and/or trained using images of the product or service being advertised).
The reasoning for combination of Xu and Doken is the same as described in Claim 9.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765) in view of Doken et al (US20250328934) further in view of Perkins (US20250377996).

Regarding Claim 10. The combination of Xu and Doken fails to explicitly teach, however, Perkins teaches The computer-implemented method of claim 7, wherein the generative language model and the generative image model are implemented as a multi-modal generative model (Perkins, abstract, the invention describes methods for visual troubleshooting a network device setup. Images of the network device setup are provided to the system. A GenAI component processes the images to generate one or more device identifying features. The features are further processed to identify the device. The system utilizes hardware-specific information to prompt the GenAI component to answer troubleshooting-related questions concerning the device setup. The images may be pre-processed to include one or more visual guides to assist the GenAI component.
[0022] In general, GenAI component 102b may comprise one or more multimodal generative artificial intelligence models (e.g., multimodal transformer models) configured to perform and/or capable of being prompted to perform, one or more tasks, including generating one or more insights (e.g. natural language insights) based on one or more files, data structures, streams, etc. containing natural language text, videos, images, etc. In particular, in the embodiments described herein, GenAI component 102b is capable of being prompted to perform one or more tasks in relation to one or more images of network hardware devices.
Doken, [0100] At 512, content platform 504 (e.g., social media platform 104 of FIG. 1) transmits measurements associated with content (e.g., an advertisement) to DMP 510. For example, the social media platform may store information such as likes, comments, engagements, other post metrics, or a combination thereof, regarding a social media post on the DMP's databases. At 514, content platform 504 performs an analysis of reactions (e.g., user comments) associated with the content (e.g., sentiment analysis machine) and/or performs a correction of such content. For example, the content platform may use multi-modal machine learning models to determine a sentiment and/or semantic analysis of comments of a content item. The content platform may use secondary generative multi-modal machine learning models to regenerate portions of or the complete content.). 
Xu, Doken and Perkins are analogous art because they all teach method of using artificial intelligent generative model to analyze and create content including image and/or text. The combination of Xu and Doken further teaches text-to-image generation using image generative model. Doken further teaches multi-modal generative model to create image. Perkins further teaches using multi-modal generative model to create language content. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify the multi-model generative model of the text-to-image generation method (taught in Xu and Doken), to further using the multi-model generative model to create both image and text/language/comments (taught in Perkins), so as to provide a compact machine learning model to automatically regenerate various content.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765) in view of Kim et al (US20240169631).

Regarding Claim 14. The system of claim 13, wherein the user selection of the selected first content item comprises a movement of the selected first content item from a selected first generative container to the new container area, the one or more second generative containers being generated in response to the movement (Kim, abstract, the invention describes method for modifying digital images via scene-based editing to remove a shadow for an object. For instance, in one or more embodiments, the disclosed systems receive a digital image depicting a scene. The disclosed systems access a shadow mask of the shadow in a first location. Further, the disclosed systems generate the modified digital image without the shadow by generating a fill for the first location that preserves a visible location of the first location. Moreover, the disclosed systems generate the digital image without the shadow for the object by combining the fill with the digital image.
[0155] As shown, the scene-based image editing system 106 utilizes the cascaded modulation inpainting neural network 420 to generate replacement pixels for the replacement region 404. In one or more embodiments, the cascaded modulation inpainting neural network 420 includes a generative adversarial neural network for generating replacement pixels. In some embodiments, a generative adversarial neural network (or "GAN") includes a neural network that is tuned or trained via an adversarial process to generate an output digital image (e.g., from an input digital image).
[0620] As shown in FIG. 57, the scene-based image editing system 106 receives an input digital image 5700 depicting a person walking along the cement with a shadow associated with the person. Further, as shown, the scene-based image editing system 106 further receives an indication via a drag-and-drop action 5702 to move the person and the associated shadow in the input digital image 5700. As also shown, the scene-based image editing system 106 utilizes a shadow proxy generation model 5706 to generate a proxy shadow for the person in a first intermediate position 5710 within the digital image.
Xu, [0046] FIG. 8 shows the embodiment of FIGS. 4-7 after the system 201 has merged the inputs. The system 201 displays the results of the search 801 using the merged search elements, that is prompt 702, model 604, and parameters 608.
[0047] When the system receives instruction 805 to generate an image it may generate and display a newly created image. The system may receive this instruction after, for example, the prompt and other search elements are satisfactory. It may also store the new image with its metadata to previously generated image database 203.
Drag-and-drop is a common operation in a graphical user interface. For example, user can drag a file and drop to a new location, the new location will have a copy of the original file. Therefore, it is obvious to a person with ordinary skill in the art to use such drag and drop operation (taught in Kim) in the text-to-image method (taught in Xu).).
Xu and Kim are analogous art because they both teach method of using artificial intelligent generative model to analyze and create content including image and/or text. Xu further teaches text-to-image generation using image generative model. Kim further teaches drag-and-drop operation in image editing process. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify text-to-image generation method (taught in Xu), to further use the drag-and-drop operation in creating new image based on user input (taught in Kim), so as to provide an intuitive user interface when refining the images generated from text.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US20250077765) in view of Lott et al (US20240320433).

Regarding Claim 19. Xu fails to explicitly teach, however, Lott teaches The system of claim 12, wherein the instructions, when executed by the processor, cause the system to:
store a tree data structure having nodes representing the containers, each node having one or more corresponding prompts and associated context (Lott, abstract, the invention describes methods for generating a response to an input query using generative models. The method generally includes generating, based on an input query and a first generative model, a first plurality of sets of tokens. The first plurality of sets of tokens are output to a second generative model for verification. While waiting to receive an indication of a selected set of tokens from the first plurality of sets of tokens, a second plurality of sets of tokens are speculatively generated. The indication of a selected set of tokens from the first plurality of sets of tokens is received. Tokens from the second plurality of sets of tokens associated with the selected set of tokens are output to the second generative model for verification, and the selected set of tokens is output as a response to the input query.
[0003] Generative artificial intelligence models can be used in various environments in order to generate a response to an input query. For example, generative artificial intelligence models can be used in chatbot applications in which large language models (LLMs) are used to generate an answer, or at least a response, to an input query. Other examples in which generative artificial intelligence models can be used include stable diffusion, in which a model generates an image from an input text description of the content of the desired image, and decision transformers, in which future actions are predicted based on sequences of prior actions within a given environment.
[0039] When a sampled group of tokens is input into the draft model in the next iteration of executing the draft model, the tokens in the sampled group of tokens are input at the sample location and treated independently. The result may be a tree data structure 110, with a prompt as a root node 111 of the tree data structure, and subsequent levels within the tree data structure 110 representing different tokens (or groups of tokens), combined with each of the previously selected token combinations. At some point in time (e.g., after generating a tree with a defined depth, corresponding to a maximum length of a sequence generated by the draft model), the draft model may output the generated tree data structure 110 to the target model for further processing. The tree data structure 110 may, in some aspects, be output to the target model with groupings and selection probabilities generated by the draft model.).
Xu and Lott are analogous art because they both teach method of using artificial intelligent generative model to analyze and create content including image and/or text. Xu further teaches text-to-image generation using image generative model. Lott further teaches storing tree data structure for prompts. Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention, to modify text-to-image generation method (taught in Xu), to further use the text data structure to process user prompt and correspond it to respective image output (taught in Lott), so as to provide an efficient method to process/store/query prompts from user in a large language model (Lott, [0001-0004]).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIN SHENG whose telephone number is (571)272-5734. The examiner can normally be reached M-F 9:30AM-3:30PM 6:00PM-8:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at 5712723022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Xin Sheng/Primary Examiner, Art Unit 2619
Read full office action
Prosecution Timeline

Jun 11, 2024
Application Filed
Mar 05, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/218,413
Patent 12603971
PROVIDING AWARENESS OF WHO CAN HEAR AUDIO IN A VIRTUAL CONFERENCE, AND APPLICATIONS THEREOF
2y 5m to grant Granted Apr 14, 2026
18/474,285
Patent 12602861
IMAGE PROCESSING METHOD, IMAGE PROCESSING DEVICE AND COMPUTER READABLE STORAGE MEDIUM
2y 5m to grant Granted Apr 14, 2026
18/451,267
Patent 12592030
INTERACTIVE THREE-DIMENSION AWARE TEXT-TO-IMAGE GENERATION
2y 5m to grant Granted Mar 31, 2026
18/068,205
Patent 12579920
ADAPTING A USER INTERFACE RESPONSIVE TO SCREEN SIZE ADJUSTMENT
2y 5m to grant Granted Mar 17, 2026
18/622,045
Patent 12555343
3D MODEL GENERATION USING MULTIMODAL GENERATIVE AI
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
90%
With Interview (+17.3%)
2y 5m
Median Time to Grant
Low
PTA Risk
Based on 401 resolved cases by this examiner. Grant probability derived from career allow rate.