Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed on 4/27/2026 has been entered and made of record. Claims 1-2, 8-9, 15-16 and 20 are amended. Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments with respect to the rejections of independent claims 1, 9 and 16 have been fully considered but they are not persuasive.
Applicant asserts that The combination of cited art fails to teach or generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt ... " (p. 15 of Remarks).
Examiner notices that Xiao teaches an augmented conditioning with two or more text tokens in Fig 3; “To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Figure 5)” at p. 2-3. Here, the remaining denoising steps may refer to a second denoising step. Therefore, Xiao teaches above argued limitations.
Applicant also alleges that The combination of art fails to teach or suggest "combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step" (p. 18 of Remarks).
Examiner notices that Xiao teaches delayed subject conditioning as shown in Fig 3. Here, each text token in the text prompt may generate a corresponding noise representation and combine with previous noise representation to generate a combined noise representation, such as, two mans in a park in this example.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for pre-AIA the inventor(s), at the time the application was filed, had possession of the claimed invention.
Independent claims 1 and 16 recite the limitation “generating, utilizing the second denoising step of the diffusion neural network, a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept, wherein the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising step”. Examiner notices that applicant discloses “As shown, FIG. 4 illustrates an initial noise representation 400, a denoising neural network generating a first noise representation 404 corresponding with a first denoising step 402. Additionally, FIG. 4 also shows a second noise representation 408 from a second denoising step 406. Moreover, FIG. 4 shows the text-to-image enhancement system 102 conditioning a third denoising step 410…” in [0068]. Here, there is neither conditioning the second denoising step, nor the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising step, see also Fig 4 as shown below. Claims 2-8 and 17-20 are dependent claims and rejected under the same rationale. For the purpose of the prosecution of the application, the claim language “the second denoising step” has no weight on the patentable subject matter.
PNG
media_image1.png
506
837
media_image1.png
Greyscale
Independent claim 9 recites “wherein parameters of the static diffusion neural network remain fixed during the training of the training diffusion neural network”. Applicant fails to provide the description in the specification to support this new amendment. Examiner notices that applicant discloses “In particular, the static diffusion neural network includes a diffusion neural network where the text-to-image enhancement system 102 does not modify parameters in response to determining a measure of loss.” in [0066]. Here, the freezing parameter is in response to determining a measure of loss, which doesn’t mean all the parameters are maintained fixed during the training of the training diffusion neural network. Claims 10-15 depend on claim 9 and rejected under the same rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami (WO 2024/243527 A1).
As to Claim 1, Xiao teaches A method comprising:
generating, utilizing a first denoising step of a diffusion neural network, a first noise representation (Xiao teaches a training stage in Fig 3 as shown below:
PNG
media_image2.png
636
1189
media_image2.png
Greyscale
);
generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt; generating, utilizing the second denoising step of the diffusion neural network, a first concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with a subset of the text tokens corresponding to the first text concept (Xiao discloses inference stage as a second diffusion neural network with delayed subject conditioning in Fig 3);
generating, utilizing the second denoising step of the diffusion neural network, a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept, wherein the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising (Xiao discloses delayed subject conditioning on the input text prompt in Fig 3; “To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Figure 5)” at p. 2-3.);
combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step (Xiao discloses generated image at inference stage in Fig 3, for example, a man and a man sitting in a park, see also section Text-Conditioning via Cross-Attention Mechanism at p. 4 and section 4.3 Delayed Subject Conditioning in Iterative Denoising at p. 6).
Xiao teaches loss function without detail description. The combination of Avrahami further teaches following limitations:
comparing the combined concept noise representation, generated from the first concept noise representation and the second concept noise representation for the second denoising step, with the prompt noise representation, generated from the text prompt also for the second denoising step, to determine a concept-prompt noise representation measure of loss; and modifying parameters of the second denoising step of the diffusion neural network according to the concept-prompt noise representation measure of loss (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study; “At inference time, a random noise zT is sampled from N(0, 1) and iteratively denoised by the U-Net to the initial latent representation z0” at p. 4. Avrahami further discloses “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; see also addition loss term in [0026], masked diffusion loss in [0040-0041], cross-attention loss in [0043] and Fig 2. Here, the loss functions can be used to calculate the difference between an input image and output image during a neural network processing.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao with the invention of Avrahami so as to calculate a reconstruction loss or any other loss functions based on a comparison of the input image and the synthetic image.
As to Claim 2, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation further comprises: selecting the
second denoising step of the diffusion neural network from a plurality of denoising steps to generate the prompt noise representation from the text prompt (Xiao, section Stable Diffusion at p. 3. Avrahami discloses “For instance, the image generation model could be… or a more complex latent diffusion model. The latter uses a series of noise-adding and denoising steps to generate the synthetic image. Thus, in some implementations, the image generation model can be a latent diffusion model that creates the synthetic image from a noised latent image” in [0052]; see also [0040, 0045].); applying a stop gradient operation to the second denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt
noise representation measure of loss from being backpropagated to more than the second denoising step selected from the plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].)
As to Claim 3, Xiao in view of Avrahami teaches The method of claim 1, further comprises: generating a third concept noise representation from a third text concept included within the text prompt; and combining the first concept noise representation, the second concept noise representation, and the third concept noise representation to generate the combined concept noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]". Here, multiple concepts within the text prompt can be extracted for individual noise representation. Xiao, Fig 3, 5, 7.)
As to Claim 4, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation comprises conditioning the second denoising step of the diffusion neural network with the text tokens of the first text concept and the second text concept of the text prompt by providing the text tokens to attention mechanisms of the second denoising step, wherein the attention mechanisms focus on removing noise from portions of the first noise representation indicated by the text tokens (Xiao discloses “We use a vision encoder to derive this identity embedding from a referenced image, and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning… To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Figure 4). When the text includes two "person" tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image” at p. 2; “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated” in Fig 4. Avrahami also discloses “In addition, (2) in order to avoid overfitting, the illustrated example uses a two-phase training regime, which starts by optimizing only the newly-added tokens…” in [0036].)
As to Claim 5, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the first concept noise representation and the second concept noise representation comprises:
conditioning the second denoising step by guiding a removal of noise from the first noise representation according to the subset of the text tokens corresponding to the first text concept; and conditioning the second denoising step by guiding an additional removal of noise from the first noise representation according to the additional subset of the text tokens corresponding to the second text concept (Xiao discloses “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated”; see also section 4.2 Localizing cross-attention maps with subject segmentation masks.)
As to Claim 6, Xiao in view of Avrahami teaches The method of claim 1, further comprises: selecting, an additional denoising step of the diffusion neural network from a plurality of denoising steps; and generating, utilizing the additional denoising step of the diffusion neural network, an additional prompt noise representation from an additional text prompt comprising a third text concept and a fourth text concept (Xiao teaches a delayed subject conditioning in Fig 3. Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]. Here, multiple concepts within the text prompt can be extracted for individual noise representation.)
As to Claim 7, Xiao in view of Avrahami teaches The method of claim 6, further comprises: generating, utilizing the additional denoising step of the diffusion neural network, a third concept noise representation and a fourth concept noise representation; generating an additional combined concept noise representation by combining the third concept noise representation and the fourth concept noise representation; and modifying parameters of the diffusion neural network by comparing the additional combined concept noise representation and the additional prompt noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]; “In some implementations, the proposed approach can be performed in two phases. In the first phase, a computing system can designate a set of dedicated text tokens (or handles), freeze the model weights, and optimize the handles to reconstruct the input image. In the second phase, the computing system can switch to fine-tuning the model weights, while continuing to optimize the handles” in [0023]; “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; fine-tuning in [0018, 0028, 0037]. Xiao, Fig 2-3)
As to Claim 8, Xiao in view of Avrahami teaches The method of claim 7, further comprises: identifying an inference-time text prompt comprising multiple text concepts from a client device; and generating, utilizing the diffusion neural network fine-tuned according to the concept-prompt noise representation measure of loss with the parameters modified, a digital image comprising the multiple text concepts (Avrahami discloses “The method includes initializing, by the computing system, a plurality of embeddings respectively for the plurality of visual concepts. The method includes, for each of one or more learning iterations: generating, by the computing system, a text prompt comprising one or more of the plurality of embeddings; processing, by the computing system, the text prompt with an image generation model to generate a synthetic image that depicts the visual concepts associated with the one or more embeddings included in the text prompt” in [0006]; “the goal is to extract a dedicated text token for each concept. This enables generation of novel images from textual prompts, featuring individual concepts or combinations of multiple concepts, as demonstrated in Figure 5” in [0019]. Xiao, Fig 3. See also Claim 1.)
Claim 16 recites similar limitations as claim 1 but in a computer-readable medium form. Therefore, the same rationale used for claim 1 is applied.
Claim 17 is rejected based upon similar rationale as Claim 3.
Claim 18 is rejected based upon similar rationale as Claim 3.
Claim 19 is rejected based upon similar rationale as Claims 4 & 5.
Claim 20 is rejected based upon similar rationale as Claim 15.
Claims 9-15 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami and Hu et al. (LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS, arXiv:2106.09685v2 [cs.CL] 16 Oct 2021).
Claim 9 recites similar limitations as claims 1 in an system form, further recites static and training diffusion neural network (Xiao teach stable diffusion at p. 3). Hu further teaches wherein parameters of the static diffusion neural network remain fixed during the training of the training diffusion neural network (Hu discloses “We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights” in Abstract; “A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices A and B in Figure 1, reducing the storage requirement and task-switching overhead significantly” at p. 2.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao and Avrahami with the invention of Hu so as to greatly reducing the number of trainable parameters for downstream tasks.
Claim 10 is rejected based upon similar rationale as Claim 2.
Claim 11 is rejected based upon similar rationale as Claim 2.
Claim 12 is rejected based upon similar rationale as Claim 7.
As to Claim 13, Xiao in view of Avrahami and Hu teaches The system of claim 9, wherein comparing the prompt noise representation and a combined concept noise representation from the first concept noise representation, and the second concept noise representation comprises utilizing a loss function to determine a concept-prompt noise representation measure of loss to backpropagate through one or more denoising steps of the training diffusion neural network (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study. Avrahami discloses “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example. backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075], see also [0076].)
As to Claim 14, Xiao in view of Avrahami and Hu teaches The system of claim 9, wherein training the training diffusion neural network further comprise:
applying a stop gradient operation to the additional denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt noise representation measure of loss from being backpropagated to more than the additional denoising step selected from a plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].)
Claim 15 is rejected based upon similar rationale as Claims 1 & 8.
Conclusion
THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221. The examiner can normally be reached Monday-Friday, 8:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Weiming He/
Primary Examiner, Art Unit 2611