Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/17/2025 has been entered.
Examiner's Note
(1) In the case of amending the Claimed invention, Applicant is respectfully
requested to indicate the portion(s) of the specification which dictate(s) the structure
relied on for proper interpretation and also to verify and ascertain the metes and bounds
of the claimed invention. This will assist in expediting compact prosecution. MPEP
714.02 recites: "Applicant should also specifically point out the support for any
amendments made to the disclosure. See MPEP § 2163.06. An amendment which does
not comply with the provisions of 37 CFR i .121 (b), (c), (d), and (h) may be held not fully responsive. See MPEP § 714." Amendments not pointing to specific support in the disclosure may be deemed as not complying with provisions of 37 C.F.R.
1.131 (b), (c), (d), and (h) and therefore held not fully responsive. Generic statements such as "Applicants believe no new matter has been introduced" may be
deemed insufficient.
(2) Examiner has cited particular columns/paragraph and line numbers in the
references applied to the claims above for the convenience of the applicant. Although
the specified citations are representative of the teachings of the art and are applied to
specific limitations within the individual claim, other passages and figures may apply as
well. It is respectfully requested from the applicant in preparing responses, to fully
consider the references in entirety as potentially teaching all or part of the claimed
invention, as well as the context of the passage as taught by the prior art or disclosed
by the Examiner.
Response to Amendment
The amendment filed on 12/17/2025 has been entered and made of record. Claims 1-2, 4-5, 9-11, 13-14, 16 and 19-20 are amended. Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments with respect to the rejections of independent claims 1, 9 and 16 have been fully considered but they are moot because the arguments do not apply to the references being used in the current rejection.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami (WO 2024/243527 A1).
As to Claim 1, Xiao teaches A method comprising:
generating, utilizing a first denoising step of a diffusion neural network, a first noise representation (Xiao teaches a training stage in Fig 3 as shown below:
PNG
media_image1.png
636
1189
media_image1.png
Greyscale
);
generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt; generating, utilizing the second denoising step of the diffusion neural network, a first concept noise representation for the second denoising step from the first noise representation (Xiao discloses inference stage as a second diffusion neural network with delayed subject conditioning in Fig 3);
generating, utilizing the second denoising step of the diffusion neural network, a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept (Xiao discloses delayed subject conditioning on the input text prompt in Fig 3);
combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step (Xiao discloses generated image at inference stage in Fig 3, see also section Text-Conditioning via Cross-Attention Mechanism at p. 4 and section 4.3 Delayed Subject Conditioning in Iterative Denoising at p. 6).
Xiao teaches loss function without detail description. The combination of Avrahami further teaches following limitations:
comparing the combined concept noise representation, generated from the first concept noise representation and the second concept noise representation for the second denoising step, with the prompt noise representation, generated from the text prompt also for the second denoising step, to determine a concept-prompt noise representation measure of loss; and modifying parameters of the second denoising step of the diffusion neural network according to the concept-prompt noise representation measure of loss (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study; “At inference time, a random noise zT is sampled from N(0, 1) and iteratively denoised by the U-Net to the initial latent representation z0” at p. 4. Avrahami further discloses “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; see also addition loss term in [0026], masked diffusion loss in [0040-0041], cross-attention loss in [0043] and Fig 2. Here, the loss functions can be used to calculate the difference between an input image and output image during a neural network processing.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao with the invention of Avrahami so as to calculate a reconstruction loss or any other loss functions based on a comparison of the input image and the synthetic image.
As to Claim 2, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation further comprises selecting the second denoising step of the diffusion neural network from a plurality of denoising steps to generate the prompt noise representation from the text prompt (Xiao, section Stable Diffusion at p. 3. Avrahami discloses “For instance, the image generation model could be… or a more complex latent diffusion model. The latter uses a series of noise-adding and denoising steps to generate the synthetic image. Thus, in some implementations, the image generation model can be a latent diffusion model that creates the synthetic image from a noised latent image” in [0052]; see also [0040, 0045].)
As to Claim 3, Xiao in view of Avrahami teaches The method of claim 1, further comprises: generating a third concept noise representation from a third text concept included within the text prompt; and combining the first concept noise representation, the second concept noise representation, and the third concept noise representation to generate the combined concept noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]". Here, multiple concepts within the text prompt can be extracted for individual noise representation. Xiao, Fig 3, 5, 7.)
As to Claim 4, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation comprises conditioning the second denoising step of the diffusion neural network with the text tokens of the first text concept and the second text concept of the text prompt by providing the text tokens to attention mechanisms of the second denoising step, wherein the attention mechanisms focus on removing noise from portions of the first noise representation indicated by the text tokens (Xiao discloses “We use a vision encoder to derive this identity embedding from a referenced image, and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning… To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Figure 4). When the text includes two "person" tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image” at p. 2; “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated” in Fig 4. Avrahami also discloses “In addition, (2) in order to avoid overfitting, the illustrated example uses a two-phase training regime, which starts by optimizing only the newly-added tokens…” in [0036].)
As to Claim 5, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the first concept noise representation and the second concept noise representation comprises:
conditioning the second denoising step by guiding a removal of noise from the first noise representation according to the subset of the text tokens corresponding to the first text concept; and conditioning the second denoising step by guiding an additional removal of noise from the first noise representation according to the additional subset of the text tokens corresponding to the second text concept (Xiao discloses “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated”; see also section 4.2 Localizing cross-attention maps with subject segmentation masks.)
As to Claim 6, Xiao in view of Avrahami teaches The method of claim 1, further comprises: selecting, an additional denoising step of the diffusion neural network from a plurality of denoising steps; and generating, utilizing the additional denoising step of the diffusion neural network, an additional prompt noise representation from an additional text prompt comprising a third text concept and a fourth text concept (Xiao teaches a delayed subject conditioning in Fig 3. Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]. Here, multiple concepts within the text prompt can be extracted for individual noise representation.)
As to Claim 7, Xiao in view of Avrahami teaches The method of claim 6, further comprises: generating, utilizing the additional denoising step of the diffusion neural network, a third concept noise representation and a fourth concept noise representation; generating an additional combined concept noise representation by combining the third concept noise representation and the fourth concept noise representation; and modifying parameters of the diffusion neural network by comparing the additional combined concept noise representation and the additional prompt noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]; “In some implementations, the proposed approach can be performed in two phases. In the first phase, a computing system can designate a set of dedicated text tokens (or handles), freeze the model weights, and optimize the handles to reconstruct the input image. In the second phase, the computing system can switch to fine-tuning the model weights, while continuing to optimize the handles” in [0023]; “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; fine-tuning in [0018, 0028, 0037]. Xiao, Fig 2-3)
As to Claim 8, Xiao in view of Avrahami teaches The method of claim 7, further comprises: identifying a text prompt comprising multiple text concepts from a client device; and generating, utilizing the diffusion neural network with the parameters modified, a digital image comprising the multiple text concepts (Avrahami discloses “The method includes initializing, by the computing system, a plurality of embeddings respectively for the plurality of visual concepts. The method includes, for each of one or more learning iterations: generating, by the computing system, a text prompt comprising one or more of the plurality of embeddings; processing, by the computing system, the text prompt with an image generation model to generate a synthetic image that depicts the visual concepts associated with the one or more embeddings included in the text prompt” in [0006]; “the goal is to extract a dedicated text token for each concept. This enables generation of novel images from textual prompts, featuring individual concepts or combinations of multiple concepts, as demonstrated in Figure 5” in [0019]. Xiao, Fig 3.)
Claim 9 recites similar limitations as claims 1 in an system form, further recites static and training diffusion neural network (Xiao teach stable diffusion at p. 3). Therefore, the same rationale used for claims 1 is applied.
Claim 10 is rejected based upon similar rationale as Claim 2.
Claim 11 is rejected based upon similar rationale as Claim 2.
Claim 12 is rejected based upon similar rationale as Claim 7.
As to Claim 13, Xiao in view of Avrahami teaches The system of claim 9, wherein comparing the prompt noise representation and a combined concept noise representation from the first concept noise representation, and the second concept noise representation comprises utilizing a loss function to determine a concept-prompt noise representation measure of loss to backpropagate through one or more denoising steps of the training diffusion neural network (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study. Avrahami discloses “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example. backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075], see also [0076].)
As to Claim 14, Xiao in view of Avrahami teaches The system of claim 9, wherein training the training diffusion neural network further comprise:
applying a stop gradient operation to the additional denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt noise representation measure of loss from being backpropagated to more than the additional denoising step selected from a plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].)
Claim 15 is rejected based upon similar rationale as Claims 1 & 8.
Claim 16 recites similar limitations as claim 1 but in a computer-readable medium form. Therefore, the same rationale used for claim 1 is applied.
Claim 17 is rejected based upon similar rationale as Claim 3.
Claim 18 is rejected based upon similar rationale as Claim 3.
Claim 19 is rejected based upon similar rationale as Claims 4 & 5.
Claim 20 is rejected based upon similar rationale as Claim 15.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221. The examiner can normally be reached on Monday-Friday, 8:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WEIMING HE/
Primary Examiner, Art Unit 2611