Last updated: April 19, 2026

Application No. 18/457,895

UTILIZING INDIVIDUAL-CONCEPT TEXT-IMAGE ALIGNMENT TO ENHANCE COMPOSITIONAL CAPACITY OF TEXT-TO-IMAGE MODELS

Non-Final OA §103

Filed

Aug 29, 2023

Examiner

HE, WEIMING

Art Unit

2611

Tech Center

2600 — Communications

Assignee

Adobe Inc.

OA Round

3 (Non-Final)

Interview Optional

— +13.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 410 resolved cases, 2023–2026

Examiner Intelligence

HE, WEIMING View full profile →

Grants 46% of resolved cases

Career Allow Rate

190 granted / 410 resolved

-15.7% vs TC avg

Moderate +14% lift

Without

With

+13.8%

Interview Lift

resolved cases with interview

Typical timeline

3y 4m

Avg Prosecution

40 currently pending

Career history

450

Total Applications

across all art units

Statute-Specific Performance

§101

7.4%

-32.6% vs TC avg

§103

59.2%

+19.2% vs TC avg

§102

12.4%

-27.6% vs TC avg

§112

15.0%

-25.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 410 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/17/2025 has been entered.

Examiner's Note
(1) In the case of amending the Claimed invention, Applicant is respectfully
requested to indicate the portion(s) of the specification which dictate(s) the structure
relied on for proper interpretation and also to verify and ascertain the metes and bounds
of the claimed invention. This will assist in expediting compact prosecution. MPEP
714.02 recites: "Applicant should also specifically point out the support for any
amendments made to the disclosure. See MPEP § 2163.06. An amendment which does
not comply with the provisions of 37 CFR i .121 (b), (c), (d), and (h) may be held not fully responsive. See MPEP § 714." Amendments not pointing to specific support in the disclosure may be deemed as not complying with provisions of 37 C.F.R.
1.131 (b), (c), (d), and (h) and therefore held not fully responsive. Generic statements such as "Applicants believe no new matter has been introduced" may be
deemed insufficient.
(2) Examiner has cited particular columns/paragraph and line numbers in the
references applied to the claims above for the convenience of the applicant. Although
the specified citations are representative of the teachings of the art and are applied to
specific limitations within the individual claim, other passages and figures may apply as
well. It is respectfully requested from the applicant in preparing responses, to fully
consider the references in entirety as potentially teaching all or part of the claimed
invention, as well as the context of the passage as taught by the prior art or disclosed
by the Examiner.

Response to Amendment
The amendment filed on 12/17/2025 has been entered and made of record. Claims 1-2, 4-5, 9-11, 13-14, 16 and 19-20 are amended. Claims 1-20 are pending.

Response to Arguments
Applicant’s arguments with respect to the rejections of independent claims 1, 9 and 16 have been fully considered but they are moot because the arguments do not apply to the references being used in the current rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami (WO 2024/243527 A1).
As to Claim 1, Xiao teaches A method comprising: 
 generating, utilizing a first denoising step of a diffusion neural network, a first noise representation (Xiao teaches a training stage in Fig 3 as shown below:

    PNG
    media_image1.png
    636
    1189
    media_image1.png
    Greyscale
);
 generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt; generating, utilizing the second denoising step of the diffusion neural network, a first concept noise representation for the second denoising step from the first noise representation (Xiao discloses inference stage as a second diffusion neural network with delayed subject conditioning in Fig 3); 
generating, utilizing the second denoising step of the diffusion neural network,  a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept (Xiao discloses delayed subject conditioning on the input text prompt in Fig 3);
combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step (Xiao discloses generated image at inference stage in Fig 3, see also section Text-Conditioning via Cross-Attention Mechanism at p. 4 and section 4.3 Delayed Subject Conditioning in Iterative Denoising at p. 6).
Xiao teaches loss function without detail description. The combination of Avrahami further teaches following limitations:
comparing the combined concept noise representation, generated from the first concept noise representation and the second concept noise representation for the second denoising step, with the prompt noise representation, generated from the text prompt also for the second denoising step, to determine a concept-prompt noise representation measure of loss; and modifying parameters of the second denoising step of the diffusion neural network according to the concept-prompt noise representation measure of loss (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study; “At inference time, a random noise zT is sampled from N(0, 1) and iteratively denoised by the U-Net to the initial latent representation z0” at p. 4. Avrahami further discloses “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; see also addition loss term in [0026], masked diffusion loss in [0040-0041], cross-attention loss in [0043] and Fig 2. Here, the loss functions can be used to calculate the difference between an input image and output image during a neural network processing.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao with the invention of Avrahami so as to calculate a reconstruction loss or any other loss functions based on a comparison of the input image and the synthetic image.

As to Claim 2, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation further comprises selecting the second denoising step of the diffusion neural network from a plurality of denoising steps to generate the prompt noise representation from the text prompt (Xiao, section Stable Diffusion at p. 3. Avrahami discloses “For instance, the image generation model could be… or a more complex latent diffusion model. The latter uses a series of noise-adding and denoising steps to generate the synthetic image. Thus, in some implementations, the image generation model can be a latent diffusion model that creates the synthetic image from a noised latent image” in [0052]; see also [0040, 0045].)

As to Claim 3, Xiao in view of Avrahami teaches The method of claim 1, further comprises: generating a third concept noise representation from a third text concept included within the text prompt; and combining the first concept noise representation, the second concept noise representation, and the third concept noise representation to generate the combined concept noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]". Here, multiple concepts within the text prompt can be extracted for individual noise representation. Xiao, Fig 3, 5, 7.)

As to Claim 4, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation comprises conditioning the second denoising step of the diffusion neural network with the text tokens of the first text concept and the second text concept of the text prompt by providing the text tokens to attention mechanisms of the second denoising step, wherein the attention mechanisms focus on removing noise from portions of the first noise representation indicated by the text tokens (Xiao discloses “We use a vision encoder to derive this identity embedding from a referenced image, and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning… To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Figure 4). When the text includes two "person" tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image” at p. 2; “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated” in Fig 4. Avrahami also discloses “In addition, (2) in order to avoid overfitting, the illustrated example uses a two-phase training regime, which starts by optimizing only the newly-added tokens…” in [0036].)

As to Claim 5, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the first concept noise representation and the second concept noise representation comprises: 
conditioning the second denoising step by guiding a removal of noise from the first noise representation according to the subset of the text tokens corresponding to the first text concept; and conditioning the second denoising step by guiding an additional removal of noise from the first noise representation according to the additional subset of the text tokens corresponding to the second text concept (Xiao discloses “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated”; see also section 4.2 Localizing cross-attention maps with subject segmentation masks.)

As to Claim 6, Xiao in view of Avrahami teaches The method of claim 1, further comprises: selecting, an additional denoising step of the diffusion neural network from a plurality of denoising steps; and generating, utilizing the additional denoising step of the diffusion neural network, an additional prompt noise representation from an additional text prompt comprising a third text concept and a fourth text concept (Xiao teaches a delayed subject conditioning in Fig 3. Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]. Here, multiple concepts within the text prompt can be extracted for individual noise representation.)

As to Claim 7, Xiao in view of Avrahami teaches The method of claim 6, further comprises: generating, utilizing the additional denoising step of the diffusion neural network, a third concept noise representation and a fourth concept noise representation; generating an additional combined concept noise representation by combining the third concept noise representation and the fourth concept noise representation; and modifying parameters of the diffusion neural network by comparing the additional combined concept noise representation and the additional prompt noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]; “In some implementations, the proposed approach can be performed in two phases. In the first phase, a computing system can designate a set of dedicated text tokens (or handles), freeze the model weights, and optimize the handles to reconstruct the input image. In the second phase, the computing system can switch to fine-tuning the model weights, while continuing to optimize the handles” in [0023]; “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; fine-tuning in [0018, 0028, 0037]. Xiao, Fig 2-3)

As to Claim 8, Xiao in view of Avrahami teaches The method of claim 7, further comprises: identifying a text prompt comprising multiple text concepts from a client device; and generating, utilizing the diffusion neural network with the parameters modified, a digital image comprising the multiple text concepts (Avrahami discloses “The method includes initializing, by the computing system, a plurality of embeddings respectively for the plurality of visual concepts. The method includes, for each of one or more learning iterations: generating, by the computing system, a text prompt comprising one or more of the plurality of embeddings; processing, by the computing system, the text prompt with an image generation model to generate a synthetic image that depicts the visual concepts associated with the one or more embeddings included in the text prompt” in [0006]; “the goal is to extract a dedicated text token for each concept. This enables generation of novel images from textual prompts, featuring individual concepts or combinations of multiple concepts, as demonstrated in Figure 5” in [0019]. Xiao, Fig 3.)

Claim 9 recites similar limitations as claims 1 in an system form, further recites static and training diffusion neural network (Xiao teach stable diffusion at p. 3). Therefore, the same rationale used for claims 1 is applied.
Claim 10 is rejected based upon similar rationale as Claim 2.
Claim 11 is rejected based upon similar rationale as Claim 2.
Claim 12 is rejected based upon similar rationale as Claim 7.

As to Claim 13, Xiao in view of Avrahami teaches The system of claim 9, wherein comparing the prompt noise representation and a combined concept noise representation from the first concept noise representation, and the second concept noise representation comprises utilizing a loss function to determine a concept-prompt noise representation measure of loss to backpropagate through one or more denoising steps of the training diffusion neural network (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study. Avrahami discloses “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example. backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075], see also [0076].)

As to Claim 14, Xiao in view of Avrahami teaches The system of claim 9, wherein training the training diffusion neural network further comprise:
applying a stop gradient operation to the additional denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt noise representation measure of loss from being backpropagated to more than the additional denoising step selected from a plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].)

Claim 15 is rejected based upon similar rationale as Claims 1 & 8.
Claim 16 recites similar limitations as claim 1 but in a computer-readable medium form. Therefore, the same rationale used for claim 1 is applied.
Claim 17 is rejected based upon similar rationale as Claim 3.
Claim 18 is rejected based upon similar rationale as Claim 3.
Claim 19 is rejected based upon similar rationale as Claims 4 & 5.
Claim 20 is rejected based upon similar rationale as Claim 15.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221.  The examiner can normally be reached on Monday-Friday, 8:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/WEIMING HE/
Primary Examiner, Art Unit 2611

Read full office action

Prosecution Timeline

Aug 29, 2023

Application Filed

Aug 01, 2025

Non-Final Rejection — §103

Sep 02, 2025

Interview Requested

Sep 09, 2025

Examiner Interview Summary

Sep 09, 2025

Applicant Interview (Telephonic)

Sep 19, 2025

Response Filed

Oct 14, 2025

Final Rejection — §103

Dec 17, 2025

Request for Continued Examination

Jan 15, 2026

Response after Non-Final Action

Feb 11, 2026

Non-Final Rejection — §103

Mar 20, 2026

Interview Requested

Apr 01, 2026

Applicant Interview (Telephonic)

Apr 05, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/580,103

Patent 12567135

MULTIMEDIA PLAYBACK MONITORING SYSTEM AND METHOD, AND ELECTRONIC APPARATUS

2y 5m to grant Granted Mar 03, 2026

18/059,377

Patent 12561876

System and method for an audio-visual avatar creation

2y 5m to grant Granted Feb 24, 2026

18/513,815

Patent 12514672

System, Method And Software Program For Aiding In Positioning Of Objects In A Surgical Environment

2y 5m to grant Granted Jan 06, 2026

18/001,120

Patent 12494003

AUTOMATIC LAYER FLATTENING WITH REAL-TIME VISUAL DEPICTION

2y 5m to grant Granted Dec 09, 2025

16/532,321

Patent 12468949

SYSTEMS AND METHODS FOR FEW-SHOT TRANSFER LEARNING

2y 5m to grant Granted Nov 11, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

46%

Grant Probability

60%

With Interview (+13.8%)

3y 4m

Median Time to Grant

High

PTA Risk

Based on 410 resolved cases by this examiner. Grant probability derived from career allow rate.