Prosecution Insights
Last updated: May 29, 2026
Application No. 18/457,895

UTILIZING INDIVIDUAL-CONCEPT TEXT-IMAGE ALIGNMENT TO ENHANCE COMPOSITIONAL CAPACITY OF TEXT-TO-IMAGE MODELS

Final Rejection §103§112
Filed
Aug 29, 2023
Examiner
HE, WEIMING
Art Unit
2611
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
4 (Final)
46%
Grant Probability
Moderate
5-6
OA Rounds
8m
Est. Remaining
59%
With Interview

Examiner Intelligence

Grants 46% of resolved cases
46%
Career Allowance Rate
191 granted / 414 resolved
-15.9% vs TC avg
Moderate +13% lift
Without
With
+13.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
25 currently pending
Career history
453
Total Applications
across all art units

Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
93.4%
+53.4% vs TC avg
§102
3.2%
-36.8% vs TC avg
§112
1.9%
-38.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 414 resolved cases

Office Action

§103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment The amendment filed on 4/27/2026 has been entered and made of record. Claims 1-2, 8-9, 15-16 and 20 are amended. Claims 1-20 are pending. Response to Arguments Applicant’s arguments with respect to the rejections of independent claims 1, 9 and 16 have been fully considered but they are not persuasive. Applicant asserts that The combination of cited art fails to teach or generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt ... " (p. 15 of Remarks). Examiner notices that Xiao teaches an augmented conditioning with two or more text tokens in Fig 3; “To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Figure 5)” at p. 2-3. Here, the remaining denoising steps may refer to a second denoising step. Therefore, Xiao teaches above argued limitations. Applicant also alleges that The combination of art fails to teach or suggest "combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step" (p. 18 of Remarks). Examiner notices that Xiao teaches delayed subject conditioning as shown in Fig 3. Here, each text token in the text prompt may generate a corresponding noise representation and combine with previous noise representation to generate a combined noise representation, such as, two mans in a park in this example. Claim Rejections - 35 USC § 112 The following is a quotation of the first paragraph of 35 U.S.C. 112(a): (a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention. The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112: The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention. Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for pre-AIA the inventor(s), at the time the application was filed, had possession of the claimed invention. Independent claims 1 and 16 recite the limitation “generating, utilizing the second denoising step of the diffusion neural network, a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept, wherein the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising step”. Examiner notices that applicant discloses “As shown, FIG. 4 illustrates an initial noise representation 400, a denoising neural network generating a first noise representation 404 corresponding with a first denoising step 402. Additionally, FIG. 4 also shows a second noise representation 408 from a second denoising step 406. Moreover, FIG. 4 shows the text-to-image enhancement system 102 conditioning a third denoising step 410…” in [0068]. Here, there is neither conditioning the second denoising step, nor the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising step, see also Fig 4 as shown below. Claims 2-8 and 17-20 are dependent claims and rejected under the same rationale. For the purpose of the prosecution of the application, the claim language “the second denoising step” has no weight on the patentable subject matter. PNG media_image1.png 506 837 media_image1.png Greyscale Independent claim 9 recites “wherein parameters of the static diffusion neural network remain fixed during the training of the training diffusion neural network”. Applicant fails to provide the description in the specification to support this new amendment. Examiner notices that applicant discloses “In particular, the static diffusion neural network includes a diffusion neural network where the text-to-image enhancement system 102 does not modify parameters in response to determining a measure of loss.” in [0066]. Here, the freezing parameter is in response to determining a measure of loss, which doesn’t mean all the parameters are maintained fixed during the training of the training diffusion neural network. Claims 10-15 depend on claim 9 and rejected under the same rationale. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-8 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami (WO 2024/243527 A1). As to Claim 1, Xiao teaches A method comprising: generating, utilizing a first denoising step of a diffusion neural network, a first noise representation (Xiao teaches a training stage in Fig 3 as shown below: PNG media_image2.png 636 1189 media_image2.png Greyscale ); generating, utilizing a second denoising step of the diffusion neural network, a prompt noise representation from the first noise representation by conditioning the second denoising step with text tokens of a first text concept and a second text concept of a text prompt; generating, utilizing the second denoising step of the diffusion neural network, a first concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with a subset of the text tokens corresponding to the first text concept (Xiao discloses inference stage as a second diffusion neural network with delayed subject conditioning in Fig 3); generating, utilizing the second denoising step of the diffusion neural network, a second concept noise representation for the second denoising step from the first noise representation by conditioning the second denoising step with an additional subset of the text tokens corresponding to the second text concept, wherein the prompt noise representation, the first concept noise representation, and the second concept noise representation are generated at the second denoising (Xiao discloses delayed subject conditioning on the input text prompt in Fig 3; “To address this, we introduce delayed subject conditioning, preserving the subject’s identity while following text instructions. It employs text-only conditioning in the early denoising stage to generate the image layout, followed by subject-augmented conditioning in the remaining denoising steps to refine the subject appearance. This simple technique effectively preserves subject identity without sacrificing editability (Figure 5)” at p. 2-3.); combining the first concept noise representation and the second concept noise representation to generate a combined concept noise representation for the second denoising step (Xiao discloses generated image at inference stage in Fig 3, for example, a man and a man sitting in a park, see also section Text-Conditioning via Cross-Attention Mechanism at p. 4 and section 4.3 Delayed Subject Conditioning in Iterative Denoising at p. 6). Xiao teaches loss function without detail description. The combination of Avrahami further teaches following limitations: comparing the combined concept noise representation, generated from the first concept noise representation and the second concept noise representation for the second denoising step, with the prompt noise representation, generated from the text prompt also for the second denoising step, to determine a concept-prompt noise representation measure of loss; and modifying parameters of the second denoising step of the diffusion neural network according to the concept-prompt noise representation measure of loss (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study; “At inference time, a random noise zT is sampled from N(0, 1) and iteratively denoised by the U-Net to the initial latent representation z0” at p. 4. Avrahami further discloses “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; see also addition loss term in [0026], masked diffusion loss in [0040-0041], cross-attention loss in [0043] and Fig 2. Here, the loss functions can be used to calculate the difference between an input image and output image during a neural network processing.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao with the invention of Avrahami so as to calculate a reconstruction loss or any other loss functions based on a comparison of the input image and the synthetic image. As to Claim 2, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation further comprises: selecting the second denoising step of the diffusion neural network from a plurality of denoising steps to generate the prompt noise representation from the text prompt (Xiao, section Stable Diffusion at p. 3. Avrahami discloses “For instance, the image generation model could be… or a more complex latent diffusion model. The latter uses a series of noise-adding and denoising steps to generate the synthetic image. Thus, in some implementations, the image generation model can be a latent diffusion model that creates the synthetic image from a noised latent image” in [0052]; see also [0040, 0045].); applying a stop gradient operation to the second denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt noise representation measure of loss from being backpropagated to more than the second denoising step selected from the plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].) As to Claim 3, Xiao in view of Avrahami teaches The method of claim 1, further comprises: generating a third concept noise representation from a third text concept included within the text prompt; and combining the first concept noise representation, the second concept noise representation, and the third concept noise representation to generate the combined concept noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]". Here, multiple concepts within the text prompt can be extracted for individual noise representation. Xiao, Fig 3, 5, 7.) As to Claim 4, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the prompt noise representation comprises conditioning the second denoising step of the diffusion neural network with the text tokens of the first text concept and the second text concept of the text prompt by providing the text tokens to attention mechanisms of the second denoising step, wherein the attention mechanisms focus on removing noise from portions of the first noise representation indicated by the text tokens (Xiao discloses “We use a vision encoder to derive this identity embedding from a referenced image, and then augment the generic text tokens with features from this identity embedding. This enables image generation based on subject-augmented conditioning… To tackle the multi-subject identity blending issue, we identify unregulated cross-attention as the primary reason (Figure 4). When the text includes two "person" tokens, each token’s attention map attends to both person in the image rather than linking each token to a distinct person in the image” at p. 2; “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated” in Fig 4. Avrahami also discloses “In addition, (2) in order to avoid overfitting, the illustrated example uses a two-phase training regime, which starts by optimizing only the newly-added tokens…” in [0036].) As to Claim 5, Xiao in view of Avrahami teaches The method of claim 1, wherein generating the first concept noise representation and the second concept noise representation comprises: conditioning the second denoising step by guiding a removal of noise from the first noise representation according to the subset of the text tokens corresponding to the first text concept; and conditioning the second denoising step by guiding an additional removal of noise from the first noise representation according to the additional subset of the text tokens corresponding to the second text concept (Xiao discloses “Figure 4: In the absence of cross-attention regularization (top), the diffusion model attends to multiple subjects’ input tokens and merge their identity. By applying cross-attention regularization (bottom), the diffusion model learns to focus on only one reference token while generating a subject. This ensures that the features of multiple subjects in the generated image are more separated”; see also section 4.2 Localizing cross-attention maps with subject segmentation masks.) As to Claim 6, Xiao in view of Avrahami teaches The method of claim 1, further comprises: selecting, an additional denoising step of the diffusion neural network from a plurality of denoising steps; and generating, utilizing the additional denoising step of the diffusion neural network, an additional prompt noise representation from an additional text prompt comprising a third text concept and a fourth text concept (Xiao teaches a delayed subject conditioning in Fig 3. Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]. Here, multiple concepts within the text prompt can be extracted for individual noise representation.) As to Claim 7, Xiao in view of Avrahami teaches The method of claim 6, further comprises: generating, utilizing the additional denoising step of the diffusion neural network, a third concept noise representation and a fourth concept noise representation; generating an additional combined concept noise representation by combining the third concept noise representation and the fourth concept noise representation; and modifying parameters of the diffusion neural network by comparing the additional combined concept noise representation and the additional prompt noise representation (Avrahami discloses “A text prompt can then be constructed that includes the selected concepts. One example text prompt is "a photo of [vi1] and ... [vik]" in [0039]; “In some implementations, the proposed approach can be performed in two phases. In the first phase, a computing system can designate a set of dedicated text tokens (or handles), freeze the model weights, and optimize the handles to reconstruct the input image. In the second phase, the computing system can switch to fine-tuning the model weights, while continuing to optimize the handles” in [0023]; “in step 405, the system evaluates a loss function. The function includes a reconstruction loss term that generates a loss value based on a comparison of the input image and the synthetic image. As an example, in cases where a latent diffusion model is used, the reconstruction loss can be a latent diffusion loss that measures the difference between a predicted set of noise linked to the synthetic image and a set of noise intentionally added to the input image during the generation of the noised latent image” in [0053]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)… Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations” in [0075]; fine-tuning in [0018, 0028, 0037]. Xiao, Fig 2-3) As to Claim 8, Xiao in view of Avrahami teaches The method of claim 7, further comprises: identifying an inference-time text prompt comprising multiple text concepts from a client device; and generating, utilizing the diffusion neural network fine-tuned according to the concept-prompt noise representation measure of loss with the parameters modified, a digital image comprising the multiple text concepts (Avrahami discloses “The method includes initializing, by the computing system, a plurality of embeddings respectively for the plurality of visual concepts. The method includes, for each of one or more learning iterations: generating, by the computing system, a text prompt comprising one or more of the plurality of embeddings; processing, by the computing system, the text prompt with an image generation model to generate a synthetic image that depicts the visual concepts associated with the one or more embeddings included in the text prompt” in [0006]; “the goal is to extract a dedicated text token for each concept. This enables generation of novel images from textual prompts, featuring individual concepts or combinations of multiple concepts, as demonstrated in Figure 5” in [0019]. Xiao, Fig 3. See also Claim 1.) Claim 16 recites similar limitations as claim 1 but in a computer-readable medium form. Therefore, the same rationale used for claim 1 is applied. Claim 17 is rejected based upon similar rationale as Claim 3. Claim 18 is rejected based upon similar rationale as Claim 3. Claim 19 is rejected based upon similar rationale as Claims 4 & 5. Claim 20 is rejected based upon similar rationale as Claim 15. Claims 9-15 are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention) in view of Avrahami and Hu et al. (LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS, arXiv:2106.09685v2 [cs.CL] 16 Oct 2021). Claim 9 recites similar limitations as claims 1 in an system form, further recites static and training diffusion neural network (Xiao teach stable diffusion at p. 3). Hu further teaches wherein parameters of the static diffusion neural network remain fixed during the training of the training diffusion neural network (Hu discloses “We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights” in Abstract; “A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices A and B in Figure 1, reducing the storage requirement and task-switching overhead significantly” at p. 2.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Xiao and Avrahami with the invention of Hu so as to greatly reducing the number of trainable parameters for downstream tasks. Claim 10 is rejected based upon similar rationale as Claim 2. Claim 11 is rejected based upon similar rationale as Claim 2. Claim 12 is rejected based upon similar rationale as Claim 7. As to Claim 13, Xiao in view of Avrahami and Hu teaches The system of claim 9, wherein comparing the prompt noise representation and a combined concept noise representation from the first concept noise representation, and the second concept noise representation comprises utilizing a loss function to determine a concept-prompt noise representation measure of loss to backpropagate through one or more denoising steps of the training diffusion neural network (Xiao discloses “denoising loss (Figure 3)” at p. 5 and cross-attention localization loss under section 5.4 Ablation Study. Avrahami discloses “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example. backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075], see also [0076].) As to Claim 14, Xiao in view of Avrahami and Hu teaches The system of claim 9, wherein training the training diffusion neural network further comprise: applying a stop gradient operation to the additional denoising step of the static diffusion neural network utilized to generate the first concept noise representation and the second concept noise representation, wherein the stop gradient operation controls a gradient flow by stopping a concept-prompt noise representation measure of loss from being backpropagated to more than the additional denoising step selected from a plurality of denoising steps (Avrahami discloses “The update could be performed using gradient-based optimization algorithms such as stochastic gradient descent (SGD), RMSprop, or Adam” in [0056]; “For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function)” in [0075].) Claim 15 is rejected based upon similar rationale as Claims 1 & 8. Conclusion THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221. The examiner can normally be reached Monday-Friday, 8:30am-5:00pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /Weiming He/ Primary Examiner, Art Unit 2611
Read full office action

Prosecution Timeline

Show 7 earlier events
Dec 17, 2025
Request for Continued Examination
Jan 15, 2026
Response after Non-Final Action
Feb 13, 2026
Non-Final Rejection mailed — §103, §112
Mar 20, 2026
Interview Requested
Apr 01, 2026
Applicant Interview (Telephonic)
Apr 05, 2026
Examiner Interview Summary
Apr 27, 2026
Response Filed
May 19, 2026
Final Rejection mailed — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12639877
REFINEMENT OF FACIAL KEYPOINT METADATA GENERATION FOR VIDEO CONFERENCING OR OTHER APPLICATIONS
3y 6m to grant Granted May 26, 2026
Patent 12632615
DATA SERIALIZATION EXTRUSION FOR CONVERTING TWO-DIMENSIONAL IMAGES TO THREE-DIMENSIONAL GEOMETRY
5y 11m to grant Granted May 19, 2026
Patent 12633000
TEXT-TO-IMAGE SYNTHESIS UTILIZING DIFFUSION MODELS WITH TEST-TIME ATTENTION SEGREGATION AND RETENTION OPTIMIZATION
2y 11m to grant Granted May 19, 2026
Patent 12608891
INFORMATION PROCESSING DEVICE, HEAD-MOUNTED DISPLAY DEVICE, CONTROL METHOD OF INFORMATION PROCESSING DEVICE, AND NON-TRANSITORY COMPUTER READABLE MEDIUM WITH WHITE-BALANCE CORRECTION VALUE CORRESPONDING TO COLOR TEMPERATURE OF ENVIRONMENT LIGHT-SOURCE
2y 9m to grant Granted Apr 21, 2026
Patent 12567135
MULTIMEDIA PLAYBACK MONITORING SYSTEM AND METHOD, AND ELECTRONIC APPARATUS
2y 1m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6
Expected OA Rounds
46%
Grant Probability
59%
With Interview (+13.0%)
3y 4m (~8m remaining)
Median Time to Grant
High
PTA Risk
Based on 414 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month