Last updated: April 19, 2026
Application No. 18/817,915
SELF ATTENTION REFERENCE FOR IMPROVED DIFFUSION PERSONALIZATION

Non-Final OA §103
Filed
Aug 28, 2024
Examiner
WANG, YUEHAN
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
1 (Non-Final)
Interview Optional

— +12.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 485 resolved cases, 2023–2026
Examiner Intelligence

WANG, YUEHAN View full profile →
Grants 83% — above average
Career Allow Rate
404 granted / 485 resolved
+21.3% vs TC avg
Moderate +13% lift
Without
With
+12.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
47 currently pending
Career history
532
Total Applications
across all art units
Statute-Specific Performance

§101
4.3%
-35.7% vs TC avg
§103
69.6%
+29.6% vs TC avg
§102
8.3%
-31.7% vs TC avg
§112
6.6%
-33.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 485 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Objections
Claim 1 is objected to because of the following informalities: 
Claim 1 recited “obtaining a reference image an input prompt”. It should read “obtaining a reference image and an input prompt.”  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (US 12346995 B2), referred herein as Liu in view of Zeng et al. (US 20250166237 A1), referred herein as Zeng.
Regarding Claim 1, Liu in view of Zeng teaches a method comprising:
obtaining a reference image an input prompt describing an image element (Liu col 7, ln 10-13: The user input text 220 may include style and/or scene words that describe visual features that the user desires to have represented in the synthesized image 130; FIG. 2: 124: input image, 220: user input);
identifying an object from the reference image (Liu col 5, ln 65-68: the image encoder 200 includes a plurality of pre-trained layers 208 (e.g. 14 pre-trained layers) that are trained for general object recognition; col 6, ln 28-30: a user identifier 214 that acts as a placeholder in a text description that is used by the diffusion model 204 to generate the synthesized image 130);
generating, using an image generation model, image features representing the object based on the reference image (Liu Abst: The diffusion model is configured to receive the input feature vector and generate a synthesized image of the user based at least on the input feature vector; col 4, ln 19-30: A trained machine learning diffusion model 128 is configured to receive the image 124 of the user and generate a synthesized image 130 of the user based at least on the image 124 of the user captured via the camera 110. The synthesized image 130 includes a character having the same or similar visual features as the user (e.g., the same eye shape, eye color, nose shape, mouth shape, cheek bone shape, complexion etc.)); and
generating, using the image generation model, a synthetic image (Liu ln 19-30: the synthesized image 130 includes additional stylized visual features. For example, the character may have different clothes, assume different body poses, and/or may be placed in a different scene).
However, Zeng explicitly teaches 
depicting the image element and the object based on the input prompt and the image features from the reference image (Zeng [0115] The generated images have as the foreground image input 601 with the subject “fox” 611 in different poses… The background of images 603, 604, 605, and 606 are scenes described by prompt 602. image 603 is input 601 with a background scene of a forest in spring).
Zeng discloses methods for using neural networks for generating multiple related images, which is analogous to the present patent application. 
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zeng, and apply the generated images from a text prompt and an input image into methods for generating a synthesized image of a user with a trained machine learning diffusion model.
Doing so would improve neural networks that generate images as well as ways to improve training of these neural networks. 

Regarding Claim 2, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein: the reference image depicts the object in a first scene and the synthetic image depicts the object in a second scene described by the input prompt (Liu col 4, ln 26-29: the synthesized image 130 includes additional stylized visual features. For example, the character may have different clothes, assume different body poses, and/or may be placed in a different scene).

Regarding Claim 3, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein generating the image features comprises: generating an object mask that indicates a location of the object in the reference image, wherein the image features are generated based on the object mask (Zeng [0076] Post processing 208 includes, for each image of the image-subject pairs, processing the image by object detection and segmentation to separate the subject (e.g., fox) of each image in the image set (e.g., the foxes in image set {z.sub.l, z.sub.l+1, z.sub.l+2 . . . . L}) and extract foreground masks; [0092] the self-attention layers 405 separate the foreground image from the background image through self-attention; [0093] the output of the self-attention layer 511 is processed to generate one or more processed feature maps (e.g., 512, 513).

Regarding Claim 4, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein: the input prompt describes a location of the object in the reference image, wherein generating the image features comprises determining the location of the object based on the input prompt, and wherein the image features are generated based on the location of the object (Zeng [0107] prompts 602 can be one or more words, phrases, or sentences describing a background scene. In at least one embodiment, prompts 602 include on one or more indications by one or more users indicating content of at least one of the two or more different images other than the one or more objects (e.g., words describing backgrounds or features of an image other than a subject in an image)… prompts 602 includes a phrase such as “generate images of the fox in a forest with different seasons,” where said fox is referring to input 601).

Regarding Claim 5, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein generating the image features comprises: generating a plurality of layer-specific image features at a plurality of layers of the image generation model, respectively (Liu col 6, ln 1-14: the plurality of pre-trained layers are from a pre-trained Contrastive Language-Image Pre-training (CLIP) ViT. In some implementations, the image encoder 200 further includes a plurality of fine-tuned layers 210 (e.g., 8 fine-tuned layers) that are re-trained specifically to extract visual features of the user from the image 124 of the user. In some implementations, the plurality of fine-tuned layers are re-trained to extract visual features of a face of the user. In other implementations, the plurality of fine-tuned layers are re-trained to extract visual features of a body of the user. In some implementations, the image encoder 200 includes a fully connected layer 212 configured to generate the set of embeddings 206 based at least on the visual features of the face of the user extracted by the plurality of fine-tuned layers 210).

Regarding Claim 6, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein: the image features are generated based on a plurality of reference images (Liu col 10, ln 18-21: FIGS. 5 and 6 show how different input images of different users produce different synthesized images even though the same word embeddings are used to generate the different synthesized images).

Regarding Claim 7, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein: the input prompt comprises a nonce token corresponding to the object (Liu col 6, ln30-36: the user identifier is implanted in different word embeddings or combined with different sentences by the text encoder 202, and the diffusion model 204 synthesizes a character (i.e., a synthetically generated image of a person) having the visual features of the user in different contexts based at least on the user identifier and the word embeddings and/or sentences; Fig. 5: a portrait of user 1 as …).

Regarding Claim 8, Liu in view of Zeng teaches the method of claim 1, and further teaches wherein: the image generation model is fine-tuned to generate images depicting the object based on the reference image (Liu col 2, ln 38-41: these training images require specific visual characteristics in order for the conventional diffusion model to be fine-tuned to accurately learn visual features of the user).

Claim(s) 9, 13 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Swaminathan et al. (US 11210831 B2), referred herein as Swaminathan in view of Zeng et al. (US 20250166237 A1), referred herein as Zeng.
Regarding Claim 9, Swaminathan in view of Zeng teaches a method of training an image generation model, comprising (Swaminathan col 1, ln 59: to train the generative model):
obtaining a training set (Swaminathan col 1, ln 59-60: to train the generative model using a plurality of training sets), including 
a reference image depicting an object (Swaminathan col 1, ln 49-53: An image generation system receives a two-dimensional reference image depicting a person and a textual description describing target clothing in which the person is to be depicted as wearing)
an input prompt describing an image element (Swaminathan col 1, ln 60-63: where each training set includes a textual description of target clothing and a ground truth image depicting a human subject wearing the target clothing), and 
a ground-truth image depicting the object and the image element (Swaminathan col 1, ln 60-63: where each training set includes a textual description of target clothing and a ground truth image depicting a human subject wearing the target clothing); and
training, using the training set, the image generation model to generate image features for the object based on the reference image (Swaminathan col 5, ln 12-17: the images output by the image generation system described herein include sharp boundaries between depicted objects as a result of the dual loss training approach. With better color correlation, sharp boundaries, and preservation of pixel values corresponding to personally identifiable human aspects (e.g., face, hair, and so forth)).
However, Zeng explicitly teaches 
and to generate a synthetic image depicting the object based on the input prompt and the image features (Zeng [0115] The generated images have as the foreground image input 601 with the subject “fox” 611 in different poses. The background of images 603, 604, 605, and 606 are scenes described by prompt 602. In at least one embodiment, image 603 is input 601 with a background scene of a forest in spring).
Zeng discloses methods for using neural networks for generating multiple related images, which is analogous to the present patent application. 
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zeng, and apply the generated images from a text prompt and an input image into the system configured to train the generative model to output visually realistic images depicting the human subject in the target clothing.
Doing so would improve neural networks that generate images as well as ways to improve training of these neural networks. 

Regarding Claim 13, Swaminathan in view of Zeng teaches the method of 9, and further teaches wherein training the image generation model comprises:
computing a diffusion loss (Swaminathan col 16, ln 1-6: In order to improve performance of the trained generative model 212, the generative model module 122 is further configured to determine a perceptual quality loss 216 associated with an output image 414 produced by the generative model 212 during training); and
updating parameters of the image generation model based on the diffusion loss (Swaminathan col 20, ln 40-445: Following each determination of the discriminator loss 214 and the perceptual quality loss 216 for a given training set, the generative model module 122 updates one or more parameters of the generative model 212 to guide the parameters toward their optimal state).

Regarding Claim 14, Swaminathan in view of Zeng teaches the method of 9, and further teaches wherein: the image generation model is trained to receive layer-specific image features for the object at a plurality of different layers (Swaminathan col 10, ln 42-45: the human parsing module 114 may extract red, green, blue (RGB) channels of face, skin, and hair regions of the person depicted in the reference image 108 in the form of feature maps to be preserved in generating the image 106; col 14, ln 9-12: , the generative model module 122 provides a text vector 204 generated from the corresponding textual description of target clothing for the training set to the generator portion 402; col 14, ln 23-25: This pixel block is subsequently scaled through a plurality of upsampling stages, such as upsampling stage 408 and upsampling stage 410).

Claim(s) 10-12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Swaminathan et al. (US 11210831 B2), referred herein as Swaminathan in view of Zeng et al. (US 20250166237 A1), referred herein as Zeng and over Liu et al. (US 12346995 B2), referred herein as Liu.
Regarding Claim 10, Swaminathan in view of Zeng teaches the method of 9. However, Liu teaches
wherein: the image generation model is pre-trained in a first training phase training phase to receive the image features at the attention layer (Liu Fig. 2: 208: pre-trained layer, 210: fine-runed layers, 204: diffusion model; col 2, ln 19-28: One type of conventional generative diffusion model is pre-trained on a large amount of image and corresponding text data to generate image content based on text inputs. For this conventional diffusion model to generate synthesized photorealistic images of a particular user, it needs to be fine-tuned with multiple (10 or more) training images of the user. Once this conventional diffusion model is fine tuned for a particular user, it can generate various synthesized images of a synthetic person who has similar visual features as the user).
Zeng further teaches without receiving the image features at an attention layer (Zeng [0069] neural network 102 includes a convolution layer 103, a self-attention layer 104, and a cross-attention layer 105. In at least one embodiment, neural network 102 includes a diffusion model that is pre-trained model and learned to reverse a diffusion process).
Liu discloses systems and methods for generating a synthesized image of a user with a trained machine learning diffusion model, which is analogous to the present patent application. 
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Swaminathan to incorporate the teachings of Liu, and apply the trained diffusion model and fine-tuned layers into the system configured to train the generative model to output visually realistic images depicting the human subject in the target clothing.
Doing so, using the technical feature of fine-tuning the image encoder instead of fine-tuning the diffusion model itself, would provide the technical benefit of reducing an overall time and consumption of computer resources to enable the trained machine learning diffusion model to generate a synthesized image of a user from an input image of the user relative to previous diffusion models

Regarding Claim 11, Swaminathan in view of Zeng and Liu teaches the method of 10, and further teaches wherein: each layer of the image generation model is updated during the second training phase (Swaminathan col 20, ln 40-52: Following each determination of the discriminator loss 214 and the perceptual quality loss 216 for a given training set, the generative model module 122 updates one or more parameters of the generative model 212 to guide the parameters toward their optimal state. Training the generative model 212 may continue until one or more of the discriminator loss 214 or the perceptual quality loss 216 satisfy one or more threshold loss values. Alternatively, training the generative model 212 may continue until all the training datasets have been processed by the image generation system 104. Upon completion of training, the generative model module 122 is configured to output the trained generative model 212; Zeng [0066] specific layers of a neural network are trained to identify what features of a subject are common in different images and how different text prompts correspond to image features; [0067] a generated feature map, which includes identified features and weight values, is then input into another layer that identifies image features that correspond to text prompts. In at least one embodiment, a layer receives a feature map (for each input image) and text prompts corresponding to each feature map).

Regarding Claim 12, Swaminathan in view of Zeng and Liu teaches the method of 10, and further teaches wherein: the attention layer receives a different number of input tokens during the first training phase and the second training phase (Zeng [0084] receives one or more inputs 402 and generates one or more outputs images 408 where each image of output images 407 includes a shared image object (e.g., one of the subjects from subject set 203 {x.sub.n, x.sub.n+1, x.sub.n+2, . . . N} where each of x.sub.n, x.sub.n+1, x.sub.n+2, N are different subjects among the list of subjects) displayed over one or more different backgrounds; [0085] The feature image maps generated by convolution layer 403 are received by one or more concatenation layers 404 that concatenate each of the feature image maps into a single feature map, where said single feature map provided to one or more self-attention layers 405. In at least one embodiment, self-attention layers 405, performed by one or more processors, compare each element of said single feature map to determine which elements are more alike each other element).

Claim(s) 15-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Karpman et al. (US 11995803 B1), referred herein as Karpman in view of Zeng et al. (US 20250166237 A1), referred herein as Zeng.
Regarding Claim 15, Karpman in view of Zeng teaches an apparatus comprising (Karpman FIG. 1):
at least one processor; at least one memory storing instructions executable by the at least one processor (Karpman col 2, ln 23-28: FIG. 1 depicts a system 100 for generating images according to some embodiments. System 100 includes a server system 102 and client device(s) 104. Server system 102 includes processors and memory resources storing computer-executable instructions that define training data processing models 106); and 
an image generation model comprising parameters stored in the at least one memory (Karpman col 2, ln 51-53: As shown in FIG. 1, the system includes a text-to-image diffusion model 112. Text-to-image diffusion model 112 may be a probabilistic generative model used to generate image data; col 3, ln 15-21: Text-to-image diffusion model 112 can execute the base image diffusion model 120 (and the high-resolution diffusion models 116) on the assembled training set (e.g., text-image pairs) to infer and/or encode custom parameters for iteratively transforming randomly sampled visual noise into a visually appealing synthetic image that aligns with visual concepts described by a text prompt; FIG. 1: 110: storage devices) and 
trained to generate image features Karpman col 4, ln 53-65: a vision transformer configured to divide an input image into segments and generate a corresponding sequence of embeddings for the input image… (2) as an image-aware text encoder that modifies embeddings generated by the pre-trained text encoder and/or unimodal encoder module to include visual information based on visual features of an input image (e.g., via cross-attention to the input image)).
However, Zeng explicitly teaches:
an object depicted in a reference image (Zeng [0074] a subject of the inputs 201 is an object of two more outputs 209 (e.g., a text includes fox, and images each include that fox)) and
to generate a synthetic image depicting the object based on an input prompt and the image features (Zeng [0115] The generated images have as the foreground image input 601 with the subject “fox” 611 in different poses. The background of images 603, 604, 605, and 606 are scenes described by prompt 602. image 603 is input 601 with a background scene of a forest in spring), 
Karpman in view of Zeng further teaches
wherein the image generation model receives the image features via an attention layer (Karpman col 17, ln 14-25: at Block M200, the method can execute the base image diffusion model 120 on the embedding representation to:… iteratively (e.g., progressively) denoise the random distribution according to the fine-tuned set of image generation parameters (e.g., parameters learned by the base image diffusion model 120 during training and/or fine-tuning) and a semantic representation of the image description encoded in the one or more embedding representations (e.g., according to outputs of the base image diffusion model 120's cross attention layers)).
Zeng discloses methods for using neural networks for generating multiple related images, which is analogous to the present patent application. 
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zeng, and apply the generated images from a text prompt and an input image into the system configured to train the generative model to output visually realistic images depicting the human subject in the target clothing.
Doing so would improve neural networks that generate images as well as ways to improve training of these neural networks. 

Regarding Claim 16, Karpman in view of Zeng teaches the apparatus of claim 15, and further teaches within: the image generation model comprises a diffusion U-Net (Karpman col 5, ln 20-21: The base image diffusion model 120 can include a U-net architecture (e.g., Efficient U-Net)).

Regarding Claim 17, Karpman in view of Zeng teaches the apparatus of claim 16, and further teaches wherein: the image generation model receives the image features at a plurality of attention layers corresponding to a plurality of decoder layers of the diffusion U-Net (Karpman col 5, ln 20-28: The base image diffusion model 120 can include a U-net architecture (e.g., Efficient U-Net) defined from residual and multi-head attention blocks that enable the base image diffusion model 120 to progressively denoise (e.g., infill, generate, augment) image data according to cross-attention inputs based on the text prompt. The base image diffusion model 120 can therefore: receive one or more text embeddings from the set of pre-trained text encoders 118).

Regarding Claim 18, Karpman in view of Zeng teaches the apparatus of claim 15, and further teaches further comprising: a mask generation network configured to generate an object mask that indicates a location of the object in the reference image (Zeng [0076]  Post processing 208 includes, for each image of the image-subject pairs, processing the image by object detection and segmentation to separate the subject (e.g., fox) of each image in the image set (e.g., the foxes in image set {z.sub.l, z.sub.l+1, z.sub.l+2 . . . . L}) and extract foreground masks. In at least one embodiment, the extracted foreground masks are representations of each subject's pose separated from any background image that may be present in image set 206. In at least one embodiment, post processing 208 of the collage or set of extracted foreground masks of the image-subject pairs includes separating the collage of images 206 into individual outputs 209).

Regarding Claim 19, Karpman in view of Zeng teaches the apparatus of claim 15, and further teaches further comprising: a text encoder configured to encode the input prompt (Karpman col 2, ln 60-61: Text encoders 118 interpret a text query and generate an embedding of the text query).

Regarding Claim 20, Karpman in view of Zeng teaches the apparatus of claim 15, and further teaches wherein: the attention layer comprises a self-attention layer (Zeng [0069] neural network 102 includes a convolution layer 103, a self-attention layer 104, and a cross-attention layer 105).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Samantha (Yuehan) Wang whose telephone number is (571)270-5011. The examiner can normally be reached Monday-Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Samantha (YUEHAN) WANG/
Primary Examiner
Art Unit 2617
Read full office action
Prosecution Timeline

Aug 28, 2024
Application Filed
Feb 06, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/198,019
Patent 12597178
VECTOR OBJECT PATH SEGMENT EDITING
2y 5m to grant Granted Apr 07, 2026
18/528,922
Patent 12597506
ENDOSCOPIC EXAMINATION SUPPORT APPARATUS, ENDOSCOPIC EXAMINATION SUPPORT METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/492,720
Patent 12586286
DIFFERENTIABLE REAL-TIME RADIANCE FIELD RENDERING FOR LARGE SCALE VIEW SYNTHESIS
2y 5m to grant Granted Mar 24, 2026
18/584,076
Patent 12586261
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
18/372,370
Patent 12567182
USING AUGMENTED REALITY TO VISUALIZE OPTIMAL WATER SENSOR PLACEMENT
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
96%
With Interview (+12.9%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 485 resolved cases by this examiner. Grant probability derived from career allow rate.