Office Action Analysis: 18785914 — SCORE BASED FINE-GRAINED CONTROL OF CONCEPT GENERATION

Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
2.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
3.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



4.	Claims 1-4, 6, 8-9, and 11-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning (Shi et al. hereinafter Shi).
Regarding claim 1, Shi teaches A method comprising: obtaining an input prompt, (Sect 3.1, Par 2 “We first inject a unique identifier ˆV to the input prompt to represent the object concept, then use a learnable image encoder to map the input images to a concept textual embedding.”)
a reference image, (Sect 3.1, Par 1 “Given a few images of a concept, the goal is to generate new high-quality images of this concept from text description p. The generated image variations should preserve the identity of the input concept.”, Sect 4.4 Ablation Study, Par 1 “Single image as input. Since our model is flexible for the number of input images, we evaluate our model using a single image as the input image condition, i.e., N = 1.”)
and a transform input, (Sect 3.4 Concept Token Renormalization Par 1-2 “The adapter weight β is set to 0.3 in this case. The attention of the identifier is significantly higher than the other words, while the key words such as “night” and “witcher” are assigned low attention weights, showing a sign of language forgetting. To address this issue, we renormalize the concept token with a factor of α ∈ (0,1]”, Par 3 “Without concept renormalization, the model failed to generate the “witcher” style or the “night” back ground. With renormalization, the attentions of the nouns are more balanced, and the model successfully generates the “witcher” style and the “night” background.” Where the adapter weight β and concept token α adjust how the input prompt and reference image are weighted towards the output image.)
wherein the input prompt describes a scene, (Introduction, Par 1 “generate new scenes or styles of the concept from input prompts”)
the reference image depicts an object, (Sect 3.1, Par 1 “Given a few images of a concept, the goal is to generate new high-quality images of this concept from text description p. The generated image variations should preserve the identity of the input concept. As DreamBooth [34] summarized, the variations include changing the concept’s location, property or style, modifying the subject’s pose, structure, expression or material, etc.” where the subject is the depicted object)
and the transform input indicates a target level of transformation for the object; (Sect 3.4 Concept Token Renormalization Par 1-2 “The adapter weight β is set to 0.3 in this case. The attention of the identifier is significantly higher than the other words, while the key words such as “night” and “witcher” are assigned low attention weights, showing a sign of language forgetting. To address this issue, we renormalize the concept token with a factor of α ∈ (0,1]”, Par 3 “Without concept renormalization, the model failed to generate the “witcher” style or the “night” back ground. With renormalization, the attentions of the nouns are more balanced, and the model successfully generates the “witcher” style and the “night” background.” Where the adapter weight β and concept token α adjust how the input prompt and reference image are weighted towards the output image.)
generating, using an object encoder of an image generation model, an object embedding based on the reference image and the transform input, wherein the object embedding represents the object and the target level of transformation; (Sect 3.2 Concept Embedding Learning Par 1-2 “we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images. We have: N fc = i=1 Ec(xi s)/N (2). To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆ V with the concept feature fc to obtain the concept injected textual embedding c. This final embedding will be the condition in the cross-attention layers of the text-to-image diffusion model.”, Sect 3.4 Concept Token Renormalization Par 2 “To address this issue, we renormalize the concept token with a factor of α ∈ (0,1]. We have: fc = α·fc.” Where the final embedding c represents the concept feature vector fc of the input images as modified by the concept token α.)
and generating, using the image generation model, a synthetic image based on the input prompt and the object embedding, wherein the synthetic image depicts the object in the scene from the input prompt with the target level of transformation (Sect 3.1, par 1 “Given a few images of a concept, the goal is to generate new high-quality images of this concept from text description p. The generated image variations should preserve the identity of the input concept.”)

	Regarding claim 2, Shi teaches the method of claim 1, wherein obtaining the reference image comprises: obtaining a preliminary image depicting the object; and removing a background from the preliminary object to obtain the reference image (Sect 3.2, Par 1 “Since the object of the input concept in the images may not be large enough, we crop out the object from each image to obtain a set of conditional image Xs = {xi s}N 1 . To further enforce the model to focus on the exact object, we mask out the background of each cropped image” where the cropped image is the preliminary image.)

	Regarding claim 3, Shi teaches the method of claim 1, wherein generating the object embedding comprises: generating a preliminary embedding representing the object; (Sect 3.2 Concept Embedding Learning: “To this end, we convert the input images into a textual concept embedding”)
and transforming the preliminary embedding based on the transform input to obtain the object embedding (Sect 3.2 Concept Embedding Learning Par 1-2 “we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images. We have: N fc = i=1 Ec(xi s)/N (2). To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆ V with the concept feature fc to obtain the concept injected textual embedding c. This final embedding will be the condition in the cross-attention layers of the text-to-image diffusion model.”, Sect 3.4 Concept Token Renormalization Par 2 “To address this issue, we renormalize the concept token with a factor of α ∈ (0,1]. We have: fc = α·fc.” Where the final embedding c represents the concept feature vector fc of the input images as modified by the concept token α.)
	Regarding claim 4, Shi teaches the method of claim 3, further comprising: encoding the transform input to obtain a projection vector, wherein the preliminary embedding is transformed based on the projection vector (Sect 3.2 Concept Embedding Learning “Since the identifier ˆV has indicated the location of the textual embedding, we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images … To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆV with the concept feature fc to obtain the concept injected textual embedding c”).

	Regarding claim 6, Shi teaches the method of claim 1, further comprising: encoding the input prompt to obtain a text embedding, wherein the synthetic image is generated based on the text embedding (Sect 3.2 Concept Embedding Learning "Since the identifier ˆV has indicated the location of the textual embedding, we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images … To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆV with the concept feature fc to obtain the concept injected textual embedding c")

	Regarding claim 8, Shi teaches the method of claim 1, wherein: the transform input includes a size parameter, an identity parameter, or both (Sect 3.4, Balanced Sampling “During training, β in Eq. 3 is set to 1. During inference, however, we observe that setting β to 1 results in a strong reconstruction of the input images with good identity preservation, while the language-image alignment is weakened. Since the original pre-trained model has a deep understanding of the language, we reduce the value of β during inference so that the adapter layer takes both the visual information from the original pre-trained model and the conditioning images. We observe that β actually plays the primary role for achieving a good balance between language understanding and identity preservation.” Where β determines identity preservation, acting as an identity parameter.)

	Regarding claim 9, Shi teaches the method of claim 8, wherein: the identity parameter indicates a pose of the object, a view angle of the object, or both (Sect 4.3, Qualitative Results “We observe that our method can also support large pose and structure variations, such as ‘riding bycicle’ and ‘open arms’”)

	Regarding claim 11, Shi teaches the method of claim 1, wherein: the transform input indicates a target level of identity preservation for the object (Sect 3.4, Balanced Sampling “During training, β in Eq. 3 is set to 1. During inference, however, we observe that setting β to 1 results in a strong reconstruction of the input images with good identity preservation, while the language-image alignment is weakened. Since the original pre-trained model has a deep understanding of the language, we reduce the value of β during inference so that the adapter layer takes both the visual information from the original pre-trained model and the conditioning images. We observe that β actually plays the primary role for achieving a good balance between language understanding and identity preservation.” Where β determines identity preservation, acting as an identity parameter.)

	Regarding claim 12, Shi teaches A method comprising: obtaining a training set including a training input image (Sect 4.1 Datasets and Metric – Datasets Par 1 “We select 50 identity in the test split of PPR10k [22], where each selected identity is guaranteed to have more than 5 images and we only keep the first 5 images in naming order as our test input”), 
a training target image (Sect 4.1 Datasets and Metric – Metrics Par 2 “It is measured by the similarity of CLIP visual features between the input image and the generated image”, Sect 3.3 Model Training Par 1 “During training, we use heavy augmentation A to obtain variations of masked images Xs. The original image set Xt (without cropping out the object region or masking out the background) is regarded as the ground-truth.”), 
and a training transform input (Sect 4.1 Datasets and Metric – Metrics Par 4 “We construct various prompts ranging from background modifications (“A photo of ˆV [class noun] on the moon”), to style changes (“An oil painting of ˆV [class noun]”), and a compositional prompt (“ˆV [class noun] shaking hand with Biden”).”), 
wherein the training target image depicts an object from the training input image with a target level of transformation indicated by the transform input (Section 4.1 Datasets and Metric – Metrics Par 2 “It is measured by the similarity of CLIP visual features between the input image and the generated image”, Metrics Par 4 “measure the vision-language alignment between the input prompt and the output image” , Sect 3.3 Model Training Par 1 “During training, we use heavy augmentation A to obtain variations of masked images Xs. The original image set Xt (without cropping out the object region or masking out the background) is regarded as the ground-truth.”), 
and training, using the training set, an image generation model to generate an object embedding that represents the object with the target level of the transformation (Sect 3.2 Concept Embedding Learning Par 1-2 “we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images. We have: N fc = i=1 Ec(xi s)/N (2). To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆ V with the concept feature fc to obtain the concept injected textual embedding c. This final embedding will be the condition in the cross-attention layers of the text-to-image diffusion model.”, Sect 3.4 Concept Token Renormalization Par 2 “To address this issue, we renormalize the concept token with a factor of α ∈ (0,1]. We have: fc = α·fc.” Where the final embedding c represents the concept feature vector fc of the input images as modified by the concept token α.)
and to generate a synthetic image based on the object embedding, wherein the synthetic image depicts the object with the target level of the transformation (Sect 3.1, Par 1 “Given a few images of a concept, the goal is to generate new high-quality images of this concept from text description p. The generated image variations should preserve the identity of the input concept.”)

	Regarding claim 13, Shi teaches the method of claim 12, wherein training the image generation model comprises: jointly training an object encoder that generates the object embedding and a diffusion model that generates the synthetic image (Sect 4.2 Implementation Details “We utilize the Stable Diffusion [33] V1-4 as our pre trained text-to-image model, which is the current leading model available to the public. For all experiments of our model, we use “sks” as the unique identifier ˆV as suggested in DreamBooth [40]. For both the concept encoder Ec and the patch encoder Ep, we use the pre-trained CLIP image encoder as the backbone followed by a randomly initialized fully-connected layer. During training, we freeze the back bone of the image encoders and only update the FC layers and the adapter layers. The weights of CLIP text encoder and the original weights in the U-Net of the pre-trained text to-image model are also frozen.”)

	Regarding claim 14, Shi teaches The method of claim 12, wherein obtaining the training set comprises: obtaining a preliminary image; and applying an image transformation to the preliminary image to obtain the training input image. (Sect 3.3 Model Training, Par 1 “During training, we use heavy augmentation A to obtain variations of masked images Xs. The original image set Xt (without cropping out the object region or masking out the background) is regarded as the ground-truth”, Section 3.4 Model Inference – Arbitrary Number of Input Images, Par 1 “During model’s inference, we still mask out the background of the cropped images, but do not perform any augmentations to the masked images, i.e., A = None.”)

	Regarding claim 15, Shi teaches The method of claim 12, wherein training the image generation model comprises: generating an intermediate output image, computing a reconstruction loss between the intermediate output image and the training target image (Sect 4.1 Dataset and Metric – Metrics Par 2 “Reconstruction is to evaluate whether the identity can be fully preserved by the default prompt “A photo of ˆV [class noun]”, where [class noun] can be person or cat. It is measured by the similarity of CLIP visual features between the input image and the generated image”); 
and updating parameters of the image generation model based on the reconstruction loss (Sect 4.4 Ablation Study – Par 7 “Adjust the adapter weight β and concept renormalization factor α. Tab.4 shows different compositions of β and α. The results indicate that larger β or α can both contribute to better identity preservation but weaker language comprehension ability. We finally choose the model with β = 0.3, α = 0.4 as a trade-off.”). 

    PNG
    media_image1.png
    225
    315
    media_image1.png
    Greyscale


	Regarding claim 16, the apparatus claim 16 is similar in scope to the method claim 1, and is rejected under the same rationale.

	Regarding claim 17, Shi teaches the apparatus of claim 16, wherein: the image generation model comprises an object encoder trained to generate the object embedding. (Sect 3.1, Par 2 “We first inject a unique identifier ˆV to the input prompt to represent the object concept, then use a learnable image encoder to map the input images to a concept textual embedding. The pre-trained diffusion model takes the concept embedding along with the embedding of the original prompts to generate new images of the input concept. To enhance the identity of the generated images, we introduce adapter layers to the pre-trained model to take rich patch features extracted from the input images for better identity preservation”).

	Regarding claim 18, Shi teaches the apparatus of claim 16, further comprising: the image generation model comprises a diffusion model trained to generate the synthetic image (Sect 3.1, Par 2 “We first inject a unique identifier ˆV to the input prompt to represent the object concept, then use a learnable image encoder to map the input images to a concept textual embedding. The pre-trained diffusion model takes the concept embedding along with the embedding of the original prompts to generate new images of the input concept. To enhance the identity of the generated images, we introduce adapter layers to the pre-trained model to take rich patch features extracted from the input images for better identity preservation”)

	Regarding claim 19, Shi teaches the apparatus of claim 16, further comprising: a text encoder configured to encode the input prompt to obtain a text embedding, wherein the synthetic image is generated based on the text embedding (Sect 3.2 Concept Embedding Learning "Since the identifier ˆV has indicated the location of the textual embedding, we adopt an image encoder Ec to map the images to a compact concept feature vector fc in the textual space. Specifically, fc is the average feature vector of the global features of all input images … To obtain the final textual embeddings of the input prompt, we first obtain the CLIP [28] Text embeddings cs of the modified prompt cs = CLIP(ps), then replace the embedding of identifier ˆV with the concept feature fc to obtain the concept injected textual embedding c")

	Claim Rejections - 35 USC § 103
5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

6.	Claims 5, 7 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shi as applied to claim 1 above, and further in view of eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers (Yogesh et al, hereinafter Yogesh).
Regarding claim 5, Shi teaches the method of claim 1, but fails to explicitly teach generating the synthetic image comprises: obtaining a noise map; and denoising the noise map based on the object embedding. In related field of endeavor, Yogesh teaches generating the synthetic image comprises: obtaining a noise map; and denoising the noise map based on the object embedding (Figure 2 Description: “Synthesis in diffusion models corresponds to an iterative denoising process that gradually generates images from random noise; a corresponding stochastic process is visualized for a one dimensional distribution. Usually, the same denoiser neural network is used throughout the entire denoising process.”)
It would have been obvious to one of ordinary skill in the art to have modified Shi to include generating a synthetic image comprises: obtaining a noise map and denoising the noise map based on the object embedding as taught by Yogesh. Doing so would allow synthetic images to be generated by iteratively denoising random noise (Figure 2 Description: “Synthesis in diffusion models corresponds to an iterative denoising process that gradually generates images from random noise”).

 	Regarding claim 7, Shi teaches the method of claim 1, but fails to explicitly teach obtaining an additional reference image depicting the scene; and encoding the additional reference image to obtain a reference embedding, wherein the synthetic image is generated based on the reference embedding. In related field of endeavor, Yogesh teaches obtaining an additional reference image depicting the scene; and encoding the additional reference image to obtain a reference embedding, wherein the synthetic image is generated based on the reference embedding (Figure 5 Description: “eDiff-I also allows the user to optionally provide an additional CLIP image embedding. This can enable detailed stylistic control over the output”).
	It would have been obvious to one of ordinary skill in the art to have modified Shi to include obtaining an additional reference image depicting the scene; and encoding the additional reference image to obtain a reference embedding, wherein the synthetic image is generated based on the reference embedding as taught by Yogesh. Doing so would provide additional control over the styling of the output (Figure 5 Description: “This can enable detailed stylistic control over the output”).

	Regarding claim 20, the apparatus claim 20 is similar in scope to claim 7 and is rejected under the same rationale.

7.	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Shi as applied to claim 8 above, and further in view of US 20240157114 A1 (Yuan et al, Hereinafter Yuan).
	Regarding claim 10, Shi teaches the method of claim 8, but fails to explicitly teach the size parameter indicates a target scale of the object relative to the reference image. In related field of endeavor, Yuan teaches a size parameter indicates a target scale of the object relative to the reference image (Par 100 “a scale parameter indicative of a scale of the at least one first element in a 3D frame of reference in which the subject is positioned”).
	It would have been obvious to one of ordinary skill in the art to have modified Shi to include a size parameter indicates a target scale of the object relative to the reference image as taught by Yuan. Doing so would allow the size of an element of to be adjusted (Par 100 “a scale parameter indicative of a scale of the at least one first element in a 3D frame of reference in which the subject is positioned”).

Conclusion
8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHN PATRICK GOCO whose telephone number is (571)272-5872. The examiner can normally be reached M-Th, 7:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at (571) 272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JOHN P GOCO/Examiner, Art Unit 2619                                                                                                                                                                                                        
/JASON CHAN/Supervisory Patent Examiner, Art Unit 2619
Read full office action
SCORE BASED FINE-GRAINED CONTROL OF CONCEPT GENERATION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

SCORE BASED FINE-GRAINED CONTROL OF CONCEPT GENERATION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email