Office Action Analysis: 18614405 — FINE-LEVEL TEXT CONTROL FOR IMAGE GENERATION

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 4, 6-11, 13, and 15-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Not4Talent (“Complex INTERACTIONS with MULTIPLE characters | Stable Diffusion”) in view of Shi et al (US 20240355022 A1), and Joachim (US 20240378832 A1), hereinafter Not4Talent, Shi, and Joachim respectively.



    PNG
    media_image1.png
    504
    752
    media_image1.png
    Greyscale
Regarding claim 1, Not4Talent teaches a method, performed by at least one processor of an electronic device, the method comprising: obtaining a geometric identifier for a target image (Not4Talent creates a target image that depicts the geometric identifier. The first picture below shows the pose image created based on an image of three women outside on a walk. However, a photo image is not necessary and a target image with geometric identifiers can be created from the user as seen in the second image below, Timestamp 2:37-2:44);













    PNG
    media_image2.png
    400
    713
    media_image2.png
    Greyscale






    PNG
    media_image3.png
    118
    842
    media_image3.png
    Greyscale

    PNG
    media_image4.png
    421
    1464
    media_image4.png
    Greyscale
obtaining a description of a scene of the target image (Not4Talent creates an example text prompt (second picture shows the general description of the scene clearly) of a couple walking through a busy street, Timestamp 3:30-3:45); 

    PNG
    media_image5.png
    768
    1530
    media_image5.png
    Greyscale
parsing the geometric identifier and the description of the scene to obtain a plurality of instances (Not4Talent highlights his skeleton of the poses for the characters and a general prompt to describe the scene, Timestamp 4:52-5:11)


    PNG
    media_image6.png
    97
    441
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    253
    414
    media_image7.png
    Greyscale
[NOTE: When Not4Talent clicks the generate button, the software parses the geometric identifier (skeleton poses in bottom left) and the description of the scene (“couple walking through a busy street” as shown on the top text prompt). The cartoon characters represent the plurality of instances that are modified by individual text prompts from the user.]; for each instance, obtaining a two-dimensional skeleton map (Not4Talent sections off two sides of the target image to map out both skeleton poses, Timestamp 3:17-3:25), an occupancy map (Not4Talent modifies the target image to segment the individual poses. He uses a marker to draw over the area in which the pose occupies and uses a different color to distinguish between the two poses, Timestamp 2:41- 3:10), and a prompt specific to the instance (Not4Talent writes a different prompt to modify each pose individually. The left instance is a cartoon male from a TV show, and the right instance is a cartoon female from a different TV show);

    PNG
    media_image8.png
    285
    628
    media_image8.png
    Greyscale

    PNG
    media_image9.png
    524
    1028
    media_image9.png
    Greyscale

[NOTE: The two skeleton poses can be modified by the individual prompts based on the mask that is applied to each pose. For example, the user can prompt for the left character to raise their hand and the generated image will show the left character raising their hand, but the right character does not.]. However, Not4Talent does not teach for each instance, obtaining a copied noise image; obtaining an intermediate image based on the copied noise image; denoising the intermediate image; and generating the target image based on the denoised intermediate image and controlling a display to output the generated target image. However, Shi teaches for each instance, obtaining a copied noise image; obtaining an intermediate image based on the copied noise image (“noise component 350 generates a noise map based on the original image 115 and a mask” – Par 64, Lines 1-3); denoising the intermediate image (“During the reverse diffusion process 610, the model begins with noisy data x.sub.T, such as a noisy image 615 and denoises the data to obtain the p(x.sub.t-1|x.sub.t). At each step t−1, the reverse diffusion process 610 takes x.sub.t, such as first intermediate image 620, and t as input” – Par 96, Lines 2-6. [NOTE: denoising step from reverse diffusion process takes the intermediate image as input]); and generating the target image based on the denoised intermediate image (“The output image can be generated based on the guidance embedding using an image generation model, where the output image depicts the subject and the input description.” – Par 10, Lines 3-6, Fig. 5 [NOTE: guidance embedding is generated in the diffusion process where denoising of intermediate images occur.]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Not4Talent’s use of obtaining a skeleton map and occlusion map with Shi’s noise image for each instance in the target image, and to further obtain the intermediate image associated with these elements to then be denoise and generate the denoised intermediate image. This combination would be useful so that the masking of the objects can clearly define what the instances are by separating background and object (like a human pose). [NOTE: After the combination, the obtaining of the copied noise image and the obtaining of the intermediate image based on the copies noise image as taught by Shi can be included with Not4Talent’s teaching of obtaining a obtaining a two-dimensional skeleton map, an occupancy map, and a prompt specific to the instance along with obtaining an intermediate image based on a two-dimensional skeleton map, an occupancy map, and a prompt specific to the instance.].
 Not4Talent in view of Shi still does not teach controlling a display to output the generated target image. However, Joachim teaches controlling a display to output the generated target image (“The server(s) 102 then provides the modified digital images to the client device 110n for display” – Par 127, Lines 8-10). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to incorporate Joachim’s teachings of displaying the modified image to the client device. It would be a logical step to display the results of the new generated image to a device after processing the geometric identifier and text prompt.


Regarding claim 10, the claim describes an electronic device performing the method of claim 1. Therefore, electronic device claim 10 corresponds to the method disclosed in claim 1 and is rejected for the same reasons of obviousness as used above.


Regarding claim 19, the claim describes a non-transitory computer readable memory (CRM) performing the method of claim 1. Therefore, CRM claim 19 corresponds to the method disclosed in claim 1 and is rejected for the same reasons of obviousness as used above.
Regarding claim 2, Not4Talent in view of Shi and Joachim teach the method of claim 1. Not4Talent further teaches where in the geometric identifier is a pose input (Not4Talent uses a target image of multiple human poses as input, Timestamp: 2:37-2:44). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to further incorporate Joachim’s teaching to have the geometric identifier as a pose input. It is common in the field of image generation to modify or transform pictures of human poses based on a text prompt. Combining the pose with the skeleton map, occupancy map, noise image, and prompt allows for a clear segmentation of the poses of different instances to ensure that the text prompts modify the respective instances.
Regarding claim 11, the claim describes an electronic device that performs the method of claim 2. Therefore, electronic device claim 11 corresponds to the method disclosed in claim 2 an is rejected for the same reasons of obviousness as used above.
Regarding claim 20, the claim describes a computer readable medium (CRM) that performs the method of claim 2. Therefore, CRM claim 20 corresponds to the method disclosed in claim 2 an is rejected for the same reasons of obviousness as used above.
Regarding claim 4, Not4Talent in view of Shi and Joachim teach the method of claim 1, Not4Talent does not teach wherein the denoising the intermediate image comprises a reverse diffusion process. However, Shi further teaches wherein the denoising the intermediate image comprises a reverse diffusion process (“During the reverse diffusion process 610, the model begins with noisy data x.sub.T, such as a noisy image 615 and denoises the data to obtain the p(x.sub.t-1|x.sub.t). At each step t−1, the reverse diffusion process 610 takes x.sub.t, such as first intermediate image 620, and t as input” – Par 96, Lines 2-6). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to further incorporate the teachings of Shi to use reverse diffusion the perform denoising on the intermediate image. Reverse diffusion is a widely known process for iteratively removing noise from an image using a neural network to predict and subtract noise from the image. Specifically using it for the intermediate images offer more control and helps guide the step-by-step process towards the content of the input text prompt.
Regarding claim 13, the claim describes an electronic device performing the method of claim 4. Therefore, electronic device claim 13 corresponds to the method disclosed in claim 4 and is rejected for the same reasons of obviousness as used above.

Regarding claim 7, Not4Talent in view of Shi and Joachim teach the method of claim 1. Not4Talent does not teach wherein the obtaining the occupancy map comprises dilating the two-dimensional skeleton map. However, Joachim further teaches wherein the obtaining the occupancy map comprises dilating the two-dimensional skeleton map (“Thus, the scene-based image editing system 106 dilates (e.g., expands) the object mask of an object to avoid associated artifacts when removing the object. Dilating objects masks, however, presents the risk of removing portions of other objects portrayed in the digital image.” – Par 496, Lines 1-5. [NOTE: The object masks work in a similar way to occupancy maps which separate the poses in the target image so that modifications to one don’t affect the other. These masks are applied over the 2D poses in the target image.]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to further incorporate the teachings of Joachim to obtain the occupancy map by dilating the 2D skeleton map. Dilating the poses in the target image is a common technique that increases the area of what is considered to be the pose compared to the background. The occupancy map or the segmentation mask can then define each instance in the target image with respect to the increased area of the poses in the skeleton map.

Regarding claim 16, the claim describes an electronic device performing the method of claim 7. Therefore, electronic device claim 16 corresponds to the method disclosed in claim 7 and is rejected for the same reasons of obviousness as used above.

Regarding claim 8, Not4Talent in view of Shi and Joachim teach the method of claim 1. Not4Talent does not teach wherein the generating the target image comprises providing the target image in at least one of smart glasses, a mobile application, or fitness tracking apparatus. However, Joachim further teaches wherein the generating the target image comprises providing the target image in at least one of smart glasses, a mobile application, or fitness tracking apparatus (“In one or more embodiments, the client devices 110a-110n include computing devices that access, view, modify, store, and/or provide, for display, digital images. For example, the client devices 110a-110n include smartphones, tablets, desktop computers, laptop computers, head-mounted-display devices, or other electronic devices.” – Par 125, Lines 1-6. [NOTE: specifically looking at smartphones as a mobile application]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to further incorporate the teachings of Joachim to generate the target image on a mobile application. Generating the new target image on some form of electronic device would be a logical step in order to display the result of the image generation process.  

Regarding claim 17, the claim describes an electronic device performing the method of claim 8. Therefore, electronic device claim 17 corresponds to the method disclosed in claim 8 and is rejected for the same reasons of obviousness as used above.

Regarding claim 9, Not4Talent in view of Shi and Joachim teach the method of claim 1. Not4Talent does not teach controlling a signal to output the generated target image to at least one of smart glasses, a mobile device, or fitness tracking apparatus. However, Shi further teaches controlling a signal to output the generated target image to at least one of smart glasses, a mobile device, or fitness tracking apparatus (“According to some aspects, I/O interface 1240 is controlled by an I/O controller to manage input and output signals for computing device 1200” – Par 175, Lines 1-3. [NOTE: computing device 1200 can be a mobile device as disclosed in par 35). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent in view of Joachim to further incorporate the teachings of Shi to control a signal to output the generated target image to a mobile application. Combining Joachim’s use of outputting the generated image to a mobile application with Shi’s I/O controller is a very common step when displaying the generated image to a device. Rather than automatically displaying the results, one of ordinary skill in the art could modify when the output is displayed by having the system wait for a signal to display the generated image. 

Regarding claim 18, the claim describes an electronic device performing the method of claim 9. Therefore, electronic device claim 18 corresponds to the method disclosed in claim 9 and is rejected for the same reasons of obviousness as used above.
Claim(s) 3 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Not4Talent in view of Shi, Joachim and Albrecht (US 9528847 B2), hereinafter Albrecht respectively.
Regarding claim 3, Not4Talent in view of Shi and Joachim teach the method of claim 1. Not4Talent does not teach wherein the geometric identifier is a sketch input. However, Albrecht teaches wherein the geometric identifier is a sketch input (“In one embodiment, the tools and techniques can include receiving a graphical sketch (such as receiving such a sketch from a user input at a computing device or receiving such a sketch from another computing environment where the sketch was provided as user input), the sketch including one or more representations of text.” – Col 1, Lines 38-43). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Joachim to incorporate the teachings of Albrecht to have the geometric identifier as sketch input. It is common in the art for a user to be able to input a basic sketch to mark the positioning of where to generate the image and have the image generator produce the image based off a text prompt and sketch.
Regarding claim 12, the claim describes an electronic device performing the method of claim 3. Therefore, electronic device claim 12 corresponds to the method disclosed in claim 3 and is rejected for the same reasons of obviousness as used above.

Claim(s) 5 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Not4Talent in view of Shi, Joachim, and Saharia et al (US 20230067841 A1), hereinafter Saharia respectively.

Regarding claim 5, Not4Talent in view of Shi and Joachim teach the method of claim 4. Not4Talent does not teach wherein the reverse diffusion process comprises repeating a diffusion process a predetermined number of times. However, Saharia teaches wherein the reverse diffusion process comprises repeating a diffusion process a predetermined number of times (“SR3 can generate high resolution images, e.g., 1024×1024, but with a constant number of refinement steps (often no more than 100). SR3 uses a series of reverse diffusion steps to transform a Gaussian distribution to an image distribution while flows require a deep and invertible network.” - Par 38, Lines 13-18. [NOTE: Saharia highlights that there is a constant number of refinement steps that do not exceed 100 steps. This implies that there is a form of control as to how many times the reverse diffusion is performed]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to modify Not4Talent to incorporate the teachings of Saharia to have the reverse diffusion process repeat a predetermined number of times. It is a very common procedure in machine learning training to repeat multiple times by setting the number of epochs (iterations). Having a predetermined number of iterations set allows the user to control the balance between accuracy of the model and utilization of computer resources.

Regarding claim 14, the claim describes an electronic device performing the method of claim 5. Therefore, electronic device claim 14 corresponds to the method disclosed in claim 5 and is rejected for the same reasons of obviousness as used above.

Allowable Subject Matter
Claims 6 and 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Regarding claim 6, Not4Talent in view of Shi and Joachim teach the method of claim 1. However, Not4Talent does not teach wherein the denoising the intermediate image comprises: for each denoising step, copying a latent embedding in a Unet feature level; obtaining pose embeddings; and obtaining a batch-wise sum based on the Unet feature level and the pose embeddings. None of the prior art searched, alone or in combination, renders obvious to the limitations of claim 6.

Regarding claim 15, the claim describes an electronic device performing the method of claim 6. Therefore, electronic device claim 6 corresponds to the method disclosed in claim 6 and would be allowable for the same reasons as used above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID V. NGUYEN whose telephone number is (571)272-6111. The examiner can normally be reached M-F 7:30-5:00. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Y Poon can be reached at 571-270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/DAVID VAN NGUYEN/Examiner, Art Unit 2617                                                                                                                                                                                         /KING Y POON/Supervisory Patent Examiner, Art Unit 2617
Read full office action
FINE-LEVEL TEXT CONTROL FOR IMAGE GENERATION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

FINE-LEVEL TEXT CONTROL FOR IMAGE GENERATION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email