Last updated: April 19, 2026
Application No. 19/216,233
Image generation

Final Rejection §103
Filed
May 22, 2025
Examiner
WANG, JIN CHENG
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Canva Pty Ltd.
OA Round
2 (Final)
Interview Optional

— +10.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 832 resolved cases, 2023–2026
Examiner Intelligence

WANG, JIN CHENG View full profile →
Grants 59% of resolved cases
Career Allow Rate
492 granted / 832 resolved
-2.9% vs TC avg
Moderate +10% lift
Without
With
+10.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
40 currently pending
Career history
872
Total Applications
across all art units
Statute-Specific Performance

§101
11.8%
-28.2% vs TC avg
§103
62.7%
+22.7% vs TC avg
§102
7.6%
-32.4% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 832 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Specification
The title of the invention is not descriptive. The title “computer-implemented systems and methods for image generation” covers any general computer system to generate an image. For example, a general PC/laptop equipped with Microsoft operating system is capable of generating a desktop image or a general PC/laptop equipped with Microsoft Word application running is capable of generating an image of the word document. A new title is required that is clearly indicative of the invention to which the claims are directed. 

Response to Arguments
Applicant's arguments filed 11/11/2025 have been fully considered but they are not persuasive. 
In Remarks, applicant argued in essence against Stannus limited disclosures at FIGS. 20-21 without relying on the overall disclosures. Applicant argued that there is no generation in Stannus of the output image by a machine learning model blending a scene image and an object image. Applicant’s argument is unfounded. 
Stannus teaches a style transfer unit 203 in terms of a neural network (a machine learning model) generates an output image by blending one or more mask object image(s) (meeting the claimed object image) and input image (meeting the claimed scene image). 
Applicant also argued with respect to identifying at least one area of the output image that is similar to a corresponding area to the scene image. The claim limitation is subject to broadest reasonable interpretation consistent with applicant’s specification. 
Stannus teaches at Paragraph 0148 “image data in which a dog is reflected” corresponds to an output image wherein the dog region of the output image is similar to the dog region of the input image. Stannus teaches at Paragraph 0053 that the style transfer unit 203 performs a style transfer on the region in the image and the image referred to here includes a composite image (output image). The region in the composite image (output image) is similar to the region of the input image before the style transfer. 
Moreover, FIGS. 7-10 shows separately applying style A and style B to an input image to generate two separate output images and at Paragraph 0069 that the optimization function is defined based on the plurality of style images. The optimization requires successively generate output images p to be optimized (see Paragraph 0081 and Paragraph 0090-0092 and the iterative optimization of Paragraph 0095-0104 for successively generating the output images p) and the similarity of the output image and the input image in the optimization is formulated by minimizing a L2 loss function (L2 similarity). 
The similarity of the output image is described in formula 2 of Paragraph 0080.
For example, Stannus teaches at Paragraph 0144-0150 that the style application range (the region) of the output image is similar to the corresponding region of the input image. For example, after applying style C to the input image, a first output image is generated in which non-dog region is style-transferred to style C and thus the dog region in the first output image remains unchanged wherein the dog region of the first output image is similar to the dog region of the input image. 
Subsequently after further applying style D to the first output image as an input image, the second output image includes the dog region to be style-transferred with the style D and while applying the style D, the non-dog region of the second output image is similar to the non-dog region of the first output image as an input image. Stannus teaches at Paragraph 0148-0150 that a dog region of an intermediate output image is similar to the dog region of the input image. 
Applicant repeatedly argued that there is no identification of an area of an output image with applicant’s own misinterpretation of Stannus’s teaching in relation to the claim invention. 
Stannus further teaches at FIGS. 20-22 and Paragraph 0184 that the style transfer can be further applied to the composite image (output image) after demarcating one or more regions of the composite image (e.g., demarcating a dog region of the composite image similar to the corresponding dog region of the input image such that a dog is reflected both in the output image and the input image; see Paragraph 0146 and Paragraph 0175 wherein the output image will reflect the scenery and objects of the real world including a dog).
Applicant failed to acknowledge that Stannus teaches at Paragraph 0061-0062 that the region defining unit 202 may either demarcate a portion closer than the tower as a region or a portion further than the tower as a region and the demarcated region may include a tower or excludes the tower. 
Stannus teaches at Paragraph 0062 that the region defining unit 202 demarcates the portion of the building reflected in the image as a region. The region to which the style transfer is applied may be the portion of the building in the image. On the other hand, the region to which the style transfer is applied may be the region of the image excluding the building portion. 
Stannus thus teaches that the region defining unit 202 demarcates both the region for style transfer and the exclusion region not for style transfer. 
Stannus further teaches at FIGS. 20-22 and Paragraph 0184 that the style transfer can be further applied to the composite image (output image) after demarcating one or more regions of the composite image (e.g., demarcating a dog region of the composite image similar to the corresponding dog region of the input image). Even after demarcation of one or more region in the composite image of FIG. 21 such as the front region of the tower object for style transfer, the demarcated region (of the composite image) corresponding to the demarcation boundary (meeting the claimed edge of the object image) and the front region of the tower object in the composite image of FIG. 21 is still similar to the front region of the tower object in the input scene image of FIG. 20. Accordingly, the claimed identifying step is met by Stannus. 
Moreover, the front region of the tower object when demarcated is further modified by style transfer to the composite image to replace the person objects in the front region of the output image with the person objects in the front region of the input image. 
The claimed at least one area corresponds to one of the region and exclusion region of Stannus by the region defining unit 202. For example, the region defined by the region defining unit 202 may corresponds to a dog object or a tower object subject to style transfer and the exclusion region is not subject to style transfer and is maintained. 
When the at least one area is mapped to the region, Stannus meets the claim limitation of identifying at least one region of a dog object or a tower object for style transfer that corresponds to the demarcation shape boundary of the dog object or tower object and is similar to a corresponding area (dog object or tower object) of the input image using the content optimization function of the output image p and the input content image c. Stannus teaches modifying the region of the output image to replace visual elements of the dog object with visual elements of the dog object of the input scene image. 
When the at least one area includes the region of the dog object or tower object and the exclusion region other than the dog object or the exclusion region in the front of the tower object, Stannus meets the claim limitation of identifying at least one area including the region for style transfer and the exclusion region not for style transfer which still corresponds to the demarcation shape boundary of the dog object or tower object, and is similar in content to the corresponding area of the input scene image. Stannus still teaches modifying the output image in the at least one area including the region for style transfer and the exclusion region not for style transfer to replace visual elements of the output image with the visual elements of the input image. 
When the at least one area is mapped to the exclusion region other than the dog object or the exclusion region in the front of the tower object, Stannus meets the claim limitation of identifying at least one exclusion region not for style transfer that still corresponds to the demarcation shape boundary of the dog object or tower object, and is similar in content to the corresponding area of the input scene image. Stannus still teaches modifying the output image in the identified at least one area (e.g., the exclusion region in the composite image) to replace visual elements of the exclusion region of the output image with the visual elements of exclusion region of the input image and thereafter applying a style transfer to the exclusion region to modify the output image in the exclusion region (see the repeatedly iterative style transfer procedure of FIGS. 12-13 in relation to FIGS. 20-22 where further style transfer is applied to a region including the exclusion region in the composite image of FIG. 21). 

In one embodiment, the claimed one area is mapped to Stannus’s exclusion region (e.g., the front side of the tower which may be demarcated by the region defining unit 202). For example, Stannus teaches that the region defining unit 202 demarcates a region on a composite image (the output image). Stannus teaches at FIG. 22 and Paragraph 0053 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (St12). The style transfer unit 203 performs a style transfer on the region in the image (St13). The image referred to here includes a composite image, which will be described later with reference to FIG. 22.
Accordingly, the region defining unit 202 identifies a region of the output image for style transfer while other region excluding the region for style transfer is maintained to be similar to the input scene image. In other words, the exclusion region is also defined by the region defining unit 202 by the demarcation shape and corresponds to the demarcation shape and is similar to or exactly the same as the content of the corresponding region in the input scene image. When region defining unit 202 demarcates a dog object or a tower object, the exclusion region outside the shape boundary of the dog object or tower object still corresponds to the demarcation boundary of the dog object or tower object and is maintained to be similar to the content of the input scene image.
It is known from the style transfer of FIGS. 12-13 that the region for style transfer corresponds to the shape boundary of the mask image such as the shape boundary of the dog object or the shape boundary of the tower object and the background of the tower object and the front of the tower object including the two person objects is maintained to be the same content as the input image and accordingly the front region of the tower object including the two person objects corresponds to the shape boundary of the tower object (i.e., the outside exclusion region still corresponds to shape boundary of the region demarcated by the region defining unit 202, in the sense that the region defining unit 202 defines the region as well as the exclusion region) and the background of tower object and is similar in content to the input image. Moreover, the portion of the first building and/or the second building is maintained to be the same as the content in the input image and corresponds to the shape boundary of the region demarcated by the region defining unit 202, i.e., the region defining unit 202 defines the region as well as an exclusion region using the shape boundary of the region demarcated by the region defining unit 202 (see Paragraph 0062). 
In another embodiment, the claimed area is mapped to the region identified by the style transfer based on the content loss minimization of the content optimization function of the output image p and the content image c.
In Stannus, the dog or tower region of the output image for style transfer corresponds to an edge (shape) of an object shape of a mask image or an inverted mask image and is similar to the corresponding dog or tower region of the input image as shown in FIGS. 12-13 and 20-21 and the identification is performed using the content optimization function as disclosed in Paragraph 0080-0085 by minimizing a L2 distance between the output image p and the content image c such that the content of the output image from the machine learning model is similar to the content of the input scene image c. 
For example, Stannus teaches at Paragraph 0062 that the region for the style transfer is the region demarcated by the region defining unit 202 or a region other than the region demarcated by the region defining unit 202. Accordingly, the region defining unit 202 identifies a region of an output image for style transfer that corresponds to an edge (boundary) of the shape of the region defined by the region defining unit 202 and is similar to the region of the content image. Stannus teaches at Paragraph 0062 that the region defining unit 202 demarcates the portion of the building reflected in the image as a region. The region to which the style transfer is applied may be the portion of the building in the image. On the other hand, the region to which the style transfer is applied may be the region of the image excluding the building portion. 
Stannus teaches that the region defining unit 202 demarcates a region on a composite image (the output image). Stannus teaches at FIG. 22 and Paragraph 0053 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (St12). The style transfer unit 203 performs a style transfer on the region in the image (St13). The image referred to here includes a composite image, which will be described later with reference to FIG. 22.
The region of the output image identified for style transfer meets the claimed area of the output image (see FIGS. 13-14 and FIGS. 20-21 and Paragraph 0169-0174) and the identification is performed by using a mask object image as an input to the style transfer unit 203 (a machine learning model). The region for style transfer is subject to processing by the machine learning model as evidenced in Paragraph 0080-0085 that the region of the output image p similar to the region of the input image c is identified using the content optimization function of the output image p and the input image c for style transfer over the identified region. 
Moreover, Stannus teaches at FIGS. 13-14 and 20-21 that the dog object or the tower object in the output image is identified by the optimization function to be similar to the dog object or the tower object in the input image as characterized by the loss function at each activation node of the machine learning model and the content optimization requires the minimization of a loss function of the output image p and the input image c (see Paragraph 0077) by training a machine learning model. The loss function characterizes the similarity of the output image p and the input image c with respect to the corresponding demarcated region for style transfer. 
Stannus teaches the area of the output image that is subject to style transfer is identified by the mask image (see FIGS. 13-14 and Paragraph 0169-0174) where the style transfer is applied to an object (e.g., a dog object or a tower object) by identifying the dog object or tower object of the output image for style transfer. The area for style transfer corresponds to an edge (shape) of the mask image or an inverted mask image (e.g., the dog mask or the tower mask or an inverted dog mask or inverted tower mask) and is similar to a corresponding area (dog area or tower area) of the input image. Applicant also failed to recognize that the mask image includes the shape of the region for style transfer and the shape of the other region than the region for style transfer. The mask image identifies a region for style transfer as well as the other region than the region for style transfer. 
Applicant argued in essence with respect to the Stannus’s embodiments at FIGS. 20-21 and Paragraph 0169-0174 with allegation that “this style transfer does not disclose a blending of a scene image and an object image”. Applicant failed to acknowledge that the style transfer requires the blending of the input image (scene image) and a mask object image where the style transfer is applied to the tower T in the same manner as the style transfer disclosed in FIGS. 13-14 of Stannus. 
Stannus teaches at FIGS. 20-21 and Paragraph 0169-0174 that the style transfer requires the blending of the input image (scene image) and a mask object image where the style transfer is applied to the tower T using a mask object image in the same manner as using a mask object image in the style transfer disclosed in FIGS. 13-14 of Stannus. The one or more regions demarcated including the portion of the tower T and the portion of the background of the tower T is a mask object. Stannus clearly teaches at Paragraph 0173 that the style transfer unit 203 generates a mask corresponding to the shape of the defined region and uses the generated mask (the mask object image) to apply style transfer to the region in the image. 
Applicant failed to consult with Paragraph 0174 that the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.
Stannus teaches at Paragraph [0169] FIG. 20 is a conceptual diagram illustrating an image before style transfer, according to at least one embodiment of the present disclosure. FIG. 21 is a conceptual diagram illustrating a state in which an image after style transfer is output to a user terminal, according to at least one embodiment of the present disclosure.
Stannus thus teaches at Paragraph 0173-0174 that the output image is generated by the style transfer unit 203 (a machine learning model) by modifying the output image in the identified region to replace visual elements (e.g., the person objects) in the front side of the tower T of the output image with the visual elements (e.g., the person objects) in the front side of the tower T of the input image where the style of the tower T and the background of the tower T of the output image are modified from the input image (the scene image). 
Stannus teaches at Paragraph [0173] In step St13, the style transfer unit 203 performs a style transfer on the region in the image. More specifically, the style transfer unit 203 generates a mask corresponding to the shape of the defined region, and uses the generated mask to apply style transfer to the region in the image.
[0174] In the output image after the style transfer is applied, the style of the tower T and the background of the tower T are style-transferred based on the style image, as illustrated in FIG. 21. On the other hand, in the output image, the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.

Applicant also completely ignored other paragraphs of Stannus. 
For example, Stannus teaches at Paragraph 0105 that the mask style transfer can be performed only on the selected regions. Stannus teaches at Paragraph 0107 that the style transfer unit 203 (machine learning model) may generate a total of two masks including a mask that prevents style transfer for regions other than the region corresponding to a first building reflected in the image, and a mask that prevents style transfer for regions other than the region corresponding to a second building reflected in the image. 
This teaching clearly shows that the output image is generated by the style transfer unit 203 by blending the input image (the scene image) and the object mask image (object image) and identifying at least one other region of the output image that corresponds to an edge of the mask object (e.g., a first building or a second building) and is similar to a corresponding other region of the input image than the first building or second building and in response to identifying, modifying the output image in the identified other region than the first building or second building to replace visual elements of the first building or the second building of the output image with visual elements of the first building or second building of the input image. 
Accordingly, the output image generated by the style transfer is modified in the identified regions to replace visual elements (the first and second buildings) of the output image with visual elements (first and second buildings) of the input image. 
Even if applicant were to argue with respect particular figures of Stannus, applicant failed to appreciate the overall features of Stannus. Stannus teaches using two mask objects in stages to generate an output image, as evidenced in FIGS. 7-10. An output image generated based on the blending of the input image and one particular mask object would prevent the style transfer unit 203 (the machine learning model) from transferring a style to at least one region and thereby the visual elements of at least one region of the output image are replaced by the visual elements of at least one region of the input image. 
Moreover, even if applicant were to argued against Stannus alone in an obviousness type of rejection, applicant failed to recognize that Stannus still teaches at FIGS. 13-14 that the visual elements (ears, nose, eyes) of the dog in the output image are replaced with the dog visual elements (eyes, nose and ears) of the dog in the input image. 
Stannus further teaches at FIGS. 20-22 and Paragraph 0184 that the style transfer can be further applied to the composite image (output image) after demarcating one or more regions of the composite image (e.g., demarcating a dog region of the composite image similar to the corresponding dog region of the input image such that a dog is reflected both in the output image and the input image; see Paragraph 0146 and Paragraph 0175 wherein the output image will reflect the scenery and objects of the real world including a dog).


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-22 are rejected under 35 U.S.C. 103 as being unpatentable over Stannus et al. US-PGPUB No. 2023/0377230 (hereinafter Stannus) in view of Mishra et al. US-PGPUB No. 2025/0037415 (hereinafter Mishra); Joachim US-PGPUB No. 2024/0378832 (hereinafter Joachim); 
Smith US-PGPUB No. 2024/0331247 (hereinafter Smith); 
Kim et al. US-PGPUB No. 2024/0169631 (hereinafter Kim). 
Re Claim 1:  
Stannus implicitly teaches a computer-implemented method for processing a composite image, the method comprising:
receiving an output image and a scene image, wherein the output image was generated by a machine learning model blending of the scene image with an object image (
Stannus teaches at FIG. 13 receiving an input image (meeting the claimed scene image) where the dog is placed in an indoor scene/environment. Stannus further teaches at Paragraph 0065-0069 that the output image was generated by the style transfer unit 203 using a neural network (machine learning model) that blends the input image with an object mask image (see Paragraph 0069 “blending a plurality of styles (object image) to the same portion of the input image”). 
For example, Stannus teaches at Paragraph 0105 that the mask style transfer can be performed only on the selected regions. Stannus teaches at Paragraph 0107 that the style transfer unit 203 (machine learning model) may generate a total of two masks including a mask that prevents style transfer for regions other than the region corresponding to a first building reflected in the image, and a mask that prevents style transfer for regions other than the region corresponding to a second building reflected in the image. 
This teaching clearly shows that the output image is generated by the style transfer unit 203 by blending the input image (the scene image) and the object mask image (object image) and identifying at least one other region of the output image that corresponds to an edge of the mask object (e.g., a first building or a second building) and is similar to a corresponding other region of the input image than the first building or second building and in response to identifying, modifying the output image in the identified other region than the first building or second building to replace visual elements of the first building or the second building of the output image with visual elements of the first building or second building of the input image. 
Stannus teaches at FIGS. 20-21 and Paragraph 0169-0174 that the style transfer requires the blending of the input image (scene image) and a mask object image where the style transfer is applied to the tower T using a mask object image in the same manner as using a mask object image in the style transfer disclosed in FIGS. 13-14 of Stannus. The one or more regions demarcated including the portion of the tower T and the portion of the background of the tower T is a mask object. Stannus clearly teaches at Paragraph 0173 that the style transfer unit 203 generates a mask corresponding to the shape of the defined region and uses the generated mask (the mask object image) to apply style transfer to the region in the image. 
Applicant failed to consult with Paragraph 0174 that the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.
Stannus teaches at Paragraph [0169] FIG. 20 is a conceptual diagram illustrating an image before style transfer, according to at least one embodiment of the present disclosure. FIG. 21 is a conceptual diagram illustrating a state in which an image after style transfer is output to a user terminal, according to at least one embodiment of the present disclosure.
Stannus teaches at FIGS. 20-21 and Paragraph 0169-0174 that the style transfer requires the blending of the input image (scene image) and a mask object image where the style transfer is applied to the tower T using a mask object image in the same manner as using a mask object image in the style transfer disclosed in FIGS. 13-14 of Stannus. The one or more regions demarcated including the portion of the tower T and the portion of the background of the tower T is a mask object. Stannus clearly teaches at Paragraph 0173 that the style transfer unit 203 generates a mask corresponding to the shape of the defined region and uses the generated mask (the mask object image) to apply style transfer to the region in the image. 
Applicant failed to consult with Paragraph 0174 that the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.
Stannus teaches at Paragraph [0169] FIG. 20 is a conceptual diagram illustrating an image before style transfer, according to at least one embodiment of the present disclosure. FIG. 21 is a conceptual diagram illustrating a state in which an image after style transfer is output to a user terminal, according to at least one embodiment of the present disclosure.
Stannus thus teaches at Paragraph 0173-0174 that the output image is generated by the style transfer unit 203 (a machine learning model) by modifying the output image in the identified region to replace visual elements (e.g., the person objects) in the front side of the tower T of the output image with the visual elements (e.g., the person objects) in the front side of the tower T of the input image where the style of the tower T and the background of the tower T of the output image are modified from the input image (the scene image). 
Stannus teaches at Paragraph [0173] In step St13, the style transfer unit 203 performs a style transfer on the region in the image. More specifically, the style transfer unit 203 generates a mask corresponding to the shape of the defined region, and uses the generated mask to apply style transfer to the region in the image.
Stannus teaches at Paragraph [0174] In the output image after the style transfer is applied, the style of the tower T and the background of the tower T are style-transferred based on the style image, as illustrated in FIG. 21. On the other hand, in the output image, the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.
Stannus teaches at FIG. 22 and Paragraph 0184 that the image composite unit 206 blends the object OBJ with the image to obtain a composite image. 
Moreover, even if applicant were to argued against Stannus alone in an obviousness type of rejection, applicant failed to recognize that Stannus still teaches at FIGS. 13-14 that the visual elements (ears, nose, eyes) of the dog in the output image are replaced with the dog visual elements (eyes, nose and ears) of the dog in the input image. 

Stannus teaches at FIG. 14 and Paragraph 0145-0148 that the style transfer unit 203 (the machine learning model) blends the mask M4 (object image) by inverting the values of the mask M3 an input image as a scene image and the output image is generated based on the blending of the input image with an object image such as mask M3 (having a mask object) corresponding to the style C. Stannus teaches at Paragraph 0065 that the style transfer unit 203 may use a neural network for style transfer and at Paragraph 0069 that the style transfer unit 203 may perform style transfer by blending a plurality of styles (including the mask object) to the same portion of the input image (scene image). 
Stannus teaches at FIG. 21 and Paragraph 0174-0175 that the object image OBJ is blended with the scene image of FIG. 20. 
Stannus teaches at FIGS. 13-14 the input image as a scene image, a mask image. The style image having an object or the input image having a dog object is served as an object image. 
Stannus teaches at FIGS. 13-14 and Paragraph 0143 that the output image is an image in which style transfer is performed on the central region to style A and on the left edge region and the right edge region to style B and at Paragraph 0081-0082 that the generated image is the output image of the neural network. 
Stannus teaches at [0103] that the style transfer unit 203 inputs the image data to the first transformation layer of the trained neural network N2 optimized as described above, for example. As a result, style transfer-applied data in which the n style images are nicely blended is output from the second transformation layer of the neural network N2. 
Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data); 
identifying at least one area of the output image that a) corresponds to an edge of the object image or an area of transition between the object image and the scene image and b) is similar to a corresponding area of the scene image (
Stannus teaches that the region defining unit 202 identifies a region of the output image that corresponds to the boundary of the demarcated region and is similar to the corresponding region of the input image. Stannus further teaches at FIGS. 20-22 and Paragraph 0184 that the style transfer can be further applied to the composite image (output image) after demarcating one or more regions of the composite image (e.g., demarcating a dog region of the composite image similar to the corresponding dog region of the input image).
Stannus still teaches at FIGS. 13-14 that the visual elements (ears, nose, eyes) of the dog in the output image are replaced with the dog visual elements (eyes, nose and ears) of the dog in the input image. 
Stannus teaches at Paragraph 0148 “image data in which a dog is reflected” corresponds to an output image wherein the dog region of the output image is similar to the dog region of the input image. Stannus teaches at Paragraph 0053 that the style transfer unit 203 performs a style transfer on the region in the image and the image referred to here includes a composite image (output image). The region in the composite image (output image) is similar to the region of the input image before the style transfer. 
Moreover, FIGS. 7-10 shows separately applying style A and style B to an input image to generate two separate output images and at Paragraph 0069 that the optimization function is defined based on the plurality of style images. The optimization requires successively generate output images p to be optimized (see Paragraph 0081 and Paragraph 0090-0092 and the iterative optimization of Paragraph 0095-0104 for successively generating the output images p) and the similarity of the output image p and the input image in the optimization is formulated by minimizing a L2 loss function (L2 similarity). 
The similarity of the output image is described in formula 2 of Paragraph 0080.
For example, Stannus teaches at Paragraph 0144-0150 that the style application range (the region) of the output image is similar to the corresponding region of the input image. For example, after applying style C to the input image, a first output image is generated in which non-dog region is style-transferred to style C and thus the dog region in the first output image remains unchanged wherein the dog region of the output image is similar to the dog region of the input image after applying style C. 
Subsequently after further applying style D to the first output image as an input image, the second output image includes the dog region to be style-transferred with the style D and while applying the style D, the non-dog region of the second output image is similar to the non-dog region of the first output image as an input image. The dog region is modified while replacing the dog region of the output image with the dog region of the input image to reflect the dog in the output image. Stannus teaches at Paragraph 0148-0150 that a dog region of an intermediate output image is similar to the dog region of the input image (see also the iterative optimization process at Paragraph 0080 for generating the output images p). 
In one embodiment, the claimed one area is mapped to Stannus’s exclusion region (e.g., the front side of the tower which may be demarcated by the region defining unit 202). For example, Stannus teaches that the region defining unit 202 demarcates a region on a composite image (the output image). Stannus teaches at FIG. 22 and Paragraph 0053 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (St12). The style transfer unit 203 performs a style transfer on the region in the image (St13). The image referred to here includes a composite image, which will be described later with reference to FIG. 22.
Accordingly, the region defining unit 202 identifies a region of the output image for style transfer while other region excluding the region for style transfer is maintained to be similar to the input scene image. In other words, the exclusion region is also defined by the region defining unit 202 by the demarcation shape and corresponds to the demarcation shape and is similar to or exactly the same as the content of the corresponding region in the input scene image. When region defining unit 202 demarcates a dog object or a tower object, the exclusion region outside the shape boundary of the dog object or tower object still corresponds to the demarcation boundary of the dog object or tower object and is maintained to be similar to the content of the input scene image.
It is known from the style transfer of FIGS. 12-13 that the region for style transfer corresponds to the shape boundary of the mask image such as the shape boundary of the dog object or the shape boundary of the tower object and the background of the tower object and the front of the tower object including the two person objects is maintained to be the same content as the input image and accordingly the front region of the tower object including the two person objects corresponds to the shape boundary of the tower object (i.e., the outside exclusion region still corresponds to shape boundary of the region demarcated by the region defining unit 202, in the sense that the region defining unit 202 defines the region as well as the exclusion region) and the background of tower object and is similar in content to the input image. Moreover, the portion of the first building and/or the second building is maintained to be the same as the content in the input image and corresponds to the shape boundary of the region demarcated by the region defining unit 202, i.e., the region defining unit 202 defines the region as well as an exclusion region using the shape boundary of the region demarcated by the region defining unit 202 (see Paragraph 0062). 
In another embodiment, the claimed area is mapped to the region identified by the style transfer based on the content loss minimization of the content optimization function of the output image p and the content image c.
In Stannus, the dog or tower region of the output image for style transfer corresponds to an edge (shape) of an object shape of a mask image or an inverted mask image and is similar to the corresponding dog or tower region of the input image as shown in FIGS. 12-13 and 20-21 and the identification is performed using the content optimization function as disclosed in Paragraph 0080-0085 by minimizing a L2 distance between the output image p and the content image c such that the content of the output image from the machine learning model is similar to the content of the input scene image c. 
The region of the output image identified for style transfer meets the claimed area of the output image (see FIGS. 13-14 and FIGS. 20-21 and Paragraph 0169-0174) and the identification is performed by using a mask object image as an input to the style transfer unit 203 (a machine learning model). The region for style transfer is subject to processing by the machine learning model as evidenced in Paragraph 0080-0085 that the region of the output image p similar to the region of the input image c is identified using the content optimization function of the output image p and the content image c for style transfer over the identified region. 
Moreover, Stannus teaches at FIGS. 13-14 and 20-21 that the dog object or the tower object in the output image is identified by the optimization function to be similar to the dog object or the tower object in the input image as characterized by the loss function at each activation node of the machine learning model and the content optimization requires the minimization of a loss function of the output image p and the input image c (see Paragraph 0077) by training a machine learning model. 
For example, Stannus teaches at Paragraph 0105 that the mask style transfer can be performed only on the selected regions. Stannus teaches at Paragraph 0107 that the style transfer unit 203 (machine learning model) may generate a total of two masks including a mask that prevents style transfer for regions other than the region corresponding to a first building reflected in the image, and a mask that prevents style transfer for regions other than the region corresponding to a second building reflected in the image. 
This teaching clearly shows that the output image is generated by the style transfer unit 203 by blending the input image (the scene image) and the object mask image (object image) and identifying at least one other region of the output image that corresponds to an edge of the mask object (e.g., a first building or a second building) and is similar to a corresponding other region of the input image than the first building or second building and in response to identifying, modifying the output image in the identified other region than the first building or second building to replace visual elements of the first building or the second building of the output image with visual elements of the first building or second building of the input image. 
Stannus thus teaches at Paragraph 0173-0174 that the output image is generated by the style transfer unit 203 (a machine learning model) by modifying the output image in the identified region to replace visual elements (e.g., the person objects) in the front side of the tower T of the output image with the visual elements (e.g., the person objects) in the front side of the tower T of the input image where the style of the tower T and the background of the tower T of the output image are modified from the input image (the scene image). 
Stannus teaches at FIGS. 20-21 and Paragraph 0169-0174 that the style transfer requires the blending of the input image (scene image) and a mask object image where the style transfer is applied to the tower T using a mask object image in the same manner as using a mask object image in the style transfer disclosed in FIGS. 13-14 of Stannus. The one or more regions demarcated including the portion of the tower T and the portion of the background of the tower T is a mask object. Stannus clearly teaches at Paragraph 0173 that the style transfer unit 203 generates a mask corresponding to the shape of the defined region and uses the generated mask (the mask object image) to apply style transfer to the region in the image. 
Applicant failed to consult with Paragraph 0174 that the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.
Stannus teaches at Paragraph [0169] FIG. 20 is a conceptual diagram illustrating an image before style transfer, according to at least one embodiment of the present disclosure. FIG. 21 is a conceptual diagram illustrating a state in which an image after style transfer is output to a user terminal, according to at least one embodiment of the present disclosure.
Stannus teaches at Paragraph [0173] In step St13, the style transfer unit 203 performs a style transfer on the region in the image. More specifically, the style transfer unit 203 generates a mask corresponding to the shape of the defined region, and uses the generated mask to apply style transfer to the region in the image.
Stannus teaches at Paragraph [0174] In the output image after the style transfer is applied, the style of the tower T and the background of the tower T are style-transferred based on the style image, as illustrated in FIG. 21. On the other hand, in the output image, the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.

Stannus teaches at FIG. 14 and Paragraph 0145-0150 identifying at least one area (the dog region) of the output image corresponding to the mask object M3 (corresponding to the dog region) which corresponds to an edge of the mask object and is similar to a corresponding area of the input image. 
Stannus teaches at FIG. 22 and Paragraph 0184 that the image composite unit 206 blends the object OBJ with the image to obtain a composite image. 
Stannus teaches at FIG. 14 and Paragraph 0145-0148 that the style transfer unit 203 (the machine learning model) blends the mask M4 (object image) by inverting the values of the mask M3 an input image as a scene image and the output image is generated based on the blending of the input image with an object image such as mask M3 (having a mask object) corresponding to the style C. Stannus teaches at Paragraph 0065 that the style transfer unit 203 may use a neural network for style transfer and at Paragraph 0069 that the style transfer unit 203 may perform style transfer by blending a plurality of styles (including the mask object) to the same portion of the input image (scene image). 
Stannus teaches at FIG. 21 and Paragraph 0174-0175 that the object image OBJ is blended with the scene image of FIG. 20. Stannus teaches at Paragraph 0172-0175 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (e.g., the portion of the tower T is identified and the style of the tower T and the background of the tower T are style-transferred based on the style image and the region corresponding to the front side of the tower T is not subjected to style transfer. At least one area of the tower T of the output image in FIG. 21 is similar to a corresponding area to the scene image in FIG. 20. Stannus teaches at Paragraph 0168 that the region corresponding to the portion inside the window W of the building in the output image is not subjected to style transfer. 
Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data). 
Stannus at least implicitly teaches the claim limitation: 
in response to the identifying, modifying the output image in the identified at least one area to replace visual elements of the output image with visual elements from the corresponding area of the scene image (
Stannus further teaches at FIGS. 20-22 and Paragraph 0184 that the style transfer can be further applied to the composite image (output image) after demarcating one or more regions of the composite image (e.g., demarcating a dog region of the composite image similar to the corresponding dog region of the input image such that a dog is reflected both in the output image and the input image; see Paragraph 0146 and Paragraph 0175 wherein the output image will reflect the scenery and objects of the real world including a dog).
Stannus still teaches at FIGS. 13-14 that the visual elements (ears, nose, eyes) of the dog in the output image are replaced with the dog visual elements (eyes, nose and ears) of the dog in the input image (see Paragraph 0080). Moreover, the dog region of the output image is modified to reflect the dog region of the input image. 
Stannus teaches at Paragraph 0148 “image data in which a dog is reflected” corresponds to an output image wherein the dog region of the output image is similar to the dog region of the input image. Stannus teaches at Paragraph 0053 that the style transfer unit 203 performs a style transfer on the region in the image and the image referred to here includes a composite image (output image). The region in the composite image (output image) is similar to the region of the input image before the style transfer. 
Moreover, FIGS. 7-10 shows separately applying style A and style B to an input image to generate two separate output images and at Paragraph 0069 that the optimization function is defined based on the plurality of style images. The optimization requires successively generate output images p to be optimized (see Paragraph 0081 and Paragraph 0090-0092 and the iterative optimization of Paragraph 0095-0104 for successively generating the output images p) and the similarity of the output image p and the input image in the optimization is formulated by minimizing a L2 loss function (L2 similarity). 
The similarity of the output image is described in formula 2 of Paragraph 0080.
For example, Stannus teaches at Paragraph 0144-0150 that the style application range (the region) of the output image is similar to the corresponding region of the input image. For example, after applying style C to the input image, a first output image is generated in which non-dog region is style-transferred to style C and thus the dog region in the first output image remains unchanged wherein the dog region of the output image is similar to the dog region of the input image after applying style C. 
Subsequently after further applying style D to the first output image as an input image, the second output image includes the dog region to be style-transferred with the style D and while applying the style D, the non-dog region of the second output image is similar to the non-dog region of the first output image as an input image. The dog region is modified while replacing the dog region of the output image with the dog region of the input image to reflect the dog in the output image. Stannus teaches at Paragraph 0148-0150 that a dog region of an intermediate output image is similar to the dog region of the input image (see also the iterative optimization process at Paragraph 0080 for generating the output images p). 
For example, Stannus teaches at Paragraph 0144-0150 that the style application range (the region) of the output image is similar to the corresponding region of the input image. For example, after applying style C to the input image, a first output image is generated in which non-dog region is style-transferred to style C and thus the dog region in the first output image remains unchanged wherein the dog region of the output image is similar to the dog region of the input image. When applying style C to the input image, at least one non-dog region of the output image is modified to replace visual elements of non-dog-region of the first output image with visual elements from the corresponding area of the scene image. 
Subsequently after further applying style D to the first output image as an input image, the second output image includes the dog region to be style-transferred with the style D and while applying the style D, the dog region of the first output image is similar to the dog region of the input image and the non-dog region of the second output image is similar to the non-dog region of the first output image as an input image. Stannus teaches at Paragraph 0148-0150 that a dog region of an intermediate output image is similar to the dog region of the input image. 
When applying style D to the first output image, at least the dog region of the second output image is modified to replace visual elements of dog-region of the second output image with visual elements from the corresponding area of the first output image as an input image at the second stage (i.e., dog is reflected in the second output image). 

Stannus teaches in response to the region defining unit 202 identifying a region of the output image, modifying the region of the output image to replace visual elements of the output image with visual elements from the corresponding area of the input image. 
For example, Stannus teaches at Paragraph 0105 that the mask style transfer can be performed only on the selected regions. Stannus teaches at Paragraph 0107 that the style transfer unit 203 (machine learning model) may generate a total of two masks including a mask that prevents style transfer for regions other than the region corresponding to a first building reflected in the image, and a mask that prevents style transfer for regions other than the region corresponding to a second building reflected in the image. 
This teaching clearly shows that the output image is generated by the style transfer unit 203 by blending the input image (the scene image) and the object mask image (object image) and identifying at least one other region of the output image that corresponds to an edge of the mask object (e.g., a first building or a second building) and is similar to a corresponding other region of the input image than the first building or second building and in response to identifying, modifying the output image in the identified other region than the first building or second building to replace visual elements of the first building or the second building of the output image with visual elements of the first building or second building of the input image. 
Stannus thus teaches at Paragraph 0173-0174 that the output image is generated by the style transfer unit 203 (a machine learning model) by modifying the output image in the identified region to replace visual elements (e.g., the person objects) in the front side of the tower T of the output image with the visual elements (e.g., the person objects) in the front side of the tower T of the input image where the style of the tower T and the background of the tower T of the output image are modified from the input image (the scene image). 
Stannus teaches at Paragraph [0173] In step St13, the style transfer unit 203 performs a style transfer on the region in the image. More specifically, the style transfer unit 203 generates a mask corresponding to the shape of the defined region, and uses the generated mask to apply style transfer to the region in the image.
Stannus teaches at Paragraph [0174] In the output image after the style transfer is applied, the style of the tower T and the background of the tower T are style-transferred based on the style image, as illustrated in FIG. 21. On the other hand, in the output image, the region corresponding to the front side of the tower T is not subjected to style transfer and remains as the original image of the real world.


Stannus teaches at Paragraph 0149 that The output data after the style transfer is applied is an output image in which the region corresponding to the portion other than the dog is style-transferred to style C, and the region corresponding to the dog is style-transferred to style D. 

Stannus teaches at FIG. 7 and Paragraph 0113 that the style transfer unit 203 inputs the input image and a mask to the processing layer P1. The output image from P1 is modified by the mask object while the visual elements in other regions remains. 
Stannus teaches at FIG. 9 and Paragraph 0125-0127 that of the original feature amounts, only the portion corresponding to the portion (left half) where the value in the style A hard mask is 1 remains and thus teaches applying only one hard mask to an input image (e.g., the style A is applied to the left half region of the input image) and thus preventing a right region from style transfer so that the output image is modified in the identified left region to replace visual elements (in the right region) of the output image with visual elements (in the right region) of the input image. . 
Stannus teaches at Paragraph 0105 that the mask style transfer can be performed only on the selected regions. Stannus teaches at Paragraph 0107 that the style transfer unit 203 may generate a total of two masks including a mask that prevents style transfer for regions other than the region corresponding to a first building reflected in the image, and a mask that prevents style transfer for regions other than the region corresponding to a second building reflected in the image. 
Accordingly, the output image generated by the style transfer is modified in the identified regions to replace visual elements (the first and second buildings) of the output image with visual elements (first and second buildings) of the input image. 
Stannus teaches at Paragraph 0128 that the style A soft mask and the style B soft mask correspond to a plurality of masks having different regions for preventing style transfer. Further, the style A hard mask and the style B hard mask correspond to a plurality of masks having different regions for preventing style transfer.
Stannus teaches at FIGS. 13-14 and Paragraph 0139-0150 in response to the identifying of the non-mask region, modifying the output image in the identified non-mask region to replace visual elements of the output image with visual elements from the corresponding area of the input image. 
For example, Stannus teaches at Paragraph 0140 that the mask M1 is a mask for preventing style transfer for the left edge region and right edge region of the image data and at Paragraph 0141 that the mask M2 is generated based on the mask M1 that the output image is modified in the identified left edge 
Stannus teaches at FIG. 22 and Paragraph 0184 that the image composite unit 206 blends the object OBJ with the image to obtain a composite image. 
Stannus teaches at FIG. 14 and Paragraph 0145-0148 that the style transfer unit 203 (the machine learning model) blends the mask M4 (object image) by inverting the values of the mask M3 an input image as a scene image and the output image is generated based on the blending of the input image with an object image such as mask M3 (having a mask object) corresponding to the style C. Stannus teaches at Paragraph 0065 that the style transfer unit 203 may use a neural network for style transfer and at Paragraph 0069 that the style transfer unit 203 may perform style transfer by blending a plurality of styles (including the mask object) to the same portion of the input image (scene image). 
Stannus teaches at FIG. 21 and Paragraph 0174-0175 that the object image OBJ is blended with the scene image of FIG. 20. Stannus teaches at Paragraph 0172-0175 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (e.g., the portion of the tower T is identified and the style of the tower T and the background of the tower T are style-transferred based on the style image and the region corresponding to the front side of the tower T is not subjected to style transfer. At least one area of the tower T of the output image in FIG. 21 is similar to a corresponding area to the scene image in FIG. 20 and the visual elements of the tower T in the output image are replaced by the visual elements of the scene image of FIG. 20. Stannus teaches at Paragraph 0168 that the region corresponding to the portion inside the window W of the building in the output image is not subjected to style transfer). 
Mishra explicitly teaches the claim limitation that in response to the identifying, modifying the output image in the identified at least one area to replace visual elements of the output image with visual elements from the corresponding area of the scene image (
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Joachim/Smith/Kim teaches the claim limitation that in response to the identifying, modifying the output image in the identified at least one area to replace visual elements of the output image with visual elements from the corresponding area of the scene image (
Joachim teaches the at least one coherence fills is generated by the computer processing system as part of the automatic composite image generation process based on the scene image 706 and the object 708a-708d. 
Joachim teaches at Paragraph [0225] that, an object-aware modification includes an editing operation that targets an identified object in a digital image. In particular, in some embodiments, an object-aware modification includes an editing operation that targets an object that has been previously segmented. For instance, as discussed, the scene-based image editing system 106 generates a mask for an object portrayed in a digital image before receiving user input for modifying the object in some implementations. Accordingly, upon user selection of the object (e.g., a user selection of at least some of the pixels portraying the object), the scene-based image editing system 106 determines to target modifications to the entire object rather than requiring that the user specifically designate each pixel to be edited. Thus, in some cases, an object-aware modification includes a modification that targets an object by managing all the pixels portraying the object as part of a cohesive unit rather than individual elements. For instance, in some implementations an object-aware modification includes, but is not limited to, a move operation or a delete operation. 
Joachim teaches at Paragraph [0229] that, as further shown in FIG. 7, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate content fills 712 for the objects 708a-708d. In particular, in some embodiments, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate a separate content fill for each portrayed object. As illustrated, the scene-based image editing system 106 generates the content fills 712 using the object masks 710. For instance, in one or more embodiments, the scene-based image editing system 106 utilizes the object masks 710 generated via the segmentation neural network 702 as indicators of replacement regions to be replaced using the content fills 712 generated by the content-aware hole-filling machine learning model 704. In some instances, the scene-based image editing system 106 utilizes the object masks 710 to filter out the objects from the digital image 706, which results in remaining holes in the digital image 706 to be filled by the content fills content fills 712. 
Smith teaches the at least one coherence fills is generated by the computer processing system as part of the automatic composite image generation process based on the scene image 706 and the object 708a-708d. 
Smith teaches at Paragraph [0221] In one or more embodiments, an object-aware modification includes an editing operation that targets an identified object in a digital image. In particular, in some embodiments, an object-aware modification includes an editing operation that targets an object that has been previously segmented. For instance, as discussed, the scene-based image editing system 106 generates a mask for an object portrayed in a digital image before receiving user input for modifying the object in some implementations. Accordingly, upon user selection of the object (e.g., a user selection of at least some of the pixels portraying the object), the scene-based image editing system 106 determines to target modifications to the entire object rather than requiring that the user specifically designate each pixel to be edited. Thus, in some cases, an object-aware modification includes a modification that targets an object by managing all the pixels portraying the object as part of a cohesive unit rather than individual elements. For instance, in some implementations an object-aware modification includes, but is not limited to, a move operation or a delete operation.
Joachim teaches at Paragraph [0225] As further shown in FIG. 7, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate content fills 712 for the objects 708a-708d. In particular, in some embodiments, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate a separate content fill for each portrayed object. As illustrated, the scene-based image editing system 106 generates the content fills 712 using the object masks 710. For instance, in one or more embodiments, the scene-based image editing system 106 utilizes the object masks 710 generated via the segmentation neural network 702 as indicators of replacement regions to be replaced using the content fills 712 generated by the content-aware hole-filling machine learning model 704. In some instances, the scene-based image editing system 106 utilizes the object masks 710 to filter out the objects from the digital image 706, which results in remaining holes in the digital image 706 to be filled by the content fills content fills 712. 
Kim teaches the at least one coherence fills is generated by the computer processing system as part of the automatic composite image generation process based on the scene image 706 and the object 708a-708d. 
Kim teaches at Paragraph [0201] In one or more embodiments, an object-aware modification includes an editing operation that targets an identified object in a digital image. In particular, in some embodiments, an object-aware modification includes an editing operation that targets an object that has been previously segmented. For instance, as discussed, the scene-based image editing system 106 generates a mask for an object portrayed in a digital image before receiving user input for modifying the object in some implementations. Accordingly, upon user selection of the object (e.g., a user selection of at least some of the pixels portraying the object), the scene-based image editing system 106 determines to target modifications to the entire object rather than requiring that the user specifically designate each pixel to be edited. Thus, in some cases, an object-aware modification includes a modification that targets an object by managing all the pixels portraying the object as part of a cohesive unit rather than individual elements. For instance, in some implementations an object-aware modification includes, but is not limited to, a move operation or a delete operation.
Kim teaches at [0205] As further shown in FIG. 7, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate content fills 712 for the objects 708a-708d. In particular, in some embodiments, the scene-based image editing system 106 utilizes the content-aware hole-filling machine learning model 704 to generate a separate content fill for each portrayed object. As illustrated, the scene-based image editing system 106 generates the content fills 712 using the object masks 710. For instance, in one or more embodiments, the scene-based image editing system 106 utilizes the object masks 710 generated via the segmentation neural network 702 as indicators of replacement regions to be replaced using the content fills 712 generated by the content-aware hole-filling machine learning model 704. In some instances, the scene-based image editing system 106 utilizes the object masks 710 to filter out the objects from the digital image 706, which results in remaining holes in the digital image 706 to be filled by the content fills content fills 712.
). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Smith/Joachim/Kim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 2: 
The claim 2 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the identifying of at least one area of the output image that is similar to a corresponding area of the scene image is based on one or more of Euclidean distance, L1 distance, and a color distribution similarity between an area of the output image and a corresponding area of the scene image. 
Stannis further teaches the claim limitation that the identifying of at least one area of the output image that is similar to a corresponding area of the scene image is based on one or more of Euclidean distance, L1 distance, and a color distribution similarity between an area of the output image and a corresponding area of the scene image (
Stannus teaches at FIG. 21 and Paragraph 0174-0175 that the object image OBJ is blended with the scene image of FIG. 20. Stannus teaches at Paragraph 0172-0175 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (e.g., the portion of the tower T is identified and the style of the tower T and the background of the tower T are style-transferred based on the style image and the region corresponding to the front side of the tower T is not subjected to style transfer. At least one area of the tower T of the output image in FIG. 21 is similar to a corresponding area to the scene image in FIG. 20 and the visual elements of the tower T in the output image are replaced by the visual elements of the scene image of FIG. 20. Stannus teaches at Paragraph 0168 that the region corresponding to the portion inside the window W of the building in the output image is not subjected to style transfer. 
Stannis teaches at Paragraph [0057] that the distance to the target means the distance from the viewpoint from which the image is captured to the target. For example, when an image is captured by a camera, the camera is the viewpoint. Therefore, in this case, the distance to the target means the distance from the camera to the target. 
Stannis teaches at Paragraph [0171] that in step St11 in FIG. 3, the distance estimation unit 201 estimates the distance to the target included in the image. The target in this example is the tower T. That is, the distance estimation unit 201 estimates the distance from the camera of the user terminal 20Z to the tower T.
Stannis teaches at Paragraph [0172] that, in step St12, the region defining unit 202 demarcates one or more regions from the image based on the estimated distance. The region in this example may be a portion of the image that is a predetermined distance or more away from the tower T, that is, a portion of the tower T, and a portion of the background when the tower T is the foreground). 
Re Claim 3: 
The claim 3 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that repeating the identifying, determining, and modifying steps for multiple areas of the output image, the multiple areas substantially covering all areas of transition between the object image and the scene image.
Stannis further teaches the claim limitation that repeating the identifying, determining, and modifying steps for multiple areas of the output image, the multiple areas substantially covering all areas of transition between the object image and the scene image (
Stannis teaches at Paragraph [0057] that the distance to the target means the distance from the viewpoint from which the image is captured to the target. For example, when an image is captured by a camera, the camera is the viewpoint. Therefore, in this case, the distance to the target means the distance from the camera to the target. 
Stannis teaches at Paragraph [0171] that in step St11 in FIG. 3, the distance estimation unit 201 estimates the distance to the target included in the image. The target in this example is the tower T. That is, the distance estimation unit 201 estimates the distance from the camera of the user terminal 20Z to the tower T.
Stannis teaches at Paragraph [0172] that, in step St12, the region defining unit 202 demarcates one or more regions from the image based on the estimated distance. The region in this example may be a portion of the image that is a predetermined distance or more away from the tower T, that is, a portion of the tower T, and a portion of the background when the tower T is the foreground). 
Re Claim 4: 
The claim 4 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that each identified area is a single pixel or a group of pixels of the output image. 
Stannis further teaches the claim limitation that each identified area is a single pixel or a group of pixels of the output image (
Stannis teaches at Paragraph [0066] FIG. 4 is a conceptual diagram illustrating an example structure of a neural network N1 used for general style transfer, according to at least one embodiment of the present disclosure. The neural network N1 includes a first transformation layer that transforms a group of pixels based on the input image into latent parameters, one or more layers that perform downsampling by convolution or the like, a plurality of residual block layers, a layer that performs upsampling, and a second transformation layer that transforms the latent parameters into a group of pixels. An output image is obtained based on the group of pixels that are the output of the second transformation layer). 

Re Claim 5: 
The claim 5 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the output image was generated by a machine learning model that combines the object image with the scene image based on a mask image. 
Stannis further teaches the claim limitation that the output image was generated by a machine learning model that combines the object image with the scene image based on a mask image (Stannus teaches at FIGS. 13-14 the input image as a scene image, a mask image. The style image having an object or the input image having a dog object is served as an object image. 
Stannus teaches at FIGS. 13-14 and Paragraph 0143 that the output image is an image in which style transfer is performed on the central region to style A and on the left edge region and the right edge region to style B and at Paragraph 0081-0082 that the generated image is the output image of the neural network. 
Stannus teaches at [0103] that the style transfer unit 203 inputs the image data to the first transformation layer of the trained neural network N2 optimized as described above, for example. As a result, style transfer-applied data in which the n style images are nicely blended is output from the second transformation layer of the neural network N2. 
Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data). 
Re Claim 6: 
The claim 6 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that the output image was generated by a process comprising: by a computer processing system comprising an image processor:
receiving a scene image, an object image, and a mask image; using at least one content image and at least one appearance image to guide or control an inference process of a controlled image generating machine learning model to generate visual elements for at least one area of a composite image, wherein the at least one area is dependent on the mask image; wherein: the at least one content image is either the scene image or the object image, or is generated based on at least one of the scene image and the object image; the at least one appearance image is either the scene image or the object image, or is generated based on at least one of the scene image and the object image; the method comprises at least one of: generating at least one said content image based on at least one of the scene image and the object image; and generating at least one said appearance image based on at least one of the scene image and the object image.  
Stannis further teaches the claim limitation that that the output image was generated by a process comprising: by a computer processing system comprising an image processor:
receiving a scene image, an object image, and a mask image (Stannus teaches at FIGS. 13-14 the input image as a scene image, a mask image. The style image having an object or the input image having a dog object is served as an object image. 
Stannus teaches at FIGS. 13-14 and Paragraph 0143 that the output image is an image in which style transfer is performed on the central region to style A and on the left edge region and the right edge region to style B and at Paragraph 0081-0082 that the generated image is the output image of the neural network. 
Stannus teaches at [0103] that the style transfer unit 203 inputs the image data to the first transformation layer of the trained neural network N2 optimized as described above, for example. As a result, style transfer-applied data in which the n style images are nicely blended is output from the second transformation layer of the neural network N2. 
Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data); 
using at least one content image and at least one appearance image to guide or control an inference process of a controlled image generating machine learning model to generate visual elements for at least one area of a composite image (
Stannus teaches at FIGS. 13-14 and Paragraph 0143 that the output image is an image in which style transfer is performed on the central region to style A and on the left edge region and the right edge region to style B and at Paragraph 0081-0082 that the generated image is the output image of the neural network. 
Stannus teaches at [0103] that the style transfer unit 203 inputs the image data to the first transformation layer of the trained neural network N2 optimized as described above, for example. As a result, style transfer-applied data in which the n style images are nicely blended is output from the second transformation layer of the neural network N2. 
Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data), 
wherein the at least one area is dependent on the mask image (Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data); 
wherein: the at least one content image is either the scene image or the object image, or is generated based on at least one of the scene image and the object image; the at least one appearance image is either the scene image or the object image, or is generated based on at least one of the scene image and the object image (Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data); the method comprises at least one of: generating at least one said content image based on at least one of the scene image and the object image; and generating at least one said appearance image based on at least one of the scene image and the object image (Stannus teaches at Paragraph [0106] that the style transfer unit 203 generates a mask corresponding to the shape of the region to which the style transfer is applied. Next, the style transfer unit 203 inputs the image data and the mask to the neural network for style transfer. This allows the masks to be used to apply style transfers based on one or more style images to the image data).  

Re Claim 7: 
The claim 7 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the at least one content image represents the structure or content of at least one of the scene image and the object image, while omitting at least some style characteristics. 
Mishra teaches the claim limitation that the at least one content image represents the structure or content of at least one of the scene image and the object image, while omitting at least some style characteristics (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 8: 
The claim 8 encompasses the same scope of invention as that of the claim 7 except additional claim limitation that the at least one appearance image represents style characteristics of at least one of the scene image and the object image. 
Mishra teaches the claim limitation that the at least one appearance image represents style characteristics of at least one of the scene image and the object image (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 9: 
The claim 9 encompasses the same scope of invention as that of the claim 8 except additional claim limitation that the at least one appearance image also represents the structure or content of the at least one of the scene image and the object image. 
Mishra teaches the claim limitation that the at least one appearance image also represents the structure or content of the at least one of the scene image and the object image (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 10: 
The claim 10 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the controlled image generating machine learning model comprises an image generating model and one or more control models that receive the at least one content image and the at least one appearance image as inputs to influence the generation of the visual elements.
Mishra teaches the claim limitation that the controlled image generating machine learning model comprises an image generating model and one or more control models that receive the at least one content image and the at least one appearance image as inputs to influence the generation of the visual elements (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302. 
Mishra teaches at Paragraph [0051] that the image feature context 321, includes the detected objects and other features determined in the selected training images 310, 312, 314, and 316—i.e., the Santa Clause object 314-2, the snowman object 314-1, a Christmas tree, reindeer, the theme of Christmas, and the like. The image feature context 321 further includes the color themes data structures 322 present in the selected training images 310, 312, 314, and 316. In other words, each color theme data structure (e.g., 323) represents the current color theme or combination for a respective training image. For example, color theme data structure 323 represents the colors within the image 316, where the first two colors 323-1 represent the purple background of the image 316 and the three colors 323-2 represent a combination of the colors of the objects within the image 316. FIG. 3 illustrates that the neural network 324 is trained to recommend a color theme output 326 for a given training image, such as the training image 314, where the color set 326-1 (i.e., green, orange, and red) represents recommendations for coloring the objects within the image 314 and the color set 326-2 (i.e., red and white) represents recommendations for coloring the background within the image 314. As described in more detail below, the color them output 326 may be generated based on the 5 most dominant colors in the image 314).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 11: 
The claim 11 encompasses the same scope of invention as that of the claim 10 except additional claim limitation that the one or more control models include a multi- controlnet and an image prompt adapter. 
Mishra teaches the claim limitation that the one or more control models include a multi- controlnet and an image prompt adapter (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302. 
Mishra teaches at Paragraph [0051] that the image feature context 321, includes the detected objects and other features determined in the selected training images 310, 312, 314, and 316—i.e., the Santa Clause object 314-2, the snowman object 314-1, a Christmas tree, reindeer, the theme of Christmas, and the like. The image feature context 321 further includes the color themes data structures 322 present in the selected training images 310, 312, 314, and 316. In other words, each color theme data structure (e.g., 323) represents the current color theme or combination for a respective training image. For example, color theme data structure 323 represents the colors within the image 316, where the first two colors 323-1 represent the purple background of the image 316 and the three colors 323-2 represent a combination of the colors of the objects within the image 316. FIG. 3 illustrates that the neural network 324 is trained to recommend a color theme output 326 for a given training image, such as the training image 314, where the color set 326-1 (i.e., green, orange, and red) represents recommendations for coloring the objects within the image 314 and the color set 326-2 (i.e., red and white) represents recommendations for coloring the background within the image 314. As described in more detail below, the color them output 326 may be generated based on the 5 most dominant colors in the image 314).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 12: 
The claim 12 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the generating of the at least one content image and the generating of the at least one appearance image includes processing one or more of the scene image and the object image using techniques selected from the group consisting of: cropping, resizing, and transparency introduction.
Mishra teaches the claim limitation that the generating of the at least one content image and the generating of the at least one appearance image includes processing one or more of the scene image and the object image using techniques selected from the group consisting of: cropping, resizing, and transparency introduction (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302. 
Mishra teaches at Paragraph [0051] that the image feature context 321, includes the detected objects and other features determined in the selected training images 310, 312, 314, and 316—i.e., the Santa Clause object 314-2, the snowman object 314-1, a Christmas tree, reindeer, the theme of Christmas, and the like. The image feature context 321 further includes the color themes data structures 322 present in the selected training images 310, 312, 314, and 316. In other words, each color theme data structure (e.g., 323) represents the current color theme or combination for a respective training image. For example, color theme data structure 323 represents the colors within the image 316, where the first two colors 323-1 represent the purple background of the image 316 and the three colors 323-2 represent a combination of the colors of the objects within the image 316. FIG. 3 illustrates that the neural network 324 is trained to recommend a color theme output 326 for a given training image, such as the training image 314, where the color set 326-1 (i.e., green, orange, and red) represents recommendations for coloring the objects within the image 314 and the color set 326-2 (i.e., red and white) represents recommendations for coloring the background within the image 314. As described in more detail below, the color them output 326 may be generated based on the 5 most dominant colors in the image 314).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 13: 
The claim 13 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that generating the mask image by a process comprising receiving an initial image containing an initial mask and dilating the initial mask to form a mask of the mask image. 
Joachim further teaches the claim limitation that generating the mask image by a process comprising receiving an initial image containing an initial mask and dilating the initial mask to form a mask of the mask image (Joachim teaches at Paragraph 0494-0496 the scene-based image editing system 106 implements smart dilation when removing objects and conventional system remove objects from digital images utilizing tight masks and the scene-based image editing system 106 dilates the object mask of an object to avoid associated artifacts when removing the object). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 14: 
The claim 14 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the inference process of the controlled image generating machine learning model includes a two-pass rendering process, the first pass using a relatively large area of at least one of the at least one content image and the at least one appearance image, and the second pass following the first pass using respectively a relatively small area of at least one of the at least one content image and the at least one appearance image. 
Joachim further teaches the claim limitation that the inference process of the controlled image generating machine learning model includes a two-pass rendering process, the first pass using a relatively large area of at least one of the at least one content image and the at least one appearance image, and the second pass following the first pass using respectively a relatively small area of at least one of the at least one content image and the at least one appearance image (Joachim teaches at Paragraph [0192] As mentioned above, in some implementations, full convolutional models suffer from slow growth of effective receptive field, especially at the early stage of the network. Accordingly, utilizing strided convolution within the encoder can generate invalid features inside the hole region, making the feature correction at decoding stage more challenging. Fast Fourier convolution (FFC) can assist early layers to achieve receptive field that covers an entire image. Conventional systems, however, have only utilized FFC at a bottleneck layer, which is computationally demanding. Moreover, the shallow bottleneck layer cannot capture global semantic features effectively. Accordingly, in one or more implementations the scene-based image editing system 106 replaces the convolutional block in the encoder with FFC for the encoder layers. FFC enables the encoder to propagate features at early stage and thus address the issue of generating invalid features inside the hole, which helps improve the results. 
Joachim teaches at Paragraph [0510] that, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. Indeed, as shown in FIG. 33, the shadow detection neural network 3300 analyzes an input image 3302 via a first stage 3304 and a second stage 3310. In particular, the first stage 3304 includes an instance segmentation component 3306 and an object awareness component 3308. Further, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the instance segmentation component 3306 includes the segmentation neural network 2604 of the neural network pipeline discussed above with reference to FIG. 26. 
Joachim teaches at Paragraph [0514] that, for each detected object, the scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). FIG. 35 illustrates the object awareness component 3500 generating input 3506 for the object 3504a. Indeed, as shown in FIG. 35, the object awareness component 3500 generates the input 3506 using the input image 3508, the object mask 3510 corresponding to the object 3504a (referred to as the object-aware channel) and a combined object mask 3512 corresponding to the objects 3504b-3504c (referred to as the object-discriminative channel). For instance, in some implementations, the object awareness component 3500 combines (e.g., concatenates) the input image 3508, the object mask 3510, and the combined object mask 3512. The object awareness component 3500 similarly generates second stage input for the other objects 3504b-3504c as well (e.g., utilizing their respective object mask and combined object mask representing the other objects along with the input image 3508).
Joachim teaches at Paragraph [0515] that, the scene-based image editing system 106 (e.g., via the object awareness component 3500 or some other component of the shadow detection neural network) generates the combined object mask 3512 using the union of separate object masks generated for the object 3504b and the object 3504c. In some instances, the object awareness component 3500 does not utilize the object-discriminative channel (e.g., the combined object mask 3512). Rather, the object awareness component 3500 generates the input 3506 using the input image 3508 and the object mask 3510. In some embodiments, however, using the object-discriminative channel provides better shadow prediction in the second stage of the shadow detection neural network.
Joachim teaches at Paragraph [0521] that, the scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to determine shadows associated with objects portrayed in a digital image when the objects masks of the objects have already been generated. Indeed, FIG. 38 illustrates a diagram for using the second stage of the shadow detection neural network for determining shadows associated with objects portrayed in a digital image in accordance with one or more embodiments.
Joachim teaches at Paragraph [0522] that, as shown in FIG. 38, the scene-based image editing system 106 provides an input image 3804 to the second stage of a shadow detection neural network (i.e., a shadow prediction model 3802). Further, the scene-based image editing system 106 provides an object mask 3806 to the second stage. The scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to generate a shadow mask 3808 for the shadow of the object portrayed in the input image 3804, resulting in the association between the object and the shadow cast by the object within the input image 3804 (e.g., as illustrated in the visualization 3810).
Joachim teaches at Paragraph [0523] that, by providing direct access to the second stage of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in the shadow detection process. Indeed, in some cases, an object mask will already have been created for an object portrayed in a digital image. For instance, in some cases, the scene-based image editing system 106 implements a separate segmentation neural network to generate an object mask for a digital image as part of a separate workflow. Accordingly, the object mask for the object already exists, and the scene-based image editing system 106 leverages the previous work in determining the shadow for the object. Thus, the scene-based image editing system 106 further provides efficiency as it avoids duplicating work by accessing the shadow prediction model of the shadow detection neural network directly.
Joachim teaches at Paragraph [0524] FIGS. 39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects portrayed in a digital image in accordance with one or more embodiments. Indeed, as shown in FIG. 39A, the scene-based image editing system 106 provides, for display within a graphical user interface 3902 of a client device, a digital image 3906 portraying an object 3908. As further shown, the object 3908 casts a shadow 3910 within the digital image 3906. 
Joachim teaches at Paragraph [0526] that, as previously discussed with reference to FIG. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within a digital image as part of a neural network pipeline for identifying distracting objects within the digital image. For instance, in some cases, the scene-based image editing system 106 utilizes a segmentation neural network to identify objects for a digital image, a distractor detection neural network to classify one or more of the objects as distracting objects, a shadow detection neural network to identify shadows and associate the shadows with their corresponding objects, and an inpainting neural network to generate content fills to replace objects (and their shadows) that are removed. In some cases, the scene-based image editing system 106 implements the neural network pipeline automatically in response to receiving a digital image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 15: 
The claim 15 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the inference process of the controlled image generating machine learning model includes a two-pass rendering process, the first pass using a relatively large area of the at least one appearance image, and the second pass following the first pass using a relatively small area of the at least one appearance image.
Joachim further teaches the claim limitation that the inference process of the controlled image generating machine learning model includes a two-pass rendering process, the first pass using a relatively large area of the at least one appearance image, and the second pass following the first pass using a relatively small area of the at least one appearance image (Joachim teaches at Paragraph [0192] As mentioned above, in some implementations, full convolutional models suffer from slow growth of effective receptive field, especially at the early stage of the network. Accordingly, utilizing strided convolution within the encoder can generate invalid features inside the hole region, making the feature correction at decoding stage more challenging. Fast Fourier convolution (FFC) can assist early layers to achieve receptive field that covers an entire image. Conventional systems, however, have only utilized FFC at a bottleneck layer, which is computationally demanding. Moreover, the shallow bottleneck layer cannot capture global semantic features effectively. Accordingly, in one or more implementations the scene-based image editing system 106 replaces the convolutional block in the encoder with FFC for the encoder layers. FFC enables the encoder to propagate features at early stage and thus address the issue of generating invalid features inside the hole, which helps improve the results. 
Joachim teaches at Paragraph [0510] that, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. Indeed, as shown in FIG. 33, the shadow detection neural network 3300 analyzes an input image 3302 via a first stage 3304 and a second stage 3310. In particular, the first stage 3304 includes an instance segmentation component 3306 and an object awareness component 3308. Further, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the instance segmentation component 3306 includes the segmentation neural network 2604 of the neural network pipeline discussed above with reference to FIG. 26. 
Joachim teaches at Paragraph [0514] that, for each detected object, the scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). FIG. 35 illustrates the object awareness component 3500 generating input 3506 for the object 3504a. Indeed, as shown in FIG. 35, the object awareness component 3500 generates the input 3506 using the input image 3508, the object mask 3510 corresponding to the object 3504a (referred to as the object-aware channel) and a combined object mask 3512 corresponding to the objects 3504b-3504c (referred to as the object-discriminative channel). For instance, in some implementations, the object awareness component 3500 combines (e.g., concatenates) the input image 3508, the object mask 3510, and the combined object mask 3512. The object awareness component 3500 similarly generates second stage input for the other objects 3504b-3504c as well (e.g., utilizing their respective object mask and combined object mask representing the other objects along with the input image 3508).
Joachim teaches at Paragraph [0515] that, the scene-based image editing system 106 (e.g., via the object awareness component 3500 or some other component of the shadow detection neural network) generates the combined object mask 3512 using the union of separate object masks generated for the object 3504b and the object 3504c. In some instances, the object awareness component 3500 does not utilize the object-discriminative channel (e.g., the combined object mask 3512). Rather, the object awareness component 3500 generates the input 3506 using the input image 3508 and the object mask 3510. In some embodiments, however, using the object-discriminative channel provides better shadow prediction in the second stage of the shadow detection neural network.
Joachim teaches at Paragraph [0521] that, the scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to determine shadows associated with objects portrayed in a digital image when the objects masks of the objects have already been generated. Indeed, FIG. 38 illustrates a diagram for using the second stage of the shadow detection neural network for determining shadows associated with objects portrayed in a digital image in accordance with one or more embodiments.
Joachim teaches at Paragraph [0522] that, as shown in FIG. 38, the scene-based image editing system 106 provides an input image 3804 to the second stage of a shadow detection neural network (i.e., a shadow prediction model 3802). Further, the scene-based image editing system 106 provides an object mask 3806 to the second stage. The scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to generate a shadow mask 3808 for the shadow of the object portrayed in the input image 3804, resulting in the association between the object and the shadow cast by the object within the input image 3804 (e.g., as illustrated in the visualization 3810).
Joachim teaches at Paragraph [0523] that, by providing direct access to the second stage of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in the shadow detection process. Indeed, in some cases, an object mask will already have been created for an object portrayed in a digital image. For instance, in some cases, the scene-based image editing system 106 implements a separate segmentation neural network to generate an object mask for a digital image as part of a separate workflow. Accordingly, the object mask for the object already exists, and the scene-based image editing system 106 leverages the previous work in determining the shadow for the object. Thus, the scene-based image editing system 106 further provides efficiency as it avoids duplicating work by accessing the shadow prediction model of the shadow detection neural network directly.
Joachim teaches at Paragraph [0524] FIGS. 39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects portrayed in a digital image in accordance with one or more embodiments. Indeed, as shown in FIG. 39A, the scene-based image editing system 106 provides, for display within a graphical user interface 3902 of a client device, a digital image 3906 portraying an object 3908. As further shown, the object 3908 casts a shadow 3910 within the digital image 3906. 
Joachim teaches at Paragraph [0526] that, as previously discussed with reference to FIG. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within a digital image as part of a neural network pipeline for identifying distracting objects within the digital image. For instance, in some cases, the scene-based image editing system 106 utilizes a segmentation neural network to identify objects for a digital image, a distractor detection neural network to classify one or more of the objects as distracting objects, a shadow detection neural network to identify shadows and associate the shadows with their corresponding objects, and an inpainting neural network to generate content fills to replace objects (and their shadows) that are removed. In some cases, the scene-based image editing system 106 implements the neural network pipeline automatically in response to receiving a digital image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 16: 
The claim 16 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that in the first pass a relatively large area of the at least one content image is used, and in the second pass a relatively small area of the at least one content image is used.
Joachim further teaches the claim limitation that in the first pass a relatively large area of the at least one content image is used, and in the second pass a relatively small area of the at least one content image is used (
First pass of the machine learning model.
Joachim teaches at Paragraph [0192] As mentioned above, in some implementations, full convolutional models suffer from slow growth of effective receptive field, especially at the early stage of the network. Accordingly, utilizing strided convolution within the encoder can generate invalid features inside the hole region, making the feature correction at decoding stage more challenging. Fast Fourier convolution (FFC) can assist early layers to achieve receptive field that covers an entire image. Conventional systems, however, have only utilized FFC at a bottleneck layer, which is computationally demanding. Moreover, the shallow bottleneck layer cannot capture global semantic features effectively. Accordingly, in one or more implementations the scene-based image editing system 106 replaces the convolutional block in the encoder with FFC for the encoder layers. FFC enables the encoder to propagate features at early stage and thus address the issue of generating invalid features inside the hole, which helps improve the results. 
Joachim teaches at Paragraph [0510] that, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. Indeed, as shown in FIG. 33, the shadow detection neural network 3300 analyzes an input image 3302 via a first stage 3304 and a second stage 3310. In particular, the first stage 3304 includes an instance segmentation component 3306 and an object awareness component 3308. Further, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the instance segmentation component 3306 includes the segmentation neural network 2604 of the neural network pipeline discussed above with reference to FIG. 26. 
Joachim teaches at Paragraph [0173] In one or more embodiments, the scene-based image editing system 106 implements object-aware image editing by generating a content fill for each object portrayed in a digital image (e.g., for each object mask corresponding to portrayed objects) utilizing a hole-filing model. In particular, in some cases, the scene-based image editing system 106 utilizes a machine learning model, such as a content-aware hole-filling machine learning model to generate the content fill(s) for each foreground object. FIGS. 4-6 illustrate a content-aware hole-filling machine learning model utilized by the scene-based image editing system 106 to generate content fills for objects in accordance with one or more embodiments.
Joachim teaches at Paragraph [0174] In one or more embodiments, a content fill includes a set of pixels generated to replace another set of pixels of a digital image. Indeed, in some embodiments, a content fill includes a set of replacement pixels for replacing another set of pixels. For instance, in some embodiments, a content fill includes a set of pixels generated to fill a hole (e.g., a content void) that remains after (or if) a set of pixels (e.g., a set of pixels portraying an object) has been removed from or moved within a digital image. In some cases, a content fill corresponds to a background of a digital image. To illustrate, in some implementations, a content fill includes a set of pixels generated to blend in with a portion of a background proximate to an object that could be moved/removed. In some cases, a content fill includes an inpainting segment, such as an inpainting segment generated from other pixels (e.g., other background pixels) within the digital image. In some cases, a content fill includes other content (e.g., arbitrarily selected content or content selected by a user) to fill in a hole or replace another set of pixels.
Joachim teaches at Paragraph [0175] that, a content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fill. In particular, in some embodiments, a content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills for replacement regions in a digital image. For instance, in some cases, the scene-based image editing system 106 determines that an object has been moved within or removed from a digital image and utilizes a content-aware hole-filling machine learning model to generate a content fill for the hole that has been exposed as a result of the move/removal in response. As will be discussed in more detail, however, in some implementations, the scene-based image editing system 106 anticipates movement or removal of an object and utilizes a content-aware hole-filling machine learning model to pre-generate a content fill for that object. In some cases, a content-aware hole-filling machine learning model includes a neural network, such as an inpainting neural network (e.g., a neural network that generates a content fill-more specifically, an inpainting segment-using other pixels of the digital image). In other words, the scene-based image editing system 106 utilizes a content-aware hole-filling machine learning model in various implementations to provide content at a location of a digital image that does not initially portray such content (e.g., due to the location being occupied by another semantic area, such as an object).
Joachim teaches at Paragraph [0176] FIG. 4 illustrates the scene-based image editing system 106 utilizing a content-aware machine learning model, such as a cascaded modulation inpainting neural network 420, to generate an inpainted digital image 408 from a digital image 402 with a replacement region 404 in accordance with one or more embodiments.
Joachim teaches at Paragraph [0177] that, the replacement region 404 includes an area corresponding to an object (and a hole that would be present if the object were moved or deleted). In some embodiments, the scene-based image editing system 106 identifies the replacement region 404 based on user selection of pixels (e.g., pixels portraying an object) to move, remove, cover, or replace from a digital image. To illustrate, in some cases, a client device selects an object portrayed in a digital image. Accordingly, the scene-based image editing system 106 deletes or removes the object and generates replacement pixels. In some case, the scene-based image editing system 106 identifies the replacement region 404 by generating an object mask via a segmentation neural network. For instance, the scene-based image editing system 106 utilizes a segmentation neural network (e.g., the detection-masking neural network 300 discussed above with reference to FIG. 3) to detect objects with a digital image and generate object masks for the objects. Thus, in some implementations, the scene-based image editing system 106 generates content fill for the replacement region 404 before receiving user input to move, remove, cover, or replace the pixels initially occupying the replacement region 404.
Second pass of the machine learning model: 
Joachim teaches at Paragraph [0514] that, for each detected object, the scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). FIG. 35 illustrates the object awareness component 3500 generating input 3506 for the object 3504a. Indeed, as shown in FIG. 35, the object awareness component 3500 generates the input 3506 using the input image 3508, the object mask 3510 corresponding to the object 3504a (referred to as the object-aware channel) and a combined object mask 3512 corresponding to the objects 3504b-3504c (referred to as the object-discriminative channel). For instance, in some implementations, the object awareness component 3500 combines (e.g., concatenates) the input image 3508, the object mask 3510, and the combined object mask 3512. The object awareness component 3500 similarly generates second stage input for the other objects 3504b-3504c as well (e.g., utilizing their respective object mask and combined object mask representing the other objects along with the input image 3508).
Joachim teaches at Paragraph [0515] that, the scene-based image editing system 106 (e.g., via the object awareness component 3500 or some other component of the shadow detection neural network) generates the combined object mask 3512 using the union of separate object masks generated for the object 3504b and the object 3504c. In some instances, the object awareness component 3500 does not utilize the object-discriminative channel (e.g., the combined object mask 3512). Rather, the object awareness component 3500 generates the input 3506 using the input image 3508 and the object mask 3510. In some embodiments, however, using the object-discriminative channel provides better shadow prediction in the second stage of the shadow detection neural network.
Joachim teaches at Paragraph [0521] that, the scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to determine shadows associated with objects portrayed in a digital image when the objects masks of the objects have already been generated. Indeed, FIG. 38 illustrates a diagram for using the second stage of the shadow detection neural network for determining shadows associated with objects portrayed in a digital image in accordance with one or more embodiments.
Joachim teaches at Paragraph [0522] that, as shown in FIG. 38, the scene-based image editing system 106 provides an input image 3804 to the second stage of a shadow detection neural network (i.e., a shadow prediction model 3802). Further, the scene-based image editing system 106 provides an object mask 3806 to the second stage. The scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to generate a shadow mask 3808 for the shadow of the object portrayed in the input image 3804, resulting in the association between the object and the shadow cast by the object within the input image 3804 (e.g., as illustrated in the visualization 3810).
Joachim teaches at Paragraph [0523] that, by providing direct access to the second stage of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in the shadow detection process. Indeed, in some cases, an object mask will already have been created for an object portrayed in a digital image. For instance, in some cases, the scene-based image editing system 106 implements a separate segmentation neural network to generate an object mask for a digital image as part of a separate workflow. Accordingly, the object mask for the object already exists, and the scene-based image editing system 106 leverages the previous work in determining the shadow for the object. Thus, the scene-based image editing system 106 further provides efficiency as it avoids duplicating work by accessing the shadow prediction model of the shadow detection neural network directly.
Joachim teaches at Paragraph [0524] FIGS. 39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects portrayed in a digital image in accordance with one or more embodiments. Indeed, as shown in FIG. 39A, the scene-based image editing system 106 provides, for display within a graphical user interface 3902 of a client device, a digital image 3906 portraying an object 3908. As further shown, the object 3908 casts a shadow 3910 within the digital image 3906. 
Joachim teaches at Paragraph [0526] that, as previously discussed with reference to FIG. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within a digital image as part of a neural network pipeline for identifying distracting objects within the digital image. For instance, in some cases, the scene-based image editing system 106 utilizes a segmentation neural network to identify objects for a digital image, a distractor detection neural network to classify one or more of the objects as distracting objects, a shadow detection neural network to identify shadows and associate the shadows with their corresponding objects, and an inpainting neural network to generate content fills to replace objects (and their shadows) that are removed. In some cases, the scene-based image editing system 106 implements the neural network pipeline automatically in response to receiving a digital image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 17: 
The claim 17 encompasses the same scope of invention as that of the claim 15 except additional claim limitation that the first pass incorporates lighting and colour characteristics from the scene image, and the second pass refines the image to a higher resolution. 
Joachim further teaches the claim limitation that the first pass incorporates lighting and colour characteristics from the scene image, and the second pass refines the image to a higher resolution (
First pass of the machine learning model.
Joachim teaches at Paragraph [0192] As mentioned above, in some implementations, full convolutional models suffer from slow growth of effective receptive field, especially at the early stage of the network. Accordingly, utilizing strided convolution within the encoder can generate invalid features inside the hole region, making the feature correction at decoding stage more challenging. Fast Fourier convolution (FFC) can assist early layers to achieve receptive field that covers an entire image. Conventional systems, however, have only utilized FFC at a bottleneck layer, which is computationally demanding. Moreover, the shallow bottleneck layer cannot capture global semantic features effectively. Accordingly, in one or more implementations the scene-based image editing system 106 replaces the convolutional block in the encoder with FFC for the encoder layers. FFC enables the encoder to propagate features at early stage and thus address the issue of generating invalid features inside the hole, which helps improve the results. 
Joachim teaches at Paragraph [0510] that, FIG. 33 illustrates an overview of a shadow detection neural network 3300 in accordance with one or more embodiments. Indeed, as shown in FIG. 33, the shadow detection neural network 3300 analyzes an input image 3302 via a first stage 3304 and a second stage 3310. In particular, the first stage 3304 includes an instance segmentation component 3306 and an object awareness component 3308. Further, the second stage 3310 includes a shadow prediction component 3312. In one or more embodiments, the instance segmentation component 3306 includes the segmentation neural network 2604 of the neural network pipeline discussed above with reference to FIG. 26. 
Joachim teaches at Paragraph [0173] In one or more embodiments, the scene-based image editing system 106 implements object-aware image editing by generating a content fill for each object portrayed in a digital image (e.g., for each object mask corresponding to portrayed objects) utilizing a hole-filing model. In particular, in some cases, the scene-based image editing system 106 utilizes a machine learning model, such as a content-aware hole-filling machine learning model to generate the content fill(s) for each foreground object. FIGS. 4-6 illustrate a content-aware hole-filling machine learning model utilized by the scene-based image editing system 106 to generate content fills for objects in accordance with one or more embodiments.
Joachim teaches at Paragraph [0174] In one or more embodiments, a content fill includes a set of pixels generated to replace another set of pixels of a digital image. Indeed, in some embodiments, a content fill includes a set of replacement pixels for replacing another set of pixels. For instance, in some embodiments, a content fill includes a set of pixels generated to fill a hole (e.g., a content void) that remains after (or if) a set of pixels (e.g., a set of pixels portraying an object) has been removed from or moved within a digital image. In some cases, a content fill corresponds to a background of a digital image. To illustrate, in some implementations, a content fill includes a set of pixels generated to blend in with a portion of a background proximate to an object that could be moved/removed. In some cases, a content fill includes an inpainting segment, such as an inpainting segment generated from other pixels (e.g., other background pixels) within the digital image. In some cases, a content fill includes other content (e.g., arbitrarily selected content or content selected by a user) to fill in a hole or replace another set of pixels.
Joachim teaches at Paragraph [0175] that, a content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fill. In particular, in some embodiments, a content-aware hole-filling machine learning model includes a computer-implemented machine learning model that generates content fills for replacement regions in a digital image. For instance, in some cases, the scene-based image editing system 106 determines that an object has been moved within or removed from a digital image and utilizes a content-aware hole-filling machine learning model to generate a content fill for the hole that has been exposed as a result of the move/removal in response. As will be discussed in more detail, however, in some implementations, the scene-based image editing system 106 anticipates movement or removal of an object and utilizes a content-aware hole-filling machine learning model to pre-generate a content fill for that object. In some cases, a content-aware hole-filling machine learning model includes a neural network, such as an inpainting neural network (e.g., a neural network that generates a content fill-more specifically, an inpainting segment-using other pixels of the digital image). In other words, the scene-based image editing system 106 utilizes a content-aware hole-filling machine learning model in various implementations to provide content at a location of a digital image that does not initially portray such content (e.g., due to the location being occupied by another semantic area, such as an object).
Joachim teaches at Paragraph [0176] FIG. 4 illustrates the scene-based image editing system 106 utilizing a content-aware machine learning model, such as a cascaded modulation inpainting neural network 420, to generate an inpainted digital image 408 from a digital image 402 with a replacement region 404 in accordance with one or more embodiments.
Joachim teaches at Paragraph [0177] that, the replacement region 404 includes an area corresponding to an object (and a hole that would be present if the object were moved or deleted). In some embodiments, the scene-based image editing system 106 identifies the replacement region 404 based on user selection of pixels (e.g., pixels portraying an object) to move, remove, cover, or replace from a digital image. To illustrate, in some cases, a client device selects an object portrayed in a digital image. Accordingly, the scene-based image editing system 106 deletes or removes the object and generates replacement pixels. In some case, the scene-based image editing system 106 identifies the replacement region 404 by generating an object mask via a segmentation neural network. For instance, the scene-based image editing system 106 utilizes a segmentation neural network (e.g., the detection-masking neural network 300 discussed above with reference to FIG. 3) to detect objects with a digital image and generate object masks for the objects. Thus, in some implementations, the scene-based image editing system 106 generates content fill for the replacement region 404 before receiving user input to move, remove, cover, or replace the pixels initially occupying the replacement region 404.
Second pass of the machine learning model: 
Joachim teaches at Paragraph [0514] that, for each detected object, the scene-based image editing system 106 generates input for the second stage of the shadow detection neural network (i.e., the shadow prediction component). FIG. 35 illustrates the object awareness component 3500 generating input 3506 for the object 3504a. Indeed, as shown in FIG. 35, the object awareness component 3500 generates the input 3506 using the input image 3508, the object mask 3510 corresponding to the object 3504a (referred to as the object-aware channel) and a combined object mask 3512 corresponding to the objects 3504b-3504c (referred to as the object-discriminative channel). For instance, in some implementations, the object awareness component 3500 combines (e.g., concatenates) the input image 3508, the object mask 3510, and the combined object mask 3512. The object awareness component 3500 similarly generates second stage input for the other objects 3504b-3504c as well (e.g., utilizing their respective object mask and combined object mask representing the other objects along with the input image 3508).
Joachim teaches at Paragraph [0515] that, the scene-based image editing system 106 (e.g., via the object awareness component 3500 or some other component of the shadow detection neural network) generates the combined object mask 3512 using the union of separate object masks generated for the object 3504b and the object 3504c. In some instances, the object awareness component 3500 does not utilize the object-discriminative channel (e.g., the combined object mask 3512). Rather, the object awareness component 3500 generates the input 3506 using the input image 3508 and the object mask 3510. In some embodiments, however, using the object-discriminative channel provides better shadow prediction in the second stage of the shadow detection neural network.
Joachim teaches at Paragraph [0521] that, the scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to determine shadows associated with objects portrayed in a digital image when the objects masks of the objects have already been generated. Indeed, FIG. 38 illustrates a diagram for using the second stage of the shadow detection neural network for determining shadows associated with objects portrayed in a digital image in accordance with one or more embodiments.
Joachim teaches at Paragraph [0522] that, as shown in FIG. 38, the scene-based image editing system 106 provides an input image 3804 to the second stage of a shadow detection neural network (i.e., a shadow prediction model 3802). Further, the scene-based image editing system 106 provides an object mask 3806 to the second stage. The scene-based image editing system 106 utilizes the second stage of the shadow detection neural network to generate a shadow mask 3808 for the shadow of the object portrayed in the input image 3804, resulting in the association between the object and the shadow cast by the object within the input image 3804 (e.g., as illustrated in the visualization 3810).
Joachim teaches at Paragraph [0523] that, by providing direct access to the second stage of the shadow detection neural network, the scene-based image editing system 106 provides flexibility in the shadow detection process. Indeed, in some cases, an object mask will already have been created for an object portrayed in a digital image. For instance, in some cases, the scene-based image editing system 106 implements a separate segmentation neural network to generate an object mask for a digital image as part of a separate workflow. Accordingly, the object mask for the object already exists, and the scene-based image editing system 106 leverages the previous work in determining the shadow for the object. Thus, the scene-based image editing system 106 further provides efficiency as it avoids duplicating work by accessing the shadow prediction model of the shadow detection neural network directly.
Joachim teaches at Paragraph [0524] FIGS. 39A-39C illustrate a graphical user interface implemented by the scene-based image editing system 106 to identify and remove shadows of objects portrayed in a digital image in accordance with one or more embodiments. Indeed, as shown in FIG. 39A, the scene-based image editing system 106 provides, for display within a graphical user interface 3902 of a client device, a digital image 3906 portraying an object 3908. As further shown, the object 3908 casts a shadow 3910 within the digital image 3906. 
Joachim teaches at Paragraph [0526] that, as previously discussed with reference to FIG. 26, in one or more embodiments, the scene-based image editing system 106 identifies shadows cast by objects within a digital image as part of a neural network pipeline for identifying distracting objects within the digital image. For instance, in some cases, the scene-based image editing system 106 utilizes a segmentation neural network to identify objects for a digital image, a distractor detection neural network to classify one or more of the objects as distracting objects, a shadow detection neural network to identify shadows and associate the shadows with their corresponding objects, and an inpainting neural network to generate content fills to replace objects (and their shadows) that are removed. In some cases, the scene-based image editing system 106 implements the neural network pipeline automatically in response to receiving a digital image). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 18: 
The claim 18 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the controlled image generating machine learning model comprises a diffusion model.
Smith further teaches the claim limitation that the controlled image generating machine learning model comprises a diffusion model (Smith teaches at Paragraph 0520 that the scene-based image editing system 106 can utilize a diffusion model (or other generative machine learning model) to complete a semantic map to progressively generate an image layout. Then, using the map as guidance, the scene-based image editing system can generate an RGB image with authenticity and at Paragraph 0523 that the scene-based image editing system 106 can iteratively utilize a diffusion model to generate progressively more accurate semantic maps and/or completed digital images. In this manner, the scene-based image editing system iteratively improves the accuracy of the resulting maps and images to generate a realistic/authentic result. 
Smith teaches at Paragraph 0541 that the scene-based image editing system 106 conditions a diffusion neural network to more accurately generate infill semantic maps and accurately resulting digital images. Thus, as illustrated in FIG. 40, the scene-based image editing system 106 can accurately expand digital images or perform other modifications, such as infilling or modifying object/scene textures. 
[0543] In addition to the above, the scene-based image editing system 106 also improves upon functional flexibility. For example, the scene-based image editing system 106 allows client devices to expand the frame of digital images, remove objects portrayed within a digital image, or modify textures within portions of a digital images. Moreover, the scene-based image editing system 106 provides improved functionality with user interfaces that allow for unique control over implementing models, such as the number of infill segmentation maps to generate, the number of modified images to generate, the number of layers to utilizes (e.g., within a diffusion neural network), textures for conditioning the generative model, and semantic editing input to guide generation infill semantic maps and/or modified digital images.
[0544] As discussed above, the scene-based image editing system 106 utilizes various types of machine learning models. For example, FIG. 41A illustrates the scene-based image editing system 106 utilizing a diffusion neural network (also referred to as “diffusion probabilistic model” or “denoising diffusion probabilistic model”) to generate an infill semantic map in accordance with one or more embodiments. In particular, FIG. 41A illustrates the diffusion neural network generating an infill semantic map 4124 while the subsequent figure (FIG. 42) illustrates the diffusion neural network generating the modified digital image conditioned on the infill semantic map 4124. For example, in one or more embodiments, the scene-based image editing system 106 utilizes a diffusion model (or diffusion neural network) as described by J. Ho, A. Jain, P Abbeel, Denoising Diffusion Probabilistic Models, arXiv:2006:11239 or by Jiaming Song, et al. in Denoising diffusion implicit models. In ICLR, 2021, which are incorporated by reference in their entirety herein.
[0545] As mentioned above, the scene-based image editing system 106 utilizes a diffusion neural network. In particular, a diffusion neural network receives as input a digital image and adds noise to the digital image through a series of steps. For instance, the scene-based image editing system 106 via the diffusion neural network maps a digital image to a latent space utilizing a fixed Markov chain that adds noise to the data of the digital image. Furthermore, each step of the fixed Markov chain relies upon the previous step. Specifically, at each step, the fixed Markov chain adds Gaussian noise with variance, which produces a diffusion representation (e.g., diffusion latent vector, a diffusion noise map, or a diffusion inversion). Subsequent to adding noise to the digital image at various steps of the diffusion neural network, the scene-based image editing system 106 utilizes a trained denoising neural network to recover the original data from the digital image. Specifically, the scene-based image editing system 106 utilizes a denoising neural network with a length T equal to the length of the fixed Markov chain to reverse the process of the fixed Markov chain.
[0546] As mentioned earlier, in one or more embodiments the scene-based image editing system 106 generates an infill semantic map. FIG. 41A illustrates the scene-based image editing system 106 training a diffusion neural network to generate an infill semantic map 4124. In particular, FIG. 41A illustrates the scene-based image editing system 106 analyzing an input infill semantic map 4102 to generate the infill semantic map 4124 (e.g., a reconstruction of the input infill semantic map 4102). Specifically, the scene-based image editing system 106 utilizes the diffusion process during training to get generate various diffusion representations, culminating in a final diffusion representation that is passed to the denoising network. The scene-based image editing system 106, during training, supervises the output of each denoising neural network layer based on the diffusion representations generated during the diffusion process.
[0547] As illustrated, FIG. 41A shows the scene-based image editing system 106 utilizing the encoder 4104 to generate a latent vector 4106 from the input infill semantic map 4102. In one or more embodiments, the encoder 4104 is a neural network (or one or more layers of a neural network) that extract features relating to the semantic map 4102, e.g., in this instance relating to objects (human sub-portions) depicted within the input infill semantic map 4102. In some cases, the encoder 4104 includes a neural network that encodes features from the input infill semantic map 4102. For example, the encoder 4104 can include a particular number of layers including one or more fully connected and/or partially connected layers that identify and represent characteristics/features of the input infill semantic map 4102 through a latent feature vector. Thus, the latent vector 4106 includes a hidden (e.g., indecipherable to humans) vector representation of the semantic map 4102. Specifically, the latent vector 4106 includes a numerical representation of features of the semantic map 4102.
[0548] Furthermore, FIG. 41A illustrates the diffusion process 4108 of the diffusion neural network. In particular, FIG. 41A shows a diffusion of the latent vector 4106. At each step (based on the fixed Markov chain) of the diffusion process 4108, the scene-based image editing system 106 via the diffusion neural network generates a diffusion representation. For instance, the diffusion process 4108 adds noise to the diffusion representation at each step until the diffusion representation is diffused, destroyed, or replaced. Specifically, the scene-based image editing system 106 via the diffusion process 4108 adds Gaussian noise to the signal of the latent vector utilizing a fixed Markov Chain. The scene-based image editing system 106 can adjust the number of diffusion steps in the diffusion process 4108 (and the number of corresponding denoising layers in the denoising steps). Moreover, although FIG. 41A illustrates performing the diffusion process 4108 with the latent vector 4106, in some embodiments, the scene-based image editing system 106 applies the diffusion process 4108 to pixels of the input infill semantic map 4102 (without generating a latent vector representation of the input infill semantic map 4102).
[0549] As just mentioned, the diffusion process 4108 adds noise at each step of the diffusion process 4108. Indeed, at each diffusion step, the diffusion process 4108 adds noise and generates a diffusion representation. Thus, for a diffusion process 4108 with five diffusion steps, the diffusion process 4108 generates five diffusion representations. As shown in FIG. 41A the scene-based image editing system 106 generates a final diffusion representation 4110. In particular, FIG. 41A in the final diffusion representation 4110 comprises random Gaussian noise after the completion of the diffusion process. As part of the diffusion neural network, the denoising neural network 4112a denoises the final diffusion representation 4110 (e.g., reverses the process of adding noise to the diffusion representation performed by the diffusion process 4108).
[0550] As shown, FIG. 41A illustrates the denoising neural network 4112a generating a first denoised representation 4114 that partially denoises the final diffusion representation 4110 by generating a first denoised representation 4114. Furthermore, FIG. 41 also illustrates a denoising neural network 4112b receiving the first denoised representation 4114 for further denoising to generate the second denoised representation 4116. In particular, in one or more embodiments the number of denoising steps corresponds with the number of diffusion steps (e.g., of the fixed Markov chain). Furthermore, FIG. 41A illustrates the scene-based image editing system 106 processing the second denoised representation 4116 with a decoder 4118 to generate the infill semantic map 4124.
[0551] In one or more implementations, the scene-based image editing system 106 trains the denoising neural networks in a supervised manner based on the diffusion representations generates at the diffusion process 4108. For example, the scene-based image editing system 106 compares (utilizing a loss function) a diffusion representation at a first step of the diffusion process 4108 with a final denoised representation generated by the final denoising neural network. Similarly, the scene-based image editing system 106 can compare (utilizing a loss function) a second diffusion representation from a second step of the diffusion process 4108 with a penultimate denoised representation generated by a penultimate denoising neural network. The scene-based image editing system 106 can thus utilize corresponding diffusion representations of the diffusion process 4108 to teach or train the denoising neural networks to denoise random Gaussian noise and generate realistic digital images.
[0560] As mentioned above, FIG. 42 shows the scene-based image editing system 106 utilizing denoising neural networks to generate a modified digital image 4224 conditioned by an infill semantic map 4221 (e.g., a completed semantic map). Similar to the discussion above relating to FIG. 41, the scene-based image editing system 106 utilizes the diffusion neural network for training purposes of generating the modified digital image 4224. In particular, during training the scene-based image editing system 106 utilizes an encoder to analyze an input digital image (instead of the semantic map) and generate a latent vector. Further, the scene-based image editing system 106 utilizes a diffusion process to process the latent vector and generate diffusion representations at each step (depending on the length of the fixed Markov chain). Moreover, the scene-based image editing system 106 generates a final diffusion representation for the input digital image (e.g., the expected output during training). The scene-based image editing system 106 trains by comparing diffusion representations generated by the diffusion process with corresponding denoised representations generated by the denoising neural network layers.
[0561] As shown, FIG. 42 illustrates the scene-based image editing system 106 utilizing a trained diffusion network to generate a complete digital image. In particular, FIG. 42 shows the scene-based image editing system 106 utilizing the infill semantic map 4221 and a digital image (e.g., the digital image 4002 discussed in FIG. 40) as a conditioning input to the denoising neural networks. Further, the scene-based image editing system 106 utilizes a binary mask to indicate to the denoising neural networks a region to infill. The scene-based image editing system 106 can also utilize additional editing inputs to the infill semantic map 4221. In particular, the scene-based image editing system 106 provides an option to a user of a client device to provide color or texture patches as conditioning input. Based on the provided color or texture patches, the scene-based image editing system 106 then utilizes the infill semantic map 4221 and the digital image with the color or texture patches as a conditioning input to condition each layer of the denoising neural networks.
[0562] As further shown, FIG. 42 illustrates the scene-based image editing system 106 utilizing a random noise input 4210 with the denoising neural network 4212a (e.g., a first denoising layer). Similar to the discussion in FIG. 41B, the scene-based image editing system 106 utilizes the denoising neural network 4212a to reverse the diffusion process and generate a modified digital image 4224. FIG. 42 also shows the denoising neural network 4212a generating a denoised representation 4214 and utilizing a denoising neural network 4212b to further denoise the denoised representation (denoising here also corresponds with the number of steps in the diffusion process).
[0563] Moreover, FIG. 42 shows the scene-based image editing system 106 performing an act 4220. For example, the act 4220 includes conditioning the layers of the denoising neural network 4212a and the denoising neural network 4212b. Specifically, FIG. 42 shows the scene-based image editing system 106 performing the act 4220 of conditioning with the infill semantic map 4221. FIG. 42 illustrates an encoder 4222 analyzing the infill semantic map 4221 (generated in FIG. 41B), the digital image, and a binary mask. By conditioning the layers of the denoising neural network 4212a and the denoising neural network 4212b with the infill semantic map 4221 and the digital image, the scene-based image editing system 106 accurately generates the modified digital image 4224.
[0564] As just mentioned, the scene-based image editing system 106 generates the modified digital image 4224 by conditioning layers of the denoising neural networks. In particular, FIG. 42 illustrates the scene-based image editing system 106 via a decoder 4218 receiving the second denoised representation 4216 and generating the modified digital image 4224. Specifically, the modified digital image 4224 accurately depicts the legs of the human from the knees down. Accordingly, FIG. 42 illustrates the scene-based image editing system 106 via the diffusion neural network generating an infilled digital image corresponding to an expanded frame of the digital image 4002 shown in FIG. 40.
[0565] As also mentioned above, in one or more implementations the scene-based image editing system 106 generates a modified digital image utilizing an input texture. For example, FIG. 43 illustrates utilizing an input texture to generate a modified digital image utilizing a diffusion neural network, in accordance with one or more embodiments. Like FIGS. 41A and 42, the scene-based image editing system 106 trains the diffusion neural network. In particular, like the above discussion, during training the scene-based image editing system 106 utilizes the expected output as an input into the diffusion neural networks and utilizes diffusion representations to supervise training of the denoising layers of the neural network.
[0566] The scene-based image editing system 106 receives an indication to replace pixel values within a digital image 4315 and an input texture 4311 utilizing a trained denoising neural network. In particular, the scene-based image editing system 106 replaces the indicated pixel values with the input texture 4311. Specifically, the input texture 4311 includes a sample texture that portrays a pattern selected by a user of a client device. To illustrate, the scene-based image editing system 106 modifies an input digital image 4315 with the specifically selected pattern by localizing the input texture 4311 to the relevant region of the input digital image 4315. For example, the scene-based image editing system 106 receives the input texture 4311 from a client device or from a selection of a pre-defined texture option.
[0567] In one or more embodiments, the scene-based image editing system 106 utilizes a diffusion neural network to generate the modified digital image 4316 that includes the input texture 4311. Similar to the discussion above, the scene-based image editing system 106 also utilizes diffusion neural networks to replace a region in the digital image 4315 with the input texture 4311. In some implementations, the scene-based image editing system 106 isolates texture modifications to certain portions of the digital image within the diffusion neural network. In particular, FIG. 43 illustrates the scene-based image editing system 106 generating a mask for relevant portion of the digital image 4315 to replace with the input texture 4311. For instance, in FIG. 43 the relevant portion includes the dress of the human portrayed in the digital image 4315. Moreover, the scene-based image editing system 106 generates the mask via a segmentation image neural network for the dress based on an input or query selection by a user of a client device. This is discussed in additional detail below (e.g., with regard to FIG. 44). As discussed above, the scene-based image editing system 106 can train the diffusion neural networks within the denoising process 4306 by reconstructing input digital images and supervising the diffusion neural networks with diffusion representations generated from steps of the diffusion process). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Smith’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 19: 
The claim 19 encompasses the same scope of invention as that of the claim 18 except additional claim limitation that the diffusion model is a text-to-image diffusion model operating without a text prompt.
Kim further teaches the claim limitation that the diffusion model is a text-to-image diffusion model operating without a text prompt (
[0525] As also shown in FIG. 41A, in some embodiments, the scene-based image editing system 106 performs an optional act 4116 of conditioning the first denoising neural network 4108 and the Nth denoising neural network 4112. For example, the act 4116 includes conditioning each layer of the denoising neural networks 4108 and 4112. To illustrate, conditioning layers of a neural network includes providing context to the networks to guide the generation of a text-conditioned image (e.g., a digital image including a synthesized shadow). For instance, conditioning layers of neural networks include at least one of (1) transforming conditioning inputs (e.g., the text prompt) into vectors to combine with the denoising representations; and/or (2) utilizing attention mechanisms which causes the neural networks to focus on specific portions of the input and condition its predictions (e.g., outputs) based on the attention mechanisms. Specifically, for denoising neural networks, conditioning layers of the denoising neural networks includes providing an alternative input to the denoising neural networks (e.g., the text query). In particular, the scene-based image editing system 106 provides alternative inputs to provide a guide in removing noise from the diffusion representation (e.g., the denoising process). Thus, the scene-based image editing system 106 conditioning layers of the denoising neural networks acts as guardrails to allow the denoising neural networks to learn how to remove noise from an input signal and produce a clean output). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Kim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 20: 
The claim 20 encompasses the same scope of invention as that of the claim 6 except additional claim limitation that the composite image is generated to blend the object image into the scene image while adapting the appearance of the object image to the lighting and colour characteristics of the scene image.  
Mishra teaches the claim limitation that the composite image is generated to blend the object image into the scene image while adapting the appearance of the object image to the lighting and colour characteristics of the scene image (
Mishra teaches at Paragraph 0049-0050 that the at least content image (304/306/308) represents the structure or content of at least one of the scene image 302 and the object image, while omitting at least the background color characteristics. 
Mishra teaches at Paragraph [0049] FIG. 3 is a schematic diagram illustrating how the background color is recommended for multiple objects of an input image 302 based on training a machine learning model 324, according to some embodiments. At a first time an input image 302 is received (e.g., uploaded to the consumer application 190). The user image 302 contains the snowman object 304, the Santa Clause object 306, and the reindeer object 308. Some embodiments perform preprocessing functionality, such as remove the background color and/or other features (e.g., trees, landscape, etc.) of the input image 302 (e.g., set a pixel value to a single opaque or clear color), such that only the objects are present in the input image 302. 
Mishra teaches at Paragraph [0051] that the image feature context 321, includes the detected objects and other features determined in the selected training images 310, 312, 314, and 316—i.e., the Santa Clause object 314-2, the snowman object 314-1, a Christmas tree, reindeer, the theme of Christmas, and the like. The image feature context 321 further includes the color themes data structures 322 present in the selected training images 310, 312, 314, and 316. In other words, each color theme data structure (e.g., 323) represents the current color theme or combination for a respective training image. For example, color theme data structure 323 represents the colors within the image 316, where the first two colors 323-1 represent the purple background of the image 316 and the three colors 323-2 represent a combination of the colors of the objects within the image 316. FIG. 3 illustrates that the neural network 324 is trained to recommend a color theme output 326 for a given training image, such as the training image 314, where the color set 326-1 (i.e., green, orange, and red) represents recommendations for coloring the objects within the image 314 and the color set 326-2 (i.e., red and white) represents recommendations for coloring the background within the image 314. As described in more detail below, the color them output 326 may be generated based on the 5 most dominant colors in the image 314. 
Mishra teaches at Paragraph [0052] Continuing with FIG. 3, based on the training indicated in 330, particular embodiments recommend multiple colors for the background of the input image 202, given the features of the objects 304, 306, and 308 within the input image 302. For example, for a combination of the objects 304, 306, and 308, particular embodiments recommend or cause a filling of the color green 342 as the background of the output image 340. Additionally or alternatively, for the snowman object 304, some embodiments recommend or cause a filling of the color red 344 as the background of another output image. Alternatively or additionally, for the reindeer object 208, some embodiments recommend or causing a filling of the color red 246 as the background of another output image. Alternatively or additionally, for the reindeer object 208, some embodiments recommend or cause a filling of the color green 248 as the background of another output image. Alternatively or additionally, for the Santa Clause object 206, some embodiments recommend or cause a filling of the color orange 250 as the background of another input image. Alternatively or additionally, for a combination of the objects, particular embodiments recommend or cause a filling of the color red 252 as the background of yet another output image. In some embodiments, where a pixel-level pattern has been identified, some embodiments recommend or cause a filling of a particular color for the background, as illustrated in 254 and 256).
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Mishra’s generating the appearance image (the background color image) based on the input image 302 (the scene image) and the object image (304/306/308) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the input image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 
Re Claim 21: 
The claim 21 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that identifying at least one further area of the output image that a) corresponds to an edge of the object image or an area of transition between the object image and the scene image and b) is not similar to a corresponding area of the scene image, wherein the at least one further area of the output image is not modified by replacing visual elements of the output image with visual elements from the corresponding area of the scene image. 
Stannis further teaches the claim limitation that identifying at least one further area of the output image that a) corresponds to an edge of the object image or an area of transition between the object image and the scene image and b) is not similar to a corresponding area of the scene image, wherein the at least one further area of the output image is not modified by replacing visual elements of the output image with visual elements from the corresponding area of the scene image (
Stannus teaches at FIG. 21 and Paragraph 0174-0175 that the object image OBJ is blended with the scene image of FIG. 20. Stannus teaches at Paragraph 0172-0175 that the region defining unit 202 demarcates one or more regions from the image based on the estimated distance (e.g., the portion of the tower T is identified and the style of the tower T and the background of the tower T are style-transferred based on the style image and the region corresponding to the front side of the tower T is not subjected to style transfer. At least one area of the tower T of the output image in FIG. 21 is similar to a corresponding area to the scene image in FIG. 20 and the visual elements of the tower T in the output image are replaced by the visual elements of the scene image of FIG. 20. Stannus teaches at Paragraph 0168 that the region corresponding to the portion inside the window W of the building in the output image is not subjected to style transfer. 
Stannis teaches at Paragraph [0057] that the distance to the target means the distance from the viewpoint from which the image is captured to the target. For example, when an image is captured by a camera, the camera is the viewpoint. Therefore, in this case, the distance to the target means the distance from the camera to the target. 
Stannis teaches at Paragraph [0171] that in step St11 in FIG. 3, the distance estimation unit 201 estimates the distance to the target included in the image. The target in this example is the tower T. That is, the distance estimation unit 201 estimates the distance from the camera of the user terminal 20Z to the tower T.
Stannis teaches at Paragraph [0172] that, in step St12, the region defining unit 202 demarcates one or more regions from the image based on the estimated distance. The region in this example may be a portion of the image that is a predetermined distance or more away from the tower T, that is, a portion of the tower T, and a portion of the background when the tower T is the foreground). 
Re Claim 22: 
The claim 22 encompasses the same scope of invention as that of the claim 1 except additional claim limitation that outputting data defining the modified output image to computer memory, to a display device or to a communication interface. 
Joachim further teaches the claim limitation that outputting data defining the modified output image to computer memory, to a display device or to a communication interface ([0125] In one or more embodiments, the client devices 110a-110n include computing devices that access, view, modify, store, and/or provide, for display, digital images. For example, the client devices 110a-110n include smartphones, tablets, desktop computers, laptop computers, head-mounted-display devices, or other electronic devices. The client devices 110a-110n include one or more applications (e.g., the client application 112) that can access, view, modify, store, and/or provide, for display, digital images. For example, in one or more embodiments, the client application 112 includes a software application installed on the client devices 110a-110n. Additionally, or alternatively, the client application 112 includes a web browser or other application that accesses a software application hosted on the server(s) 102 (and supported by the image editing system 104). 
[0130] As shown in FIG. 2, the scene-based image editing system 106 provides a graphical user interface 202 for display on a client device 204. As further shown, the scene-based image editing system 106 provides, for display within the graphical user interface 202, a digital image 206. In one or more embodiments, the scene-based image editing system 106 provides the digital image 206 for display after the digital image 206 is captured via a camera of the client device 204. In some instances, the scene-based image editing system 106 receives the digital image 206 from another computing device or otherwise accesses the digital image 206 at some storage location, whether local or remote. 
[0887] Additionally, as shown in FIG. 86, the scene-based image editing system 106 includes data storage 8614. In particular, data storage 8614 includes data associated with modifying two-dimensional images according to three-dimensional representations of the two-dimensional images. For example, the data storage 8614 includes neural networks for generating three-dimensional representations of two-dimensional images. The data storage 8614 also stores the three-dimensional representations. The data storage 8614 also stores information such as depth values, camera parameters, parameters of input elements, objects, or other information that the scene-based image editing system 106 utilizes to modify two-dimensional images according to three-dimensional characteristics of the content of the two-dimensional images.
[0888] Each of the components of the scene-based image editing system 106 of FIG. 86 optionally includes software, hardware, or both. For example, the components include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices, such as a client device or server device. When executed by the one or more processors, the computer-executable instructions of the scene-based image editing system 106 cause the computing device(s) to perform the methods described herein. Alternatively, the components include hardware, such as a special-purpose processing device to perform a certain function or group of functions. Alternatively, the components of the scene-based image editing system 106 include a combination of computer-executable instructions and hardware). 
It would have been obvious to one of the ordinary skill in the art before the filing date of the instant application to have incorporated Joachim’s generating the appearance image (the content fills 712) based on the input image 706 (the digital scene image) and the object image (708a-708d) into Stannus to have modified Stannus’s style image to be generated based on the scene image (the digital image) and the object image. One of the ordinary skill in the art would have modified the input image based on the appearance characteristics of a portion of the scene image and/or the object image. 

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JIN CHENG WANG whose telephone number is (571)272-7665. The examiner can normally be reached Mon-Fri 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at 571-270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JIN CHENG WANG/            Primary Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

May 22, 2025
Application Filed
Jul 08, 2025
Non-Final Rejection — §103
Nov 11, 2025
Response Filed
Jan 06, 2026
Final Rejection — §103
Feb 25, 2026
Interview Requested
Mar 03, 2026
Examiner Interview Summary
Mar 03, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

16/270,926
Patent 12594883
DISPLAY DEVICE FOR DISPLAYING PATHS OF A VEHICLE
2y 5m to grant Granted Apr 07, 2026
16/703,494
Patent 12597086
Tile Region Protection in a Graphics Processing System
2y 5m to grant Granted Apr 07, 2026
18/291,702
Patent 12592012
METHOD, APPARATUS, ELECTRONIC DEVICE AND READABLE MEDIUM FOR COLLAGE MAKING
2y 5m to grant Granted Mar 31, 2026
17/655,739
Patent 12586270
GENERATING AND MODIFYING DIGITAL IMAGES USING A JOINT FEATURE STYLE LATENT SPACE OF A GENERATIVE NEURAL NETWORK
2y 5m to grant Granted Mar 24, 2026
17/888,216
Patent 12579709
IMAGE SPECIAL EFFECT PROCESSING METHOD AND APPARATUS
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
59%
Grant Probability
69%
With Interview (+10.3%)
3y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 832 resolved cases by this examiner. Grant probability derived from career allow rate.