DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 1, 4, 11-13, 15, 18-20 and 22 are objected to because of the following informalities:
As to claim 1, the phrase “an object of a same type as an object included in the foreground image” in lines 20-21 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in line 18 of the claim. Appropriate correction is required.
As to claim 4, the phrase “an object of a same type as an object included in the foreground image” in line 2 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in claim 1. Appropriate correction is required.
As to claim 11, the phrase “an object of a same type as an object included in the foreground image” in lines 25-26 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in line 23 of claim 11. Appropriate correction is required.
Claim 12 is objected to because it recites “extract a feature vector from the foreground image; and generate, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector” which is a substantial duplicate of “extracting a feature vector from the foreground image; and generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector” recited in liens 18-21 of claim 11. Appropriate correction is required.
As to claim 13, the phrase “an object of a same type as an object included in the foreground image” in line 3 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in claim 11. Appropriate correction is required.
Claim 15 is objected to because it recites “wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image” which is a substantial duplicate of “wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image” recited in lines 25-27 of claim 11. Appropriate correction is required.
As to claim 18, the phrase “an object of a same type as an object included in the foreground image” in lines 21-22 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in line 19 of the claim. Appropriate correction is required.
Claim 19 is objected to because it recites “extract a feature vector from the foreground image; and generate, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector” which is a substantial duplicate of “extracting a feature vector from the foreground image; and generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector” recited in liens 14-17 of claim 18. Appropriate correction is required.
As to claim 20, the phrase “an object of a same type as an object included in the foreground image” in line 3 of the claim should be changed to “the object of the same type as the object included in the foreground image”, since “an object of a same type as an object included in the foreground image” was previously recited in claim 18. Appropriate correction is required.
Claim 22 is objected to because it recites “wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image” which is a substantial duplicate of “wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image” recited in lines 21-23 of claim 18. Appropriate correction is required.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 3, 7, 18-19, 22 and 23 is/are rejected under 35 U.S.C. 102(a)(1) as being unpatentable by Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)).
As to claim 1, Zhou et al. teaches a method, for generating a composite image, executed by one or more processors (Section 1, paragraph 2: In this paper, we advocate the compositional approach to image manipulation and augmentation to serve the above application scenarios. Specifically, we aim to plausibly insert an object (e.g., a pair of glasses, a piece of furniture, a vehicle or pedestrian) represented as a cropped patch from an object image, into a target background scene, as shown in Fig. 1; Fig. 3, processed image), the method comprising:
receiving a foreground image (Section 3, paragraph 1: given an object image y);
receiving a background image (Section 3, paragraph 1: scene image x);
generating, using a first artificial neural network, information on a position and size within the background image, wherein the information on the position and size within the background image is generated from the foreground image and the background image (Section 3, paragraph 1: our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x; Section 3, paragraph 2: isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2: SAC-GAN is used to predict Tgen); and
generating a composite image based on: the foreground image, the background image, and
the information on the position and size within the background image (Section 3, Paragraphs 1-2: Given an object image y and a scene image x, our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x (see Fig. 1). The outcomes are then used for composition which is a cut-and-paste operation, isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2, input scene image, input object image and 2D transform are used to generate a composite image),
wherein the generating of the composite image includes: extracting a feature vector from the foreground image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and
generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2),
wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image (Fig. 2: SAC-GAN predicts a 2D transform Tgen, which is utilized by a spatial transformer network (STN), to produce a composed image; note that it is SAC-GAN which predicts a 2D transform Tgen and not the user, thus is interpreted as “without user input”. Furthermore, section 2 of Zhou et al. discloses automatically determine the location and other transformation such as rotation and scaling of more general object categories; note that the term “automatically” is interpreted as “without user input”);
wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images).
As to claim 3, Zhou et al. the method according to claim 1, wherein the information on the position and size within the background image, and the feature vector are input to the second artificial neural network as conditions for generating the composite image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2).
As to claim 7, Zhou et al. teaches the method according to claim 1, wherein
the first artificial neural network is a model trained to estimate information on a position and size of a training foreground image to be placed within a training background image based on training data including a pair of the training foreground image and the training background image (Section 1, transforms default object patch (centered in image with size 64×64) to match its corresponding cropped patch; Fig. 3: the training process uses the same image as both the foreground object and the background image), and
the training foreground image and the training background image are generated based on a same original training image (Fig. 3: the training process uses the same image as both the foreground object and the background image).
As to claim 18, Zhou et al. teaches wherein the one or more computer-readable programs storing instructions that when executed by the one or more processors of an information processing system (Section 3.1: computing transformation; Fig. 3, processed image), cause the information processing system to: receive a foreground image (Section 3, paragraph 1: given an object image y); receive a background image(Section 3, paragraph 1: scene image x); generate, using a first artificial neural network, information on a position and size within the background image, wherein the information on the position and size within the background image is generated from the foreground image and the background image(Section 3, paragraph 1: our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x; Section 3, paragraph 2: isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2: SAC-GAN is used to predict Tgen); and generate a composite image based on: the foreground image, the background image, and the information on the position and size within the background image (Section 3, Paragraphs 1-2: Given an object image y and a scene image x, our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x (see Fig. 1). The outcomes are then used for composition which is a cut-and-paste operation, isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2, input scene image, input object image and 2D transform are used to generate a composite image), wherein the generation of the composite image includes: extracting a feature vector from the foreground image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and
generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2),
wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image (Fig. 2: SAC-GAN predicts a 2D transform Tgen, which is utilized by a spatial transformer network (STN), to produce a composed image; note that it is SAC-GAN which predicts a 2D transform Tgen and not the user, thus is interpreted as “without user input”. Furthermore, section 2 of Zhou et al. discloses automatically determine the location and other transformation such as rotation and scaling of more general object categories; note that the term “automatically” is interpreted as “without user input”);
wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images).
As to claim 19, Zhou et al. teaches the one or more non-transitory computer-readable media according to claim 18, wherein the instructions, when executed by the one or more processors (Section 3.1: computing transformation; Fig. 3, processed image), cause the information processing system to generate of the composite image by causing the information processing system to: extract a feature vector from the foreground image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and
generate, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2).
As to claim 22, Zhou et al. teaches the one or more non-transitory computer-readable media according to claim 18, wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images).
As to claim 23, Zhou et al. teaches the one or more non-transitory computer-readable media according to claim 18, wherein the first artificial neural network is a model trained to estimate information on a position and size of a training foreground image to be placed within a training background image based on training data including a pair of the training foreground image and the training background image (Section 1, transforms default object patch (centered in image with size 64×64) to match its corresponding cropped patch; Fig. 3: the training process uses the same image as both the foreground object and the background image), and the training foreground image and the training background image are generated based on a same original training image (Fig. 3: the training process uses the same image as both the foreground object and the background image).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 4-5 and 20-21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)) in view of Lee et al. (“Context-Aware Synthesis and Placement of Object Instances”, ARXIV.ORG, Cornell University Library, 6 December 2018, cited in IDS (Information Disclosure Statement)).
As to claim 4, Zhou et al. the method as discussed above, but does not explicitly disclose wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image.
However, Lee et al. teaches wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image (Abstract: generative models determine object placement and appearance in the composite).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. such that the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image as taught by Lee et al. in order to generate realistic images.
As to claim 5, Zhou et al. teaches the method as discussed above, but does not explicitly disclose wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose.
However, Lee et al. teaches wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose (Abstract, page 3: synthesize a new instance of a specific object category; Fig. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. such that the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose as taught by Lee et al. in order to generate realistic images.
As to claim 20, Zhou et al. teaches the one or more non-transitory computer-readable media according to claim 18, but does not explicitly disclose wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image.
However, Lee et al. teaches wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image (Abstract: generative models determine object placement and appearance in the composite).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. such that the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image as taught by Lee et al. in order to generate realistic images.
As to claim 21, Zhou et al. teaches the one or more non-transitory computer-readable media as discussed above, but does not explicitly disclose wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose.
However, Lee et al. teaches wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose (Abstract, page 3: synthesize a new instance of a specific object category; Fig. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. such that the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose as taught by Lee et al. in order to generate realistic images.
Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)) in view of Zhu et al. (“TopNet: Transformer-Based Object Placement Network for Image Compositing”, 2023 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), IEEE, 17 June 2023, pages 1838-1847, cited in IDS (Information Disclosure Statement)).
As to claim 8, Zhou et al. teaches the method according to claim 7, wherein
the training foreground image is generated by extracting an area containing a specific object from the same original training image (Section 1, paragraph 6: self-supervised training, where ground truth transformations are derived from cropped objects and their original images),
the training background image is generated by removing the specific object from the same original training image (Section 1, paragraph 6: self-supervised training, where ground truth transformations are derived from cropped objects and their original images. The use of inpainting to prepare training background images is a common alternative in the field), but does not explicitly disclose information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network.
However, Zhu et al. teaches information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network (pages 1841-1842, Fig. 4, Section 4.1: original bounding box of the object considered as ground-truth, and the bounding box along with foreground and background pairs are used during training to predict the optimal placement of an object within a background image).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. such that information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network as taught by Zhu et al. in order to provide object placement for real-word compositing with diverse challenging scenes and object categories.
Claim(s) 11-12 and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable by Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)) in view of Shechtman et al. (US 2019/0251401 A1).
As to claim 11, Zhou et al. teaches an information processing system, comprising:
one or more processors configured to execute one or more computer-readable programs (Section 3.1: computing transformation; Fig. 3, processed image), wherein the one or more computer-readable programs store instructions that when executed by the one or more processors, cause the information processing system to: receive a foreground image (Section 3, paragraph 1: given an object image y); receive a background image(Section 3, paragraph 1: scene image x); generate, using a first artificial neural network, information on a position and size within the background image, wherein the information on the position and size within the background image is generated from the foreground image and the background image(Section 3, paragraph 1: our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x; Section 3, paragraph 2: isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2: SAC-GAN is used to predict Tgen); and generate a composite image based on: the foreground image, the background image, and the information on the position and size within the background image (Section 3, Paragraphs 1-2: Given an object image y and a scene image x, our goal is to learn a transformation matrix Tgen that successfully places an object patch yp existing in y into the scene image x (see Fig. 1). The outcomes are then used for composition which is a cut-and-paste operation, isotropic scaling and 2D translation of the object patches for Tgen; Fig. 2, input scene image, input object image and 2D transform are used to generate a composite image), wherein the generation of the composite image includes: extracting a feature vector from the foreground image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and
generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2),
wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image (Fig. 2: SAC-GAN predicts a 2D transform Tgen, which is utilized by a spatial transformer network (STN), to produce a composed image; note that it is SAC-GAN which predicts a 2D transform Tgen and not the user, thus is interpreted as “without user input”. Furthermore, section 2 of Zhou et al. discloses automatically determine the location and other transformation such as rotation and scaling of more general object categories; note that the term “automatically” is interpreted as “without user input”);
wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images), but does not explicitly disclose a communication device; a memory; and one or more processors coupled to the memory and configured to execute one or more computer-readable programs included in the memory.
However, Shechtman et al. teaches a communication device ([0209]: communication interface); a memory ([0205]: memory); and one or more processors coupled to the memory and configured to execute one or more computer-readable programs included in the memory ([0205]: memory storing data, metadata and programs for execution by processor(s)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. with a communication device, a memory, and one or more processors coupled to the memory and configured to execute one or more computer-readable programs included in the memory as taught by Shechtman et al. in order to generate composite images and communicate with other computing devices or one or more networks.
As to claim 12, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 11, wherein the instructions, when executed by the one or more processors (Zhou et al., Section 3.1: computing transformation; Fig. 3, processed image), cause the information processing system to generate of the composite image by causing the information processing system to: extract a feature vector from the foreground image (Zhou et al., Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and generate, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Zhou et al., Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2).
As to claim 15, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 11, wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images).
As to claim 16, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 11, wherein the first artificial neural network is a model trained to estimate information on a position and size of a training foreground image to be placed within a training background image based on training data including a pair of the training foreground image and the training background image (Zhou et al., Section 1, transforms default object patch (centered in image with size 64×64) to match its corresponding cropped patch; Fig. 3: the training process uses the same image as both the foreground object and the background image), and the training foreground image and the training background image are generated based on a same original training image (Zhou et al., Fig. 3: the training process uses the same image as both the foreground object and the background image).
Claim(s) 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)) in view of Shechtman et al. (US 2019/0251401 A1) and further in view of Lee et al. (“Context-Aware Synthesis and Placement of Object Instances”, ARXIV.ORG, Cornell University Library, 6 December 2018, cited in IDS (Information Disclosure Statement)).
As to claim 13, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 11, but does not explicitly disclose wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image.
However, Lee et al. teaches wherein the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image (Abstract: generative models determine object placement and appearance in the composite).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. in view of Shechtman et al. such that the second artificial neural network is a generative model trained to generate the composite image in which an object of a same type as an object included in the foreground image is synthesized at the position and size within the background image as taught by Lee et al. in order to generate realistic images.
As to claim 14, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 13, but does not explicitly disclose wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose.
However, Lee et al. teaches wherein the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose (Abstract, page 3: synthesize a new instance of a specific object category; Fig. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device Zhou et al. in view of Shechtman et al. such that the object included in the foreground image and the object of the same type included in the composite image are different from each other in at least one of appearance and pose as taught by Lee et al. in order to generate realistic images.
Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (“SAC-GAN: Structure-Aware Image Composition”, ARXIV.ORG, Cornell University Library, 5 July 2022, cited in IDS (Information Disclosure Statement)) in view of Shechtman et al. (US 2019/0251401 A1) and further in view of in view of Zhu et al. (“TopNet: Transformer-Based Object Placement Network for Image Compositing”, 2023 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), IEEE, 17 June 2023, pages 1838-1847, cited in IDS (Information Disclosure Statement)).
As to claim 17, Zhou et al. in view of Shechtman et al. teaches the information processing system according to claim 16, wherein the training foreground image is generated by extracting an area containing a specific object from the same original training image (Zhou et al., Section 1, paragraph 6: self-supervised training, where ground truth transformations are derived from cropped objects and their original images), the training background image is generated by removing the specific object from the same original training image (Zhou et al., Section 1, paragraph 6: self-supervised training, where ground truth transformations are derived from cropped objects and their original images. The use of inpainting to prepare training background images is a common alternative in the field), but does not explicitly disclose information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network.
However, Zhu et al. teaches information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network (pages 1841-1842, Fig. 4, Section 4.1: original bounding box of the object considered as ground-truth, and the bounding box along with foreground and background pairs are used during training to predict the optimal placement of an object within a background image).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the device of Zhou et al. in view of Shechtman et al. such that information on a position and size of the specific object within the same original training image is used as ground truth for the pair of the training foreground image and the training background image when learning the first artificial neural network as taught by Zhu et al. in order to provide object placement for real-word compositing with diverse challenging scenes and object categories.
Allowable Subject Matter
Claim 24 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Response to Arguments
Applicant's arguments filed 03/16/2026 have been fully considered but they are not persuasive.
Examiner respectfully disagrees with the applicant’s argument that “None of the cited references teach, disclose, or otherwise suggest at least, as recited by claim 1, wherein the generating of the composite image includes: extracting a feature vector from the foreground image; and generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector, wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image, wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image."
Zhou et al. teaches “wherein the generating of the composite image includes: extracting a feature vector from the foreground image (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks); and generating, using a second artificial neural network, the composite image from the background image, the information on the position and size within the background image, and the feature vector (Section 3: using the transformation Tgen, object patch yp, corresponding object mask ym, and scene image x to generate a composite image via a spatial-transformer network, consider isotropic scaling and 2D translation of the object patches for Tgen; Section 3.1: use of an encoder Eobj, which encodes semantics features and structural details of the object in the form of various masks; Fig. 2), wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image (Fig. 2: SAC-GAN predicts a 2D transform Tgen, which is utilized by a spatial transformer network (STN), to produce a composed image; note that it is SAC-GAN which predicts a 2D transform Tgen and not the user, thus is interpreted as “without user input”. Furthermore, section 2 of Zhou et al. discloses automatically determine the location and other transformation such as rotation and scaling of more general object categories; note that the term “automatically” is interpreted as “without user input”); wherein the composite image is an image in which an object of a same type as an object included in the foreground image is synthesized to match at least some of brightness, saturation, hue, and luminance of the background image (Zhou et al., Section 2: color harmonization, shadow generation and image blending; Section 5: image harmonization and image-based shadow generation used to improve the photorealism of composite images),” as claimed.
Furthermore, applicant argues that “Zhou does not provide any reference regarding whether user input is required for this process”. Examiner respectfully disagrees.
Zhou et al. teaches “wherein the composite image is generated without user input regarding a position and size of an object of a same type as an object included in the foreground image to be placed within the background image (Fig. 2: SAC-GAN predicts a 2D transform Tgen, which is utilized by a spatial transformer network (STN), to produce a composed image; note that it is SAC-GAN which predicts a 2D transform Tgen and not the user, thus is interpreted as “without user input”. Furthermore, section 2 of Zhou et al. discloses automatically determine the location and other transformation such as rotation and scaling of more general object categories; note that the term “automatically” is interpreted as “without user input”),” as claimed.
Therefore, the cited prior art teaches applicant’s claimed invention as noted in the office action above.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to STACY KHOO whose telephone number is (571)270-3698. The examiner can normally be reached Mon-Fri 8:00 am-5:00 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Eason can be reached at 571-270-7230. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/STACY KHOO/Primary Examiner, Art Unit 2624