Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 6 and 16 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Applicant’s amended claims 6 and 16 recite that the content map includes “object placement information”. Applicant’s remarks do not indicate where support is found for this amendment. Further, Applicant’s disclosure does not appear to use the term “placement”. Therefore, this amendment corresponds to subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 6 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 6 and 16, the claims recite that the content map is global content feature information including “object placement information” and the style map is feature information for determining a detailed structure of the object including “a pose”. As one of ordinary skill in the art would understand, the pose of the object is part of the object’s placement information, and vise versa, i.e. the pose of the object may comprise a position and/or a configuration of articulated or moving elements, which is also object placement information.
For purposes of applying prior art, object placement/pose information will be considered to be both part of the content map and part of the style map.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5, 9-13, 15, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over “Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering” by Yuxuan Zhang, et al. (hereinafter Zhang) in view of “Training Generative Adversarial Networks with Limited Data” by Tero Karras, et al. (hereinafter Karras).
Regarding claim 1, the limitations “A rendering method comprising: receiving training image data including a real image; generating a rendering simulation image using the training image data … acquiring background feature information by separating foreground and background areas on the basis of the rendering simulation image or the training image data” are taught by Zhang (Zhang, e.g. abstract, sections 1, 3, 4, discloses a rendering system which combines a styleGAN rendering neural network with an inverse graphics neural network in order to disentangle styleGAN’s latent code space by training a mapping neural network. Zhang, e.g. section 4, paragraph 2, teaches that three class specific styleGAN models were trained, i.e. car, horse, and bird models, using existing real image sets comprising 5.7 million, 2 million, and 48 thousand images, respectively, corresponding to the claimed receiving training data including a real image. Zhang, e.g. section 3.1, section 4, paragraph 3, further teaches generation of a styleGAN multi-view dataset for each of the styleGAN models, i.e. the claimed generating a rendering simulation image using the training image data. Further, Zhang, sections 3.1, 3.2, further teaches that a segmentation mask is determined for each image in the styleGAN datasets which is used in training the inverse graphics neural network by separating the input images into a foreground object whose shape and texture are predicted using the portion of the input image within the segmentation mask, and a background portion which does not influence the prediction, i.e. the claimed separating foreground and background areas on the basis of the rendering simulation image. Finally, Zhang, section 3.3, describes using the trained inverse graphics network to train the mapping network and fine-tune the styleGAN networks, including obtaining the background component for a given image by masking out the object, i.e. the claimed acquiring background feature information by separating foreground and background areas on the basis of the rendering simulation image, where the background component is one of the disentangled components mapped by the mapping network, and is also used to fine-tune the styleGAN networks.)
The limitation “generating a rendering simulation image using the training data by converting the real image into an image having degraded image quality or a stylistic transformation, which simulates unrealism of a three-dimensional (3D) graphic rendering image” is partially taught by Zhang (Zhang does not explicitly suggest synthesizing training images which simulate unrealism, per se, although the images generated by Zhang’s styleGAN-R system are simulating three-dimensional graphics rendering, as claimed.) However, this limitation is taught by Karras (Karras, e.g. abstract, sections 1-5, describes a technique for improving training results of generative adversarial networks using data augmentation, which operates by including augmentation nodes in the training pipeline, e.g. figure 2b, in order to train the discriminator and generator using loss functions calculated on augmented generated and real images, instead of the generated and real images directly, which results in improved training results for a given training dataset size as discussed in sections 4 and 5. Depending claim 2 clarifies that the degraded image quality resulting from the conversion is one of color distortion, noise, and image resolution degradation. Further, Karras, e.g. section 2.3, indicates 18 transformations are considered for the augmentation, including geometric transforms and image-space filtering, i.e. the claimed image resolution degradation, color transforms, i.e. the claimed color distortion, and additive noise, i.e. the claimed noise, such that Karras’ data augmentation GAN training technique corresponds to the claimed rendering simulation images which simulate unrealism of the input training images by converting a real image from the training image dataset into an image having degraded image quality.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Zhang’s styleGAN-R system to use Karras’ data augmentation GAN training technique in order to improve the training results of the resulting styleGAN-R system by augmenting styleGAN training datasets to generate training images simulating unrealistic 3D renderings of the corresponding type of object as taught by Karras. In the modified system, the randomly sampled images of the styleGAN training dataset, e.g. Zhang, section 4, paragraph 3, would be augmented according to the probability factor p when being used for training the styleGAN generator, as taught by Karras, section 2.3, such that as claimed, the simulation image rendering further includes the unrealism simulation of the 3D image by converting the real image(s) of the training data into image(s) having degraded image quality, i.e. performing one of the color distortion, noise, or image resolution degradation(s) on the real image(s) as recited in depending claim 2.
The limitations “acquiring latent feature information required for generating a realistic image on the basis of the rendering simulation image; and generating a realistic image on the basis of the latent feature information and the background feature information” are taught by Zhang (Zhang’s trained inverse rendering network can be used to predict the shape S and texture T components of an input image, e.g. section 3.2, which, along with a viewpoint V associated with the input image, and background B component, can be provided as input to the trained mapping network to determine the corresponding latent code for the styleGAN model, e.g. section 3.3, figure 3, where said latent code, when provided as input to the styleGAN model, produces a realistic image having viewpoint V, background B, and a foreground object having shape S and texture T. The S and T components which are predicted by the inverse graphics network, along with the image’s associated viewpoint V component, correspond to the claimed latent feature information acquired based on the rendering simulation image, which is combined with the background B component, corresponding to the claimed background feature information, by the mapping network to determine the latent code which produces the corresponding realistic image. Zhang, sections 4.2, 4.3, describes example results from the trained system, including recreating the original image as in figure 7, as well as swapping components produced by different input images as in figures 10 and 11, i.e. the S, T, and B components can be determined from any input image, and used without alteration to recreate the input image, or separately replaced by the S, T, or B component(s) determined from second input image(s), where recreating the input image corresponds to the claimed acquiring/generating steps, i.e. an input image from the simulated training image dataset, i.e. Zhang’s styleGAN synthetic image datasets, is provided to the inverse graphics network to determine the S, T, B components, which are in turn provided as input to the mapping network to determine a corresponding latent code provided to the styleGAN model to generate a realistic image similar to the input image.)
Regarding claim 2, the limitations “wherein the image that simulates the unrealism of the 3D graphic rendering image is obtained by degrading the image quality of the real image through at least one of color distortion, noise, and image resolution degradation of the training image data, or by converting the real image through a style-transfer neural network” are taught by Zhang in view of Karras (As discussed in the claim 1 modification above, in the modified system, the randomly sampled images of the styleGAN training dataset, e.g. Zhang, section 4, paragraph 3, would be augmented according to the probability factor p when being used for training the styleGAN generator, as taught by Karras, section 2.3, such that as claimed, the simulation image rendering further includes the unrealism simulation of the 3D image by converting the real image(s) of the training data into image(s) having degraded image quality, i.e. performing one of the color distortion, noise, or image resolution degradation(s) on the real image(s).)
Regarding claim 3, the limitation “wherein the acquiring of the background feature information by separating the foreground and background areas comprises: separating the foreground and background areas into the foreground area which is a target object of realism visualization and the background area which is not a target of realism visualization on the basis of the rendering simulation image or the training image data and acquiring the background feature information from the background area” is taught by Zhang (As discussed in the claim 1 rejection above, Zhang, sections 3.1, 3.2, teaches that a segmentation mask is determined for each image in the styleGAN datasets which is used in training the inverse graphics neural network by separating the input images into a foreground object whose shape and texture are predicted using the portion of the input image within the segmentation mask, and a background portion which does not influence the prediction, i.e. the claimed separating foreground and background areas on the basis of the rendering simulation image, where the foreground object is the target of realism visualization, i.e. the inverse graphics prediction is used to determine a consistent multiview representation of the foreground object, where the background content is modeled with less fidelity. Further as noted in the claim 1 rejection, Zhang, section 3.3, describes using the trained inverse graphics network to train the mapping network and fine-tune the styleGAN networks, including obtaining the background component for a given image by masking out the object, i.e. the claimed acquiring background feature information by separating foreground and background areas on the basis of the rendering simulation image, where the background B component is background feature information acquired from the background area.)
Regarding claim 5, the limitation “wherein the generating of the realistic image on the basis of the latent feature information and the background feature information comprises: generating a content map and a style map of the rendering simulation image from the latent feature information; and generating a realistic image on the basis of the content map, the style map, and the background feature information” is taught by Zhang (Depending claim 6 clarifies that the content map is global content feature information including a category of object and object placement information, whereas the style map is detailed structure feature information, including a pose, and a texture feature including color or texture. As discussed in the claim 1 rejection above, Zhang’s trained inverse rendering network can be used to predict the shape S and texture T components of an input image, e.g. section 3.2, which, along with a viewpoint V associated with the input image, and background B component, can be provided as input to the trained mapping network to determine the corresponding latent code for the styleGAN model, e.g. section 3.3, figure 3, where said latent code, when provided as input to the styleGAN model, produces a realistic image having viewpoint V, background B, and a foreground object having shape S and texture T. The S and T components which are predicted by the inverse graphics network, along with the image’s associated viewpoint V component, correspond to the claimed latent feature information acquired based on the rendering simulation image, where S and T components correspond to the claimed style map, i.e. the shape S is a detailed structure of the object and T is the color/texture of the object, and the V component corresponds to the claimed content map and style map, i.e. the specified viewpoint is both object arrangement information, and a pose of the object relative to the viewpoint. Further, it is noted that Zhang’s system also implicitly includes the claimed category of an object to be rendered, i.e. three different styleGAN models are used for three different object classes, implicitly requiring class information to be associated with each image in order to be used with the corresponding network(s). Finally, as discussed in the claim 1 rejection above, the V, S, and T components comprising the claimed content and style map, correspond to the claimed latent feature information acquired based on the rendering simulation image, which is combined with the background B component, corresponding to the claimed background feature information, by the mapping network to determine the latent code which produces the corresponding realistic image, i.e. generating the claimed realistic image based on the content map, style map, and background feature information.)
Regarding claim 9, the limitation “further comprising training a neural network using an error calculated on the basis of the realistic image” is taught by Zhang (Depending claim 10 further clarifies that the error, calculated on the generated realistic image(s) used to train a neural network, is an adversarial generative error calculated using a generator neural network which generates a realistic image on the basis of the latent feature information and the background feature information. Zhang, e.g. section 1, paragraph 5, section 3.3, figure 1, teaches that the final training operation is fine-tuning the styleGAN network using the inverse graphics and mapping neural networks, using equation 6 which measures the error between a sample shape S, texture T, and background B and the S, T, and B components predicted by the inverse graphics network using an image synthesized by the styleGAN network based on the latent code generated by the mapping network using the sample components S, T, and B, where the S and T errors are measured in image space, i.e. between the sampled and predicted images, and the B errors are measured in code space. That is, as claimed, the styleGAN network is fine-tuned, i.e. the claimed neural network being trained, using an error calculated based on the realistic images generated by the styleGAN network which generates realistic image(s) on the basis of the latent feature information V, S, T, and background feature information B, corresponding to the adversarial generative error of a generative adversarial network using the styleGAN network as a generator. With respect to the latent feature and background feature information being acquired from the simulated image(s) as in claim 1, it is noted that Zhang, e.g. section 4, paragraph 3, indicates that the random samples used to fine-tune the styleGAN models correspond to the synthesized dataset.)
Regarding claim 10, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 9 above.
Regarding claim 11, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 1 above, where one of ordinary skill in the art would have understood that Zhang’s neural networks are implemented using processors executing program instructions stored in memory, e.g. section 4.1, paragraph 1, indicates training the inverse graphics network in 120 hours using 4 GPUs.
Regarding claim 13, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 3 above.
Regarding claim 15, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 5 above.
Regarding claim 19, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 9 above.
Regarding claim 20, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 10 above.
Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over “Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering” by Yuxuan Zhang, et al. (hereinafter Zhang) in view of “Training Generative Adversarial Networks with Limited Data” by Tero Karras, et al. (hereinafter Karras) as applied to claims 1 and 11 above, and further in view of “Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer” by Wenzheng Chen, et al. (hereinafter Chen) in view of “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs” by Ting-Chun Wang, et al. (hereinafter Wang).
Regarding claim 4, the limitation “wherein the acquiring of the latent feature information required for generating a realistic image on the basis of the rendering simulation image comprises: acquiring a latent vector and a multi-resolution feature map required for generating a realistic image by encoding the rendering simulation image” is partially taught by Zhang (As discussed in the claim 1 rejection above, Zhang’s trained inverse rendering network can be used to predict the shape S and texture T components of an input image, e.g. section 3.2, which, along with a viewpoint V associated with the input image, and background B component, can be provided as input to the trained mapping network to determine the corresponding latent code for the styleGAN model, e.g. section 3.3, figure 3, where said latent code, when provided as input to the styleGAN model, produces a realistic image having viewpoint V, background B, and a foreground object having shape S and texture T. The S and T components which are predicted by the inverse graphics network, along with the image’s associated viewpoint V component, correspond to the claimed latent feature information acquired based on the rendering simulation image. Zhang, section 3.3, indicates that the latent feature information, i.e. V, S, T, are defined in a high dimensional space, i.e. are the claimed latent vector acquired by encoding the rendering simulation image using the inverse graphics network. Further, while Zhang describes the inverse graphics network based on DIB-R by Chen, e.g. section 3.2, which uses the perceptual loss function Lpercept as part of training the inverse graphics network, Zhang does not explicitly indicate that the loss functions in equation 1 are based on multi-resolution feature maps acquired from the input image being processed by the inverse graphics network. However, Chen, in view of Wang, teaches that the Lpercept function is calculated using multi-resolution feature maps of an input image.) However, this limitation is taught by Chen in view of Wang (Chen, e.g. abstract, sections 1-6, describes DIB-R, the inverse graphics network design upon which Zhang’s inverse graphics network is based, including equation 16, an Lpercept function, which is calculated based on differences between features at different layers i of the VGG network V and discriminator network D. While not explicitly stated by Chen, Wang, i.e. the reference 34 which Chen cites as disclosing a similar Lpercept function in section 4.1, discloses, e.g. section 3.2, that the different layers correspond to different scales/resolutions, i.e. the Lpercept function used by Chen, and by extension, Zhang, is calculated using feature maps of the input image(s) at multiple resolutions, indicating that Chen’s DIB-R network, and by extension Zhang’s inverse graphics network using Chen’s Lpercept function, is an encoder which also generates multi-resolution feature maps of the input image(s) in addition to determining the S, T, and B components of the input image.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement Zhang’s styleGAN-R system, using Karras’ data augmentation GAN training technique, using Chen’s Lpercept function, because Zhang indicates Chen’s DIB-R is the basis of Zhang’s inverse graphics network, where one of ordinary skill in the art would have understood, in view of Chen and Wang as discussed above, the Lpercept function used by Chen, and by extension, Zhang, is calculated using feature maps of the input image(s) at multiple resolutions, such that, by extension, Zhang’s inverse graphics network using Chen’s Lpercept function is an encoder which also generates multi-resolution feature maps of the input image(s) in addition to determining the S, T, and B components of the input image, corresponding to the claimed encoder.
Regarding claim 14, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 4 above.
Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over “Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering” by Yuxuan Zhang, et al. (hereinafter Zhang) in view of “Training Generative Adversarial Networks with Limited Data” by Tero Karras, et al. (hereinafter Karras) as applied to claims 5 and 15 above, and further in view of “Learning an Animatable Detailed 3D Face Model from In-The-Wild Images” by Yao Feng, et al. (hereinafter Feng).
Regarding claim 6, the limitations “wherein the content map is global content feature information including a category of an object to be rendered, and placement information, and the style map is feature information for determining a detailed structure of the object to be rendered, and a texture feature, wherein the detailed structure includes a pose … and the texture feature includes one of a color and texture” are taught by Zhang (As discussed in the claim 5 rejection above, the content map and style map were interpreted in view of claim 6’s clarification that the content map is global content feature information including a category of object and object placement information, whereas the style map is detailed structure feature information, including a pose, and a texture feature including color or texture. As discussed in the claims 1 and 5 rejections above, Zhang’s trained inverse rendering network can be used to predict the shape S and texture T components of an input image, e.g. section 3.2, which, along with a viewpoint V associated with the input image, and background B component, can be provided as input to the trained mapping network to determine the corresponding latent code for the styleGAN model, e.g. section 3.3, figure 3, where said latent code, when provided as input to the styleGAN model, produces a realistic image having viewpoint V, background B, and a foreground object having shape S and texture T. The S and T components which are predicted by the inverse graphics network, along with the image’s associated viewpoint V component, correspond to the claimed latent feature information acquired based on the rendering simulation image, where S and T components correspond to the claimed style map, i.e. the shape S is a detailed structure of the object and T is the color/texture of the object, and the V component corresponds to the claimed content map and style map, i.e. the specified viewpoint is both object placement information, and a pose of the object relative to the viewpoint. Further, it was noted that Zhang’s system also implicitly includes the claimed category of an object to be rendered, i.e. three different styleGAN models are used for three different object classes, implicitly requiring class information to be associated with each image in order to be used with the corresponding network(s).)
The limitation “the style map is feature information for determining a detailed structure of the object to be rendered, and a texture feature, wherein the detailed structure includes a pose, and a facial expression” is not explicitly taught by Zhang (While Zhang’s V, S, and T components correspond to the claimed detailed structure of the object to be rendered, including the pose, shape, and texture, as discussed in the claim 4 rejection above, Zhang, e.g. section 3.2, indicates that the inverse graphics network relies on the differentiable renderer DIB-R disclosed by Chen is adopted for Zhang’s implemented system, where the shape component S represents the shape of the object, but not necessarily a facial expression, per se. That is, while Zhang’s system could be applied to a category of humans or animals with faces, analogous to the exemplary bird and horse examples, Chen’s DIB-R would still model the shape of a face using a deformed sphere mesh as in section 3.2, paragraph 1, such that the facial expression, per se, would not be one of the components being predicted by the trained inverse graphics network. Finally, it is noted that Zhang, section 3.2, indicates that DIB-R’s performance for lighting prediction was weak and therefore omitted, and further Zhang, e.g. section 4.4, further indicates that the resulting model fails to predict correct lighting, as well as that predicting faithful shapes for out-of-distribution objects is a challenge, such that one of ordinary skill in the art would have been motivated to substitute other differentiable renderers for inverse graphics networks for Chen’s DIB-R differentiable renderer in order to improve system performance, e.g. to improve the lighting prediction, or shape prediction for other categories of objects.) However, this limitation is taught by Feng (Feng, e.g. abstract, sections 1, 3-8, describes the Detailed Expression Capture and Animation (DECA) system, which is an inverse graphics network specialized for predicting the latent code parameters for rendering an image matching an input image, which is specialized for predicting the parameters of human faces, e.g. Feng, figure 2 left, shows the inverse graphics network analogous to Zhang, Figure 1, right, where an input image is passed through a network predicting rendering parameters which are provided as input to a differentiable renderer generating an output image used for computing a loss function. Further, Feng, e.g. section 3, 4.1, indicates that the latent code components include pose θ and camera parameters c, corresponding to the claimed object placement information and pose, identity β and expression ψ parameters, corresponding to Zhang’s shape component S and the claimed style map facial expression feature information, as well as albedo coefficients α and lighting parameters l which are used to calculate the color in the rendered image using equation 3, corresponding to the claimed texture feature information, e.g. as shown in figure 2, the albedo map is a texture representing the surface of the face. Finally, Feng, e.g. section 6.1, indicates that DECA produces high-quality reconstructions while being robust to variations in head pose, expression, occlusions, image resolution, and lighting conditions.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Zhang’s styleGAN-R system, using Karras’ data augmentation GAN training technique, to use Feng’s DECA inverse graphics network as a substitute for Chen’s DIB-R inverse graphics network in order to include a human face specific version of styleGAN-R analogous to the car, bird, and horse versions of styleGAN-R disclosed by Zhang. In the human face styleGAN-R version, Feng’s DECA would be substituted for Chen’s DIB-R, replacing the right side component of Zhang, Figure 1, and the StyleGAN component would be trained on a human face image dataset, e.g. Feng, section 5, notes exemplary human face image datasets which could be used. Further, Zhang’s modified system including the human face styleGAN-R version would have the claimed content map and style map feature information, i.e. as discussed above and in the claim 5 rejection, it was noted that Zhang’s system also implicitly includes the claimed category of an object to be rendered, i.e. human face objects would use the human face StyleGAN-R version, as well as the claimed latent feature components of object placement and pose, i.e. Feng’s pose θ and camera parameters c, facial expression feature information, i.e. Feng’s expression ψ parameters, and the texture feature information, i.e. Feng’s as albedo coefficients α and lighting parameters l which are used to calculate the color in the rendered image using equation 3, corresponding to the claimed texture feature information.
Regarding claim 16, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 6 above.
Claims 7, 8, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over “Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering” by Yuxuan Zhang, et al. (hereinafter Zhang) in view of “Training Generative Adversarial Networks with Limited Data” by Tero Karras, et al. (hereinafter Karras) as applied to claims 1 and 11 above, and further in view of “Large Scale GAN Training for High Fidelity Natural Image Synthesis” by Andrew Brock, et al. (hereinafter Brock).
Regarding claim 7, the limitation “wherein the generating of the realistic image on the basis of the latent feature information and the background feature information comprises: when noise is input, extracting a content map and a style map corresponding to the noise and generating a realistic image on the basis of the content map and the style map using a neural network model which is pretrained to output a realistic image as a result” is implicitly taught by Zhang (As discussed in the claim 5 rejection above, Zhang’s trained inverse rendering network can be used to predict the shape S and texture T components of an input image, e.g. section 3.2, which, along with a viewpoint V associated with the input image, and background B component, can be provided as input to the trained mapping network to determine the corresponding latent code for the styleGAN model, e.g. section 3.3, figure 3, where said latent code, when provided as input to the styleGAN model, produces a realistic image having viewpoint V, background B, and a foreground object having shape S and texture T. The S and T components which are predicted by the inverse graphics network, along with the image’s associated viewpoint V component, correspond to the claimed latent feature information acquired based on the rendering simulation image, where S and T components correspond to the claimed style map, i.e. the shape S is a detailed structure of the object and T is the color/texture of the object, and the V component corresponds to the claimed content map and style map, i.e. the specified viewpoint is both object arrangement information, and a pose of the object relative to the viewpoint. While Zhang, e.g. section 3.3, paragraph 6, indicates that the mapping network is trained by sampling viewpoint, shape, texture, and background codes, Zhang does not explicitly indicate what would have been implicit to one of ordinary skill in the art, i.e. the training samples are generated randomly based on noise, such that as claimed, training would involve inputting noise used to determine a random sample, extracting content and style map components of the latent code corresponding to the random sample determined based on the input noise, and comparing the mapped code to the styleGAN code using equation 5, where said mapped code can also be input to the styleGAN network, i.e. the claimed content/style map of the random sample extracted from the input noise can be provided to styleGAN to generate the claimed realistic image on the basis of the content/style map. While Zhang does not explicitly indicate the samples are randomly generated based on noise, Brock, describing training GANs for image synthesis, does teach generating random samples using noise.) However, this limitation is taught by Brock (Brock, e.g. section 2, paragraph 1, explains that GANs are trained by mapping random noise to samples, and then discriminating real and generated samples, i.e. a well known basis for generating random samples for training is mapping random noise to samples.)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement Zhang’s styleGAN-R system, using Karras’ data augmentation GAN training technique, to generate the samples codes for training Zhang’s mapping network by mapping random noise to samples as taught by Brock, because Zhang does not explicitly indicate whether the samples are generated by random noise, although it would have been implicit to one of ordinary skill in the art, e.g. in view of Brock, that one basis for generating random samples is mapping random noise to samples. As discussed above, when the training samples are generated randomly based on noise, as claimed, training would involve inputting noise used to determine a random sample, extracting content and style map components of the latent code corresponding to the random sample determined based on the input noise, and comparing the mapped code to the styleGAN code using equation 5, where said mapped code can also be input to the styleGAN network, i.e. the claimed content/style map of the random sample extracted from the input noise can be provided to styleGAN to generate the claimed realistic image on the basis of the content/style map.
Regarding claim 8, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claims 5 and 7 above, i.e. as discussed in the claim 7 rejection, the mapping network is trained using random samples based on noise, and as discussed in the claim 5 rejection, the content, style, and background components predicted by the inverse graphics network are mapped by the mapping network to a latent code provided to the pre-trained styleGAN network to generate a realistic image.
Regarding claim 17, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 7 above.
Regarding claim 18, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 8 above.
Response to Arguments
Applicant’s arguments, see page 8, paragraph 3, filed 7/21/25, with respect to 35 U.S.C. 112(b) rejections of claims 2, 6, 12, and 16 have been fully considered and are persuasive. The 35 U.S.C. 112(b) rejections of claims 2, 6, 12, and 16 have been withdrawn, with exception of the 35 U.S.C. 112(b) rejection based on “arrangement”, now amended to “placement”, and “pose”.
Applicant's arguments filed 7/21/25 have been fully considered but they are not persuasive.
Applicant asserts, with respect to the 35 U.S.C. 112(b) rejection based on “arrangement”, now amended to “placement”, and “pose”, that the object arrangement information includes “at least one of a position, orientation, scale, or relative placement with respect to other objects”, while the pose “may refer to a combination of the position and orientation of an object in three-dimensional space”. Applicant’s remarks do not cite support for these interpretations from the disclosure. Further, in addition to Applicant’s disclosure not mentioning the term “placement” at all, Applicant’s disclosure does not provide any particular definition for the terms “arrangement” or “pose”, i.e. Applicant’s disclosure does not support Applicant’s assertions regarding the scope placement/arrangement information and pose. Therefore, Applicant’s argument cannot be considered persuasive.
Applicant asserts that “the purpose of generating a “rendering simulation image” in the presently claimed embodiment is to simulate the “unrealism” or “unnaturalness” inherent in rendered images from 3D graphic rendering”, e.g. page 10 of the remarks. Applicant’s assertion is not supported by citation to the disclosure describing such a purpose. Indeed, Applicant’s disclosure does not use any variant of the term “natural”, and does not discuss any “unrealism” being “inherent in rendered images from 3D graphic rendering”. Further, Applicant’s disclosure at no point offers any specific definition for the term “unrealism”, and instead only describes exemplary operations which can be performed on real input images to degrade the real image into an image simulating “unrealism”, with the exemplary operations being color distortion, adding Gaussian noise, image resolution degradation, or applying a style-conversion network. That is, the only reasonable interpretation for Applicant’s claimed “rendering simulation image[s] … simulating unrealism” is performing one or more of said exemplary operations, as explicitly recited in depending claims 2 and 12. Therefore, Applicant’s arguments asserting that “mimicking unrealism” requires something other than the specifically recited exemplary operations are not persuasive because Applicant’s disclosure does not describe the simulated “unrealism” beyond using the exemplary operations.
Applicant argues that “Karras’ augmentation is not about “elements that compromise realism,” but rather about “quantitative/qualitative expansion of training data”. First, as discussed above, Applicant’s argument is not persuasive because Karras' augmentations are performed using 3 of Applicant’s disclosed exemplary operations for converting a real image into an image simulating unrealism, i.e. Karras, e.g. section 2.3, indicates 18 transformations are considered for the augmentation, including geometric transforms and image-space filtering, i.e. the claimed image resolution degradation, color transforms, i.e. the claimed color distortion, and additive noise, i.e. the claimed noise. Second, Applicant’s arguments are not persuasive because Karras’ transformations are not intended for “preserving “realism””, e.g. remarks, page 12, i.e. Applicant’s argument cites no portion of Karras suggesting that the transforms are selected for preserving “realism”, and indeed the disclosed transforms, in addition to being the same as Applicant’s exemplary operations, would not preserve realism, i.e. geometric transforms, color transforms, image-space filtering, additive noise, and cutouts can all reduce the “realism” of the transformed image, i.e. a geometric transform scales the image to a lower resolution having reduced clarity, a color transform can cause image elements to have unnatural colors, adding noise results in a less defined image, and cutouts literally cut portions of an image out, resulting in unrealistic holes. That is, Applicant’s argument essentially amounts to an opinion that Karras is directed to “preserving realism” without actually citing any discussion thereof in Karras and without actually showing any technical distinction between Applicant’s disclosed exemplary operations for simulating unrealism and Karras’ disclosed transforms for augmentation. Therefore this argument cannot be considered persuasive because it does not identify a difference between the claim scope as recited in the claims and the obvious combination of Zhang and Karras’ disclosures as mapped in the rejections.
Applicant notes that Karras “does not directly mention “stylistic transformation” as a method of data augmentation. Applicant is reminded that the claims recite this as an alternative to the transforms which are disclosed by Karras, and therefore this argument is not persuasive.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT BADER whose telephone number is (571)270-3335. The examiner can normally be reached 11-7 m-f.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ROBERT BADER/Primary Examiner, Art Unit 2611