Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 17 April 2025. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5, 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu (US 20240153153) in view of Xu (US 11599972).
Regarding claim 1, Liu teaches a method performed by one or more computers, the method comprising:
Obtaining an input image (Paragraph 3, provide an initial image);
Processing the input image using an image encoder neural network to generate an initial latent representation of the input image (Paragraph 29, image encoder 316 receives the first stage further-processed image 310 as input to generate an image embedding);
Updating the initial latent representation over n update iterations to generate an updated latent representation of the input image (Paragraph 17, The iterative process outputs an image at each iteration, and the CLIP model computes the similarity), the updating comprising, at each update iteration:
Processing, using a diffusion neural network, a denoising input comprising an intermediate latent representation derived from the initial latent representation to generate a denoising output for the intermediate latent representation (Paragraph 46, Processing of the initial image 500 can include transformation of the image through a denoising process that adjusts the pixel values of the image based on a probabilistic distribution);
Determining an update to the intermediate latent representation based on the computed gradient of the latent space object function (Paragraph 47, A gradient applicator 326 then applies the calculated gradient 504 to the processed image 502 to generate an updated initial image);
Determining one or more updates to the input image based on the computed gradient of the image space objective function (Paragraph 30, A gradient applicator 326 then applies the calculated first stage gradient 314 to the first stage processed image 306, which was generated by the diffusion model 304, to generate an updated initial image).
While Liu fails to disclose the following, Xu teaches:
Computing a gradient of a latent space objective function that measures a difference between (i) the denoising output for the intermediate latent representation and (ii) a known noise included in the intermediate latent representation (Column 2, Lines 41-43, the denoising loss is evaluated based on a difference between the predicted noise vector and the noise added to the first training image);
Processing the updated latent representation that is generated as a result of the n update iterations using an image decoder neural network to generate a target image (Column 2, Lines 65-66, decoding the quantized latent using a denoising model to produce an output image);
Computing a gradient of an image space object function that comprises a decoder-based accumulative score sampling (DASS) loss term that measures a difference between the (i) input image that is obtained prior to the n update iterations and (ii) the target image that is generated as the result of the n update iterations (Column 4, Lines 19-23, distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way);
Xu and Liu are both considered to be analogous to the claimed invention because they are in the same field of denoising. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu by using Xu and compute a gradient of noise added and denoising output as well as compute a gradient of a difference between an initial image and a denoised final image. Doing so would allow for using a known way of efficiently analyzing the result of constructing the desired image description.
System claim 17 and CRM claim 20 correspond to method claim 1. Therefore, claims 17 and 20 are rejected for the same reasons as used above.
Regarding claim 2, the combination of Liu and Xu teaches the method of claim 1, wherein n is an integer value greater than or equal to two (Liu, Paragraph 32, the second predetermined number of iterations is 10 iterations).
Regarding claim 3, the combination of Liu and Xu teaches the method of claim 2, wherein n is an integer value between three and ten (Liu, Paragraph 32, the second predetermined number of iterations is 10 iterations).
Regarding claim 4, the combination of Liu and Xu teaches the method of claim 1. While the combination as presented previously fails to disclose the following, Xu further teaches:
Wherein the denoising output comprises a noise estimate of the intermediate latent representation (Xu, Column 2, Lines 1-8, The denoising process may be an iterative process and may include a denoising function configured to predict a noise vector; wherein the denoising function receives as input an output of the previous iterative step, the data based on the latent representation and parameters describing a noise distribution; and the noise vector is applied to the output of the previous iterative step to obtain the output of the current iterative step).
Xu and Liu are both considered to be analogous to the claimed invention because they are in the same field of denoising. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Liu by using Xu and using a noise estimate of the intermediate latent representation as the denoising output. Doing so would allow for determining the amount of denoising that occurred to compare to the amount of noise added in the iterative process.
Regarding claim 5, the combination of Liu and Xu teaches the method of claim 1, wherein the diffusion neural network is a pre-trained text-to-image diffusion neural network that operates on latent images (Liu, Paragraph 12, generating an output image corresponding to an input text using a multi-algorithm diffusion sampling process).
System claim 18 corresponds to method claim 5. Therefore, claim 18 is rejected for the same reasons as used above.
Claims 6-8, 11, 13-16 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Xu as applied to claims 1-5, 17-18 and 20 and further in view of Zhou (US 12518485).
Regarding claim 6, the combination of Liu and Xu teaches the method of claim 1. While the combination fails to disclose the following, Zhou teaches:
Wherein the input image is a 2D rendered image of a target object instance, and wherein obtaining the image comprises
Obtaining an initial image of the target object instance (Column 1, Lines 44-45, view synthesis of a dynamic human body); and
Generating, from at least the initial image and by using a differentiable renderer, the 2D rendered image of the target object instance (Column 1, Lines 61-63, rendering, by a differentiable volume renderer, the neural network implicit function into a two-dimensional image).
Zhou and the combination of Liu and Xu are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Xu by using Zhou and using a differentiable renderer to generate a 2D image. Doing so would allow for using a known process to generate a 2D image from a 3D representation.
System claim 19 corresponds to method claim 6. Therefore, claim 19 is rejected for the same reasons as used above.
Regarding claim 7, the combination of Liu, Xu, and Zhou teaches the method of claim 6. While the combination as presented previously fails to disclose the following, Zhou further teaches:
Wherein generating the 2D rendered image of the target object instance comprises:
Sampling a random camera pose (Column 2, Lines 5-7, the mesh nodes of the deformable human body model are driven by a posture of the human body to change a spatial position of the constructed structured latent variables); and
Using the differentiable renderer to generate the 2D rendered image with respect to the sampled random camera pose (Column 2, Lines 29-34, the step of rendering, by a differentiable volume renderer, the neural network implicit function into a two-dimensional image includes: sampling a set of three-dimensional points along light projected to a pixel by a camera, calculating a volume density and a color of the three-dimensional points by using the neural network implicit function, and accumulating the volume density and the color on the light to obtain a pixel color).
Zhou and the combination of Liu and Xu are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Xu by using Zhou and using a differentiable renderer to generate a 2D image by sampling a random pose. Doing so would allow for using a known process to generate a 2D image from a 3D representation from a specific pose.
Regarding claim 8, the combination of Liu, Xu, and Zhou teaches the method of claim 7, wherein determining the one or more updates to the 2D rendered image comprises:
Updating the differentiable renderer based on backpropagating the gradient of the image space objective function through the 2D rendered image to the differentiable renderer (Liu, Paragraph 32, second stage processed image 334 outputted by the diffusion model 304 is back-propagated through the text-image match gradient calculator 312 to calculate a second stage gradient 336 against the input text).
Regarding claim 11, the combination of Liu, Xu, and Zhou teaches the method of claim 6. While the combination as presented previously fails to disclose the following, Xu further teaches:
Wherein the image space objective function also comprises a texture reconstruction loss term that measures a difference between (i) an enhanced image of the target object instance that has been generated by using the diffusion neural network from the input image and (ii) a rendered RGB image of the target object instance that has been generated based on a shape and texture estimation of the target object instance (Column 4, Lines 19-23 and 27-28, The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way… The distortion function may comprise a trained neural network).
Xu and the combination of Liu and Zhou are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Zhou by using Xu and determining a difference between an image generated by a neural network and a rendered image. Doing so would allow for using a known way of efficiently analyzing the result of constructing the desired image description.
Regarding claim 13, the combination of Liu, Xu, and Zhou teaches the method of claim 11, wherein generating the enhanced image by using the diffusion neural network from the input image comprises adding Gaussian noise to background pixels in the input image (Liu, Paragraph 46, the initial image 500 can be a random noise image with pixel values sampled from a Gaussian unit).
Regarding claim 14, the combination of Liu, Xu, and Zhou teaches the method of claim 6. While the combination as presented previously fails to disclose the following, Zhou teaches:
Using the updated differentiable renderer to generate a three-dimensional (3D) representation of the target object instance based on the input image (Column 4, Lines 49-51, the method for three-dimensional reconstruction and view synthesis of a dynamic human body provided by the present application, the neural network implicit function representation of structured latent variables is optimized by differential rendering).
Zhou and the combination of Liu and Xu are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Xu by using Zhou and generating a 3D representation based on the input image. Doing so would allow for using a known process to generate a 3D representation from a 2D image.
Regarding claim 15, the combination of Liu, Xu, and Zhou teaches the method of claim 14. While the combination as presented previously fails to disclose the following, Zhou further teaches:
Wherein the 3D representation comprises a 3D voxel grid representation or a 3D mesh representation (Column 1, Lines 56-57, mesh nodes of a human model).
Zhou and the combination of Liu and Xu are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Xu by using Zhou and generating a mesh 3D representation. Doing so would allow for using a known process to represent a 3D model that is easily displayed and transformed.
Regarding claim 16, the combination of Liu, Xu, and Zhou teaches the method of claim 16, wherein the 3D representation of the target object instance has a user-specified pose, a user-specified motion, a user-specified texture, or a combination thereof (Liu, Paragraph 52, the multi-text guided image cropping module 114 can be implemented to receive different text inputs for different regions of the image to be generated). Note: While Liu teaches generating a 2D image based on a user-specified input, the combination of Liu, Xu, and Zhou teaches generating the 3D representation from the 2D image. Therefore, the combination teaches generating a 3D representation based on a user-specified pose, motion, or texture.
Claims 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Xu and further in view of Zhou as applied to claims 6-8, 11, 13-16 and 19 and further in view of Kamio (US 20250139874).
Regarding claim 9, the combination of Liu, Xu, and Zhou teaches the method of claim 7, wherein the differentiable renderer comprises:
A 2D renderer configured to generate the 2D rendered image based on the set of volumetric density values and the set of albedo values (Zhou, Column 2, Lines 29-34, the step of rendering, by a differentiable volume renderer, the neural network implicit function into a two-dimensional image includes: sampling a set of three-dimensional points along light projected to a pixel by a camera, calculating a volume density and a color of the three-dimensional points by using the neural network implicit function, and accumulating the volume density and the color on the light to obtain a pixel color).
Zhou and the combination of Liu and Xu are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu and Xu by using Zhou and using a differentiable renderer to generate a 2D image by sampling a random pose. Doing so would allow for using a known process to generate a 2D image from a 3D representation from a specific pose.
While the combination fails to disclose the following, Kamio teaches:
A neural radiance fields (NeRF) model configured to generate (i) a set of volumetric density values that define a density estimation of the target object instance and (ii) a set of albedo values that define a color estimation of the target object instance (Paragraph 58, NeRF is a technique of learning Radiance Fields (color RGB and its density σ)).
Kamio and the combination of Liu, Xu, and Zhou are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu, Xu, and Zhou by using Kamio and using a NeRF to generate density and color values. Doing so would allow for using a known process to generate and store information for 2D and 3D model representations.
Regarding claim 10, the combination of Liu, Xu, Zhou, and Kamio teaches the method of claim 9. While the combination as presented above fails to disclose the following, Zhou further teaches:
Wherein the NeRF model is a NeRF multi-layer perceptron (MLP) model, and wherein updating the differentiable renderer comprises updating parameter values of the NeRF MLP model (Column 4, Lines 7-8, The volume density field and color field here are represented by a multilayer perceptron network).
Zhou and the combination of Liu, Xu, and Kamio are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu, Xu, and Kamio by using Zhou and using a multilayer perceptron NeRF. Doing so would allow for using a known process of generating NeRF that allows for efficient 2D and 3D model representations.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Xu and further in view of Zhou as applied to claims 6-8, 11, 13-16 and 19 and further in view of Kallenbach (US 20250139874).
Regarding claim 12, the combination of Liu, Xu, and Zhou teaches the method of claim 11. While the combination fails to disclose the following, Kallenbach teaches:
Wherein the difference in the texture reconstruction loss term is masked by a foreground silhouette estimation of the target object instance that has been generated by using a trained vision Transformer (ViT) neural network (Paragraph 29, a Vision Transformer (ViT), and/or a spiking neural network (SNN). The NN extracts features from the digital images to classify every pixel into classes or categories, and as discussed below, will predict a segmentation mask).
Kallenbach and the combination of Liu, Xu, and Zhou are both considered to be analogous to the claimed invention because they are in the same field of image generation. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Liu, Xu, and Zhou by using Kallenbach and using a ViT to mask a foreground silhouette estimation. Doing so would allow for using a known process to segment the foreground and background.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SNIGDHA SINHA whose telephone number is (571)272-6618. The examiner can normally be reached Mon-Fri. 12pm-8pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at 571-272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SNIGDHA SINHA/Examiner, Art Unit 2619
/JASON CHAN/Supervisory Patent Examiner, Art Unit 2619