Office Action Analysis: 18780747 — MACHINE LEARNING-BASED GENERATION OF THREE-DIMENSIONAL REPRESENTATIONS

Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-6, 10-16, 18, 19 are rejected under 35 U.S.C. 102(a)(1) as being taught by Poole (DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION).
Regarding claims 1, 15, 18, 
Poole teaches:
	An apparatus comprising: at least one processing device comprising a processor coupled to a memory; (Poole 3.2 Text-To-3D Synthesis “3D scenes are optimized on a TPUv4 machine with 4 chips” Note: The mentioned TPU machine is a type of processor designed for AI and ML training, the machine itself contains a number of connected processors and memory units.) the at least one processing device being configured: to extract a set of features from a user prompt using a natural language processing model; (Poole 2.1 How Can We Sample in Parameter Space, Not Pixel Space? “Figure 3: DreamFusion generates 3D objects from a natural language caption” 3.2 Text-To-3D Synthesis “Text prompts often describe canonical views of an object that are not good descriptions when sampling different views. We therefore found it beneficial to append view-dependent text to the provided input text based on the location of the randomly sampled camera. For high elevation angles φcam > 60°, we append “overhead view.” For φcam ≤ 60°, we use a weighted combination of the text embeddings for appending “front view,” “side view,” or “back view” depending on the value of the azimuth angle θcam (see App. A.2 for details). We use the pretrained 64 × 64 base text-to-image model from Saharia et al. (2022). This model was trained on large-scale web-image-text data, and is conditioned on T5-XXL text embeddings” Note: Poole teaches the use of a model which accepts a natural language caption/text as its inputs for processing, teaching the use of a natural language processing model for user prompts. Feature extraction from the user’s text is taught as well, and is referred to by the analogous term of text embedding. Text embedding or word embedding in natural language processing is used to make a computer usable representation of words, allowing a model to understand the words meaning and context within the sentence. Poole gives a clear example of how a feature of the user’s desired image, which angle the image is captured from, is gained from the words text embeddings allowing the model to extract the desired feature from the sentence and implement it.) to initialize a three-dimensional scene reconstruction model utilizing a set of parameters determined based at least in part on the set of features extracted from the user prompt; (
    PNG
    media_image1.png
    540
    862
    media_image1.png
    Greyscale

Poole 1 Introduction “output 3D object or scene… Originally, NeRF was found to work well for “classic” 3D reconstruction tasks: many images of a scene are provided as input to a model, and a NeRF is optimized to recover the geometry of that specific scene, which allows for novel views of that scene from unobserved angles to be synthesized.” Note: In the Figure 3 caption above Poole describes how a machine learning model, a Neural Radiance Field/NeRF, generates 3D objects and scenes from a natural language caption, or prompt. This model, which is described to excel at “3D reconstruction tasks” is the claims “three-dimensional scene reconstruction model”. Poole teaches that the parameters, which include things like volumetric density and color, are initialized in the NeRF and are then tweaked to match the prompt. This can be seen in the Figure 3 image where a prompt describing a peacock outputs a properly colored peacock in the red box labelled NeRF under the albedo(color) example. This shows Poole teaches a 3D scene reconstruction model that is initialized and has a set of parameters based in part on the features from the user prompt.) to generate, utilizing the three-dimensional scene reconstruction model, a set of two-dimensional images of a given scene from two or more different viewpoint perspectives; (Poole 3 The DreamFusion Algorithm “To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles” 3.1 Neural Rendering of a 3D Model “Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel”
    PNG
    media_image2.png
    412
    878
    media_image2.png
    Greyscale
 Note: In the first provided citation Poole teaches that images of the 3D model the NeRF generates are taken from random angles and positions. In the second citation Poole elaborates on the specific method for producing images from the NeRF, rays are cast from the camera and color and density values are computed to render individual pixels in the image. Figure 4 above is provided to clearly show that two or more images from unique camera angles can be produced, in this case 4. As the NeRF is the claims 3D reconstruction model, Poole clearly teaches how a set of two or more images of a given scene can be generated from different perspectives.) to apply an image diffusion model to the generated set of two-dimensional images to generate a refined set of two-dimensional images; to modify the three-dimensional scene reconstruction model based at least in part on the refined set of two-dimensional images; (

    PNG
    media_image1.png
    540
    862
    media_image1.png
    Greyscale

Note: In Figure 3 above and its accompanying description Poole teaches how 2D images of a 3D model generated by a “3D scene reconstruction model”, which we have established is the NeRF, are used for changing the NeRF. Figure 3 details that once the albedo(color) and density parameters are initialized and tuned by the NeRF to resemble the prompt, images of the NeRF are rendered from random camera positions/angles. To update the NeRF parameters, the renderings/images are input into Imagen, a known diffusion model, and 2D diffused images are produced as seen below the blue Imagen box. The 3D object/scene is then modified based on the 2D diffused images, as an output of the diffused images is “backpropagated through the rendering process to update the NeRF MLP parameters”. This is seen in Figure 3 where the diffused images and their information on the right side has an arrow showing they are backpropagated to the NeRF so parameters can be updated. This clearly teaches the claims language of 2D images are provided to a diffusion model to produce diffused, or refined, images, and how the 3D reconstruction model is then modified based on the 2D refined images.) and to utilize the modified three-dimensional scene reconstruction model to generate a three-dimensional representation of the given scene. (3.2 Text-to-3D Synthesis “For each text prompt, we train a randomly initialized NeRF from scratch. Each iteration of DreamFusion optimization performs the following: (1) randomly sample a camera and light, (2) render an image of the NeRF from that camera and shade with the light, (3) compute gradients of the SDS loss with respect to the NeRF parameters, (4) update the NeRF parameters using an optimizer … 4. Optimization. Our 3D scenes are optimized on a TPUv4 machine with 4 chips. Each chip renders a separate view and evaluates the diffusion … compute time is split evenly between rendering the NeRF and evaluating the diffusion model.” Note: The citation above gives an overview of Poole’s method, covering much of what has already been discussed. In the final step 4, which occurs after the image diffusion takes place as described previously, it is explicitly stated that the NeRF/3D scene reconstruction model renders the final scene. When it renders the 3D scene again it does so this time with updated optimized parameters, teaching the claims language that the now modified 3D scene reconstruction model generates a 3D representation of the scene.)
	Regarding claim 2, 16, 19,
	Poole teaches:
	The apparatus of claim 1 wherein the three-dimensional scene reconstruction model comprises a Neural Radiance Field (NeRF) model configured to take as input a three-dimensional position vector and a two-dimensional viewing direction and output a color and density at each of two or more points of the given scene. (Poole 3.1 Neural Rendering of a 3D Model “NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel” Note: Poole teaches that from a camera’s center of projection rays are cast onto individual points on the 3D model, and the information describing each point is obtained from them and used to generate a pixel. The information obtained from each visible point is the color and density information, and as many points are hit with rays in the process of making pixels multiple points colors and densities are obtained from the camera.)
	Regarding claim 3,
	Poole teaches:
The apparatus of claim 2 wherein initializing the three-dimensional scene reconstruction model comprises initializing weights of a neural network that represents a neural radiance field. (Poole 3 The DreamFusion Algorithm “To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles, using these renderings as the input to our score distillation loss function that wraps around Imagen. As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text.” Note: Here Poole explicitly states that we initialize a neural radiance field model, or a NeRF, with weights which results in the construction of a 3D model or scene.)
	Regarding claim 4,
	Poole teaches:
The apparatus of claim 1 wherein generating the set of two-dimensional images of the given scene from two or more different viewpoint perspectives comprises: selecting the two or more different viewpoint perspectives to capture a range of perspectives of the given scene;(Poole 3.2 Text-to-3D Synthesis and Figure 4 cited previously detail that two or more viewpoint perspectives are used to capture images from different perspectives of a 3D scene.) for each of the two or more different viewpoint perspectives, performing ray tracing through the given scene for a plurality of rays, where a color and density of each of the plurality of rays is computed using the three-dimensional scene reconstruction model; and synthesizing the set of two-dimensional images of the given scene using the plurality of rays. (Poole 3.1 Neural Rendering of a 3D Model “NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel” Note: Poole here explicitly states that for the 3D points along each ray color and density are computed. Poole states the rays are cast from a camera’s center and the color and density information is used to produce each of the pixels in the image, as it is already taught that we use multiple cameras from multiple angles this teaches that the set of 2D images Poole produces from different perspectives is done using a plurality of rays.)
Regarding claim 5,
Poole teaches:
The apparatus of claim 1 wherein the image diffusion model comprises a denoising diffusion probabilistic model (DDPM). (Poole A.1 Pseudocode For Ancestral Sampling And Out Score Distillation Sampling 
    PNG
    media_image3.png
    368
    1062
    media_image3.png
    Greyscale
Note: Here Poole teaches that the output of DDPM is used as an input in the diffusion model, which has already been established to perform image diffusion.)
Regarding claim 6,
Poole teaches:
The apparatus of claim 1 wherein applying the image diffusion model to the generated set of two-dimensional images comprises applying a noise-reduction process to the generated set of two-dimensional images by: inputting the generated set of two-dimensional images to the image diffusion model; predicting noise added at each timestep based at least in part on an output of the image diffusion model; and removing the predicted noise from the generated set of two-dimensional images to generate the refined set of two-dimensional images. (Poole 2 Diffusion Models and Score Distillation Sampling “diffusion models that learn Eφ(zt;t, y) conditioned on text embeddings … We use Eˆand pˆ throughout to denote the guided version of the noise prediction and marginal distribution” 3.2 Text-to-3D Synthesis “Each iteration of DreamFusion optimization performs the following: (1) randomly sample a camera and light, (2) render an image of the NeRF from that camera and shade with the light, (3) compute gradients of the SDS loss with respect to the NeRF parameters” 2.1 How Can We Sample In Parameter Space, Not Pixel Space? “DreamFusion diffuses the rendering and reconstructs it with a (frozen) conditional Imagen model to predict the injected noise Eˆφ(zt|y;t). This contains structure that should improve fidelity, but is high variance. Subtracting the injected noise produces a low variance update direction stopgrad[E^φ − E] that is backpropagated through the rendering process to update the NeRF MLP parameters … We name our sampling approach Score Distillation Sampling (SDS) as it is related to distillation, but uses score functions instead of densities. We refer to it as a sampler because the noise in the variational family q(zt| . . .) disappears as t → 0” 2 Diffusion Models and Score Distillation Sampling “The forward process is typically a Gaussian distribution that transitions from the previous less noisy latent at timestep t to a noisier latent at timestep t + 1.”Note: Here Poole describes that its “renderings” which refers to the 2D images it renders are diffused with a diffusion model. As part of the specific steps of the diffusion model, we predict the noise that will be created/injected in the image. Once we predict the injected noise, we can then subtract it from the diffused rendering to produce a more refined 2D image. We know that noise is found for each individual timestep as the noise obtained is noted to increase with each timestep.)
Regarding claim 10,
	Poole teaches:
The apparatus of claim 1 wherein the user prompt comprises a natural language description of a design of a product, and wherein utilizing the modified three-dimensional scene reconstruction model to generate the three-dimensional representation of the given scene comprises generating a three-dimensional representation of a prototype of the product. (Poole 1. Introduction “This work showed that pretrained 2D image-text models may be used for 3D synthesis, though 3D objects produced by this approach tend to lack realism and accuracy … By combining SDS with a NeRF variant tailored to this 3D generation task, DreamFusion generates high-fidelity coherent 3D objects and scenes for a diverse set of user-provided text prompts.” Note: The claim states that its model is capable of generating a 3D representation of a prototype of a product described in a text prompt. As this simply defines a particular type of 3D content to be generated from a text prompt, and Poole’s model is capable of generating high quality coherent 3D objects and scenes from a diverse set of prompts, the use of Poole to generate prototypes for a design is taught.)
Regarding claim 11,
Poole teaches:
The apparatus of claim 1 wherein the user prompt comprises a natural language description of a virtual showroom of one or more products, and wherein utilizing the modified three-dimensional scene reconstruction model to generate the three-dimensional representation of the given scene comprises generating a three-dimensional representation of the one or more products for the virtual showroom. (Poole 1. Introduction states that high quality coherent objects and scenes can be created from diverse user prompts, as protypes for described designs in a “showroom” environment are simply an example of an objects in a scene Poole teaches this claim as well.)
Regarding claim 12,
Poole teaches:
The apparatus of claim 1 wherein the user prompt comprises a natural language description specifying one or more customizations of a product, and wherein utilizing the modified three-dimensional scene reconstruction model to generate the three-dimensional representation of the given scene comprises generating a three-dimensional representation of a customized version of the product based at least in part on the specified one or more customizations. (
    PNG
    media_image2.png
    412
    878
    media_image2.png
    Greyscale

Note: In the example above, Poole details just a few customizations to 3D models that can be made using natural language descriptions. This teaches the claims description of customizing a product with one or more customizations, as Poole shows a single model (a squirrel) can have multiple objects added separately or concurrently (leather jacket and motorcycle, leather jacket and cello), and also make more abstract customizations such as changing the material of the 3D object to appear as wood. As the products of the claim are simply 3D models, Poole teaches the ability to apply one or more customizations to the model of a product.)
	Regarding claim 13,
	Poole teaches:
The apparatus of claim 1 wherein the user prompt comprises a natural language description of one or more features of a product, and wherein utilizing the modified three-dimensional scene reconstruction model to generate the three-dimensional representation of the given scene comprises generating a three-dimensional representation of a training simulation for the one or more features of the product. (Figure 4, Poole 1. Introduction cited above Note: As generating a 3D representation of a training simulation is ambiguous, we refer to the specifications statement that “an enterprise may subscribe to or otherwise utilize the development platform 110 for generation of three-dimensional (3D) models for use in digital content creation (e.g., in product development, marketing and sales, customization and personalization, training and simulation, enterprise solutions, etc.”. Referring to this definition, making a “3D representation of a training simulation” consists of generating one or more 3D objects that have relevance for training and simulation. As Poole teaches the ability to make coherent high-quality models, regardless of what purpose they may serve for the user, the ability to make 3D objects that could be used for a training simulation is taught.)
Regarding claim 14,
Poole teaches:
The apparatus of claim 1 wherein the user prompt comprises a natural language description of a configuration of an information technology infrastructure environment, and utilizing the modified three-dimensional scene reconstruction model to generate the three-dimensional representation of the given scene comprises generating a three-dimensional representation of the configuration of the information technology infrastructure environment. (Note: As the configuration of an information technology infrastructure environment is said to be describable with a natural language prompt, and Poole teaches the ability to generate high quality coherent 3D objects from natural language prompts, the ability to generate a particular type of 3D content that can be described with a natural language prompt, in this case an information technology infrastructure environment, is taught.)

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
Claims 7, 8, 17, 20, are rejected under 35 U.S.C. 103 as being unpatentable over Poole (DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION) in view of Wang (Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation).
Regarding claims 7, 17, and 20,
Poole teaches,
The apparatus of claim 1 wherein modifying the three-dimensional scene reconstruction model based at least in part on the refined set of two-dimensional images comprises: estimating probability density for the refined set of two-dimensional images; and adjusting the set of parameters of the three-dimensional scene reconstruction model based at least in part on the estimated probability densities. (Poole 2.1 How can we Sample In Parameter Space, Not Pixel Space? “we instead want to create 3D models that look like good images when rendered from random angles. Such models can be specified as a differentiable image parameterization (DIP, Mordvintsev et al., 2018), where a differentiable generator g transforms parameters θ to create an image x = g(θ) … For 3D, we let θ be parameters of a 3D volume and g a volumetric renderer. To learn these parameters, we require a loss function that can be applied to diffusion models … Our approach leverages the structure of diffusion models to enable tractable sampling via optimization — a loss function that, when minimized, yields a sample. We optimize over parameters θ such that x = g(θ) looks like a sample from the frozen diffusion model. To perform this optimization, we need a differentiable loss function where plausible images have low loss, and implausible images have high loss, … We first investigated reusing the diffusion training loss (Eqn. 1) to find modes of the learned conditional density p(x|y) … In practice, we found that this loss function did not produce realistic samples even when using an identity DIP where x = θ … We found that omitting the U-Net Jacobian term leads to an effective gradient for optimizing DIPs with diffusion models:

    PNG
    media_image4.png
    70
    494
    media_image4.png
    Greyscale

… While this gradient for learning DIPs with diffusion models may appear ad hoc, in Appendix A.4 we show that it is the gradient of a weighted probability density distillation loss (van den Oord et al., 2018) using the learned score functions from the diffusion model” A.2 NeRF Details and Training Hyperparameters “We use a Gaussian PDF to parameterize the added density … Representative settings are λτ = 5 for the scale parameter and στ = 0.2 for the width parameter. This density is added to the τ output of the NeRF MLP” Note: Poole teaches that a differentiable image parameterization model (DIP) is used to transform parameters of our 3D rendering model. The DIP model does this by altering the parameters of the model, as described by the claim, and performs these alterations by backpropagating a gradient to the NeRF, which is also described in Figure 4. This gradient, or the info that specifies how we will update parameters, is obtained from a loss function. Poole teaches a key component in the function is the probability density, which is obtained using a Guassian PDF(probability density function). This teaches that Poole uses the probability density in optimizing its 3D scene reconstruction model’s parameters, and that probability density for the diffused/refined images is found.)
	While Poole teaches that the probability density which aids in optimizing parameters of the 3D reconstruction model are learned from diffusion, which in this context is the diffusion we perform to refine the 2D images, it does not explicitly state that the individual pixels in the image have their probability density taken.
	Doing so is taught in Wang, which teaches modifying the three-dimensional scene reconstruction model based at least in part on the refined set of two-dimensional images comprises: estimating probability densities for pixels of the refined set of two-dimensional images; and adjusting the set of parameters of the three-dimensional scene reconstruction model based at least in part on the estimated probability densities. (4. Score Jacobian Chaining for 3D Generation “Let θ denotes the parameters of a 3D asset, e.g., voxel grid of (RGB, τ ) as in Sec. 4.2. Our goal is to model and sample from the distribution p(θ) to generate a 3D scene. In our setting, only a pretrained 2D diffusion model on images p(x) is given… To relate the 2D and 3D distributions p(x) and p(θ), we assume that the probability density of 3D asset θ is proportional to the expected probability densities of its multiview 2D image renderings xπ over camera poses π, i.e., pσ(θ) ∝ Eπ pσ(xπ(θ)) , (6) up to a normalization constant Z = R Eπ pσ(xπ(θ)) dθ. That is, a 3D asset θ is as likely as its 2D renderings xπ. Next, we establish a lower bound, log ˜pσ(θ), on the distribution in Eq. (6) using Jensen’s inequality: log pσ(θ) = log Eπ(pσ(xπ)) − log Z (7) ≥ Eπ[log pσ(xπ)] − log Z , log ˜pσ(θ). (8) Recall that the score is the gradient of log probability density of data. … We will next discuss how to compute the 2D score in practice using a pretrained diffusion model.” 4.2 Inverse Rendering on Voxel Radiance Field “We represent a 3D asset θ as a voxel radiance field … The parameters θ consist of a density voxel grid V(density) ∈ R 1×Nx×Ny×Nz and a voxel grid of appearance features V(app) ∈ R c×Nx×Ny×Nz . Conventionally the appearance features are simply the RGB colors and c = 3 … Image rendering is performed independently along a camera ray through each pixel. We cut a camera light ray into equally distanced segments of length d, and at the spatial location corresponding to the beginning of the i-th segment we sample a (RGBi , τi) tuple from the color and density grids using trilinear interpolation.” Note: Here Wang teaches that it attempts to tweak parameters of a 3D asset using 2D images of the desired asset taken from multiple views. To do this, it associates parameters of the 3D asset that will define how it is rendered with parameters of the 2D images, the translation from 2D to 3D information is made possible by the probability densities of the 2D image. In the process of using the 2D image’s probability density to define the 3D asset’s parameters Wang teaches that a diffusion model is applied to the 2D images to obtain a score. A listed example of such 3D parameters are a density and color voxel grid. These two grids are what are directly used to render pixels. That is to say, the core content that defines a pixel (color and density) has probability densities found for them. Or in other words, probability densities are found for each pixel.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing
date of the claimed invention to modify Poole with Wang where the probability densities found from 2D diffused images used to alter the 3D reconstruction model are the probability densities for pixels.
	There are several reasons that would motivate one to do so, one is to obtain a higher level of detail by considering the probability density on a pixel level scale. As we already use probability density to better adjust 3D parameters in Poole, it would be obvious to use Wang’s higher, pixel level probability density to adjust parameters more finely by considering more detail.
Regarding 8,
Poole teaches:
The apparatus of claim 7
Wang teaches:
 wherein estimating the probability densities for the pixels of the refined set of two-dimensional images utilizes a density estimation model that takes the refined set of two-dimensional images and the user prompt as input and computes probability density likelihoods of the pixels of the refined set of two-dimensional images. (Wang Abstract “A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation.” 1. Introduction “The key insight is to interpret diffusion models as learned predictors of a gradient field, often referred to as the score function of the data log-likelihood.” Wang 4.2 “Qualitative results of text-prompted generation of 3D models with SJC, purely from the pretrained Stable Diffusion (2D) image model.” Wang 4. Score Jacobian Chaining for 3D Generation “Recall that the score is the gradient of log probability density of data.” Note: In the previous claim we established that Wang 4. Score Jacobian Chaining teaches probability density is used with a diffusion model to output a score, and that the score is used to determine the 3D parameters. In the last citation provided here we see that the score is referred to as the “gradient of log probability density of data”. This is taught to be analogous to “probability density likelihoods” in 1. Introduction where the score function is stated to be a “score function of the data log-likelihood”, since our score function accepts probability density the output score could be called probability density likelihoods. Another input passed into the diffusion model is the user prompt which we use to generate 2D diffused images, from the images a 3D model is generated, teaching that the user prompt is an input to the model.)
It would have been obvious to a person having ordinary skill in the art before the effective filing
date of the claimed invention to modify Poole with Wang where the probability densities found from 2D diffused images used to alter the 3D reconstruction model are used to find probability density likelihoods.
There are several reasons that would motivate one to do so, probability density likelihoods are simply the data of a probability density function taken with respect to a specific variable or point. Evaluating the probability density data at this more specific level allows us to learn more from it and potentially make better decisions as we are considering more information.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Poole (DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION) in view of Wang (Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation) and further in view of Redford (US 20230222336 A1).
Regarding claim 9,
Poole teaches:
The apparatus of claim 7 wherein adjusting the set of parameters of the three-dimensional scene reconstruction model comprises utilizing a gradient descent algorithm that utilizes a loss function comprising the estimated probability densities for the refined set of two-dimensional images. (3. The DreamFusion Algorithm “As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text.” 3.2 Text-To-3D Synthesis “3. Diffusion loss with view-dependent conditioning … Given the rendered image and sampled timestep t, we sample noise E and compute the gradient of the NeRF parameters according to Eqn. 3. 4. Optimization … Compute time is split evenly between rendering the NeRF and evaluating the diffusion model” Note: Poole teaches that gradient descent is the method used for creating the initial 3D model, and for determining the 3D model parameters. Gradient descent is used to further edit the parameters based on the refined 2D images as it is taught a gradient of the NeRF parameters is again computed for the images we produce. It is also that the gradient which alters the parameters is in the diffusion loss step as it leverages a loss function to find the parameters. This loss function is described in detail in the previous claim where it is taught that probability density is a key part in the loss function. This teaches a gradient descent algorithm which utilizes a loss function comprising the estimated probability densities for the refined set of 2D images to adjust the parameters of the three-dimensional scene.)
While Poole finds probability densities it does not specify that probability densities for individual pixels of refined 2D images are found, doing so is taught in Wang which teaches estimated probability densities for the pixels of the refined set of two-dimensional images. How Wang teaches this is discussed in the previous claim.
Neither Poole nor Wang teach that the loss function utilized by a gradient descent algorithm comprises a negative log-likelihood of the estimated probability densities. While Wang mentions log-likelihood, it does not mention that it is negative. Doing so is taught in Redford which teaches a gradient descent algorithm that utilizes a loss function comprising a negative log-likelihood of the estimated probability densities (Redford ¶448 “The neural network A00 is trained with stochastic gradient descent (maximum likelihood—using negative log pdf of random variables as the position and extent variables” ¶423 “There exist various neural networks architectures that can be trained to predict a conditional distribution of the form p(e|t), given a sufficient set of example {e,t} pairs. For a simple Gaussian distribution (univariate or multivariate), a log normal or (negative) log PDF loss function A08 can be used.” ¶11 “a normal distribution and Lognormal is a log-normal distribution, chosen as it only has support on the positive reals. This defines a normally distributed probability density centred on the camera coordinates of a point in 3D space.” Note: Redford teaches that its gradient descent algorithm is composed of a likelihood obtained from a negative log probability density function (pdf). This teaches the claims language of a gradient algorithm that uses a loss function that comprises the negative log likelihood of probability densities.)
It would have been obvious to a person having ordinary skill in the art before the effective filing
date of the claimed invention to modify Poole with Wang and further modify them with Redford where the loss function described uses a probability density likelihood, and that likelihood is obtained from a negative log probability density function.
There are several reasons that would motivate one to do so, negative log likelihood functions are useful as they make optimizing the maximum likelihood easy, as to maximize it you need only minimize the negative log likelihood function. Having a straightforward method to output likelihood is just one of the benefits that would motivate someone to combine Poole and Wang with Redford.

	Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALAN GREGORY HAKALA whose telephone number is (571)272-7863. The examiner can normally be reached 8:00am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571) 270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/KING Y POON/Supervisory Patent Examiner, Art Unit 2617
Read full office action
MACHINE LEARNING-BASED GENERATION OF THREE-DIMENSIONAL REPRESENTATIONS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

MACHINE LEARNING-BASED GENERATION OF THREE-DIMENSIONAL REPRESENTATIONS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email