Last updated: May 29, 2026
Application No. 18/232,279
HIGH RESOLUTION TEXT-TO-3D CONTENT CREATION

Non-Final OA §103
Filed
Aug 09, 2023
Priority
Nov 16, 2022 — provisional 63/425,932
Examiner
SALVUCCI, MATTHEW D
Art Unit
2613
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
4 (Non-Final)
Interview Optional

— +28.3% interview lift. Examiner has a relatively high allowance rate (72%); +28.3% interview lift. A written response may suffice.
Based on 487 resolved cases, 2023–2026
Examiner Intelligence

SALVUCCI, MATTHEW D View full profile →
Grants 72% — above average
Career Allowance Rate
350 granted / 487 resolved
+9.9% vs TC avg
Strong +28% interview lift
Without
With
+28.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
17 currently pending
Career history
504
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
88.4%
+48.4% vs TC avg
§102
7.8%
-32.2% vs TC avg
§112
2.1%
-37.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 487 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Applicant's amendments filed on 14 April 2026 have been entered.  Claims 1, 27, and 28 have been amended.  No claims have been canceled.  Claims 29-32 have been added.  Claims 1-12, 14, and 16-32 are still pending in this application, with claims 1, 27, and 28 being independent.

Response to Arguments
Applicant's arguments filed 14 April 2026 have been fully considered but they are not persuasive. 
Applicant argues, with respect to claim 1, that “Langoju involves a training a generalized machine learning model to enhance the resolution of any given 3D image, including update internal parameters of the generalized model, which differs from applicant's claimed "backpropagating gradients into the first 3D polygon mesh" (emphasis added), as claimed. 
Additionally, Langoju discloses computing a loss between an enhanced 3D image generated by the neural network from a training 3D image and the training 3D image's ground-truth enhanced version, which differs from applicant's claimed "loss defined on images rendered from the first 3D polygon mesh at the second resolution" (emphasis added), as claimed. To this end, the combination of references relied on by the Examiner do not teach or suggest applicant's claimed technique to optimize a first 3D polygon mesh rendered from a scene model having a first resolution, to form a 3D mesh model comprised of at least one second 3D polygon mesh and having a second resolution that is greater than the first resolution, where such optimizing is performed by backpropagating gradients into the first 3D polygon mesh via a loss defined on images rendered from the first 3D polygon mesh at the second resolution, as claimed”.
Examiner respectfully disagrees with the interpretation of the Langoju reference and points to the cited portion of Langoju, which discloses: “the preprocessing component can instead apply image denoising via any suitable machine learning technique (e.g., a deep learning neural network can be trained via backpropagation to infer a denoised version of an inputted image). Similarly, in various instances, the preprocessing component can apply image resolution enhancement via any suitable analytical and/or modality-based image resolution enhancement technique (e.g., converting thick slices/projections to thinner slices/projections, implementing non-wobble-to-wobble acquisition, implementing no-comb-to-comb acquisition, and/or implementing computer simulation). In various other instances, the preprocessing component can apply image resolution enhancement via any suitable machine learning technique (e.g., a deep learning neural network can be trained via backpropagation to infer a resolution-enhanced version of an inputted image). In any case, the preprocessing component can electronically generate a denoised and/or resolution enhanced 3D image” (Paragraph [0033]). Examiner notes that the stated “apply image resolution enhancement” via backpropagation clearly corresponds to the claimed “optimizing is performed by backpropagating gradients into the first 3D polygon mesh via a loss defined on images rendered from the first 3D polygon mesh al the second resolution,” wherein the image resolution enhancement corresponds to the claimed second resolution. Examiner further notes that Langoju alone is not cited as teaching all of the limitations of the claim and that instead it is Poole, in view of Zafeiriou, and further in view of Langoju that teaches the claimed limitations. Examiner thus maintains the rejection of claim 1 for the above reasons. 
For the remaining claims, Applicant argues for their allowance for similar reasons as above or for dependence to one of the independent claims. It follows that all remaining rejections are maintained. 

Allowable Subject Matter
Claims 12, 20, and 29-32 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 14, 16, 19, 21, 22, 27, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Poole et al. (NPL: DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION), hereinafter Poole, in view of Zafeiriou et al. (US Pub. 2023/0077187), hereinafter Zafeiriou, and further in view of Langoju et al. (US Pub. 2023/0177747), hereinafter Langoju.
Regarding claim 1, Poole discloses a method comprising: at a device: accessing a scene model generated from an input text prompt describing a 3D content, the scene model having a first resolution (Section 3: we will construct our specific algorithm that allows us to generate 3D assets from text. For the diffusion model, we use the Imagen model from Saharia et al. (2022), which has been trained to synthesize images from text. We only use the 64 × 64 base model (not the super-resolution cascade for generating higher-resolution images), and use this pretrained model as-is with no modifications. To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles, using these renderings as the input to our score distillation loss function that wraps around Imagen. As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text; Section 3.2: Given a pretrained text-to-image diffusion model, a differentiable image parameterization in the form of a NeRF, and a loss function whose minima are good samples, we have all the components needed for text-to-3D synthesis using no 3D data. For each text prompt, we train a randomly initialized NeRF from scratch); extracting a first three-dimensional (3D) mesh from the scene model having the first resolution (Section 3: we will construct our specific algorithm that allows us to generate 3D assets from text. For the diffusion model, we use the Imagen model from Saharia et al. (2022), which has been trained to synthesize images from text. We only use the 64 × 64 base model (not the super-resolution cascade for generating higher-resolution images), and use this pretrained model as-is with no modifications. To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles, using these renderings as the input to our score distillation loss function that wraps around Imagen. As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text); optimizing the first 3D polygon mesh, using a diffusion model (Fig. 3; Section 2: predicts the noise content of the latent zt…The predicted noise can be related to a predicted score function for the smoothed density; Section 3.1: NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 5 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel; Section 4: Our 3D scenes are optimized on a TPUv4 machine with 4 chips. Each chip renders a separate view and evaluates the diffusion U-Net with per-device batch size of 1. We optimize for 15,000 iterations which takes around 1.5 hours. Compute time is split evenly between rendering the NeRF and evaluating the diffusion model).
	While Poole teaches extracting a first three-dimensional (3D) mesh from the scene model (Section 3: we will construct our specific algorithm that allows us to generate 3D assets from text. For the diffusion model, we use the Imagen model from Saharia et al. (2022), which has been trained to synthesize images from text. We only use the 64 × 64 base model (not the super-resolution cascade for generating higher-resolution images), and use this pretrained model as-is with no modifications. To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles, using these renderings as the input to our score distillation loss function that wraps around Imagen. As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text), Poole does not explicitly disclose that the meshes are polygon meshes, extracting a first three-dimensional (3D) polygon mesh from the scene model; and to form a 3D mesh model comprised of at least one second 3D polygon mesh and having a second resolution that is greater than the first resolution, wherein the optimizing is performed by backpropagating gradients into the first 3D polygon mesh via a loss defined on images rendered from the first 3D polygon mesh at the second resolution.
	However, Zafeiriou teaches image synthesis using diffusion models (Abstract), further comprising extracting a first three-dimensional (3D) polygon mesh from the scene model (Fig. 3; Paragraphs [0049]-[0050]: FIG. 3 shows a schematic overview of a further example method 300 of generating a three-dimensional facial rendering from a two dimensional image. The method may be implemented on a computer. The method 300 begins as described in FIG. 1: a 2D image 302 comprising a face is input into one or more fitting neural networks 304, which generate a low resolution 2D texture map 306 of the textures of the face and a 3D model 308 of the geometry of the face. A super resolution model 310 is applied to the low resolution 2D texture map 306 in order to upscale the low resolution 2D texture map 306 into a high resolution 2D texture map 312. A 2D diffuse albedo map 316 is generated from the high resolution 2D texture map 112 using an image-to-image translation neural network 314…3D model 308 of the geometry of the face can be used to generate one or more 2D normal maps 324, 330 of the face. A 2D normal map in object space 324 may be generated directly from the 3D model 308 of the geometry of the face. A high-pass filter may be applied to the 2D normal map in object space 324 to generate a 2D normal map in tangent space 324. Normals may be calculated per-vertex of the 3D model as the perpendicular vectors to two vectors of a ‘face’ (e.g. triangle) of the 3D mesh. The normals may be stored in image format using a UV map parameterisation. Interpolation may be used to create a smooth normal map); and to form a 3D mesh model comprised of at least one second 3D polygon mesh and having a second resolution that is greater than the first resolution (Fig. 1; Fig. 3; Paragraph [0025]: FIG. 1 shows a schematic overview of an example method 100 of generating a three-dimensional facial rendering from a two dimensional image. The method may be implemented on a computer. A 2D image 102 comprising a face is input into one or more fitting neural networks 104, which generate a low resolution 2D texture map 106 of the textures of the face and a 3D model 108 of the geometry of the face. A super resolution model 110 is applied to the low resolution 2D texture map 106 in order to upscale the low resolution 2D texture map 106 into a high resolution 2D texture map 112. A 2D diffuse albedo map 116 is generated from the high resolution 2D texture map 112 using an image-to-image translation neural network 114 (also referred to herein as a “de-lighting image-to-image translation network”). The 2D diffuse albedo map 116 is used to render the 3D model 108 of the geometry of the face to generate a high resolution 3D model 118 of the face in the input image 102; Paragraph [0081]: the high resolution data may be split into patches (for example, of size 512×512 pixels) in order to augment the number of data sample and avoid overfitting. For example, using a stride of a given size (e.g. 128 pixels), partly overlapping patches may be derived by passing through each original 2D map (e.g. UV map) horizontally as well as vertically). Zafeiriou teaches that this allows for realistic modelling (Paragraph [0024]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Poole with the features of above as taught by Zafeiriou so as to allow for realistic modelling as presented by Zafeiriou.
	Further, Langoju teaches image synthesis using diffusion models (Paragraphs [0033]-[0034]), further comprising wherein the optimizing is performed by backpropagating gradients into the first 3D polygon mesh via a loss defined on images rendered from the first 3D polygon mesh at the second resolution (Paragraph [0033]: the preprocessing component can apply image denoising via any suitable analytical denoising technique (e.g., linear smoothing filters, nonlinear smoothing filters, anisotropic diffusion, non-local means, wavelet transforms, and/or statistical methods). In various other aspects, the preprocessing component can instead apply image denoising via any suitable machine learning technique (e.g., a deep learning neural network can be trained via backpropagation to infer a denoised version of an inputted image). Similarly, in various instances, the preprocessing component can apply image resolution enhancement via any suitable analytical and/or modality-based image resolution enhancement technique (e.g., converting thick slices/projections to thinner slices/projections, implementing non-wobble-to-wobble acquisition, implementing no-comb-to-comb acquisition, and/or implementing computer simulation). In various other instances, the preprocessing component can apply image resolution enhancement via any suitable machine learning technique (e.g., a deep learning neural network can be trained via backpropagation to infer a resolution-enhanced version of an inputted image). In any case, the preprocessing component can electronically generate a denoised and/or resolution enhanced 3D image). Langoju teaches that this will allow for improved visual quality (Paragraphs [0033]-[0040]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Poole, in view of Zafeiriou with the features of above as taught by Langoju so as to allow for improved visual quality as presented by Langoju.
Regarding claim 2, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the text prompt is input by a user (Section 1/Fig 1: DreamFusion generates high-fidelity coherent 3D objects and scenes for a diverse set of user-provided text prompts).
Regarding claim 3, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 2, Poole discloses wherein the scene model is further generated based on a reference image input by the user together with the text prompt (Section 3.1: In the traditional NeRF use-case we are given a dataset of input images and associated camera positions and the NeRF MLP is trained from random initialization using a mean squared error loss function between each pixel’s rendered color and the corresponding ground-truth color from the input image. This yields a 3D model (parameterized by the weights of the MLP) that can produce realistic renderings from previously-unseen views. Our model is built upon mip-NeRF 360 (Barron et al., 2022), which is an improved version of NeRF that reduces aliasing. Though mip-NeRF 360 was originally designed for 3D reconstruction from images, its improvements are also helpful for our generative text-to-3D task).
Regarding claim 4, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the scene model is a neural field representation (Section 1: By combining SDS with a NeRF variant tailored to this 3D generation task, DreamFusion generates high-fidelity coherent 3D objects and scenes for a diverse set of user-provided text prompts; Fig. 3: scene is represented by a Neural Radiance Field that is randomly initialized and trained from scratch for each caption. Our NeRF parameterizes volumetric density and albedo (color) with an MLP. We render the NeRF from a random camera, using normals computed from gradients of the density to shade the scene with a random lighting direction. Shading reveals geometric details that are ambiguous from a single viewpoint. To compute parameter updates, DreamFusion diffuses the rendering and reconstructs it with a (frozen) conditional Imagen model to predict the injected … that is backpropagated through the rendering process to update the NeRF MLP parameters).
Regarding claim 5, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the scene model is generated by another diffusion model that back-propagates gradients into the scene model via a loss defined on rendered images at the first resolution (Fig. 3; Section 2.1: o understand the difficulties of this approach, consider the gradient…In practice, the U-Net Jacobian term is expensive to compute (requires backpropagating through the diffusion model U-Net), and poorly conditioned for small noise levels as it is trained to approximate the scaled Hessian of the marginal density).
Regarding claim 6, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 5, Poole discloses wherein the other diffusion model is a pre-trained text-to-image diffusion model (Fig. 1: DreamFusion uses a pretrained text-to-image diffusion model to generate realistic 3D models from text prompts; Section 3: We only use the 64 × 64 base model (not the super-resolution cascade for generating higher-resolution images), and use this pretrained model as-is with no modifications. To synthesize a scene from text, we initialize a NeRF-like model with random weights, then repeatedly render views of that NeRF from random camera positions and angles, using these renderings as the input to our score distillation loss function that wraps around Imagen. As we will demonstrate, simple gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF) that resembles the text. See Fig. 3 for an overview of our approach).
Regarding claim 7, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the scene model is a coordinate-based multi-layer perceptron (MLP) (Fig. 3; Section 2.1: we instead want to create 3D models that look like good images when rendered from random angles. Such models can be specified as a differentiable image parameterization (DIP, Mordvintsev et al., 2018), where a differentiable generator g transforms parameters θ to create an image x = g(θ). DIPs allow us to express constraints, optimize in more compact spaces (e.g. arbitrary resolution coordinate-based MLPs), or leverage more powerful optimization algorithms for traversing pixel space. For 3D, we let θ be parameters of a 3D volume and g a volumetric renderer. To learn these parameters, we require a loss function that can be applied to diffusion models; Section 3.1: NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 5 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel).
Regarding claim 8, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 7, Poole discloses wherein the coordinate-based MLP predicts albedo and density (Section 3.1: NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 5 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel).
Regarding claim 14, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the diffusion model is a latent diffusion model (Section 2: Diffusion models are latent-variable generative models that learn to gradually transform a sample from a tractable noise distribution towards a data distribution (Sohl-Dickstein et al., 2015; Ho et al., 2020). Diffusion models consist of a forward process q that slowly removes structure from data x by adding noise, and a reverse process or generative model p that slowly adds structure starting from noise zt. The forward process is typically a Gaussian distribution that transitions from the previous less noisy latent at timestep t to a noisier latent at timestep t + 1. We can compute the marginal distribution of the latent variables at timestep t given an initial datapoint x by integrating out intermediate timesteps… Diffusion model training can thereby be viewed as either learning a latent-variable model (Sohl-Dickstein et al., 2015; Ho et al., 2020), or learning a sequence of score functions corresponding to noisier versions of the data).
Regarding claim 16, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the diffusion model processes a latent code to predict the 3D mesh model, and wherein a resolution of the latent code is smaller than the second resolution (Fig. 3; Section 2: Diffusion models are latent-variable generative models that learn to gradually transform a sample from a tractable noise distribution towards a data distribution (Sohl-Dickstein et al., 2015; Ho et al., 2020). Diffusion models consist of a forward process q that slowly removes structure from data x by adding noise, and a reverse process or generative model p that slowly adds structure starting from noise zt. The forward process is typically a Gaussian distribution that transitions from the previous less noisy latent at timestep t to a noisier latent at timestep t + 1. We can compute the marginal distribution of the latent variables at timestep t given an initial datapoint x by integrating out intermediate timesteps… Diffusion model training can thereby be viewed as either learning a latent-variable model (Sohl-Dickstein et al., 2015; Ho et al., 2020), or learning a sequence of score functions corresponding to noisier versions of the data…predicts the noise content of the latent zt…The predicted noise can be related to a predicted score function for the smoothed density; Section 3.1: NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 5 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel; Section 4: Our 3D scenes are optimized on a TPUv4 machine with 4 chips. Each chip renders a separate view and evaluates the diffusion U-Net with per-device batch size of 1. We optimize for 15,000 iterations which takes around 1.5 hours. Compute time is split evenly between rendering the NeRF and evaluating the diffusion model; Section 5: DreamFusion uses the 64 × 64 Imagen model, and as such our 3D synthesized models tend to lack fine details. Using a higher-resolution diffusion model and a bigger NeRF would presumably address this, but synthesis would become impractically slow. Hopefully improvements in the efficiency of diffusion and neural rendering will enable tractable 3D synthesis at high resolution in the future). 
Regarding claim 19, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Zafeiriou discloses wherein the 3D mesh model is textured (Fig. 3; Paragraph [0029]: 2d texture map 106 may be any 2D map that can represent 3D textures. An example of such a map is a UV map. A UV map is a 2D representation of a 3D surface or mesh. Points in 3D space (for example described by (x, y, z) co-ordinates) are mapped onto a 2D space (described by (u, v) co-ordinates). A UV map may be formed by unwrapping a 3D mesh in a 3D space onto the u-v plane in the 2D UV space, and storing parameters associated with the 3D surface at each point in UV space. A texture UV map 110 may be formed by storing colour values of the vertices of a 3D surface/mesh in the 3D space at corresponding points in the UV space; Paragraph [0049]: FIG. 3 shows a schematic overview of a further example method 300 of generating a three-dimensional facial rendering from a two dimensional image. The method may be implemented on a computer. The method 300 begins as described in FIG. 1: a 2D image 302 comprising a face is input into one or more fitting neural networks 304, which generate a low resolution 2D texture map 306 of the textures of the face and a 3D model 308 of the geometry of the face).
Regarding claim 21, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses wherein the first resolution is 64×64 (Section 3: t. We only use the 64 × 64 base model (not the super-resolution cascade for generating higher-resolution images), and use this pretrained model as-is with no modifications).
Regarding claim 22, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Zafeiriou discloses wherein the second resolution is 512×512 (Paragraph [0081]: the high resolution data may be split into patches (for example, of size 512×512 pixels) in order to augment the number of data sample and avoid overfitting. For example, using a stride of a given size (e.g. 128 pixels), partly overlapping patches may be derived by passing through each original 2D map (e.g. UV map) horizontally as well as vertically).
Regarding claim 27, the limitations of this claim substantially correspond to the limitations of claim 1; thus they are rejected on similar grounds.
Regarding claim 28, the limitations of this claim substantially correspond to the limitations of claim 1; thus they are rejected on similar grounds.

Claims 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Zafeiriou, in view of Langoju, and further in view of Wang et al. (US Pub. 2024/0135483), hereinafter Wang.
Regarding claim 9, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1. 
Poole, in view of Zafeiriou, and further in view of Langoju does not explicitly disclose wherein the scene model is an Instant-neural graphics primitive (Instant-NGP).
However, Wang teaches 3D scene generation using NeRFs (Abstract; Paragraph [0246]), further comprising wherein the scene model is an Instant-neural graphics primitive (Instant-NGP) (Paragraph [0254]: Naïve incremental transfer can also be applied to other neural graphics machine learning models and techniques. For example, the instant neural graphics primitives (NGP) technique can be modified to use with naïve incremental transfer to enable real-time per-frame model generation. Instant NGP uses an input encoding that permits the use of a smaller MLP than NeRF. The use of the smaller network significantly reduces the number of floating point and memory access operations. The smaller MLP is augmented with a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Instant NGP technique enables model training to be performed in seconds, rather than minutes, as with the original NeRF technique. As shown by chart 2320, adapting instant NGP to use naïve incremental transfer enables the generation of novel frame output that stabilizes to approximately 32 db PSNR approximately 0.5 seconds into the video. This technique allows the streaming of multi-view video with nearly immediate high quality output). Wang teaches that this will allow for nearly immediate high quality output (Paragraph [0254]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Poole, in view of Zafeiriou, and further in view of Langoju with the features of above as taught by Wang so as to allow for fast high quality output as presented by Wang.
Regarding claim 10, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Wang teaches the method of claim 9, Poole discloses wherein the Instant-NGP uses a hash grid encoding, and includes a first single-layer neural network that predicts albedo and density and a second single-layer neural network that predicts surface normal (Fig. 3; Section 3.1: Calculating the final shaded output color for the 3D point requires a normal vector indicating the local orientation of the object’s geometry. This surface normal vector can be computed by normalizing the negative gradient of density τ with respect to the 3D coordinate; Section A.2: Our NeRF MLP consists of 5 ResNet blocks (He et al., 2016) with 128 hidden units, Swish/SiLU activation (Hendrycks & Gimpel, 2016), and layer normalization (Ba et al., 2016) between blocks. We use an exp activation to produce density τ and a sigmoid activation to produce RGB albedo ρ).
Regarding claim 11, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Wang teaches the method of claim 10, Poole discloses wherein a spatial data structure is maintained that encodes scene occupancy and utilizes empty space skipping (Section 3.1: mip-NeRF 360 model we build upon contains many other details that we omit for brevity. We include a regularization penalty on the opacity along each ray similar to Jain et al. (2022) to prevent unneccesarily filling in of empty space. To prevent pathologies in the density field where normal vectors face backwards away from the camera we use a modified version of the orientation loss proposed in Ref-NeRF (Verbin et al., 2022). This penalty is important when including textureless shading as the density field will otherwise attempt to orient normals away from the camera so that the shading becomes darker. Full details on these regularizers and additional hyperparameters of NeRF are presented in the Appendix A.2; Section A.2: n NeRF, each 3D input point is mapped to a higher dimensional space using a sinusoidal positional encoding function…where v is the direction of the ray (the viewing direction). We also apply a small regularization to the accumulated alpha value (opacity) along each ray…This discourages optimization from unnecessarily filling in empty space, and improves foreground/background separation).

Claims 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Zafeiriou, in view of Zafeiriou, in view of Langoju, and further in view of Shen et al. (NPL: Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis), hereinafter Shen.
Regarding claim 17, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1.
Poole, in view of Zafeiriou, and further in view of Langoju does not explicitly disclose wherein the 3D mesh model is a deformable tetrahedral grid.
However, Shen teaches generation of 3D models (Abstract), further comprising wherein the 3D mesh model is a deformable tetrahedral grid (Section 3.1: We represent a shape using a sign distance field (SDF) encoded with a deformable tetrahedral grid, adopted from DefTet [18, 20]. The grid fully tetrahedralizes a unit cube, where each cell in the volume is a tetahedron with 4 vertices and faces). Shen teaches that this will allow for vertices that can deform to represent the geometry of the shape more efficiently (Section 3.1). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Poole, in view of Zafeiriou, and further in view of Langoju with the features of above as taught by Shen so as to represent the geometry of the shape more efficiently as presented by Shen.
Regarding claim 18, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Shen teaches the method of claim 17, Shen discloses wherein the deformable tetrahedral grid includes vertices in a grid, wherein each vertex contains a signed distance field value and a deformation of the vertex from its initial canonical coordinate (Fig. 1; Section 2: DefTet [18] represents a mesh with a deformable tetrahedral grid where the grid vertex coordinates and the occupancy values are learned. However, similar to voxel-based methods, the computational costf increases cubically with the grid resolution; Section 3.1: We represent a shape using a sign distance field (SDF) encoded with a deformable tetrahedral grid, adopted from DefTet [18, 20]. The grid fully tetrahedralizes a unit cube, where each cell in the volume is a tetahedron with 4 vertices and faces. The key aspect of this representation is that the grid vertices can deform to represent the geometry of the shape more efficiently. While the original DefTet encoded occupancy defined on each tetrahedron, we here encode signed distance values defined on the vertices of the grid and represent the underlying surface implicitly).

Claims 23-26  are rejected under 35 U.S.C. 103 as being unpatentable over Poole, in view of Zafeiriou, in view of Langoju, and further in view of Tambi et al. (US Pub. 2024/0095275), hereinafter Tambi.
Regarding claim 23, Poole, in view of Zafeiriou, and further in view of Langoju teaches the method of claim 1, Poole discloses further comprising, at the device: presenting the 3D content, using the 3D mesh model (Fig. 3; Section 2: predicts the noise content of the latent zt…The predicted noise can be related to a predicted score function for the smoothed density; Section 3.1: NeRF is a technique for neural inverse rendering that consists of a volumetric raytracer and a multilayer perceptron (MLP). Rendering an image from a NeRF is done by casting a ray for each pixel from a camera’s center of projection through the pixel’s location in the image plane and out into the world. Sampled 3D points µ along each ray are then passed through an MLP, which produces 4 5 scalar values as output: a volumetric density τ (how opaque the scene geometry at that 3D coordinate is) and an RGB color c. These densities and colors are then alpha-composited from the back of the ray towards the camera, producing the final rendered RGB value for the pixel; Section 4: Our 3D scenes are optimized on a TPUv4 machine with 4 chips. Each chip renders a separate view and evaluates the diffusion U-Net with per-device batch size of 1. We optimize for 15,000 iterations which takes around 1.5 hours. Compute time is split evenly between rendering the NeRF and evaluating the diffusion model).
Poole, in view of Zafeiriou, and further in view of Langoju does not explicitly disclose presenting the content on a display device.
However, Tambi teaches generation of imagery from text prompts (Paragraph [0068]), further comprising presenting the content on a display device (Fig. 1; Paragraph [0028]: Query processing apparatus 110 provides a set of images in response to the original query. The images are associated with the modified query (e.g., the images depict content of the modified query such as objects and their relations). Query processing apparatus 110 may retrieve the images from database 120 based on the modified query. The images are displayed to user 100, e.g., via cloud 115 and user device 105). Tambi teaches that this will allow for the images to be displayed to the user (Paragraph [0028]). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Poole, in view of Zafeiriou, and further in view of Langoju with the features of above as taught by Tambi so as to allow for display to a user as presented by Tambi.
Regarding claim 24, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Tambi teaches the method of claim 23, Tambi discloses further comprising, at the device: receiving a modification to the input text prompt (Paragraph [0068]: a user changes the input prompt to “abstract mosaic pattern background illustration green colors” and chooses Fill in the Blanks Model. The input to masked language model 315 is “abstract mosaic <mask> pattern background illustration green colors”. Masked language model 315 generates the following prompts and corresponding images (e.g., via image generation model 305). The modified queries include “abstract mosaic seamless pattern background illustration green colors”, “abstract mosaic square pattern background illustration green colors”, “abstract mosaic triangle pattern background illustration green colors”, “abstract mosaic tiles pattern background illustration green colors, “abstract mosaic tile pattern background illustration green colors”, “abstract mosaic vector pattern background illustration green colors”, “abstract mosaic circle pattern background illustration green colors”, “abstract mosaic design pattern background illustration green colors”, “abstract mosaic geometric pattern background illustration green colors”, “abstract mosaic dot pattern background illustration green colors”, etc); and optimizing the 3D mesh model based on the modification to the input text prompt (Paragraph [0068]: a user changes the input prompt to “abstract mosaic pattern background illustration green colors” and chooses Fill in the Blanks Model. The input to masked language model 315 is “abstract mosaic <mask> pattern background illustration green colors”. Masked language model 315 generates the following prompts and corresponding images (e.g., via image generation model 305); Paragraph [0108]: the causal language model generates variations of user prompt. The machine learning model as shown in FIG. 2 provides an expanded set of images (i.e., more diverse results) based on the set of expanded queries. The causal language model to expand prompts for broadening user intent can be used jointly with a masked language models to edit prompts for narrowing user intent).
Regarding claim 25, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Tambi teaches the method of claim 24, Tambi discloses wherein the modification is to a texture (Paragraph [0097]: an input prompt is “grunge background with light texture”. At broaden intent phase 900, the machine learning model generates the following expanded queries 905 and corresponding first images 910. Expanded queries 905 include “grunge background with light texture and blue stains”, “grunge background with light texture and black spots”, “grunge background with light texture and old white painted wood”, “grunge background with light texture and red blood”, “grunge background with light texture and dark salmon color”, etc. For example, additional phrase 907 is “and old white painted wood” in the expanded query “grunge background with light texture and old white painted wood).
Regarding claim 26, Poole, in view of Zafeiriou, in view of Langoju, and further in view of Tambi teaches the method of claim 24, Tambi discloses wherein the modification is to a geometry (Paragraph [0090]: an original query or input prompt is “abstract red and white background”. Machine learning model 600 generates a set of expanded queries 605. Expanded queries 605 include “abstract red and white background with place for text”, “abstract red and white background with square shapes”, “abstract red and white background with halftone effect in center”, “abstract red and white background with diagonal lines”, “abstract red and white background with squares”, etc. At least one image 610 is provided for each of the expanded queries 605. In the above examples, machine learning model 600 inserts an additional phrase at the end of the original query).

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW D SALVUCCI whose telephone number is (571)270-5748. The examiner can normally be reached M-F: 7:30-4:00PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, XIAO WU can be reached at (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW SALVUCCI/Primary Examiner, Art Unit 2613
Read full office action
Prosecution Timeline

Show 1 earlier event
Jul 01, 2025
Non-Final Rejection mailed — §103
Sep 24, 2025
Response Filed
Oct 06, 2025
Final Rejection mailed — §103
Dec 16, 2025
Request for Continued Examination
Jan 15, 2026
Response after Non-Final Action
Feb 05, 2026
Non-Final Rejection mailed — §103
Apr 14, 2026
Response Filed
May 15, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/644,322
Patent 12639890
SAMPLING LIGHT DIRECTIONS ON NEURAL MATERIALS
2y 1m to grant Granted May 26, 2026
18/466,620
Patent 12632946
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND IMAGE PROCESSING PROGRAM
2y 8m to grant Granted May 19, 2026
18/493,424
Patent 12633073
EFFICIENT AVATAR CREATION WITH MESH PENETRATION AVOIDANCE
2y 6m to grant Granted May 19, 2026
18/769,972
Patent 12626473
DYNAMIC VIRTUAL OBJECTS
1y 10m to grant Granted May 12, 2026
18/320,600
Patent 12620160
GRAPHICS LIBRARY EXTENSIONS
2y 11m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

4-5
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+28.3%)
2y 11m (~1m remaining)
Median Time to Grant
High
PTA Risk
Based on 487 resolved cases by this examiner. Grant probability derived from career allowance rate.