DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 6, 10, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Wu et al. (Wu, Shuang, et al. "Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer." arXiv:2405.14832v1 [cs.CV] (May 23, 2024): 1-15) in view of
Lu et al. (Lu, Zeyu, “Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation”, arXiv:2304.11829 [cs.CV] (April 25, 2023) and in further view of
Shi et al. (US 2025/0078392 A1)
Regarding claim 10, Wu discloses:
A system, comprising: (Wu, Abstract; p. 7, section 4.1 Implementation details: trained GPU including NVIDIA A100) comprising:
Receiving an input prompt describing a 3-dimensional (3D) object (Wu, p. 7, Section 4.1, ¶1: “Our D3D-VAE takes as input 81,920 point clouds with normal uniformly sampled from the 3D model, along with a learnable latent token of a resolution r = 32 and a channel dimension de = 768.”; p. 9, “Text-to-3D” section, “Our Direct3D can produce 3D assets from text prompts by incorporating text-to-image models like Hunyuan-DiT” and use a generated image as input to the disclosed model)
Generating one or more levels of latent features based on the input prompt using a trained latent diffusion model; (Wu, p. 5, Fig. 2, “We utilize transformer to encode point cloud sampled from 3D model, along with a set of learnable tokens, into an explicit triplane latent space.”, and ¶1: Subsequently, multiple self-attention layers are employed to enhance the representation of these tokens, ultimately yielding the latent representation z ∈ R (3×r×r)×dz , where dz represents the channel dimensional of z; p. 6, Section 3.2, “After training the D3D-VAE, we have access to a continuous and compact latent space, upon which we train the latent diffusion model”; p. 7, “Training. Following LDM [42], our 3D latent diffusion transformer model predicts the noise ϵ of the noisy latent representation zt at time t, conditioned on image C.”. p. 7, Section 4.1, D3D-VAE section, “The encoder network consists of 1 cross-attention layer and 8 self-attention layers, with each attention layer comprising 12 heads of a dimension 64. The channel dimension of the latent representation is dz = 16.”; p. 9, Conclusion, ¶1: “Leveraging a hybrid architecture, our proposed D3D-VAE efficiently encode 3D shapes into a compact latent space, enhancing the fidelity of the generated shapes. Our image-conditioned 3D diffusion transformer (D3D-DiT) further improves the generation quality by integrating image information at both pixel and semantic levels, ensuring high consistency between generated 3D shapes and conditional images”)
Determining a 3D shape representation by decoding the one or more levels of latent features using a trained (Wu, p. 7, Section 4.1, ¶1: “The decoder network comprises of 5 ResNet [9] blocks to upsample the latent representation into triplane feature maps with resolution of 256 × 256 and channel dimension of 32”; p. 5, Figure 2 discloses use of Direct 3D variational auto-encoder, including convolutional decoder fed with noisy data);
and
generating a 3D shape for the 3D object based on the 3D shape representation (Wu, p. 9, Fig. 6: “We employ SyncMVD [27] to generate texture for the meshes produced by our Direct3D”; p. 9, “Text-to-3D” section discloses “We render videos of meshes generated by each method rotating 360 degrees”)
The only element not explicitly disclosed by Wu is that the decoding of the latent features is performed by a trained hierarchical autoencoder.
Lu discloses:
Determining a shape representation by decoding the one or more levels of latent features using a trained hierarchical autoencoder (Lu, p. 2: “we design the Hierarchical
Diffusion Autoencoders (HDAE) that exploits the coarse-to-fine and low-level-to-high-level feature hierarchy of the semantic encoder and the diffusion-based decoder for comprehensive and hierarchical representations”; p. 3, section 3.2, ¶2: Design space of latent representations and network architectures. “we propose hierarchical diffusion autoencoders which explore the hierarchical semantic latent space of diffusion autoencoders”; p. 4, 3rd bullet point: “HDAE(U) hierarchical diffusion autoencoders with U-Net semantic encoder. HDAE(U) also adopts the hierarchical latent space design but leverages a semantic encoder with the U-Net structure.”)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, by incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, using known electronic interfacing and programming techniques. The use of a hierarchical diffusion autoencoder for improving details and visual features omitted or insufficient with other techniques (see Wu, p. 3, section 3.2, ¶1 and p. 4, “Advantages and applications of HDAE”, including “while the latent space of DAE lacks low-level details, the hierarchical latent space of HDAE encodes comprehensive fine-grained-to-abstract and low-level-to-high-level features, leading to more accurate and detail-preserving image reconstruction and manipulation results”).
Wu further does not explicitly discuss the computer architecture of the processing device as recited by the claim. Although this is merely application of a well-known and conventional computer architecture, Shi discloses:
A system, (Shi, Fig. 6 and ¶77) comprising: a memory component storing computer-executable instructions; a processing device coupled to the memory component, the processing device configured to execute the computer-executable instructions to perform operations (Shi, Fig. 6 and ¶77: computing device including system memory 604 may include an operating system 605 and one or more program modules 606 suitable for running image processor application 620; ¶¶79-80 further disclose processing device)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, by using the computer architecture as provided by Shi, using known electronic interfacing and programming techniques. The modification results in an improved system by using common computer components for cheaper and easier implementation.
Regarding claim 1, the system of claim 10 performs the method of claim 1 and as such claim 1 is rejected based on the same rationale as claim 1 set forth above.
Regarding claim 17, the operations perform the method of claim 1 and as such claim 17 is rejected based on the same rationale as claim 1 set forth above. Further regarding claim 17, Shi discloses:
A non-transitory computer-readable medium, storing executable instructions, which when executed by a processing device, cause the processing device to perform operations (Shi, Fig. 6 and ¶77: computing device including system memory 604 may include an operating system 605 and one or more program modules 606 suitable for running image processor application 620; ¶¶79-80 further disclose processing device)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, by using the computer architecture as provided by Shi, using known electronic interfacing and programming techniques. The modification results in an improved system by using common computer components for cheaper and easier implementation.
Regarding claim 6, Wu discloses use of a Variational Auto-Encoder (VAE) (Wu, p. 4, section 3.1, and Fig. 2 on page 5)
Lu further discloses:
wherein the trained hierarchical autoencoder comprises a hierarchical vector quantized Variational Autoencoders (VQ-VAE) network (Lu, p. 5, Table 1 discloses model including VQ-VAE2).
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, by incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, using known electronic interfacing and programming techniques. The use of a hierarchical diffusion autoencoder for improving details and visual features omitted or insufficient with other techniques (see Wu, p. 3, section 3.2, ¶1 and p. 4, “Advantages and applications of HDAE”, including “while the latent space of DAE lacks low-level details, the hierarchical latent space of HDAE encodes comprehensive fine-grained-to-abstract and low-level-to-high-level features, leading to more accurate and detail-preserving image reconstruction and manipulation results”).
Claim(s) 2-5, 8-9, 11-13, 15-16, 18 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Wu et al. (Wu, Shuang, et al. "Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer." arXiv:2405.14832v1 [cs.CV] (May 23, 2024): 1-15) in view of
Lu et al. (Lu, Zeyu, “Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation”, arXiv:2304.11829 [cs.CV] (April 25, 2023) and in further view of
Shi et al. (US 2025/0078392 A1) and in further view of
Shim et al. (Jaehyeok Shim, Changwoo Kang, Kyungdon Joo, “Diffusion-Based Signed Distance Fields for 3D Shape Generation”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 20887-20897).
Regarding claim 2, the limitations included from claim 1 are rejected based on the same rationale as claim 1 set forth above. Further regarding claim 2, Wu further discloses:
Receiving a shape occupancy map along with the input prompt (Wu,p 2, last paragraph: “we employ a transformer model to encode high-resolution point clouds into an explicit triplane latent, which has been widely used in 3D reconstruction methods [2] for its efficiency. While the latent triplane is intentionally set with a low resolution, we introduce a convolutional neural network to upsample the latent resolution and decode it into a high-resolution 3D occupancy grid”; p. 5, Fig. 2: Subsequently, a CNN-based decoder is employed to upsample these latent representations into high-resolution triplane feature maps. The occupancy values of queried points can be decoded through a geometric mapping network. (b) Then we train the image conditioned latent diffusion transformer in the 3D latent space obtained by VAE. Pixel-level information and semantic-level information from images are extracted using DINO-v2 and CLIP, respectively, and then injected into each DiT block.; and p. 5 “Semi-continuous surface sampling” section: “We employ a Multi-Layer Perceptron (MLP) as the geometric mapping network to predict the occupancy of queried points via features interpolated from the triplane. The MLP contains multiple linear layers with ReLU activation. Typical occupancy is represented by a discrete binary value of 0 and 1 to indicate whether a point is inside an object”; p. 9, Text-to-3D section discloses producing input from text prompt used for generating meshes – note that claim does not specify ordering of input but merely at some time input is provided and used to generate the latent codes, i.e. any time in the pipeline)
Furthermore, Shim discloses:
Receiving a low-resolution shape occupancy map along with the input prompt (Shim, Abstract: generates a low-resolution SDF of 3D shapes. Using the estimated low-resolution SDF as a condition, the second stage diffusion model performs super-resolution to generate high-resolution SDF.; p. 20891, Section 3.3: We aim to train a conditional diffusion-based super-resolution model that generates realistic high-resolution SDF voxel x HR 0 with low-resolution condition x LR 0)
Generating one or more levels of latent features based on the low-resolution shape occupancy map and the input prompt using the trained latent diffusion model (Shim, p. 20891, Section 3.3: We aim to train a conditional diffusion-based super-resolution model that generates realistic high-resolution SDF voxel x HR 0 with low-resolution condition x LR 0; p. 20892, “Implementation details” ¶1: For point cloud input, we use the point cloud encoder from the convolutional occupancy network [53] as our point cloud encoder and combine the resulting features using AdaGN; Specifically, we employ the PointNet layer [55] from the encoder of the convolutional occupancy network [53], which transforms the point cloud feature into a voxel and gives this feature to each layer of SDF-Diffusion along with the AdaGN; Also p. 20891, Algorithm 1 and 2 disclosing 2-stage algorithm based on low resolution condition to obtain super resolution shape in latent space)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the diffusion modeling technique as provided by Shim, using known electronic interfacing and programming techniques. The modification merely substitutes a known type of diffusion model denoising model for another, yielding predictable results of utilizing a denoising model in place of a different type of denoising model for image diffusion modeling. The modification also provides an improved diffusion model by better preserving finer details and structural information while suppression other various types of noise for improved image generation results.
Regarding claim 11, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above. Further regarding claim 11, Wu further discloses:
Receiving a shape occupancy map along with the input prompt (Wu,p 2, last paragraph: “we employ a transformer model to encode high-resolution point clouds into an explicit triplane latent, which has been widely used in 3D reconstruction methods [2] for its efficiency. While the latent triplane is intentionally set with a low resolution, we introduce a convolutional neural network to upsample the latent resolution and decode it into a high-resolution 3D occupancy grid”; p. 5, Fig. 2: Subsequently, a CNN-based decoder is employed to upsample these latent representations into high-resolution triplane feature maps. The occupancy values of queried points can be decoded through a geometric mapping network. (b) Then we train the image conditioned latent diffusion transformer in the 3D latent space obtained by VAE. Pixel-level information and semantic-level information from images are extracted using DINO-v2 and CLIP, respectively, and then injected into each DiT block.; and p. 5 “Semi-continuous surface sampling” section: “We employ a Multi-Layer Perceptron (MLP) as the geometric mapping network to predict the occupancy of queried points via features interpolated from the triplane. The MLP contains multiple linear layers with ReLU activation. Typical occupancy is represented by a discrete binary value of 0 and 1 to indicate whether a point is inside an object”; p. 9, Text-to-3D section discloses producing input from text prompt used for generating meshes – note that claim does not specify ordering of input but merely at some time input is provided and used to generate the latent codes, i.e. any time in the pipeline)
Furthermore, Shim discloses:
Receiving a low-resolution shape occupancy map along with the input prompt (Shim, Abstract: generates a low-resolution SDF of 3D shapes. Using the estimated low-resolution SDF as a condition, the second stage diffusion model performs super-resolution to generate high-resolution SDF.; p. 20891, Section 3.3: We aim to train a conditional diffusion-based super-resolution model that generates realistic high-resolution SDF voxel x HR 0 with low-resolution condition x LR 0)
Determining an initial set of latent codes for the 3D shape to be generated based on the low-resolution shape occupancy map and the input prompt (Shim, p. 20891, Section 3.3: We aim to train a conditional diffusion-based super-resolution model that generates realistic high-resolution SDF voxel x HR 0 with low-resolution condition x LR 0; p. 20892, “Implementation details” ¶1: For point cloud input, we use the point cloud encoder from the convolutional occupancy network [53] as our point cloud encoder and combine the resulting features using AdaGN; Specifically, we employ the PointNet layer [55] from the encoder of the convolutional occupancy network [53], which transforms the point cloud feature into a voxel and gives this feature to each layer of SDF-Diffusion along with the AdaGN; Also p. 20891, Algorithm 1 and 2 disclosing 2-stage algorithm based on low resolution condition to obtain super resolution shape in latent space)
Adding Gaussian noises to the initial set of latent codes to obtain a noised set of latent codes (Shim, p. 20889-20890, section 3.1 Background on Denoising Diffusion Models, paragraphs 1-2: a given data distribution, where a forward processor or diffusion process gradually adds the Gaussian noise to the clean data; p. 20890, section 3.2 Diffusion SDF Generation, paragraphs 1-2: a diffusion-based generative model for low-resolution 3D shape represented as SDF voxel, where given clean data sample, the sample is corrupted with voxel noise sampled from the standard Gaussian distribution forward process); and
Denoising the noised set of latent codes using the trained latent diffusion model for a predetermined time steps to obtain the one or more level of latent features (Shim, p. 20890, left column, “Reverse Process” section: “The reverse process (or generative process) aims to generate samples by gradually inverting infinitesimal noise”, defined by a Markov chain with the learned Gaussian transitions, using a time-step dependent variance, including “Training objective” section simplifying the Gaussian distribution by modifying with a neural network predicting to obtain objective function (11); Section 3.2, “Diffusion Based SDF voxel generation” discloses using the MSE objective function (11) to predict the noise-free data with new SDF voxel generated with progressively predicting lesser noisy samples as function of t; p. 20890, last section, Network architecture, discloses “timestep t” using multi-layered network)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the diffusion modeling technique as provided by Shim, using known electronic interfacing and programming techniques. The modification merely substitutes a known type of diffusion model denoising model for another, yielding predictable results of utilizing a denoising model in place of a different type of denoising model for image diffusion modeling. The modification also provides an improved diffusion model by better preserving finer details and structural information while suppression other various types of noise for improved image generation results.
Regarding claim 3, the limitations included from claim 2 are rejected based on the same rationale as claim 2 set forth above. Further regarding claim 3, the system of claim 11 performs the substantially same method steps as further recited by claim 3 and as such claim 3 is further rejected based on the same rationale as claim 11 set forth above.
Regarding claim 18, the limitations included from claim 17 are rejected based on the same rationale as claim 17 set forth above. Further regarding claim 18, the additional operations are the same additional operations performed by the system of claim 11 and as such claim 18 is further rejected based on the same rationale as claim 11 set forth above.
Regarding claim 12, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above. Further regarding claim 12, Shim discloses:
Wherein the one or more levels of latent features comprises a top level of latent features and a bottom level of latent features, wherein the top level of latent features corresponds to rough geometry features, and wherein the bottom level of latent features corresponds to detailed shape features (Shim, p. 20888, right column, last bullet point: “We represent a memory-efficient two-stage framework composed of low-resolution SDF generation and SDF super-resolution conditioned on the low-resolution SDF”; p. 20891, Algorithm 1 training including stage 1 diffusion SDF generation and stage 2 Patch-based Diffusion SDF Super-resolution and further Algorithm 2, including stage 1 diffusion-based 3D shape Generation, and stage 2 patch based diffusion SDF super-resolution)
Regarding claim 4, the system of claim 12 performs the method of claim 4 and as such claim 4 is rejected based on the same rationale as claim 12 set forth above.
Regarding claim 13, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above. Further regarding claim 13, Wu modified by Lu further discloses:
wherein the trained hierarchical autoencoder comprises a hierarchical vector quantized Variational Autoencoders (VQ-VAE) network (Lu, p. 5, Table 1 discloses model including VQ-VAE2)
Wu and Lu are combinable for the same reasons as set forth above for claim 10.
Further regarding claim 13, Wu modified by Lu does not disclose the 3D U Net.
Shim discloses:
Wherein the trained latent diffusion model is a denoising diffusion probabilistic model, comprising a 3D U-Net, (Shim, p. 20890, Section 3.2 Diffusion SDF Generation section disclosing generating new SDF voxel through denoising, including “Network Architecture. Our model is built upon a U-shaped network in SR3 [59], which is a DDM-based super-resolution method designed for 2D image domain. Thus, we modify and improve this structure for 3D domain. We convert 2D-based U-shaped network of SR3 into 3D-based model by replacing 2D-based convolutional layers with 3D ones.”).
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the 3D U-Net as provided by Shim, using known electronic interfacing and programming techniques. The modification merely substitutes a known type of diffusion model denoising model for another, yielding predictable results of utilizing a 3D U-Net denoising model in place of a different type of denoising model for image diffusion modeling. The modification also provides an improved diffusion model by better preserving finer details and structural information while suppression other various types of noise for improved image generation results.
Regarding claim 5, the limitations included from claim 1 are rejected based on the same rationale as claim 1 set forth above. Further regarding claim 5, Shim discloses:
Wherein the trained latent diffusion model is a denoising diffusion probabilistic model, comprising a 3D U-Net, (Shim, p. 20890, Section 3.2 Diffusion SDF Generation section disclosing generating new SDF voxel through denoising, including “Network Architecture. Our model is built upon a U-shaped network in SR3 [59], which is a DDM-based super-resolution method designed for 2D image domain. Thus, we modify and improve this structure for 3D domain. We convert 2D-based U-shaped network of SR3 into 3D-based model by replacing 2D-based convolutional layers with 3D ones.”).
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the 3D U-Net as provided by Shim, using known electronic interfacing and programming techniques. The modification merely substitutes a known type of diffusion model denoising model for another, yielding predictable results of utilizing a 3D U-Net denoising model in place of a different type of denoising model for image diffusion modeling. The modification also provides an improved diffusion model by better preserving finer details and structural information while suppression other various types of noise for improved image generation results.
Regarding claim 15, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above. Further regarding claim 15, Shim discloses:
wherein the 3D shape representation comprises a set of volumetric Truncated-Signed Distance Field (T-SDF) values (Shim, p. 20888, 2nd paragraph: “We can view SDF as a function that takes an arbitrary location as input and returns a signed distance value from the input location to the nearest surface of the mesh, and the sign of the value means whether inside or outside of the shape. We sample SDF values uniformly from a 3D shape to form a voxel-shaped SDF. This form has several advantages over point clouds. It can directly reconstruct mesh through the marching cube algorithm [38] and can utilize convolutional neural network (CNN) because of its dense and fixed structure”; p. 20892, right col., lines 3-4: “we utilize SDF in the form of truncated signed distance fields (T-SDF) to decrease redundant information”)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the data representation provided by Shim, using known electronic interfacing and programming techniques. The modification results in an improved image generation model by more efficiently utilizing memory resources and processing resources (see Shim, p. 20892, right col., lines 3-4 disclosing the reduction of redundancy).
Regarding claim 8, the system of claim 15 performs the method of claim 8 and as such claim 8 is rejected based on the same rationale as claim 15 set forth above.
Regarding claim 16, Wu modified by Lu, Shi and Shim further discloses:
wherein generating a 3D shape for the 3D object based on the 3D shape representation comprising transforming the set of volumetric T-SDF values into a 3D mesh using a marching cube algorithm (Shim, p. 20888, 2nd paragraph: “We can view SDF as a function that takes an arbitrary location as input and returns a signed distance value from the input location to the nearest surface of the mesh, and the sign of the value means whether inside or outside of the shape. We sample SDF values uniformly from a 3D shape to form a voxel-shaped SDF. This form has several advantages over point clouds. It can directly reconstruct mesh through the marching cube algorithm [38] and can utilize convolutional neural network (CNN) because of its dense and fixed structure”; p. 20892, right col., lines 3-4: “we utilize SDF in the form of truncated signed distance fields (T-SDF) to decrease redundant information”)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the data representation provided by Shim, using known electronic interfacing and programming techniques. The modification results in an improved image generation model by more efficiently utilizing memory resources and processing resources (see Shim, p. 20892, right col., lines 3-4 disclosing the reduction of redundancy).
Regarding claim 9, the system of claim 16 performs the method of claim 9 and as such claim 9 is rejected based on the same rationale as claim 16 set forth above.
Regarding claim 20, the limitations included from claim 17 are rejected based on the same rationale as claim 17 set forth above. Further regarding claim 20, Wu discloses:
use of a Variational Auto-Encoder (VAE) (Wu, p. 4, section 3.1, and Fig. 2 on page 5)
and wherein the 3D shape for the 3D object comprises a 3D mesh (Wu, p. 5, Figure 2 discloses output as 3D mesh)
Lu further discloses:
wherein the trained hierarchical autoencoder comprises a hierarchical vector quantized Variational Autoencoders (VQ-VAE) network (Lu, p. 5, Table 1 discloses model including VQ-VAE2).
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, by incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, using known electronic interfacing and programming techniques. The use of a hierarchical diffusion autoencoder for improving details and visual features omitted or insufficient with other techniques (see Wu, p. 3, section 3.2, ¶1 and p. 4, “Advantages and applications of HDAE”, including “while the latent space of DAE lacks low-level details, the hierarchical latent space of HDAE encodes comprehensive fine-grained-to-abstract and low-level-to-high-level features, leading to more accurate and detail-preserving image reconstruction and manipulation results”).
Further regarding claim 20, Shim discloses:
Wherein the trained latent diffusion model is a denoising diffusion probabilistic model, comprising a 3D U-Net, (Shim, p. 20890, Section 3.2 Diffusion SDF Generation section disclosing generating new SDF voxel through denoising, including “Network Architecture. Our model is built upon a U-shaped network in SR3 [59], which is a DDM-based super-resolution method designed for 2D image domain. Thus, we modify and improve this structure for 3D domain. We convert 2D-based U-shaped network of SR3 into 3D-based model by replacing 2D-based convolutional layers with 3D ones.”).
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the 3D U-Net as provided by Shim, using known electronic interfacing and programming techniques. The modification merely substitutes a known type of diffusion model denoising model for another, yielding predictable results of utilizing a 3D U-Net denoising model in place of a different type of denoising model for image diffusion modeling. The modification also provides an improved diffusion model by better preserving finer details and structural information while suppression other various types of noise for improved image generation results.
Shim further discloses:
wherein the 3D shape representation comprises a set of volumetric Truncated-Signed Distance Field (T-SDF) values (Shim, p. 20888, 2nd paragraph: “We can view SDF as a function that takes an arbitrary location as input and returns a signed distance value from the input location to the nearest surface of the mesh, and the sign of the value means whether inside or outside of the shape. We sample SDF values uniformly from a 3D shape to form a voxel-shaped SDF. This form has several advantages over point clouds. It can directly reconstruct mesh through the marching cube algorithm [38] and can utilize convolutional neural network (CNN) because of its dense and fixed structure”; p. 20892, right col., lines 3-4: “we utilize SDF in the form of truncated signed distance fields (T-SDF) to decrease redundant information”)
wherein the 3D shape for the 3D object comprises a 3D mesh (Shim, p. 20894, Figs. 6-7)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further using the data representation provided by Shim, using known electronic interfacing and programming techniques. The modification results in an improved image generation model by more efficiently utilizing memory resources and processing resources (see Shim, p. 20892, right col., lines 3-4 disclosing the reduction of redundancy).
Claim(s) 7, 14, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Wu et al. (Wu, Shuang, et al. "Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer." arXiv:2405.14832v1 [cs.CV] (May 23, 2024): 1-15) in view of
Lu et al. (Lu, Zeyu, “Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation”, arXiv:2304.11829 [cs.CV] (April 25, 2023) and
Shi et al. (US 2025/0078392 A1) in further view of
Sharma (US 2025/0322179 A1)
Regarding claim 14, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above. Further regarding claim 14, Wu further discloses:
training a hierarchical autoencoder using a set of training 3D shape models to obtain a trained (Wu, p. 4, section 3, ¶2: two step training process includes training D3D-VAE as shown in figure 2);
obtaining a set of training latent features using the trained (Wu, p. 5, Figure 2, “We utilize transformer to encode point cloud sampled from 3D model, along with a set of learnable tokens, into an explicit triplane latent space. Subsequently, a CNN-based decoder is employed to upsample these latent representations into high-resolution triplane feature maps.”);
a set of training input prompts corresponding to the set of training 3D shape models (Wu, p. 4, section 3, ¶2: Figure 2 illustrates the overall framework of our proposed method, which comprises a two-step training process: 1) the D3D-VAE is first trained to convert 3D shapes into 3D latents, which is described in Sec. 3.1; 2) the image-conditioned D3D-DiT is then trained to generate high-quality 3D assets, which is detailed in Sec. 3.2.”; and fig. 2 on page 5 discloses the training process of the Direct3D; p. 9, “Text-to-3D” section: Our Direct3D can produce 3D assets from text prompts by incorporating text-to-image models like Hunyuan-DiT)
training a latent diffusion model at least using the set of training latent features and the set of training input prompts to obtain the trained latent diffusion model (Wu, p. 7, Section 4.1, ¶1: “Our D3D-VAE takes as input 81,920 point clouds with normal uniformly sampled from the 3D model, along with a learnable latent token of a resolution r = 32 and a channel dimension de = 768.”; p. 9, “Text-to-3D” section, “Our Direct3D can produce 3D assets from text prompts by incorporating text-to-image models like Hunyuan-DiT” and use a generated image as input to the disclosed model)
Wu modified by Lu further discloses:
obtaining a set of training latent features using the trained (Lu, p. 2: “we design the Hierarchical Diffusion Autoencoders (HDAE) that exploits the coarse-to-fine and low-level-to-high-level feature hierarchy of the semantic encoder and the diffusion-based decoder for comprehensive and hierarchical representations”; p. 3, section 3.2, ¶2: Design space of latent representations and network architectures. “we propose hierarchical diffusion autoencoders which explore the hierarchical semantic latent space of diffusion autoencoders”; p. 4, 3rd bullet point: “HDAE(U) hierarchical diffusion autoencoders with U-Net semantic encoder. HDAE(U) also adopts the hierarchical latent space design but leverages a semantic encoder with the U-Net structure.”)
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, by incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, using known electronic interfacing and programming techniques. The use of a hierarchical diffusion autoencoder for improving details and visual features omitted or insufficient with other techniques (see Wu, p. 3, section 3.2, ¶1 and p. 4, “Advantages and applications of HDAE”, including “while the latent space of DAE lacks low-level details, the hierarchical latent space of HDAE encodes comprehensive fine-grained-to-abstract and low-level-to-high-level features, leading to more accurate and detail-preserving image reconstruction and manipulation results”).
The only limitation missing is the generating of prompts using a captioning model.
Sharma discloses:
generating a set of input prompts corresponding to the set of training shape models using a captioning model; (Sharma, ¶41: method 700 begins with receiving an input image from a client application (block 702). An optical character recognition process is then performed on the input image to extract original text from the image (block 704). Translated text is then generated which corresponds to a translation of the original text from a first language to a second language (block 706). In addition, a portion of the input that includes the original text is extracted (block 708). A natural language description of the extracted portion of the input image is then generating using an image captioning model, the natural language description including a description of visual characteristics of the extracted portion of the input image (block 710). A text-to-image model is then used to generate a translated text image that corresponds to the extracted portion of the input image and that depicts the translated text in place of the original text using the natural language description as a prompt for the model (block 712). )
It is the combination of the teachings of Sharma the teach the use of a caption model for producing text prompts for use with image generation modeling, with the teachings of Wu that teaches the training of a diffusion model using inputs to generate 3D images, where text can be used as an input to the model that teaches and renders obvious the use of a captioning model for inputs to train a text-to-image modeling system as claimed.
It would have been obvious to one of ordinary skill in the art to modify the system and method for generating 3D objects using a prompt-based diffusion model generating 3D modeling data in latent space as provided by Wu, incorporating the use of an image autoencoder for decoding latent space data for image generation as provided by Lu, the computer architecture as provided by Shi, by further including the captioning model for generating prompts in a text to image ai system as provided by Sharma, using known electronic interfacing and programming techniques. The modification merely applies a known technique of generating text prompts to a system that uses text prompts for input to a generative model, ready for improvement and yielding predictable results. The base device performs the same process for training a prompt to 3D object generative modeling system as without the captioning model. The addition of the caption model merely provides a technique for generating training data automatically for application to the text-to-3D object system. It would have further provided an improved system by allowing for faster and more robust training data by allowing a caption model to generate the training prompt data, instead of requiring more expensive and time-consuming user labor or otherwise.
Regarding claim 7, the system of claim 14 performs the method of claim 7. As such, claim 7 is rejected based on the same rationale as claim 14.
Regarding claim 19, the limitations included from claim 17 are rejected based on the same rationale as claim 17 set forth above. Further regarding claim 19, the additional operations are performed by claim 7 and as such the claim is further rejected based on the same rationale as claim 7.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM A BEUTEL whose telephone number is (571)272-3132. The examiner can normally be reached Monday-Friday 9:00 AM - 5:00 PM (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL HAJNIK can be reached at 571-272-7642. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WILLIAM A BEUTEL/Primary Examiner, Art Unit 2616