Last updated: April 19, 2026

Application No. 18/537,496

TEXT-TO-IMAGE DIFFUSION MODELS FOR GENERALIZABLE MESH GENERATION

Non-Final OA §103

Filed

Dec 12, 2023

Examiner

MAZUMDER, SAPTARSHI

Art Unit

2612

Tech Center

2600 — Communications

Assignee

Qualcomm Incorporated

OA Round

1 (Non-Final)

Interview Optional

— +11.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 375 resolved cases, 2023–2026

Examiner Intelligence

MAZUMDER, SAPTARSHI View full profile →

Grants 64% of resolved cases

Career Allow Rate

241 granted / 375 resolved

+2.3% vs TC avg

Moderate +12% lift

Without

With

+11.8%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

27 currently pending

Career history

402

Total Applications

across all art units

Statute-Specific Performance

§101

10.2%

-29.8% vs TC avg

§103

50.6%

+10.6% vs TC avg

§102

6.8%

-33.2% vs TC avg

§112

19.5%

-20.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 375 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-3, 11-14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kreis et al. ( US patent Publication: 20240171788, “Kreis”) in view of   Li et al. “INSTANT3D: FAST TEXT-TO-3D WITH SPARSE-VIEW GENERATION AND LARGE RECONSTRUCTION MODEL”, arXiv:2311.06214v2 [cs.CV] 23 Nov 2023, “Li”) .


Regarding claim 1, Kreis teaches, A processing system (Fig. 7)  comprising: 
one or more memories ( Fig. 7 element 704)  comprising processor-executable instructions; and 
one or more processors (Fig. 7, element 706) configured to execute the processor-executable instructions (“[0102] The computer-storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, and/or other data types. For example, the memory 704 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system.”) and cause the processing system to:
generate latent tensor based on processing a textual input using a diffusion machine learning model, (“[0037] In some examples, different conditioning signals can be applied. The video diffusion model can accept text prompts that describe the desired video content. The video diffusion model can be updated (e.g., video fine-tuned) to text-to-image diffusion models. …..” [0005] The neural network model is modified from an image diffusion model by adding the at least one first temporal attention layer into the image diffusion model. In one or more embodiments, the image diffusion model may be implemented as an LDM that includes an encoder to map an input from an image space to a latent space”.)  but doesn’t teach generate a multiview latent tensor and wherein the multiview latent tensor corresponds to a plurality of orthographic projections corresponding to the textual input;
However, Li teaches, generate a multiview latent tensor based on processing a textual input using a diffusion machine learning model, wherein the multiview latent tensor corresponds to a plurality of orthographic projections corresponding to the textual input; (  “Figure 2: Overview of our method. Given a text prompt (‘a car made out of sushi’), we perform multi-view generation with Gaussian blobs as initialization using fine-tuned 2D diffusion model, producing a 4-view image in the form of a 2 × 2 grid “ ("our 2x2 grid") . Then Fig.3 uses these multi view images   as an input image encoder to create 2D image token which are  latent space vectors corresponding to Multiview projections. “ Figure 3: Architecture of our sparse-view reconstructor. The model applies a pretrained ViT to encode multi-view images into pose-aware image tokens,”)
Kreis and Li are analogous as they are from the field of image generation using diffusion model.
Therefore it would have been obvious for an ordinary skilled position in the art before the effectively filing date of the claimed invention to have modified Kreis to  generate a multiview latent tensor based on processing a textual input using a diffusion machine learning model, wherein the multiview latent tensor corresponds to a plurality of orthographic projections corresponding to the textual input as taught by Li.
The motivation for the modification is to enhance Kreis to have  used  latent vector for each surface/plane for improved generation of 3D objects..
Kreis as modified by Li teaches, generate a triplane latent tensor based on the multiview latent tensor using a conversion machine learning model; (Li, Page 6 Figure 3: Architecture of our sparse-view reconstructor. The model applies a pretrained ViT to encode multi-view images into pose-aware image tokens, from which we decode a triplane representation of the scene using a transformer-based decoder.” Transformer based decoder is the claimed conversion machine learning model.) 
generate a three-dimensional mesh based on processing the triplane latent tensor using a decoder machine learning model. (LI, Page 6: Fig.3 “Finally we decode per-point triplane features to its density and color and perform volume rendering to render novel views”   A NeRF generation is generated from the triplane representation; figures 1, 6, 8, 10, 12 - novel view renderings (left) and extracted meshes (right), using the corresponding generated NeRF density fields.)

Claim 12 is directed to a method and its steps are similar in scope and functions of the elements of the device claim 1 and therefore claim 12 is rejected with same rationales as specified in the rejection of claim 1.

Claim 20 is directed to one or more non-transitory computer-readable media ( Kreis, “[0102] The computer-storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, and/or other data types. For example, the memory 704 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system.”)  and its elements are similar in scope and functions of the elements of the device claim1 and therefore claim 20 is rejected with same rationales as specified in the rejection of claim 1.

Regarding claims 2 and 13, Kreis as modified by Li teaches,  wherein: the textual input specifies an object ( Li Fig.2 provides the text that refers to a car or object.)  and the plurality of orthographic projections correspond to  four orthographic views of the object. (See the discussion in section 3.1, subsection "Multi-view generation with image grid", and Fig.2) but doesn’t teach the plurality of orthographic projections correspond to  six orthographic views of the object.
However,  Six orthographic views is an obvious variation of four orthographic views ( taught by Li)..
Therefore, it would have been for an ordinary skilled person in the art to have modified  Kreis as modified by Li to have the plurality of orthographic projections correspond to  six orthographic views of the object for better of 3d image generation as more views will provide better coverage of an object.

Regarding claims 3 and 14, Kreis as modified by Li teaches, wherein, to generate the triplane latent tensor, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to process the multiview latent tensor, along with a three-dimensional conditioning token, using the diffusion machine learning model to generate an intermediate tensor. (Li, Figure 3 – and Page 6 third paragraph: "We use triplane as the scene representation. The triplane is flattened to a sequence of learnable tokens, and the image-to-triplane decoder connects these triplane tokens with the pose-aware image tokens fI using cross-attention layers, followed by self-attention and MLP layers. The final output tokens are reshaped and upsampled using a de-convolution layer to the final triplane representation.” Intermediate tensor is the summation 3d image sensors and  triplane token  in Fig.3)

Regarding claim 11, Kreis as modified by Li teaches, the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to: 
render an image depicting the three-dimensional mesh; (Li, Fig.3: Architecture of …………Finally we decode per-point triplane features to its density and color and perform volume rendering to render novel views”)  and 
output the rendered image via a display.(Kreis: “ [0111] The presentation component(s) 718 may include a display (e.g., a monitor, a touch screen, a television screen, a heads-up-display (HUD), other display types, or a combination thereof), speakers, and/or other presentation components. The presentation component(s) 718 may receive data from other components (e.g., the GPU(s) 708, the CPU(s) 706, DPUs, etc.), and output the data (e.g., as an image, video, sound, etc.).”)

Claim(s) 4 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kreis as modified by Li and further in view of Balaji et al.( Balaji et al. "ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers." arXiv preprint arXiv:2211.01324 (2022).”, “Balaji”).

Regarding claims 4 and 15,  Kreis as modified by Li teaches, wherein: to generate the multiview latent tensor, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to perform processing data without the three-dimensional conditioning token using the diffusion machine learning model,(Li,  See Fig.3 2D image tokens are generated without   the three-dimensional conditioning token ) but doesn’t teach so using a plurality of iterations.
However, Balaji teaches,  perform processing data in plurality of iteration using diffusion machine learning model (Abstract: “Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis, demonstrating complex text comprehension and outstanding zero-shot generalization. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts.”)
Kreis as modified by Li and Balaji are analogous as they are from the field of text to image creation.
Therefore it would have been obvious for an ordinary skilled person in the art before the effective filing date of the claimed invention to have modified Kreis as modified by Li to have included performing processing data without the three-dimensional conditioning token in plurality of iteration using the diffusion machine learning model based on teaching of Balaji’s teaching performing processing data in plurality of iteration using a diffusion machine learning model.
The motivation to include the modification is to reduce noise in creating text to image. 
Kreis as modified by Li and Balaji teaches, generate the intermediate tensor, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to perform a single iteration of processing data using the diffusion machine learning model. ( Li, Fig.3  The intermediate tensor is the addition of 3D image tokens and triplane token which is done in single iteration )

Claim(s) 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Kreis as modified by Li and further in view of Shi et al. ( US patent Publication: 20250184581, “Shi”) and  Park et al. ( US Patent Publication: 20230237741, “Park”).

Regarding claims 6 and 17, Kreis as modified by Li  doesn’t expressly teach,: generate a multiview image based on the multiview latent tensor; generate a two-dimensional texture based on the multiview image and texture the three-dimensional mesh based on the two-dimensional texture.
However, Shi teaches, generate a multiview image based on the multiview latent tensor; (“[0018] The text 101 may be input into the second machine learning model 104. The set of multi-view images or latent representations of the set of multi-view images 103 may be input into the second machine learning model 104. The second machine learning model 104 may generate a plurality of sets of multi-view images based on the text 101 and the set of multi-view images or latent representations of the set of multi-view images 103.”) and 
Park teaches, generate a two-dimensional texture based on the multiview image; (“[0116] In operation 1320, the 3D model generator 110 may generate a multi-texture and a 3D mesh based on the multi-view image obtained in operation 1310.”) and 
texture the three-dimensional mesh based on the two-dimensional texture.( Park. “[0120] In operation 1360, the AR device 101 may output the rendered image. In this case, the rendered image may be a multi-texture 3D mesh Nreal glass image”).
Kreis as modified by Li, Shi and Park are analogous as they are from the field of image processing.
Therefore it would have been obvious for an ordinary skilled person in the art before the effective filing date of the claimed invention to have modified Kreis as modified Li to have included generate a multiview image based on the multiview latent tensor as taught by Shi  and generate a two-dimensional texture based on the multiview image and texture the three-dimensional mesh based on the two-dimensional texture as taught by Park.
The motivation to include the modification to properly texture the each side of the a 3d object for rendering . 

Allowable Subject Matter
Claims 5, 7-10, 16, 18 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claims 5 and 16 are objected because the combination of prior arts fails to expressly teach, wherein to generate the triplane latent tensor, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to: for each respective pair of parallel orthographic projections, from the plurality of orthographic projections, the respective pair corresponding to opposite views of an object specified in the textual input: realign at least one orthographic projection of the respective pair of parallel orthographic projections to match orientations; and concatenate the realigned respective pair of parallel orthographic projections; and process the concatenated orthographic projections using the conversion machine learning model.

Claim  7 and 18 are objected as the combination of the best available prior arts fails to expressly teach, generate a UV mapping based on the three-dimensional mesh; 
initialize the two-dimensional texture based on mapping each texel in the two-dimensional texture to a corresponding point on the three-dimensional mesh, based on the UV mapping; and project the multiview image into the initialized two-dimensional texture.

Claim  8 is objected as the combination of the best available prior arts fails to expressly teach, wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to: generate a plurality of textures using the diffusion machine learning model, each of the plurality of textures corresponding to a respective texture modality; and texture the three-dimensional mesh based further on the plurality of textures.

Claims 9 and 19 are objected because the combination of prior arts fails to expressly teach, generate an auxiliary multiview latent tensor using the diffusion machine learning model; and  generate the triplane latent tensor based on both the multiview latent tensor and the auxiliary multiview latent tensor.

Claim 10 is objected by virtue of dependency.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAPTARSHI MAZUMDER whose telephone number is (571)270-3454. The examiner can normally be reached 8 am-4 pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Said Broome can be reached at (571)272-2931. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SAPTARSHI MAZUMDER/Primary Examiner, Art Unit 2612

Read full office action

Prosecution Timeline

Dec 12, 2023

Application Filed

Feb 21, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/604,682

Patent 12597211

GENERATING VARIANTS OF VIRTUAL OBJECTS BASED ON ADJUSTABLE EXTERNAL FACTORS

2y 5m to grant Granted Apr 07, 2026

18/226,015

Patent 12586316

METHOD FOR MIRRORING 3D OBJECTS TO LIGHT FIELD DISPLAYS

2y 5m to grant Granted Mar 24, 2026

18/552,137

Patent 12582488

USER INTERFACE FOR CONNECTING MODEL STRUCTURES AND ASSOCIATED SYSTEMS AND METHODS

2y 5m to grant Granted Mar 24, 2026

18/122,848

Patent 12579745

Curvature-Guided Inter-Patch 3D Inpainting for Dynamic Mesh Coding

2y 5m to grant Granted Mar 17, 2026

18/141,015

Patent 12567210

Multipath Artifact Avoidance in Mobile Dimensioning

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

64%

Grant Probability

76%

With Interview (+11.8%)

2y 8m

Median Time to Grant

Low

PTA Risk

Based on 375 resolved cases by this examiner. Grant probability derived from career allow rate.