Last updated: April 19, 2026
Application No. 18/748,080
METHOD AND APPARATUS FOR GENERATING 3D SCENE BASED ON LARGE LANGUAGE MODEL, ELECTRONIC DEVICE, AND STORAGE MEDIUM

Non-Final OA §103
Filed
Jun 19, 2024
Examiner
PATEL, SHIVANG I
Art Unit
2615
Tech Center
2600 — Communications
Assignee
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
Interview Optional

— +18.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 415 resolved cases, 2023–2026
Examiner Intelligence

PATEL, SHIVANG I View full profile →
Grants 74% — above average
Career Allow Rate
309 granted / 415 resolved
+12.5% vs TC avg
Strong +18% interview lift
Without
With
+18.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
22 currently pending
Career history
437
Total Applications
across all art units
Statute-Specific Performance

§101
10.3%
-29.7% vs TC avg
§103
57.8%
+17.8% vs TC avg
§102
16.7%
-23.3% vs TC avg
§112
13.5%
-26.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 415 resolved cases
Office Action

§103
DETAILED ACTION

	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Saharia et al (US 20230377226 A1) in view of Singh et al (US 20240095077 A1). 

Regarding claim 1, Saharia discloses a method for generating 3D scene based on large language model (LLM) ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding), comprising: 
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and 
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).

Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation), 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
	
Saharia and Singh are combinable because they are from the same field of invention. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh

 The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).

Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 1.

Regarding claim 2, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).

Regarding claim 3, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).

Regarding claim 4, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).

Regarding claim 5, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).

Regarding claim 6, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).

Regarding claim 7, Saharia discloses displaying the target three-dimensional scene to the user ([0070] final image 108 depicts the scene described by the text prompt 102 and is output by the system 100 with a final resolution.); and
generating the modified target three-dimensional scene based on a modification instruction of the user ([0164] e image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs).

Regarding claim 8, Saharia discloses An electronic device ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding), comprising: 
at least one processor ([0181] a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions.); and 
a memory connected with the at least one processor communicatively ([0181] computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them); 
wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for generating 3D scene based on large language model (LLM) ([0181] the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.), the method for generating 3D scene based on large language model comprising
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and 
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).

Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation), 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
	
Saharia and Singh are combinable because they are from the same field of invention. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh

 The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).

Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 8.

Regarding claim 9, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).

Regarding claim 10, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).

Regarding claim 11, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).

Regarding claim 12, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).

Regarding claim 13, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).

Regarding claim 14, Saharia discloses displaying the target three-dimensional scene to the user ([0070] final image 108 depicts the scene described by the text prompt 102 and is output by the system 100 with a final resolution.); and
generating the modified target three-dimensional scene based on a modification instruction of the user ([0164] e image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs).

Regarding claim 15, Saharia discloses a non-transitory computer readable storage medium with computer instructions stored thereon ([0181] computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them), wherein the computer instructions are used for causing a computer to perform a method for generating 3D scene based on large language model (LLM), the method for generating 3D scene based on large language model ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding),comprising:
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and 
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).

Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation), 
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
	
Saharia and Singh are combinable because they are from the same field of invention. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh

 The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).

Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 15.

Regarding claim 16, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).

Regarding claim 17, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).

Regarding claim 18, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).

Regarding claim 19, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).

Regarding claim 20, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHIVANG I PATEL whose telephone number is (571)272-8964.  The examiner can normally be reached on M-F 9-5am.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alicia Harrington can be reached on (571) 272-2330.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SHIVANG I PATEL/Primary Examiner, Art Unit 2615
Read full office action
Prosecution Timeline

Jun 19, 2024
Application Filed
Mar 05, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/240,836
Patent 12602847
SYSTEMS AND METHODS FOR LAYERED IMAGE GENERATION
2y 5m to grant Granted Apr 14, 2026
18/491,101
Patent 12599838
APPARATUS AND METHODS FOR RECORDING AND REPORTING ABUSIVE ONLINE INTERACTIONS
2y 5m to grant Granted Apr 14, 2026
18/389,932
Patent 12592004
IMAGE PROCESSING DEVICE AND IMAGE PROCESSING METHOD
2y 5m to grant Granted Mar 31, 2026
18/433,133
Patent 12591947
DISTORTION-BASED IMAGE RENDERING
2y 5m to grant Granted Mar 31, 2026
18/276,437
Patent 12584296
Work Machine Display Control System, Work Machine Display System, Work Machine, Work Machine Display Control Method, And Work Machine Display Control Program
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
93%
With Interview (+18.5%)
2y 4m
Median Time to Grant
Low
PTA Risk
Based on 415 resolved cases by this examiner. Grant probability derived from career allow rate.