DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Saharia et al (US 20230377226 A1) in view of Singh et al (US 20240095077 A1).
Regarding claim 1, Saharia discloses a method for generating 3D scene based on large language model (LLM) ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding), comprising:
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt,
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).
Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation),
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
Saharia and Singh are combinable because they are from the same field of invention.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh
The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).
Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 1.
Regarding claim 2, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).
Regarding claim 3, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).
Regarding claim 4, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).
Regarding claim 5, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).
Regarding claim 6, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).
Regarding claim 7, Saharia discloses displaying the target three-dimensional scene to the user ([0070] final image 108 depicts the scene described by the text prompt 102 and is output by the system 100 with a final resolution.); and
generating the modified target three-dimensional scene based on a modification instruction of the user ([0164] e image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs).
Regarding claim 8, Saharia discloses An electronic device ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding), comprising:
at least one processor ([0181] a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions.); and
a memory connected with the at least one processor communicatively ([0181] computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them);
wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for generating 3D scene based on large language model (LLM) ([0181] the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.), the method for generating 3D scene based on large language model comprising
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt,
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).
Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation),
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
Saharia and Singh are combinable because they are from the same field of invention.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh
The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).
Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 8.
Regarding claim 9, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).
Regarding claim 10, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).
Regarding claim 11, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).
Regarding claim 12, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).
Regarding claim 13, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).
Regarding claim 14, Saharia discloses displaying the target three-dimensional scene to the user ([0070] final image 108 depicts the scene described by the text prompt 102 and is output by the system 100 with a final resolution.); and
generating the modified target three-dimensional scene based on a modification instruction of the user ([0164] e image generation system 101 shown in FIG. 6A can generate images from noise which amounts to changing the conditioning input into the image generation system 100 of FIG. 1A, e.g., replacing text prompts 102 with noise inputs).
Regarding claim 15, Saharia discloses a non-transitory computer readable storage medium with computer instructions stored thereon ([0181] computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them), wherein the computer instructions are used for causing a computer to perform a method for generating 3D scene based on large language model (LLM), the method for generating 3D scene based on large language model ([0052] image generation system that combines the power of text encoder neural networks (e.g., large language models (LLMs)) with a sequence of generative neural networks (e.g., diffusion-based models) to deliver text-to-image generation with a high degree of photorealism, fidelity, and deep language understanding),comprising:
processing description information of a target three-dimensional scene to obtain label information in the description information ([0064] text prompt can be a text sequence that includes multiple text token in a natural language);
generating query operation prompt of the LLM based on the label information ([0065] text encoder is configured to process the text prompt to generate a set of contextual embeddings (u) of the text prompt), and
acquiring a target asset set matched with the label information by the LLM based on the query operation prompt,
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset; and
generating the target three-dimensional scene based on the target asset set ([0070] final image depicts the scene described by the text prompt and is output by the system with a final resolution ).
Singh discloses acquiring a target asset set matched with the label information by the LLM based on the query operation prompt ([0038] generate an image data structure 110 as a representation of the scene, object(s), and/or environment. For example, the image data structure 110 can be a data structure that can be queried to retrieve one or more 2D or 3D portions of the representation),
the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset ([0038] scene representation can be generated and/or updated using at least one of real or synthetic image data, such as image data captured using image capture devices in a physical/real-world environment, or synthetic image data generated to represent virtual or simulated environments.)
Saharia and Singh are combinable because they are from the same field of invention.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify image generation system of Saharia to include acquiring a target asset set matched with the label information by the LLM based on the query operation prompt, the target asset set comprising a target asset in the target three-dimensional scene, target material information of the target asset and target scene attribute information of the target asset as described by Singh
The motivation for doing so would have been to synthetic image generation, including diffusion models that can be used to update or supplement (e.g., inpaint) neural radiance field (NeRF) representations of 3D environments (Singh, [0003]).
Therefore, it would have been obvious to combine Saharia and Singh to obtain the invention as specified in claim 15.
Regarding claim 16, Saharia discloses wherein the processing description information of a target three-dimensional scene to obtain label information in the description information comprises:
generating extraction operation prompt of the LLM based on the description information, and processing the description information by the LLM based on the extraction operation prompt to obtain the label information ([0074] system receives an input text prompt including a sequence of text tokens in a natural language).
Regarding claim 17, Saharia discloses wherein the acquiring a target asset set matched with the label information by the LLM based on the query operation prompt comprises:
matching the label information with a plurality of pieces of pre-recorded candidate information by the LLM based on the query operation prompt to obtain target information of the target asset set ([0082] A GNN 120 facilitates a likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of image); and
acquiring the target asset set based on the target information ([0082] latent spaces can also provide the GNNs a means of combining, mixing, and compressing information from different images such that the sequence can generate new instances of images that are ostensibly unlike anything appearing in the training sets).
Regarding claim 18, Saharia discloses wherein the acquiring the target asset set based on the target information comprises:
acquiring the target asset set in a user-customized local asset library based on the target information ([0064] he system 100 can generate various different types of images such as three-dimensional (3D) images, photorealistic images, cartoon images, abstract visualizations, point cloud images, medical images of different modalities, among others).
Regarding claim 19, Saharia discloses wherein the generating the target three- dimensional scene based on the target asset set comprises:
generating an initial three-dimensional scene based on the target asset set ([0066] initial GNN may receive a set of contextual embeddings associated with the text prompt “photograph of cat”); and
adjusting the initial three-dimensional scene based on scene function information in the label information to generate the target three-dimensional scene ([0066] one or more of the subsequent GNNs may receive a set of contextual embeddings associated with the text prompt “oil painting of cat”).
Regarding claim 20, Saharia discloses displaying the label information to a user ([0194] a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.);
acquiring the label information modified by the user ([0082] likelihood parametrization by modelling intermediate distributions over latent representations z of images x, a.k.a., embeddings, encodings, or “labels” of images); and
acquiring a modified target asset set based on the modified label information ([0068] the GNNs 120 can learn these transformations and associate them with respective text modifiers included in text prompts), and
generating a modified target three-dimensional scene based on the modified target asset set ([0069] post-processor 130 may perform analysis on the output image 106 such as image classification and/or image quality analysis).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHIVANG I PATEL whose telephone number is (571)272-8964. The examiner can normally be reached on M-F 9-5am.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alicia Harrington can be reached on (571) 272-2330. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHIVANG I PATEL/Primary Examiner, Art Unit 2615