Last updated: April 19, 2026
Application No. 18/611,100
ARTIFICIAL INTELLIGENCE AUGMENTED VIRTUAL FILM PRODUCTION

Non-Final OA §103§112
Filed
Mar 20, 2024
Examiner
BADER, ROBERT N.
Art Unit
2611
Tech Center
2600 — Communications
Assignee
Arwall Inc.
OA Round
5 (Non-Final)
Interview Optional

— +26.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 393 resolved cases, 2023–2026
Examiner Intelligence

BADER, ROBERT N. View full profile →
Grants 44% of resolved cases
Career Allow Rate
173 granted / 393 resolved
-18.0% vs TC avg
Strong +26% interview lift
Without
With
+26.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
32 currently pending
Career history
425
Total Applications
across all art units
Statute-Specific Performance

§101
9.9%
-30.1% vs TC avg
§103
48.7%
+8.7% vs TC avg
§102
13.9%
-26.1% vs TC avg
§112
19.5%
-20.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 393 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2/27/26 has been entered.
 
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Applicant’s amended independent claims recite “wherein the three-dimensional mesh for objects and the associated textures for the objects are divided into three categories of decreasing fidelity: foreground objects having high-resolution textures and meshes, mid-ground objects having lower-quality textures and meshes, and background objects having a skybox-like mesh and a low quality texture relative to foreground and [mid-ground] objects”.  The three categories of “fidelity” include a first category defined as having high “resolution” textures and meshes, whereas the second and third category are defined as relatively lower “quality”, with the relation between “fidelity”, “resolution”, and “quality” undefined, i.e. it is not clear whether they are intended to be used synonymously, such that the level of fidelity is a level of resolution which is also a level of quality, or if the terms are interpreted more broadly, such that the different categories having different quality levels do not necessarily vary in resolution of the textures and meshes.  Therefore the relation between the three categories is indefinite.  Depending claims are rejected under the same rationale.
	For purposes of applying prior art, the claim will be interpreted as using the same term, i.e. “three categories of decreasing resolution: foreground objects having high-resolution textures and meshes, mid-ground objects having lower resolution textures and meshes relative to foreground objects, and background objects having a skybox-like mesh and a low resolution relative to foreground and [mid-ground] objects”.
	Applicant’s amended independent claims recite “background objects having a skybox-like mesh”.  Applicant’s disclosure does not use the term “skybox-like mesh” or otherwise suggest the scope of a “skybox-like mesh”.  Applicant’s disclosure, e.g. paragraph 114, suggests that a skybox may be a two-dimensional image positioned within an three-dimensional environment, or “simplified three-dimensional models may be used”, but neither of these examples clarify the scope of “skybox-like mesh”, a term that is both relative and subjective even without attempting to identify a definite definition of a skybox or a skybox-like.  Therefore the scope of “skybox-like mesh” is indefinite.  Depending claims are rejected under the same rationale.
	For purposes of applying prior art, Applicant’s claims will be interpreted as reciting a textured skybox surface, i.e. “background objects represented as a skybox having a low resolution texture relative to foreground and [mid-ground] objects”.
	Applicant’s amended independent claims recite “background objects having a skybox-like mesh and a low quality texture relative to foreground and background objects” (emphasis added).  Though it is apparent Applicant’s intent was to recite relative to mid-ground objects, the limitation as recited is indefinite because it is self referential, i.e. it is indefinite to define the background objects as having a low quality/resolution relative texture to themselves.  Depending claims are rejected under the same rationale.
	For purpose of applying prior art, the claims will be interpreted as reciting “relative to foreground and mid-ground objects”.
	Regarding claims 7, 13, and 20, the limitation “receiving each of the many options and selecting a result from each of the options generated by each of the swarm agents that best matches the text-based prompt and the one or more images” is indefinite, i.e. the outputs are of separate processes from each of the three computers, and the claim requires selecting a single result that best matches both the first text-based prompt and one or more images and the second text-based prompt.  Applicant’s disclosure only describes examples of selecting separately from the results of each computer, i.e. selecting a best result from the first computer’s 2D image results separately from selecting a best result from the second computer’s 3D environment separately from selecting a best result from the third computer’s video.  As there is no disclosure clarifying the selection requirement, an alternative limitation that is both definite and supported by the disclosure is not apparent to the examiner, and therefore no prior art rejection is mapped at this time.
	Claim 25 recites that “the background objects have some elements of depth differentiating from scaling of the foreground objects and the mid-ground objects by including at least one of a sky, a building, a tree or a landmark in the background”.  The phrase “elements of depth differentiating from the scaling of the foreground objects and the mid-ground objects” is indefinite, i.e. depth and scale are not directly comparable attributes, depth being a measure of distance from the viewpoint, and scale being a measure of a relative size of an object compared to a reference measurement or object.  Furthermore, the apparent corresponding support from Applicant’s disclosure does not clarify this issue, i.e. Applicant’s disclosure, paragraph 69 indicates that the background objects “may be simply a skybox or may also have some elements of depth differentiating from, for example, sky, building or trees, landmarks, or the like in the background” (emphasis added), and Applicant’s independent claim 1 recites that the background objects are the skybox representation, i.e. “background objects having a skybox-like mesh”, which is an alternative to the background objects represented as “elements of depth differentiating from” other objects in the background.  Therefore the scope of the claimed background objects as further defined in claim 25 is indefinite.
	For purposes of applying prior art, the claim will be interpreted in correspondence with the paragraph 69 disclosure, i.e. “the background objects include a plurality of elements at different depths from each other, and including at least one of a sky, a building, a tree or a landmark in the background”.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6, 8-10, 12, 14-17, 19, and 21-24 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Application Publication 2024/0112394 A1 (hereinafter Neal) in view of “Versatile Diffusion: Text, Images and Variations All in One Diffusion Model” by Xingqian Xu, et al. (hereinafter Xu) in view of “Towards Text-guided 3D Scene Composition” by Qihang Zhang, et al. (hereinafter Zhang) in view of “Text-To-4D Dynamic Scene Generation” by Uriel Singer, et al. (hereinafter Singer) in view of "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding" by Thomas Muller, et al. (hereinafter Muller) in view of "Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis" by Tianchang Shen, et al. (hereinafter Shen) in view of "Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes" by Towaki Takikawa, et al. (hereinafter Takikawa) in view of "NeRF++: Analyzing and Improving Neural Radiance Fields" by Kai Zhang, Gernot Riegler, et al. (hereinafter Riegler) in view of "Real-Time Neural Rasterization for Large Scenes" by Jeffrey Yunfan Liu, et al. (hereinafter Liu) in view of “Cinematographic Camera Diffusion Model” by Hongda Jiang, et al. (hereinafter Jiang).
	Regarding claim 1, the limitations “A system comprising: a first computer for receiving a text-based prompt … of a desired virtual location; and outputting a two-dimensional image representative of the desired virtual location, the two-dimensional image incorporating fixed optical parameters and … at least two objects within the two dimensional image” are taught by Neal (Neal, e.g. abstract, paragraphs 23-64 discloses a system for receiving an input text prompt from a user for generating an immersive volumetric photo or video, including steps 102-104, e.g. paragraphs 27, 29, 30, 52-61, wherein the user provides a text prompt describing a desired visual scene and objects within the scene, and a 2D RGB image is generated depicting the desired visual scene, e.g. paragraph 30, which comprises the indicated object.  Further, Neal’s 2D RGB images are not described as having variable optical parameters, i.e. as in paragraph 38, the field of view may be chosen arbitrarily, which would correspond to a fixed FOV, or fixed optical parameters.)
	The limitation “receiving a text-based prompt from a first user along with one or more images of a desired virtual location; and outputting a two-dimensional image representative of the desired virtual location” is not explicitly taught by Neal (Neal, e.g. paragraphs 29, 52-61, describes receiving text prompts describing virtual scenes for generating 2D images of the described scenes, and as in paragraph 61 suggests that this could include combining a plurality of text prompt generated images into a composited image.  While it is noted that this could be considered to teach the claimed limitation, i.e. when the LLM is used to generate a plurality of prompts, images, and composited image(s), there is a text-based prompt received from a user, and one or more images are received, although not specifically from the user, in the interest of compact prosecution Xu is cited for suggesting using a multi-context blender for using two types of input prompts for generating images using a versatile-diffusion model as an improvement over a text-only (or image-only) stable-diffusion model.)  However, this limitation is taught by Xu (Xu, e.g. abstract, sections 3, 4, describes Versatile Diffusion, an improved diffusion model which handles diffusion tasks in multiple directions, e.g. as in figure 1, text-to-image, image-to-image, image-to-text, text-and-image-to-image, and text-and-multiple-images-to-image tasks are demonstrated using the same VD framework.  Xu, e.g. section 4.5, teaches that this is advantageous in that multi-context generation tasks may achieve better results than single-context generation tasks.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system to use Xu’s multi-context VD diffusion framework for generating the non-immersive images in order to allow a user to provide text prompts, image prompts, or both text and image prompts.  In the Neal’s modified system, Xu’s VD framework would be used in place of the exemplary frameworks in Neal paragraph 30, i.e. Neal anticipates a stable diffusion, i.e. SD, framework, which is the framework improved on by Xu’s VD framework.
	The limitations “a second computer for receiving the two-dimensional image … and outputting a three-dimensional environment based on the two-dimensional image … the three-dimensional environment including a three-dimensional mesh for objects within the three-dimensional environment” are taught by Neal (Neal, e.g. paragraphs 33-37, 48, teaches that the 3D immersive images are generated from the 2D non-immersive images by generating a depth map for each non-immersive image, where the depth map corresponds to a mesh of the elements of the immersive image, i.e. as in the example of paragraph 48, the depth map defines a triangle mesh which can be used to render display images from the 3D immersive image according to a user point of view.)
	The limitations “a third computer for receiving the three-dimensional environment and a second text-based prompt providing instructions for a particular type of video to be created, wherein the second text-based prompt is different than the first text-based prompt … and outputting a video consistent with the desired virtual location and captured within the three-dimensional environment” are taught by Neal (Neal, e.g. paragraphs 29, 30, 32, 33, teaches that in addition to generating 3D immersive photos from text prompts, the system may generate a series of 3D immersive frames making up a video, by performing the same process performed with respect to the single 3D immersive photo for a series of 2D RGB frames, i.e. the text-based prompt, and/or a later/second text-based prompt as in paragraph 63, may be received for specifying the content and actions performed by objects within of a video captured within the described environment, corresponding to the claimed video.)
	The limitations “outputting a two-dimensional image … the two-dimensional image incorporating fixed optical parameters and associated metadata identifying at least two objects within the two dimensional image; a second computer for receiving the two-dimensional image and the associated metadata; and outputting a three-dimensional environment based on the two-dimensional image and the associated metadata, the three-dimensional environment including a three-dimensional mesh for objects within the three-dimensional environment, associated textures for the objects and an associated lighting map for the three-dimensional environment … a third computer for receiving the three-dimensional environment and a second text-based prompt providing instructions for a particular type of video to be created … and outputting a video consistent with the desired virtual location and captured within the three-dimensional environment” are partially taught by Neal (As discussed above, Neal’s system receives text prompt(s) from the user, which are used to generate one or more 2D RGB images of the desired virtual scene/video, which are in turn used to generate one or more 3D immersive images corresponding to the claimed 3D environment/mesh, which are used to render the frames of the video corresponding to the user’s text prompt(s).  While Neal’s 2D RGB images comprise the objects, and are arguably associated with metadata, i.e. the input prompts, identifying the objects, and Neal’s 3D immersive images have depth maps defining a triangle mesh which has texture content for the objects based on the immersive image, Neal’s 2D RGB images do not actually include object metadata, and Neal’s 3D immersive images do not include separate textures for the objects, or a lighting map for the environment.)  However, these limitations are taught by Zhang in view of Singer (Zhang, e.g. abstract, sections 3, 4, figures 1, 2, describes the SceneWiz3D system for text prompt guided 3D scene composition, which operates by receiving text prompt from the user for generating a hybrid representation of the scene comprising a NeRF representation of the background environment and DMTet mesh objects of interest positioned therein, e.g. section 3.3. The 3D objects and NeRF environment are generated based on a text-based prompt from a user describing a desired virtual scene and objects within the scene, e.g. section 3, 3.2, 3.3, 3.3.1, 3.3.2, with an optimization process that alternatively optimizes the object and environment representations, and their relative configuration, resulting in scenes corresponding to the input text prompt that have plausible configurations, as shown in the examples of figure 3.  Further, Zhang’s 3D objects are represented using a textured mesh, e.g. section 3.2, paragraphs 2, 3, describing extracting the textured mesh representation, which is a triangular mesh having an occupancy mask with the color of each point, which corresponds to a textured mesh, e.g. as shown in the examples of paragraph 6 where the objects have consistent textures.  Finally, Zhang’s NeRF representation of the environment corresponds to the claimed lighting map, e.g. section 3.2, paragraph 4, indicating that the NeRF representation is a volumetric radiance field which can be used to render complex lighting effects.  It is additionally noted that Zhang’s hybrid representation is analogous to Neal’s suggested foreground/background separate representation as in paragraphs 47, 50, where the foreground and background portions of the scene are separated, supporting an augmented reality display mode.  With respect to the Neal’s disclosure of generating an immersive video using a plurality of frames as an alternative to generating an immersive photo, although Zhang does not address extension of the SceneWiz3D system to a 4th/temporal dimension, Singer teaches a technique for extending text-to-3D systems to the 4th/temporal dimension, e.g. abstract, sections 1, 3, 4.  Singer, e.g. section 1, paragraphs 4-6, teaches that the system uses a dynamic NeRF for scene representation, and that the system has a first stage using a text prompt input to fit a static 3D scene, analogous to Zhang’s SceneWiz3D, and subsequently adds dynamics by extending the 3D scene optimization to 4D, e.g. sections 3, 3.1, 3.2.  Further, Zhang uses rendered scene images for guiding the optimization process, and Singer, e.g. section 3.1, paragraphs 4, 5, section 3.2, paragraphs 4-6, indicates that the static optimization phase uses only frames from the initial time instant, using rendered scene images from 3 orthogonal directions for the static optimization, followed by the dynamic optimization to extend the model to 4D, such that Singer’s extension is applicable to Zhang’s SceneWiz3D system.  It is additionally noted that Singer indicates that the system is built on a similar architecture to DreamFusion, e.g. section 3.1, and Zhang’s system is built on an architecture based on ProlificDreamer, e.g. section 3.3, which is an extension of DreamFusion, indicating compatibility of the techniques.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, to include Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step in order to generate more plausible scene configurations as taught by Zhang, and further to use Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique in order to generate 3D immersive videos for Neal’s system.  In the modified system, Neal’s 3D immersive images would be represented using Zhang’s hybrid representation format, where Zhang’s hybrid representation comprises the textured meshes for the objects in Neal’s 2D RGB images, i.e. Neal’s 2D RGB images would be associated with metadata identifying the objects therein as shown in Zhang, figures 1, 2, and a background/environment NeRF representation corresponding to the claimed lighting map.  Further, with respect to the second prompt and generating a video using the 3D environment, when Neal’s modified system is used to generate videos with the above noted SceneWiz4D technique, the hybrid representation format would be used to generate dynamic effects over the duration of the video in response to the first and/or second text-based prompt(s), i.e. the 3D object(s) would change location, scaling, and/or rotation over time, as in Singer’s examples in figure 2.
	The limitations “wherein the three-dimensional mesh for objects and the associated textures for the objects are divided into three categories of decreasing fidelity: foreground objects having high-resolution textures and meshes, mid-ground objects having lower resolution textures and meshes relative to foreground objects,” are partially taught by Neal in view of Zhang (Neal, e.g. paragraphs 29, 47, 59, teaches that the user can specify relative locations in the scene for the objects, i.e. both foreground and background are mentioned explicitly, and with relative positioning of a sufficient number of objects in the scene spread sufficiently along a depth direction, at least some of these objects would correspond to the claimed mid-ground between foreground object(s) and background object(s).  Zhang, section 3.2, paragraph 3, teaches that the objects of interest, i.e. the claimed foreground and mid-ground objects in Neal’s modified system, are modeled using a DMTet for each object containing two networks which use hashing-based encoding to model the geometry and color separately, where marching tetrahedra is used to extract a triangular mesh from the DMTet prediction of SDF.  Zhang cites Muller, i.e. reference 31, for disclosing the hashing-based encoding technique used in Zhang’s system, and cites Shen, i.e. reference 45, for disclosing the DMTet system, short for Deep Marching Tetrahedra.  That is, one of ordinary skill in the art implementing the modification of Neal’s system to substitute Zhang’s text-to-3D SceneWiz3D technique for Neal’s 3D depth image generation step, and using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, would look to the disclosures of Muller and Shen for details of implementing Zhang’s per-object hashing-based encoded DMTet.  Muller, e.g. abstract, sections 1-6, describes multiresolution hash encoding for different neural graphics applications, including an application corresponding to Zhang’s predicted SDF based modeling, i.e. Muller, section 5.2, implements multiresolution hash encoding for Takikawa’s Neural Geometric Level of Detail (NGLOD) system, which generates geometric representations of a modeled object at different levels of detail using a learned/predicted SDF representing the object shape.  Shen, e.g. abstract, sections 1, 3, describe the detail of the DMTet generative model, which predicts an object’s shape using an SDF, i.e. analogous to Muller’s example of section 5.2.  Finally, while one of ordinary skill in the art would understand that one of the advantages of Muller’s multiresolution hash encoding is the ability to generate results having a desired level of detail between the coarsest and finest resolution representations encoded in the network, e.g. Muller, section 3, indicates that the multiresolution hash encoding networks have L levels between the coarsest and finest resolutions, allowing for the same network to generate results at different resolutions, and one of ordinary skill in the art would be aware that it is common to select an object’s level of detail for rendering according to the objects distance to the camera, i.e. the claimed high resolution and relatively lower resolution textures/meshes for objects in the foreground and mid-ground, Zhang does not explicitly address using the distance from the camera to control the level of detail at which each per-object hashing-based encoded DMTet generates the geometry and texture for each object.  Further, while Muller, e.g. section 5.2, indicates that Takikawa’s NGLOD system was implemented by Muller using the multiresolution hash encoding technique, Muller does not explicitly indicate how the LOD is selected for generating the geometry using an NGLOD neural network, i.e. as with Zhang, Muller does not explicitly address using the distance from the camera to control the level of detail at which the multiresolution hash encoding neural network generates the geometry and texture for each object.)  However, this limitation is taught by Takikawa (Takikawa, e.g. abstract, sections 1, 3-5, describes the NGLOD system, which encodes an implicit 3D shape representation using an SDF, analogous to Shen’s encoded SDF representation as noted above.  Takikawa, section 3.4, subsection LOD selection, teaches that the LOD value L can be selected using a depth heuristic having user defined thresholds based on the distance to the object from the camera, i.e. the above noted LOD selection technique which one of ordinary skill in the art would know is common but is not explicitly taught by Zhang or Muller.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, including Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step, using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, to use Takikawa’s user defined distance thresholds for LOD selection for Zhang’s per-object hashing-based encoded DMTet to control the level of detail at which each per-object hashing-based encoded DMTet generates the geometry and texture for each object.  As noted above, one of ordinary skill in the art implementing the modification of Neal’s system in view of Zhang and Singer would look to the disclosures of Muller and Shen for details of implementing Zhang’s per-object hashing-based encoded DMTets.  Further, Zhang, e.g. figure 2, section 3.1, indicates that the optimization relies on perspective rendered images from a given camera viewpoint, such that each object has a distance to the camera which can be used to select a LOD using Takikawa’s user defined distance thresholds.  That is, in the modified system, for first object(s) having a distance within a first distance threshold, Zhang’s per-object hashing-based encoded DMTets would use the highest LOD value L, corresponding to the claimed foreground objects generated with high resolution textures and meshes, and for second object(s) having a distance greater than the first distance threshold and less than a second distance threshold, Zhang’s per-object hashing-based encoded DMTets would use the second highest LOD value L, corresponding to the claimed mid-ground objects having lower resolution textures and meshes relative to the first/foreground object(s).
	The limitations “wherein the three-dimensional mesh for objects and the associated textures for the objects are divided into three categories of decreasing fidelity: foreground objects having high-resolution textures and meshes, mid-ground objects having lower resolution textures and meshes relative to foreground objects, and background objects represented as a [background] having a low resolution texture relative to foreground and mid-ground objects” are partially taught by Neal in view of Zhang, Muller, Shen, and Takikawa (As discussed above, in the modified system, for first object(s) having a distance within a first distance threshold, Zhang’s per-object hashing-based encoded DMTets would use the highest LOD value L, corresponding to the claimed foreground objects generated with high resolution textures and meshes, and for second object(s) having a distance greater than the first distance threshold and less than a second distance threshold, Zhang’s per-object hashing-based encoded DMTets would use the second highest LOD value L, corresponding to the claimed mid-ground objects having lower resolution textures and meshes relative to the first/foreground object(s).  Further, as discussed above, Zhang, section 3.2, paragraph 4, teaches that the background/environment is modeled using a NeRF comprising the claimed lighting map for the environment, and represents background objects in the scene, e.g. in the examples of figure 7, the football and basketball courts include a plurality of background objects such as hoops, seating, lights.  Zhang does not explicitly teach that the background objects represented in the NeRF model are represented with lower resolution texture relative to the foreground and mid-ground objects modeled using the per-object hashing-based encoded DMTets, but Zhang does indicate that the NeRF representation can accommodate both bounded and unbounded scenes, citing Riegler, i.e. reference 55, as an example.  Further, as discussed above, Neal, e.g. paragraphs 29, 47, 59, teaches that the user can specify relative locations in the scene for the objects, wherein paragraph 47 discusses generating a background layer comprising background objects, i.e. the scene may include foreground objects, midground objects, and background objects corresponding to a background layer.  While Zhang does not address using a NeRF model having a foreground and a background modeled with different resolutions, Riegler discloses an NeRF model for unbounded scene representations including a foreground model and a background model having a lower resolution representation relative to the foreground model.)  However, this limitation is taught by Riegler (Riegler, e.g. abstract, sections 1-6, describes NeRF++, an improved NeRF model for representing unbounded scenes.  Riegler, e.g. section 1, paragraphs 3-6, section 4, uses two separate NeRFs to model an inner unit sphere corresponding to a foreground volume, and an outer volume covering the complement to the inner volume corresponding to a background volume, and rendering is performed by compositing foreground samples and background samples along the ray.  Riegler, e.g. section 4, paragraph 4, indicates that the objects represented in the background volume have a lower resolution relative to the foreground volume, and, e.g. section 1, paragraph 5, that this is advantageous for representing unbounded scenes in comparison to unseparated models, where unseparated models would require limiting the volume to a small portion of the scene for high detail sampling or fit the full scene into the volume requiring a lower detail sampling.  Additionally, as noted above, Zhang, section 3.2, paragraph 4, cites Riegler as an exemplary NeRF model, such that one of ordinary skill in the art would also be motivated to use Riegler’s NeRF model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, as in the football and basketball court examples of Zhang, figure 7.)
 	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, including Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step, using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, using Takikawa’s user defined distance thresholds for LOD selection for Zhang’s per-object hashing-based encoded DMTet, to use Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes because Zhang cites Riegler as an exemplary NeRF model which can be used for modeling unbounded scenes, and because Riegler, e.g. section 1, paragraph 5, section 4, paragraph 4, indicates representing the background volume in lower resolution is advantageous for representing unbounded scenes in comparison to unseparated models.  In Neal’s modified system including Zhang’s text-to-3D SceneWiz3D technique, extended in view of Singer to the text-to-4D SceneWiz4D technique, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, the background NeRF model component would represent objects in the background volume of the scene with a lower resolution relative to the foreground volume, and foreground objects, corresponding to the claimed background objects represented as a [background] having a low resolution texture relative to foreground and mid-ground objects.
	The limitation “wherein the three-dimensional mesh for objects and the associated textures for the objects are divided into three categories of decreasing fidelity: … background objects represented as a skybox having a low resolution texture relative to foreground and mid-ground objects” is partially taught by Neal in view of Zhang and Riegler (As discussed in the above modification in view of Riegler, in Neal’s modified system including Zhang’s text-to-3D SceneWiz3D technique, extended in view of Singer to the text-to-4D SceneWiz4D technique, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, the background NeRF model component would represent objects in the background volume of the scene with a lower resolution relative to the foreground volume, and foreground objects, corresponding to the claimed background objects represented as a [background] having a low resolution texture relative to foreground and mid-ground objects.  It is noted that Riegler’s background NeRF model is related to the concept of multi-sphere images, wherein scene representation consists of nested concentric spheres sampled according to inverse depth, e.g. section 4, final paragraph.  While Riegler’s background NeRF model models background objects with a lower resolution than the foreground NeRF model, Riegler’s background NeRF model is not a skybox model, as claimed.)  However, this limitation is taught by Liu (Liu, e.g. abstract, sections 1, 3-5, describes a system for neural rasterization for large scale scenes, where Liu, e.g. section 3.1, uses separate models for the foreground region and the background regions.  Liu, section 3.1, paragraph 3, indicates that the background representation is inspired by the multi-sphere image representation cited by Riegler, i.e. Liu’s reference 4 is Riegler’s Attal, et al., and that the neural skyboxes can be integrated into existing pipelines to enable efficient rendering.  Further, Liu, section 3.2, paragraphs 4, teaches that the neural skyboxes are composited with the foreground content, analogous to Riegler, section 4, paragraph 5, indicating the sampling ray is a compositing of foreground samples and background samples, and Liu, section 3.3, indicates that the skybox neural networks are optimized using perceptual and photometric losses based on images rendered using the hybrid representation including the neural skyboxes, analogous to Zhang, e.g. section 3.3.2, optimizing the hybrid representation including the background/environment NeRF using perceptual losses based on images rendered using the hybrid representation.  Finally, Liu, section 4.2, 4.3, teaches that the NeuRas can increase performance for popular NeRF approaches, i.e. one of ordinary skill in the art would be motivated to use Liu’s neural skyboxes to improve performance of prior art NeRF models.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, including Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step, using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, using Takikawa’s user defined distance thresholds for LOD selection for Zhang’s per-object hashing-based encoded DMTet, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, to substitute Liu’s neural skyboxes for Riegler’s background NeRF model because Liu teaches that neural skyboxes can be integrated into existing pipelines to enable efficient rendering and can increase performance for popular NeRF approaches, and because Riegler and Liu teach that the background NeRF model and neural skyboxes are related concepts for scene representation according to the multi-sphere image technique.  In Neal’s modified system including Zhang’s text-to-3D SceneWiz3D technique, extended in view of Singer to the text-to-4D SceneWiz4D technique, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, substituting Liu’s neural skyboxes for Riegler’s background NeRF model, Liu’s neural skyboxes would represent objects in the background volume of the scene with a lower resolution relative to the foreground volume, corresponding to the claimed background objects represented as a skybox having a low resolution texture relative to foreground and mid-ground objects.
	The limitations “wherein the second text-based prompt … describes a path through which a virtual camera should fly within the three-dimensional environment; and outputting a video consistent with the desired virtual location and captured within the three-dimensional environment with a virtual camera moving along the path according to the second text-based prompt” are partially taught by Neal in view of Zhang and Singer (As discussed above, when Neal’s modified system is used to generate videos with the above noted SceneWiz4D technique, the hybrid representation format would be used to generate dynamic effects over the duration of the video in response to the first and/or second text-based prompt(s), i.e. the 3D object(s) would change location, scaling, and/or rotation over time, as in Singer’s examples in figure 2.  Neal, e.g. paragraphs 28, 49, indicates that the videos may be generated by rendering video with 6DOF, by simulating movement of a virtual camera through a time series of different locations and orientations, and similarly Singer, e.g. section 3, paragraph 2, section 3.2, paragraph 2, indicates that the SceneWiz4D technique uses a camera trajectory C to control the path of the virtual camera.  One of ordinary skill in the art would understand that the virtual camera trajectory could be defined using positions within the 3D environment, e.g. Zhang, figures 6, 7, show images generated using virtual cameras placed inside the 3D environment.  Further, while one of ordinary skill in the art would have found it implicit that the virtual camera trajectory could be defined using the first and/or second text-based prompts, i.e. Neal, e.g. paragraphs 60-63, indicates that the first and/or second text-based prompts are used to generate the immersive video without requiring more than a vague description from the user, implicitly teaching that the time series of locations and orientations for the virtual camera are defined based on the prompt(s), Neal does not explicitly state that the virtual camera trajectory is generated based on a virtual camera path description received in the prompt(s).  Singer indicates that the camera trajectory C is given as input, but uses random trajectories in training, e.g. section A.3.)  However, this limitation is taught by Jiang (Jiang, e.g. abstract, sections 1, 3-6, describes a cinematographic camera diffusion model used for generating virtual camera trajectories based on descriptions received from text prompts.  Jiang, e.g. section 4.1, teaches that the output is a sequence of camera poses having different positions and orientations defined relative to a character in the scene, i.e. analogous to Neal and Singer, the virtual camera trajectory is a time series of camera locations/orientations in the 3D scene.  Further, Jiang, e.g. section 5.3.1, figure 4, teaches that the initial trajectory from a text prompt such as a script can be iteratively improved by adding descriptive information, analogous to Neal, e.g. paragraph 63, describing the use of second text prompt(s) following the initial text prompt for iteratively updating the immersive video.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, including Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step, using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, using Takikawa’s user defined distance thresholds for LOD selection for Zhang’s per-object hashing-based encoded DMTet, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, substituting Liu’s neural skyboxes for Riegler’s background NeRF model, to include Jiang’s cinematographic camera diffusion model for generating virtual camera trajectories using the first and/or second text prompts because, as noted above, one of ordinary skill in the art would have found it implicit that the virtual camera trajectory could be defined using the first and/or second text-based prompts as in Neal, paragraphs 61-63, and because Jiang’s model allows for generating quality camera trajectories with reduced user effort and/or expertise, e.g. Jiang, sections 5, 6.  It is noted that Zhang, e.g. figures 6, 7, shows exemplary 3D scenes in which the 3D object(s) are human characters.  In the modified system, when the 3D object(s) in the generated 3D scene include character(s), and the first and/or second text prompts include description of a virtual camera trajectory relative to the character(s), Jiang’s model would be used to generate the virtual camera trajectory as in section 4.1, which would be used for controlling the camera viewpoint trajectory of the resulting 4D scene generated by the SceneWiz4D technique, as in Singer section 3.2, paragraph 2, and analogous to Neal, paragraph 49.  Finally, as noted above, one of ordinary skill in the art would understand that the virtual camera trajectory can include positions within the 3D environment, as in the examples of Zhang, figures 6, 7.
	Regarding claim 2, the limitation “wherein at least two of the first, second, and third computers are one computer” is taught by Neal (Neal, e.g. paragraphs 65-74, describes the computing device(s) used to implement the system, which could include a single computing device, i.e. wherein all three of the computers are the same one computer.)
	Regarding claim 3, the limitations “wherein the second computer is further for: generating certain objects as foreground objects at an appropriate foreground scale that is scaled to an average human size within the three-dimensional environment; generating other objects as mid-ground objects at an appropriate mid-ground scale relative to the foreground scale of the foreground objects; and generating still other objects as background objects at an appropriate background scale from a perspective on the three-dimensional environment sufficient to provide realistic depth to a scene shown in the three-dimensional environment” are taught by Neal in view of Zhang, Takikawa, Riegler, and Liu (As discussed in the claim 1 rejection above, Neal, e.g. paragraphs 29, 47, 59, teaches that the user can specify relative locations in the scene for the objects, i.e. both foreground and background are mentioned explicitly, and with relative positioning of a sufficient number of objects in the scene spread sufficiently along a depth direction, at least some of these objects would correspond to the claimed mid-ground between foreground object(s) and background object(s).  Further, as discussed in the claim 1 rejection above, in the modified system, first object(s) having a distance within a first distance threshold correspond to the claimed foreground objects represented with high resolution textures and meshes generated using the highest LOD value with Zhang’s per-object hashing-based encoded DMTets, and second object(s) having a distance greater than the first distance threshold and less than a second distance threshold represented with lower resolution textures and meshes using the second highest LOD value with Zhang’s per-object hashing-based encoded DMTets, and the background objects in the background volume would be represented using Liu’s neural skyboxes substituted for Riegler’s background NeRF model at a lower resolution than the foreground and mid-ground objects.  Zhang’s SceneWiz3D (and by extension, the SceneWiz4D technique discussed in the claim 1 modification) composes the objects in the scene at appropriate scale, e.g. section 3.3.1, figure 3, including human objects, i.e. the objects in the scene are all scaled to an appropriate size relative to an average human size as claimed.  When the scene is rendered from the perspective viewpoint, the foreground objects will be scaled to the average human size, as with the exemplary scenes in Zhang including humans in the foreground, the mid-ground objects will be at an appropriate mid-ground scale relative to the foreground objects, and the neural skyboxes comprising the background objects will be at the appropriate background scale, providing a realistic depth to the scene, as in Zhang’s examples including large scale scenes, such as the football and basketball courts of figure 7.)
	Regarding claim 4, the limitation “wherein the second computer is further for generating a perspective on the three-dimensional environment including at least a one-hundred-and-eighty-degree view of the three-dimensional environment” is taught by Neal in view of Zhang (Neal, e.g. paragraphs 26, 38, teaches that the 3D immersive image may be a 180 degree field of view image, as well as that the resulting 3D immersive image or video may be stored for layer playback, e.g. paragraphs 48-51.  Further, Zhang teaches that the hybrid format may be rendered using a panoramic field of view, e.g. figure 2, section 3.3.2, although Zhang renders the RGBD panoramic images without objects for guiding the environment modeling rather than for display.)
	Regarding claim 6, the limitation “wherein each of the first computer, the second computer, and the third computer utilize swarm agents to repeatedly perform each output process a plurality of times to generate many options” is taught by Neal in view of Zhang and Singer (Neal, e.g. paragraph 63, indicates that receiving the text prompt and performing the text-to-image operations may be performed repeatedly to update and modify the scene description used to generate the image, which corresponds to repeatedly performing the first computer’s output process to generate a plurality of, i.e. many, options.  Further, Zhang’s SceneWiz3D, and by extension, the SceneWiz4D technique discussed in the claim 1 modification, corresponding to the output steps of the second and third computer(s), utilizes a particle swarm optimization technique, e.g. Zhang, section 3.31, figure 2, i.e. both the static optimization phase corresponding to the second computer’s output, and the dynamic optimization phase corresponding to the third computer’s output, utilize particle swarm optimization, corresponding to the claimed swarm agents repeatedly performing the output processes to generate a plurality of, i.e. many, options used to guide the optimization.)
Regarding claims 8, 14, and 15, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 1 above.
Regarding claims 9 and 16, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 3 above.
Regarding claims 10 and 17, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 4 above.
Regarding claims 12 and 19, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 6 above.
	Regarding claim 21, the limitation “further comprising an external generative AI for outputting the two-dimensional image representative of the desired virtual location, and for outputting the three-dimensional environment” is taught by Neal in view of Xu, Zhang, and Singer (Neal, e.g. paragraphs 52-64, teaches that the system may use an additional generative AI, such as the LLM 300, for receiving the user’s input(s) and based thereon generating input elements provided to the other prompt-based AI elements, e.g. as in paragraph 61 the user input is used to generate prompts for the text-to-image AI model(s), or paragraph 64 where the LLM generates prompts for a text-to-speech AI model.  Neal, paragraph 61, indicates that the LLM is further capable of receiving the output of the AI model(s) and combining the results to generate the resulting output video, i.e. the LLM 300 is the claimed external generative AI for outputting the two-dimensional image(s) and outputting the resulting video for the user to view in Neal’s unmodified system.  As discussed in the claim 1 rejection above, Neal’s text-to-3D immersive video system is modified using Xu’s multi-context VD diffusion framework for generating the non-immersive images, i.e. instead of the LLM 300 receiving only a text-based input prompt, the user may provide text and images as part of the input prompt.  Further, as discussed in the claim 1 rejection above, Neal’s text-to-3D immersive video system is modified to substitute Neal’s 3D depth image generation step with the text-to-4D SceneWiz4D technique by extending Zhang’s text-to-3D SceneWiz3D technique using Singer’s static-to-dynamic 4D extension technique.  That is, in Neal’s modified system, the LLM 300, analogous to the description of paragraph 61, would receive the text-prompt and images input by the first user, which would be provided to the multi-context VD diffusion framework for generating the output images of the desired virtual environment, which the LLM 300 would provide as input to the text-to-4D SceneWiz4D technique which outputs the three-dimensional environment based on the two-dimensional images and metadata using Zhang’s text-to-3D SceneWiz3D technique as input to Singer’s static-to-dynamic optimization used to generate the output video captured within the three-dimensional environment, or more succinctly, Neal’s LLM 300 is an external generative AI which receives as input the user input, and the results generated by the generative AI models as discussed in the claim 1 rejection, and provides said user input and generated results as input for generating the next component, as in paragraph 61, as well as the final resulting output video, e.g. paragraph 62.)
	Regarding claim 22, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 1 and 21 above, i.e. as discussed in the claim 21 rejection, Neal’s LLM 300 is the claimed external AI which receives the input text prompt and images from the user, and as discussed in the claim 1 rejection above, Neal teaches that the first user input including the text-based prompt is also used to specify the content and actions performed by objects within a video captured within the described environment, i.e. the second text-based prompt can be generated by the LLM 300 using the first text-based prompt.  It is additionally noted that, as in paragraph 63, Neal indicates that second/additional inputs for refining the results are optional, i.e. the user may or may not choose to make modifications by providing additional inputs.
	Regarding claims 23 and 24, the limitations “wherein the path is an AI-selected path, including changes in position and viewing angles over the course of the path, the path being either a tracking shot moving through the scene within the three-dimensional environment for a period of time following at least one character as they move through the three-dimensional environment or a fly-through shot moving through the three-dimensional environment along the AI-selected path” and “wherein the path follows and AI generative timing and placement of the virtual camera along the path selected by a generative AI” are taught by Neal in view of Jiang (As discussed in the claim 1 rejection above, in the modified system, when the 3D object(s) in the generated 3D scene include character(s), and the first and/or second text prompts include description of a virtual camera trajectory relative to the character(s), Jiang’s model would be used to generate the virtual camera trajectory as in section 4.1, which would be used for controlling the camera viewpoint trajectory of the resulting 4D scene generated by the SceneWiz4D technique, as in Singer section 3.2, paragraph 2, and analogous to Neal, paragraph 49.  That is, the trajectory is AI-selected, includes AI generated timing and placement at changing camera positions and angles over the course of the trajectory/path, and is a tracking shot moving through the scene/3D environment for a period of time relative to at least one character.  It is additionally noted that Jiang’s generated virtual camera trajectories fly through the 3D scene, e.g. figure 4, showing different trajectories for a camera pushing in toward the character, such that the trajectories are used to generate both types of claimed shots, i.e. the tracking shot and the fly-through shot.)
Regarding claim 25, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 3 above, with exception of the limitations requiring that “the background objects include a plurality of elements at different depths from each other, and including at least one of a sky, a building, a tree or a landmark in the background”, which are also taught by Neal in view of Zhang and Liu, i.e. as discussed in the claim 1 rejection, Neal, e.g. paragraph 47, indicates the background layer may include objects such as buildings or plants, and Liu, e.g. section 3.1, paragraph 3, section 3.2, paragraph 3, indicates that the neural skyboxes are separated into L layers of different depths from near to far, i.e. the neural skyboxes represent the background objects using a plurality of skybox layers at different depths from each other, as claimed.

Claims 5, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Application Publication 2024/0112394 A1 (hereinafter Neal) in view of “Versatile Diffusion: Text, Images and Variations All in One Diffusion Model” by Xingqian Xu, et al. (hereinafter Xu) in view of “Towards Text-guided 3D Scene Composition” by Qihang Zhang, et al. (hereinafter Zhang) in view of “Text-To-4D Dynamic Scene Generation” by Uriel Singer, et al. (hereinafter Singer) in view of "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding" by Thomas Muller, et al. (hereinafter Muller) in view of "Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis" by Tianchang Shen, et al. (hereinafter Shen) in view of "Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes" by Towaki Takikawa, et al. (hereinafter Takikawa) in view of "NeRF++: Analyzing and Improving Neural Radiance Fields" by Kai Zhang, Gernot Riegler, et al. (hereinafter Riegler) in view of "Real-Time Neural Rasterization for Large Scenes" by Jeffrey Yunfan Liu, et al. (hereinafter Liu) in view of “Cinematographic Camera Diffusion Model” by Hongda Jiang, et al. (hereinafter Jiang) as applied to claims 3, 9, and 16 above, and further in view of “GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning” by Jiaxi Lv, et al. (hereinafter Lv).
	Regarding claim 5, the limitation “wherein the second computer further generates a selected one of particle effects, fluid effects, weather, or non-rigid environmental effects for the three-dimensional environment” is not explicitly taught by Neal in view of Zhang and Singer (Neal’s modified system, using Zhang’s SceneWiz3D, and the extended SceneWiz4D technique discussed in the claim 1 modification, introduces dynamics to the 3D scene composition to generate a video.  While this could result in non-rigid deformation effects, e.g. Singer, figure 2, neither reference addresses adding effects to the environment, per se, and therefore in the interest of compact prosecution, Lv is cited for teaching addition of environmental effects.)  However, this limitation is taught by Lv (Lv, e.g. abstract, sections 3, 4, describes the GPT4Motion system for adding simulated physical effects to videos generated using Stable Diffusion based image-to-video techniques using blender physics simulation and ControlNet(s), e.g. figure 2.  Lv teaches that the received text prompts are used, in part, to generate a 3D environment and add physics effect functions to the 3D environment, e.g. section 3.2, where rendered images of the simulated scenes are used as inputs to the ControlNet(s), e.g. section 3.3, resulting in output videos having the effect described in the text prompt, e.g. figures 4-7, showing simulation based videos with different types of effects corresponding to the claimed effects, i.e. the bouncing basketballs are particle effects, the water pouring is a fluid effect, and the flag and shirt are deformed by the simulated wind, corresponding to both weather and non-rigid environmental effects.)
	Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Neal’s text-to-3D immersive video system, using Xu’s multi-context VD diffusion framework for generating the non-immersive images, including Zhang’s text-to-3D SceneWiz3D technique as a substitute for Neal’s 3D depth image generation step, using Singer’s static-to-dynamic 4D extension technique to extend Zhang’s text-to-3D SceneWiz3D technique to a text-to-4D SceneWiz4D technique, using Takikawa’s user defined distance thresholds for LOD selection for Zhang’s per-object hashing-based encoded DMTet, using Riegler’s NeRF++ model for implementing Zhang’s background/environment NeRF modeling for unbounded scenes, substituting Liu’s neural skyboxes for Riegler’s background NeRF model, using Jiang’s cinematographic camera diffusion model for generating virtual camera trajectories, to include Lv’s GPT4Motion technique to add simulated effects to the 3D/4D scene using physical simulation to produce rendered images used by ControlNet(s) to guide the diffusion based image/video generation.  In the modified system, Zhang’s hybrid scene representation corresponds to the 3D scene elements in Lv’s Blender scene as in section 3.2, such that Zhang’s hybrid scene representation would include a third element in addition to the object meshes and environment NeRF, i.e. Lv’s environmental effects, used to control the physical simulation of the objects within the environment, analogous to Lv’s exemplary objects within the Blender scene environment.
Regarding claims 11 and 18, the limitations are similar to those treated in the above rejection(s) and are met by the references as discussed in claim 5 above.

Response to Arguments
Applicant's arguments filed 2/27/26 have been fully considered but they are not persuasive. 
	Applicant asserts that claims 7, 13, and 20 have been amended for clearer antecedent basis, however, as discussed in the above 112(b) rejections, the scope of the claims is still indefinite.
	In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  Applicant’s remarks, e.g. pages 13-14, only consider each reference separately, rather than the combination thereof.  Further, as discussed in the above prior art rejections, further in view of Muller, Shen, and Riegler cited by Zhang, the combination of references do include foreground, midground, and background objects represented with decreasing levels of resolution, i.e. the foreground and midground objects, being represented using a combination of Muller’s and Shen’s techniques in Zhang’s system, are innately capable of producing different resolution representations, and the background objects being represented in the background/environment NeRF would be represented using a relatively low resolution, as discussed by Riegler, section 1, paragraph 5.  Therefore Applicant’s assertions are not persuasive.
	In response to applicant's argument that the examiner has combined an excessive number of references, reliance on a large number of references in a rejection does not, without more, weigh against the obviousness of the claimed invention.  See In re Gorman, 933 F.2d 982, 18 USPQ2d 1885 (Fed. Cir. 1991).  Applicant’s remarks emphasize the number of references, but do not actually show any particular combination in the rejection to be unreasonable.
	In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).  Applicant asserts that the rejection of claim 5 relies on improper hindsight because there is “no teaching, suggestion or motivation to combine all 6 references”.  Applicant’s remarks do not acknowledge or address the specific teachings, suggestions, and motivations cited in the rejection, and therefore Applicant’s remarks fail to show that these specifically cited teachings, suggestions, and motivations are based on Applicant’s disclosure.  As Applicant’s assertions are not supported by any rationale contradicting the analysis of the rejection, Applicant’s argument that the rejection relies on improper hindsight cannot be considered persuasive.
	Applicant asserts that there is “no reasonable expectation of success” in combining the references before the effective date of the claimed invention.  Applicant’s support for this assertion is that “Applicant does not believe there is a reasonable expectation of success”, further asserting that there is no indication of a reasonable expectation of success or some degree of predictability.  Applicant’s arguments offer no credible reasons or evidence which contradict the finding that one of ordinary skill in the art could combine the references as proposed in the rejection, i.e. Applicant has not even identified any single aspect of even one of the proposed modifications for which the expectation of success would be in doubt, much less proposed any technical reason why one of ordinary skill in the art could not make the specific proposed modifications as discussed in the rejection.  Furthermore, Applicant’s argument appears to be based on a lack of explicitly stated expectation of success in the rejection, and as explained in MPEP 2143.02 I, reasonable expectation of success can be implicitly shown via the prior art teachings or as part of the obviousness analysis, i.e. there is no requirement to explicitly lay out a hypothetical detailed analysis of expectation of success in an obviousness analysis.  Rather, Applicant’s burden is, in addition to actually identifying an aspect of a particular combination proposed by the rejection for which Applicant believes there is not a reasonable expectation of success, present evidence showing there is no reasonable expectation of success, as required by MPEP 2143.02 II.  Therefore, Applicant’s argument cannot be considered persuasive because it fails to identify any particular aspect of any particular combination which lacks a reasonable expectation of success, fails to identify any reason why any particular aspect of any particular combination lacks a reasonable expectation of success, and is not supported by evidence showing that any particular aspect of any particular combination lacks a reasonable expectation of success.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT BADER whose telephone number is (571)270-3335. The examiner can normally be reached 11-7 m-f.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ROBERT BADER/Primary Examiner, Art Unit 2611
Read full office action
Prosecution Timeline

Mar 20, 2024
Application Filed
Jul 09, 2024
Non-Final Rejection — §103, §112
Aug 08, 2024
Applicant Interview (Telephonic)
Aug 09, 2024
Examiner Interview Summary
Oct 14, 2024
Response Filed
Nov 02, 2024
Final Rejection — §103, §112
Feb 06, 2025
Request for Continued Examination
Feb 07, 2025
Response after Non-Final Action
Feb 21, 2025
Non-Final Rejection — §103, §112
Apr 18, 2025
Applicant Interview (Telephonic)
Apr 18, 2025
Examiner Interview Summary
May 27, 2025
Response Filed
Aug 27, 2025
Final Rejection — §103, §112
Nov 06, 2025
Applicant Interview (Telephonic)
Nov 06, 2025
Examiner Interview Summary
Feb 27, 2026
Request for Continued Examination
Mar 02, 2026
Response after Non-Final Action
Mar 21, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/538,814
Patent 12586334
SYSTEMS AND METHODS FOR RECONSTRUCTING A THREE-DIMENSIONAL OBJECT FROM AN IMAGE
2y 5m to grant Granted Mar 24, 2026
18/538,825
Patent 12586335
SYSTEMS AND METHODS FOR RECONSTRUCTING A THREE-DIMENSIONAL OBJECT FROM AN IMAGE
2y 5m to grant Granted Mar 24, 2026
18/258,729
Patent 12541916
METHOD FOR ASSESSING THE PHYSICALLY BASED SIMULATION QUALITY OF A GLAZED OBJECT
2y 5m to grant Granted Feb 03, 2026
18/614,976
Patent 12536728
SHADOW MAP BASED LATE STAGE REPROJECTION
2y 5m to grant Granted Jan 27, 2026
18/520,378
Patent 12505615
GENERATING THREE-DIMENSIONAL MODELS USING MACHINE LEARNING MODELS
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
44%
Grant Probability
70%
With Interview (+26.4%)
3y 1m
Median Time to Grant
High
PTA Risk
Based on 393 resolved cases by this examiner. Grant probability derived from career allow rate.
ARTIFICIAL INTELLIGENCE AUGMENTED VIRTUAL FILM PRODUCTION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email