Till working on heDETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to because in FIG.2 the blocks’ labels (204A … 204N, 210A … 210N, 220A … 220N, 215A … 215N, 225 A … 225N, 230 A … 230N, 255A … 255N, 260A … 260N) are not consistent with the labels (204a … 204n, 210a … 210n, 220a … 220n, 215a … 215n, 225a … 225n, 230a … 230n, 255a … 255n, 260a … 260n) in pages 13-28 of the specification of this application. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
In page 18, line 6, “spare structures” should read “sparse structures”.
In page 18, line 8, “The feature point cloud 240” should read “The sparse feature point cloud 240”.
In page 18, line 18, “The sparse grid 225c” should read “The sparse grid 225a”.
Appropriate correction is required.
Claim Objections
Claim 9 is objected to because of the following informalities:
In claim 9, line 12, “correspond to” should read “corresponding to”.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 10 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 10 recites the limitations “the modeling” (line 2). There is insufficient basis for the limitations in the claim and claims 1 and 9. For examination purpose, “the modeling” will be read as “a modeling”.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Philion et al. (Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, arXiv.org, arXiv:2008.05711v1 [cs.CV] 13 Aug 2020, hereinafter “Philion”) in view of Kim et al. (NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models, arXiv.org, arXiv:2304.09787v1 [cs.CV] 19 Apr 2023, pp 1-37, hereinafter “Kim”).
[Examiner’s Note: the Kim reference is published at arXiv.org on April 19, 2023, with authors Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, and Sanja Fidler, which is within one year of the filing date of the current application. However, because four of the authors, Bradley Brown, Katja Schwarz, Daiqing Li, and Robin Rombach are not the named inventors for the current application, according to MPEP 2153.01(a), the Kim reference would not be readily apparent from the publication that it is an inventor-originated disclosure and it would be treated as prior art under AIA 35 USC 102(a)(1) unless there is evidence of record that an exception under AIA 35 USC 102(b)(1) applies.]
Regarding claim 1, Philion discloses A system comprising at least one processor, the at least one processor comprising one or more circuits to: (page 8, para. 3, “With these hyperparameters and architectural design choices, the forward pass of the model runs at 35 hz on a Titan V GPU”). Note that: the processing system with a Titan V GPU is a system with at least one processor comprising one or more circuits.
construct at least one initial feature map of a plurality of initial feature maps based on a respective input image of an input dataset, wherein each of the at least one initial feature maps incorporates depth data of the respective input image and corresponds to a plurality of pixels of the respective input image; (page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 4 / para. 4 – page 5 / para. 5, “Formally, we are given n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
each with an extrinsic
PNG
media_image2.png
22
98
media_image2.png
Greyscale
and an intrinsic matrix
PNG
media_image3.png
22
88
media_image3.png
Greyscale
… Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to
match a notion of attention and discrete depth inference. At pixel p, the network
predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every
pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context
vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) the image data from an arbitrary number of cameras formulate a plurality of images (
PNG
media_image1.png
24
158
media_image1.png
Greyscale
) as an input dataset; (2) “feature C” is a contextual feature vector at pixel p, and the contextual feature vector Cs of all pixels formulate an initial feature map for the image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
of n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
; and (3) the “feature C” of pixel p is associated with a corresponding depth D. Therefore, the initial feature map can be a feature map in shape of R(D+C) x H x W incorporating depth data of the respective input image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
.
construct a sparse feature point cloud comprising a plurality of features determined using the plurality of initial feature maps; (page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 5, paras. 4-5, “Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to match a notion of attention and discrete depth inference. At pixel p, the network predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) a large point cloud for a for a given image of size D.H.W is created or constructed similar to the structure of a multi-plane image, and each point in the point cloud has the contextual feature vector; and (2) the point cloud is sparse because the feature vectors within it are lifted into a frustum of features for each camera and are constructed from a limited number of feature image planes corresponding to the cameras.
However, Philion fails to disclose, but in the same art of computer graphics, Kim discloses
transform the sparse feature point cloud into multi-resolution sparse grids, at least one of the multi-resolution sparse grids comprising a plurality of voxels; (Kim, page 2, col. left, para. 2, “Specifically, a latent-autoencoder decomposes the scene voxels into a 3D coarse, 2D fine and 1D global latent.”; page 3, col. right, para. 2, “After constructing the frustum for each view, we transform the frustums to world coordinates and fuse them into a shared 3D neural field, represented as density and feature voxel grids. Let VDensity and VFeat denote the density and feature grid, respectively. This formulation of representing a scene with density and feature grids has been explored before [74] for optimization-based scene reconstruction and we utilize it as an intermediate representation for our scene auto-encoder. VDensity,Feat have the same spatial size, and each voxel in V represents a region in the world coordinate system. For each voxel indexed by (x, y, z), we pool all densities and features of the corresponding frustum entries”; page 4, col. left, para. 3, “We concatenate VDensity and VFeat along the channel dimension and use separate CNN encoders to encode the voxel grid V into a hierarchy of three latents: 1D global latent g, 3D coarse latent c, and 2D fine latent f, as shown in Fig. 12”; page 15, Table 6: ”Encoder for the coarse latent c” and “Quantization (3D)” as “8x32x32”). Note that: (1) the sparse feature point cloud V has both density grid VDensity and feature grid VFeat; (2) concatenated VDensity and VFeat are encoded into a hierarchy of three latents with fine to coarse resolutions as multi-resolution sparse grids; and (3) one of three sparse grids, 3D coarse c, comprises a plurality of voxels (8x32x32).
model, using a plurality of neural networks and according to a hierarchal architecture, the multi-resolution sparse grids to construct a hierarchical volume representation; and (Kim, page 2, col. left, para. 2, “Hierarchical diffusion models are then trained on the tri-latent representation to generate novel 3D scenes”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variables g, c, f that represent a voxel based scene representation V, we define our generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) [24] … We train our hierarchical LDM with the following losses … ψ denotes the learnable denoising networks for g, c, f … Each ψ can be trained in parallel and, once trained, can be sampled one after another following the hierarchy”). Note that: (1) After the generative model as p(V, g, c, f) with Denoising Diffusion Models (DDMs) has been trained, the trained 3 neural networks for g, c, f in p(V, g, c, f) model can be regarded as a plurality of neural networks; and (2) the trained p(V, g, c, f) model can be regarded as a hierarchical volume representation that models the multi-resolution sparse grids according to a hierarchal architecture.
generate constructed content based on the hierarchical volume representation. (Kim, page 5, para 2, “Once g, c, f are sampled, we can use the latent decoder from Sec. 3.2 to construct the voxel V which represents the neural field for the sampled scene. Following the volume rendering and decoding step in Sec. A.1, the sampled scene can be visualized from desired viewpoints”; page 20, Figure 15: “Renderings from the scene autoencoder. Top row: without explicit density & feature grids, Bottom row: the full model”). Note that: (1) after the hierarchical volume representation has been constructed by the trained Denoising Diffusion Models (DDMs), one can obtain a V for a sampled scene using sampled g,c,f latents by using the latent decoder; and (2) the sampled scene or constructed content can be generated or visualized from a desired or target viewpoints.
Philion and Kim are in the same field of endeavor, namely computer graphics. Before the effective filing date of the claimed invention, it would have been obvious to apply transferring feature point cloud into resolution sparse grids, a hierarchical volume representation, and generating constructed contents, as taught by Kim into Philion. The motivation would have been “We achieve a substantial improvement over existing state-of-the-art scene generation models” (Kim, Abstract). The suggestion for doing so would allow to improve scene generation models. Therefore, it would have been obvious to combine Philion and Kim.
Regarding claim 2, Philion in view of Kim discloses The system of claim 1, wherein the input dataset comprises a plurality of input images of a 3D scene, and wherein features of the plurality of initial feature maps corresponding to depths within at least one range of depths are used to fill entries in a respective one of a plurality of frustums, the depths of the features being indicated by incorporating the depth data. (Philion, page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 4 / para. 4 – page 5 / para. 5, “Formally, we are given n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
each with an extrinsic
PNG
media_image2.png
22
98
media_image2.png
Greyscale
and an intrinsic matrix
PNG
media_image3.png
22
88
media_image3.png
Greyscale
… Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to match a notion of attention and discrete depth inference. At pixel p, the network predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context
vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) the image dataset from different cameras for a 3D scene comprises a plurality of input images of a 3D scene; (2) the features of the plurality of initial feature maps corresponding are lifted individually into a frustum of features for each camera, respectively; (3) for each frustum |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
are associated or corresponding to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
as one range of depth; and (4) with “lift” the features incorporate the depth data, and are filled into the entries with depth data in the frustrum.
Regarding claim 3, Philion in view of Kim discloses The system of claim 1, wherein a first multi-resolution sparse grid of the multi-resolution sparse grids comprises a first voxel size, and a second multi-resolution sparse grid of the multi-resolution sparse grids comprises a second voxel size. (Kim, page 2, col. left, para. 2, “Specifically, a latent-autoencoder decomposes the scene voxels into a 3D coarse, 2D fine and 1D global latent.”; page 3, col. right, para. 2, “After constructing the frustum for each view, we transform the frustums to world coordinates and fuse them into a shared 3D neural field, represented as density and feature voxel grids. Let VDensity and VFeat denote the density and feature grid, respectively. This formulation of representing a scene with density and feature grids has been explored before [74] for optimization-based scene reconstruction and we utilize it as an intermediate representation for our scene auto-encoder. VDensity,Feat have the same spatial size, and each voxel in V represents a region in the world coordinate system. For each voxel indexed by (x, y, z), we pool all densities and features of the corresponding frustum entries”; page 4, col. left, para. 3, “We concatenate VDensity and VFeat along the channel dimension and use separate CNN encoders to encode the voxel grid V into a hierarchy of three latents: 1D global latent g, 3D coarse latent c, and 2D fine latent f, as shown in Fig. 12”; page 15, Table 6: ”Encoder for the coarse latent c” and “Quantization (3D)” as “8x32x32”); page 15, Table 7: “Encoder for the fine latent f” and “Quantization (2D)” as 4x128x128”). Note that: (1) the coarse latent c can be regarded as a first multi-resolution sparse grid with a first voxel size corresponding to the “8x32x32” dimensions for a volume; and (2) the fine latent f can be regarded as a second multi-resolution sparse grid with a second voxel size (pixel size) corresponding to the 1x128x128 dimensions for a volume while a pixel can be regarded as a voxel that collapses in one of three dimensions.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 4, Philion in view of Kim discloses The system of claim 3, wherein the hierarchal architecture comprises: a first neural network of the plurality of neural networks processes the first multi-resolution sparse grid at the first voxel size; and a second neural network of the plurality of neural networks processes the second multi-resolution sparse grid at the second voxel size. (Kim, page 13, Figure 12:
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variables g, c, f that represent a voxel based scene representation V, we define our generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) [24] … We train our hierarchical LDM with the following losses … ψ denotes the learnable denoising networks for g, c, f … Each ψ can be trained in parallel and, once trained, can be sampled one after another following the hierarchy”). Note that: (1) the LDM has three neural networks with Denoising Diffusion Models (DDMs) that process multi-resolution sparse grids g, c, f, respectively; (2) the neural network (model) processing sparse coarse grid c at the first voxel size can be regarded as a first neural network of the plurality of neural networks; and (3) the neural network (model) processing sparse fine grid f at the second voxel size can be regarded as a second neural network of the plurality of neural networks.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 5, Philion in view of Kim discloses The system of claim 1, wherein generating the constructed content further comprises: determining a new feature map via volume rendering of the hierarchical volume representation, wherein the new feature map comprises a two-dimensional (2D) projection of the hierarchical volume representation corresponding with a target capture device. (Kim, Abstract, “We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene”; page 3, col. right, para. 3, “Finally, we perform volume rendering using the camera poses κ to project V onto a 2D feature map. We trilinearly interpolate the values on each voxel to get the feature and density for each sampling point along the camera rays. 2D features are then fed into a CNN decoder that produces the output image i”). Note that: (1) the camera pose is used to project V (the hierarchical volume representation) onto a 2D feature map that is regarded as a new feature map in performing volume rendering; and (2) the camera poses κis the specific pose of the target capture device (camera).
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 6, Philion in view of Kim discloses The system of claim 5, wherein the new feature map comprises:
a first component corresponding to a first level of the hierarchal architecture;
a second component corresponding to a second level of the hierarchal architecture; and (Kim, page 5, para 2, “Once g, c, f are sampled, we can use the latent decoder from Sec. 3.2 to construct the voxel V which represents the neural field for the sampled scene. Following the volume rendering and decoding step in Sec. A.1, the sampled scene can be visualized from desired viewpoints”; page 3, col. right, para. 3, “Finally, we perform volume rendering using the camera poses κ to project V onto a 2D feature map”). Note that: (1) one can query or sample g, c, f from volume of the hierarchical volume representation with the camera poses κ; (2) sampled c can be regarded as a first component corresponding to a first level of the hierarchal architecture (coarse level of the hierarchical volume representation); and (3) sampled f can be regarded as a second component corresponding to a second level of the hierarchal architecture (fine level of the hierarchical volume representation).
the method further comprises combining vectors for a plurality of features constructed by the plurality of neural networks to construct the hierarchical volume representation. (Kim, page 13, Figure 12: “Sampled latents can then be decoded into a neural field that can be rendered into a given viewpoint”, two “Dec”s (
PNG
media_image15.png
34
72
media_image15.png
Greyscale
and
PNG
media_image16.png
56
80
media_image16.png
Greyscale
)as decoders combining sampled latents as vectors for three features g, c, f (a plurality of features) by
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV” to construct the hierarchical volume representation).
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 7, Philion in view of Kim discloses The system of claim 5, wherein generating the constructed content based on the hierarchical volume representation comprises decoding the new feature map using a decoder neural network. (Kim, page 13, Figure 12: “Sampled latents can then be decoded into a neural field that can be rendered into a given viewpoint”, two “Dec”s (
PNG
media_image15.png
34
72
media_image15.png
Greyscale
and
PNG
media_image16.png
56
80
media_image16.png
Greyscale
) as decoders combining sampled latents as vectors for three features g, c, f (a plurality of features) by
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV” to construct the hierarchical volume representation). Note that: the larger “Dec” above decodes the new feature map using a decoder neural network.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 8, Philion in view of Kim discloses The system of claim 1, further comprising:
determining, using a depth encoder with the respective input image as input, a depth map of the respective input image; (Philion, page 3, para. 3, “Monocular object detectors are de_ned by how they model the transformation from the image plane to a given 3-dimensional reference frame … The current state-of-the-art 3D object detector on the nuScenes benchmark [31] uses an architecture that trains a standard 2d detector to also predict depth using a loss that seeks to disentangle error due to incorrect depth from error due to incorrect bounding boxes”). Note that: (1) the depth is predicated or encoded by a standard 2d detector as an encoder taking the respective input image as input.
lifting each initial feature map using the depth map into a frustum. (Philion, page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 4 / para. 4 – page 5 / para. 5, “Formally, we are given n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
each with an extrinsic
PNG
media_image2.png
22
98
media_image2.png
Greyscale
and an intrinsic matrix
PNG
media_image3.png
22
88
media_image3.png
Greyscale
… Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to
match a notion of attention and discrete depth inference. At pixel p, the network
predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every
pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context
vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) the image data from an arbitrary number of cameras formulate a plurality of images (
PNG
media_image1.png
24
158
media_image1.png
Greyscale
) as an input dataset; (2) “feature C” is a contextual feature vector at pixel p, and all contextual feature vector Cs of all pixels formulate an initial feature map for the image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
of n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
; and (3) the “feature C” of pixel p is associated with a corresponding depth D. Therefore, the initial feature map can be a feature map in shape of R(D+C) x H x W incorporating depth data of the respective input image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
.
determining, using a feature encoder with the respective input image as input, each initial feature map; and (Kim, Figure 3: “Each input image is processed with a 2D CNN”, and
PNG
media_image17.png
118
292
media_image17.png
Greyscale
). Note that: “2D CNN Encoder” is a feature encoder with the respective input image as input to generate or determine “Features” as each initial feature map.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 9, Philion in view of Kim discloses The system of claim 1, the at least one processor further to:
construct and update a hierarchical encoder to reduce dimensionality of at least one voxel hierarchical level of a hierarchical voxel representation and output the hierarchical voxel representation into compressed latent variables; (Kim, page 13, Figure 12:
PNG
media_image18.png
74
148
media_image18.png
Greyscale
’s “Enc C” is a hierarchical encoder to reduce dimensionality at one voxel hierarchical level (a coarse level (3x128x128)); page 15, Table 6: “Encoder for the coarse latent c” and “Quantization (3D)” as “8x32x32”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variables g, c, f that represent a voxel based scene representation V, we define our generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) [24] … We train our hierarchical LDM with the following losses … ψ denotes the learnable denoising networks for g, c, f … Each ψ can be trained in parallel and, once trained, can be sampled one after another following the hierarchy”). Note that: (1) “Encoder for the coarse latent c” is designed or constructed as a hierarchical encoder; (2) When the coarse encoder as a neural network is trained, the hierarchical encoder (neural network) corresponding to the coarse latent c is updated; and (3) the generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) is trained to output or formulate a hierarchical voxel representation into compressed latent variables g, c, f as a volume representation.
construct and update a multi-layer neural network by querying a subset of the plurality of voxels using coordinates, wherein updating comprises matching the plurality of features in the hierarchical volume representation and outputting a compressed representation of the hierarchical volume representation; and (Kim, page 13, Figure 12:
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variables g, c, f that represent a voxel based scene representation V, we define our generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) [24] … We train our hierarchical LDM with the following losses … ψ denotes the learnable denoising networks for g, c, f … Each ψ can be trained in parallel and, once trained, can be sampled one after another following the hierarchy”; page 5, col. left, para. 1:
PNG
media_image19.png
470
460
media_image19.png
Greyscale
). Note that: (1) the generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) is designed or constructed as a multi-layer neural network (LDM); (2) The LDM is trained using the three loss functions above; (3) during the training or updating of the neural network, the neural network can be queried to have a set of latent variables g, c, f of subset of voxels indicated by the voxel’s coordinates (x,y,z) in the volume of the hierarchical volume representation to match g0, c0, and f0 in the losses above; and (4) after the LDM has been trained, the hierarchical volume representation corresponding to the LDM model is outputted or formulated.
wherein the determining of the hierarchical encoder and the multi-layer neural network comprises:
a first stage corresponding to compression of each voxel hierarchical level; and (Kim, page 3, Figure 2: three hierarchical sub-encoders
PNG
media_image20.png
36
116
media_image20.png
Greyscale
is for compression each voxel hierarchical level). Note that: (1) three hierarchical sub-encoders are regarded as the hierarchical encoder; (2) determining of the hierarchical encoder can be regarded as the a first stage corresponding to compression of each voxel hierarchical level (g, c, f).
a second stage correspond to compression of the hierarchical voxel representation into a final latent representation. (Kim, page 13, Figure 12:
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV” are for compression each voxel hierarchical level is for compression of the hierarchical voxel representation into a final latent representation). Note (1) three neural networks corresponding to “1D Global”, “3D Voxel”, “2D BEV” are regarded as the multi-layer neural network; (2) determining of the multi-layer neural network can be regarded as a second stage corresponding to compression of each voxel hierarchical level (g, c, f). The trained LDM can be taken as a final latent representation.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 10, Philion in view of Kim discloses The system of claim 9, wherein the plurality of neural networks comprise a plurality of diffusion models, and wherein the modeling comprises using the plurality of diffusion models to model the plurality of voxels to construct the hierarchical volume representation. (Kim, page 13, Figure 12:
PNG
media_image14.png
70
242
media_image14.png
Greyscale
, three neural networks as “Hierarchical Latent Diffusion Model” corresponding to “1D Global”, “3D Voxel”, “2D BEV”; page 2, col. left, para. 2, “Hierarchical diffusion models are then trained on the tri-latent representation to generate novel 3D scenes”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variables g, c, f that represent a voxel based scene representation V, we define our generative model as p(V, g, c, f) = p(V |g, c, f)p(f|g, c)p(c|g)p(g) with Denoising Diffusion Models (DDMs) [24] … We train our hierarchical LDM with the following losses … ψ denotes the learnable denoising networks for g, c, f … Each ψ can be trained in parallel and, once trained, can be sampled one after another following the hierarchy”). Note that: (1) three neural networks of “Hierarchical Latent Diffusion Model” corresponding to “1D Global”, “3D Voxel”, “2D BEV” can be regarded as a plurality of diffusion models; (2) three neural networks (diffusion models) model the multi-resolution sparse grids to construct or formulate the hierarchical volume representation.
The motivation to combine Philion and Kim given in claim 1 is incorporated here.
Regarding claim 11, Philion in view of Kim discloses The system of claim 1, wherein the at least one processor is comprised in at least one of:
a control system for an autonomous or semi-autonomous machine;
a perception system for an autonomous or semi-autonomous machine;
a system implemented using a robot; an aerial system; a medical system;
a boating system, a smart area monitoring system;
a system for performing deep learning operations;
a system for performing simulation operations;
a system for generating or presenting virtual reality (VR) content, augmented reality (AR) content, or mixed reality (MR) content;
a system for performing digital twin operations;
a system implemented using an edge device; a system incorporating one or more virtual machines (VMs);
a system for generating synthetic data;
a system implemented at least partially in a data center;
a system for performing conversational artificial intelligence (AI) operations;
a system for performing generative AI operations;
a system implementing language models; a system implementing large language models (LLMs); a system implementing vision language models (VLMs);
a system for hosting one or more real-time streaming applications;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets; or
a system implemented at least partially using cloud computing resources. (Philion, page 1, Abstract, “The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single bird's-eye-view" coordinate frame for consumption by motion planning”; page 8, para. 3, “With these hyperparameters and architectural design choices, the forward pass of the model runs at 35 hz on a Titan V GPU”). Note that: (1) the processing system with a Titan V GPU is a system with at least one processor comprising one or more circuits; and (2) the GPU processor is comprised in a perception system to extract semantic representations for an autonomous vehicle (autonomous or semi-autonomous machines).
Regarding claim 12, Philion in view of Kim discloses A system comprising at least one processor, the at least one processor comprises one or more circuits to: (Philion, page 8, para. 3, “With these hyperparameters and architectural design choices, the forward pass of the model runs at 35 hz on a Titan V GPU”). Note that: the processing system with a Titan V GPU is a system with at least one processor comprising one or more circuits.
determine an initial feature map based on an input dataset, wherein the initial feature map, incorporating depth data, corresponds with a plurality of pixels of the input dataset; (Philion, page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 4 / para. 4 – page 5 / para. 5, “Formally, we are given n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
each with an extrinsic
PNG
media_image2.png
22
98
media_image2.png
Greyscale
and an intrinsic matrix
PNG
media_image3.png
22
88
media_image3.png
Greyscale
… Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to
match a notion of attention and discrete depth inference. At pixel p, the network
predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every
pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context
vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) the image data from an arbitrary number of cameras formulate a plurality of images (
PNG
media_image1.png
24
158
media_image1.png
Greyscale
) as an input dataset; (2) “feature C” is a contextual feature vector at pixel p, and all contextual feature vector Cs of all pixels formulate a initial feature map for the image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
of n images
PNG
media_image1.png
24
158
media_image1.png
Greyscale
; and (3) the “feature C” of pixel p is associated with a corresponding depth D. Therefore, the initial feature map can be a feature map in shape of R(D+C) x H x W incorporating depth data of the respective input image
PNG
media_image4.png
20
116
media_image4.png
Greyscale
.
… sparse feature point cloud; (Philion, page 1, Abstract, “We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera”; page 5, paras. 4-5, “Let
PNG
media_image4.png
20
116
media_image4.png
Greyscale
be an image with extrinsics E and intrinsics I, and let p be a pixel in the image with image coordinates (h, w). We associate |D| points
PNG
media_image5.png
24
206
media_image5.png
Greyscale
to each pixel where D is a set of discrete depths, for instance
PNG
media_image6.png
24
198
media_image6.png
Greyscale
. Note that there are no learnable parameters in this transformation. We simply create a large point cloud for a given image of size D.H.W. This structure is equivalent to what the multi-view synthesis community [38,32] has called a multi-plane image except in our case the features in each plane are abstract vectors instead of
PNG
media_image7.png
22
78
media_image7.png
Greyscale
values. The context vector for each point in the point cloud is parameterized to match a notion of attention and discrete depth inference. At pixel p, the network predicts a context
PNG
media_image8.png
20
62
media_image8.png
Greyscale
and a distribution over depth
PNG
media_image9.png
20
94
media_image9.png
Greyscale
for every pixel. The feature
PNG
media_image10.png
20
68
media_image10.png
Greyscale
associated to point pd is then defined as the context vector for pixel p scaled by
PNG
media_image11.png
22
23
media_image11.png
Greyscale
:
PNG
media_image12.png
20
88
media_image12.png
Greyscale
”; Fig. 3: “feature c” is the contextual feature vector at pixel p
PNG
media_image13.png
272
672
media_image13.png
Greyscale
”). Note that: (1) a large point cloud for a for a given image of size D.H.W is created or constructed similar to the structure of a multi-plane image, and each point in the point cloud has the contextual feature vector; and (2) the point cloud is sparse because the feature vectors within it are lifted into a frustum of features for each camera and are constructed from a limited number of feature image planes corresponding to the cameras.
determine a hierarchical volume representation based on multi-resolution sparse grids comprising a plurality of voxels corresponding to a transformed sparse feature point cloud; and (Kim, page 2, col. left, para. 2, “Specifically, a latent-autoencoder decomposes the scene voxels into a 3D coarse, 2D fine and 1D global latent.”; page 3, col. right, para. 2, “After constructing the frustum for each view, we transform the frustums to world coordinates and fuse them into a shared 3D neural field, represented as density and feature voxel grids. Let VDensity and VFeat denote the density and feature grid, respectively. This formulation of representing a scene with density and feature grids has been explored before [74] for optimization-based scene reconstruction and we utilize it as an intermediate representation for our scene auto-encoder. VDensity,Feat have the same spatial size, and each voxel in V represents a region in the world coordinate system. For each voxel indexed by (x, y, z), we pool all densities and features of the corresponding frustum entries”; page 4, col. left, para. 3, “We concatenate VDensity and VFeat along the channel dimension and use separate CNN encoders to encode the voxel grid V into a hierarchy of three latents: 1D global latent g, 3D coarse latent c, and 2D fine latent f, as shown in Fig. 12”; page 15, Table 6: ”Encoder for the coarse latent c” and “Quantization (3D)” as “8x32x32”; page 2, col. left, para. 2, “Hierarchical diffusion models are then trained on the tri-latent representation to generate novel 3D scenes”; page 4 / col. right / para. 2 – page 5 / col. left / para. 2, “Given the latent variable