DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of the Claims
Claims 1-20, as originally filed, are currently pending and have been considered below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 5, 6, 11-14 and 16-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, Alex, et al. "pixelNeRF: Neural Radiance Fields from One or Few Images." arXiv preprint arXiv:2012.02190 (2020), hereinafter, “Yu”, and further in view of Liu, Andrew, et al. "Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image." arXiv preprint arXiv:2012.09855 (2020), hereinafter, “Liu”.
As per claim 1, Yu discloses a method of training a neural radiance field-based (NeRF-based) machine learning model for object recognition (Yu, Abstract, We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images ... introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior), the method comprising:
obtaining a set of content items to train the NeRF-based machine learning model (Yu, Abstract, We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixel-NeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset);
determining depth of objects depicted in the set of content items (Yu, pages 11-14, B.1. Implementation Details, Encoder E ... we use a ResNet34 backbone and extract a feature pyramid by taking the feature maps prior to the first pooling operation and after the first ResNet 3 layers ... Hierarchical volume sampling To improve the sampling efficiency, in practice, we also use coarse and fine NeRF networks fc, ff ... we use 64 stratified uniform and 16 importance samples, and additionally take 16 fine samples with a normal distribution around the expected ray termination (i.e. depth) from the coarse model, to further promote denser sampling near the surface);
generating, based on the depth, a first set of training data comprising reconstructed content items depicting only the objects (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 11-14, B.1. Implementation Details, NeRF rendering hyperparameters We use positional encoding from NeRF for the spatial coordinates ... We use a white background color in NeRF to match the ShapeNet renderings, except in the DTU setup where a black background is used; Yu, page 15, B.2.1 Single-category ShapeNet, We train for 400000 iterations ... on a single Titan RTX. For efficiency, we sample rays from within a tight bounding box around the object; Yu, page 15, B.2.2 Category-agnostic ShapeNet, We train our model for 800000 iterations on the entire training set, where rays are sampled from within a tight bounding box; Yu, pages 17-19, B.2.3 Generalization to Novel Categories, We train our model for 680000 iterations across all instances of 3 categories: airplane, car, and chair. Rays are sampled from within a tight bounding box for the first 400000 iterations; Yu, pages 19-20, B.2.5 Sim2Real on Real Car Images, We use car images from the Stanford Cars dataset. PointRend is applied to the images to obtain foreground masks and bounding boxes ... For evaluation, we set the camera pose to identity and use the same sampling strategy and bounds as at train time for the single-category cars model);
generating, based on the depth maps, a second set of training data comprising one or more optimal training paths associated with the set of content items (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, page 4, 4.1. Single-Image pixelNeRF, Given a input image I of a scene, we first extract a feature volume W = E(I). Then, for a point on a camera ray x, we retrieve the corresponding image feature by projecting x onto the image plane to the image coordinates π(x) using known intrinsics, then bilinearly interpolating between the pixelwise features to extract the feature vector W(π (x)). The image features are then passed into the NeRF network, along with the position and view direction ... In the few-shot view synthesis task, the query view direction is a useful signal for determining the importance of a particular image feature in the NeRF network. If the query view direction is similar to the input view orientation, the model can rely more directly on the input; if it is dissimilar, the model must leverage the learned prior),
training the NeRF-based machine learning model based on the first set of training data and the second set of training data (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 1-2, 1. Introduction, pixelNeRF, a learning framework that enables predicting NeRFs from one or several images in a feed-forward manner ... pixelNeRF takes spatial image features aligned to each pixel as an input. This image conditioning allows the framework to be trained on a set of multi-view images, where it can learn scene priors to perform view synthesis from one or few input views ... we condition NeRF on input images by first computing a fully convolutional image feature grid from the input image. Then for each query spatial point x and viewing direction d of interest in the view coordinate frame, we sample the corresponding image feature via projection and bilinear interpolation. The query specification is sent along with the image features to the NeRF network that outputs density and color, where the spatial image features are fed to each layer as a residual. When more than one image is available, the inputs are first encoded into a latent representation in each camera’s coordinate frame, which are then pooled in an intermediate layer prior to predicting the color and density).
Yu does not explicitly disclose the following limitations as further recited however Liu discloses
determining depth maps of objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … The renderer also outputs a depth map as seen from the new camera);
generating, based on the depth maps, a first set of training data comprising reconstructed content items depicting only the objects (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … we compute a per-pixel binary mask … The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1);
generating, based on the depth maps, a second set of training data comprising one or more optimal training paths associated with the set of content items, wherein the one or more optimal training paths are generated based at least in part on a dissimilarity matrix associated with the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map as seen from the new camera, which we invert and multiply by the rendered mask to obtain ˆDt+1 ... Refinement and Synthesis. Given the rendered image ˆIt+1, its disparity ˆDt+1 and its mask Mt+1, our next task is to refine this image ... in our work the input is the rendered image, disparity, and mask. The generator output is a 4-channel image comprising RGB and disparity channels. We also train a single encoder that encodes the initial input image I0 to compute the latent noise ... Rinse and Repeat. A crucial part of our approach is to not just refine the RGB pixels, but also the disparity as well. Together the geometry (represented by a disparity) and RGB texture provide the necessary information for our renderer to produce the next view).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Liu and Yu because they are in the same field of endeavor. One skilled in the art would have been motivated to include the disparity map and depth map as taught by Liu in the system of Yu as an alternate means to perform novel view synthesis (Yu, Abstract; Liu, Abstract).
As per claim 2, Yu and Liu disclose the method of claim 1. Liu discloses wherein determining the depth maps of the objects depicted in the set of content items comprises:
calculating, based on the set of content items, internal and external parameters of cameras from which the set of content items was captured (Liu, page 3, 3. Perpetual View Generation, We introduce perpetual view generation, the task of continually generating novel views of a scene ... Specifically, at test time, given an RGB image I0 and a camera trajectory ... the task is to output a new image sequence ... The trajectory is a series of 3D camera poses ... where R and t are 3D rotations and translations, respectively. In addition, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion);
determining, based on the internal and external parameters, coarse point clouds associated with the objects depicted in the set of content items (Liu, page 3, 3. Perpetual View Generation, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), We collected 765 videos using keywords such as ‘coastal’ and ‘aerial footage’, and processed these videos with SLAM and structure-from-motion ... Disparity We use the off-the-shelf MiDaS single-view depth prediction method [22] to obtain disparity maps for every frame ... we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps; Liu, page 12, 1.2. Inference without Disparity Scaling, Scaling and shifting the disparity as described above requires a sparse point cloud, which is generated from SfM);
determining, based on the coarse point clouds, meshes of the objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps); and
determining, based on the meshes of the objects, the depth maps of the objects depicted in the content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors … we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map). The motivation would be the same as above in claim 1.
As per claim 5, Yu and Liu disclose the method of claim 1. Liu discloses wherein generating the first set of training data comprising the reconstructed content items comprises:
determining, based on the depth maps, pixels in each content item of the set of content items to be filtered out (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image);
filtering out the pixels in each content item of the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, This use of the mask ensures that any regions in ˆIt+1 and ˆDt+1 that were occluded in It are masked out and set to zero (along with regions that were outside the field of view of the previous camera); and
sampling remaining pixels in each content item of the set of content items to generate the reconstructed content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 1.
As per claim 6, Yu and Liu disclose the method of claim 5. Liu discloses wherein determining the pixels in each content item of the set of content items to be filtered out comprises: determining pixels in each content item of the set of content items that are outside a threshold depth range indicated by a corresponding depth map of each content item, wherein the threshold depth range indicates a depth range of at least one object depicted in each content item (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 1.
As per claim 11, Yu discloses a system comprising:
at least one processor (Yu, page 15, B.2.1 Single-Category ShapeNet, Titan RTX); and
a memory storing instructions that, when executed by the at least one processor, cause the system to perform a method of training a neural radiance field-based (NeRF-based) machine learning model for object recognition (Yu, Abstract, We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images ... introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior), the method comprising:
obtaining a set of content items to train the NeRF-based machine learning model (Yu, Abstract, We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixel-NeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset);
determining depth of objects depicted in the set of content items (Yu, pages 11-14, B.1. Implementation Details, Encoder E ... we use a ResNet34 backbone and extract a feature pyramid by taking the feature maps prior to the first pooling operation and after the first ResNet 3 layers ... Hierarchical volume sampling To improve the sampling efficiency, in practice, we also use coarse and fine NeRF networks fc, ff ... we use 64 stratified uniform and 16 importance samples, and additionally take 16 fine samples with a normal distribution around the expected ray termination (i.e. depth) from the coarse model, to further promote denser sampling near the surface);
generating, a first set of training data comprising reconstructed content items depicting only the objects (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 11-14, B.1. Implementation Details, NeRF rendering hyperparameters We use positional encoding from NeRF for the spatial coordinates ... We use a white background color in NeRF to match the ShapeNet renderings, except in the DTU setup where a black background is used; Yu, page 15, B.2.1 Single-category ShapeNet, We train for 400000 iterations ... on a single Titan RTX. For efficiency, we sample rays from within a tight bounding box around the object; Yu, page 15, B.2.2 Category-agnostic ShapeNet, We train our model for 800000 iterations on the entire training set, where rays are sampled from within a tight bounding box; Yu, pages 17-19, B.2.3 Generalization to Novel Categories, We train our model for 680000 iterations across all instances of 3 categories: airplane, car, and chair. Rays are sampled from within a tight bounding box for the first 400000 iterations; Yu, pages 19-20, B.2.5 Sim2Real on Real Car Images, We use car images from the Stanford Cars dataset. PointRend is applied to the images to obtain foreground masks and bounding boxes ... For evaluation, we set the camera pose to identity and use the same sampling strategy and bounds as at train time for the single-category cars model);
generating, a second set of training data comprising one or more optimal training paths associated with the set of content items (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, page 4, 4.1. Single-Image pixelNeRF, Given a input image I of a scene, we first extract a feature volume W = E(I). Then, for a point on a camera ray x, we retrieve the corresponding image feature by projecting x onto the image plane to the image coordinates π(x) using known intrinsics, then bilinearly interpolating between the pixelwise features to extract the feature vector W(π (x)). The image features are then passed into the NeRF network, along with the position and view direction ... In the few-shot view synthesis task, the query view direction is a useful signal for determining the importance of a particular image feature in the NeRF network. If the query view direction is similar to the input view orientation, the model can rely more directly on the input; if it is dissimilar, the model must leverage the learned prior); and
training the NeRF-based machine learning model based on the first set of training data and the second set of training data (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 1-2, 1. Introduction, pixelNeRF, a learning framework that enables predicting NeRFs from one or several images in a feed-forward manner ... pixelNeRF takes spatial image features aligned to each pixel as an input. This image conditioning allows the framework to be trained on a set of multi-view images, where it can learn scene priors to perform view synthesis from one or few input views ... we condition NeRF on input images by first computing a fully convolutional image feature grid from the input image. Then for each query spatial point x and viewing direction d of interest in the view coordinate frame, we sample the corresponding image feature via projection and bilinear interpolation. The query specification is sent along with the image features to the NeRF network that outputs density and color, where the spatial image features are fed to each layer as a residual. When more than one image is available, the inputs are first encoded into a latent representation in each camera’s coordinate frame, which are then pooled in an intermediate layer prior to predicting the color and density).
Yu does not explicitly disclose the following limitations as further recited however Liu discloses
determining depth maps of objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … The renderer also outputs a depth map as seen from the new camera);
generating, based on the depth maps, a first set of training data comprising reconstructed content items depicting only the objects (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … we compute a per-pixel binary mask … The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1);
generating, based on the depth maps, a second set of training data comprising one or more optimal training paths associated with the set of content items, wherein the one or more optimal training paths are generated based at least in part on a dissimilarity matrix associated with the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map as seen from the new camera, which we invert and multiply by the rendered mask to obtain ˆDt+1 ... Refinement and Synthesis. Given the rendered image ˆIt+1, its disparity ˆDt+1 and its mask Mt+1, our next task is to refine this image ... in our work the input is the rendered image, disparity, and mask. The generator output is a 4-channel image comprising RGB and disparity channels. We also train a single encoder that encodes the initial input image I0 to compute the latent noise ... Rinse and Repeat. A crucial part of our approach is to not just refine the RGB pixels, but also the disparity as well. Together the geometry (represented by a disparity) and RGB texture provide the necessary information for our renderer to produce the next view).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Liu and Yu because they are in the same field of endeavor. One skilled in the art would have been motivated to include the disparity map and depth map as taught by Liu in the system of Yu as an alternate means to perform novel view synthesis (Yu, Abstract; Liu, Abstract).
As per claim 12, Yu and Liu disclose the system of claim 11. Liu discloses wherein determining the depth maps of the objects depicted in the set of content items comprises:
calculating, based on the set of content items, internal and external parameters of cameras from which the set of content items was captured (Liu, page 3, 3. Perpetual View Generation, We introduce perpetual view generation, the task of continually generating novel views of a scene ... Specifically, at test time, given an RGB image I0 and a camera trajectory ... the task is to output a new image sequence ... The trajectory is a series of 3D camera poses ... where R and t are 3D rotations and translations, respectively. In addition, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion);
determining, based on the internal and external parameters, coarse point clouds associated with the objects depicted in the set of content items (Liu, page 3, 3. Perpetual View Generation, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), We collected 765 videos using keywords such as ‘coastal’ and ‘aerial footage’, and processed these videos with SLAM and structure-from-motion ... Disparity We use the off-the-shelf MiDaS single-view depth prediction method [22] to obtain disparity maps for every frame ... we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps; Liu, page 12, 1.2. Inference without Disparity Scaling, Scaling and shifting the disparity as described above requires a sparse point cloud, which is generated from SfM);
determining, based on the coarse point clouds, meshes of the objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps); and
determining, based on the meshes of the objects, the depth maps of the objects depicted in the content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors … we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map). The motivation would be the same as above in claim 11.
As per claim 13, Yu and Liu disclose the system of claim 11. Liu discloses wherein generating the first set of training data comprising the reconstructed content items comprises:
determining, based on the depth maps, pixels in each content item of the set of content items to be filtered out (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image);
filtering out the pixels in each content item of the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, This use of the mask ensures that any regions in ˆIt+1 and ˆDt+1 that were occluded in It are masked out and set to zero (along with regions that were outside the field of view of the previous camera); and
sampling remaining pixels in each content item of the set of content items to generate the reconstructed content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 11.
As per claim 14, Yu and Liu disclose the system of claim 13. Liu discloses wherein determining the pixels in each content item of the set of content items to be filtered out comprises:
determining pixels in each content item of the set of content items that are outside a threshold depth range indicated by a corresponding depth map of each content item, wherein the threshold depth range indicates a depth range of at least one object depicted in each content item (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 11.
As per claim 16, Yu discloses a non-transitory memory of a computing system storing instructions that, when executed by at least one processor of the computing system, causes the computing system to perform a method of training a neural radiance field-based (NeRF-based) machine learning model for object recognition (Yu, Abstract, We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images ... introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior), the method comprising:
obtaining a set of content items to train the NeRF-based machine learning model (Yu, Abstract, We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixel-NeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset);
determining depth of objects depicted in the set of content items (Yu, pages 11-14, B.1. Implementation Details, Encoder E ... we use a ResNet34 backbone and extract a feature pyramid by taking the feature maps prior to the first pooling operation and after the first ResNet 3 layers ... Hierarchical volume sampling To improve the sampling efficiency, in practice, we also use coarse and fine NeRF networks fc, ff ... we use 64 stratified uniform and 16 importance samples, and additionally take 16 fine samples with a normal distribution around the expected ray termination (i.e. depth) from the coarse model, to further promote denser sampling near the surface);
generating, a first set of training data comprising reconstructed content items depicting only the objects (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 11-14, B.1. Implementation Details, NeRF rendering hyperparameters We use positional encoding from NeRF for the spatial coordinates ... We use a white background color in NeRF to match the ShapeNet renderings, except in the DTU setup where a black background is used; Yu, page 15, B.2.1 Single-category ShapeNet, We train for 400000 iterations ... on a single Titan RTX. For efficiency, we sample rays from within a tight bounding box around the object; Yu, page 15, B.2.2 Category-agnostic ShapeNet, We train our model for 800000 iterations on the entire training set, where rays are sampled from within a tight bounding box; Yu, pages 17-19, B.2.3 Generalization to Novel Categories, We train our model for 680000 iterations across all instances of 3 categories: airplane, car, and chair. Rays are sampled from within a tight bounding box for the first 400000 iterations; Yu, pages 19-20, B.2.5 Sim2Real on Real Car Images, We use car images from the Stanford Cars dataset. PointRend is applied to the images to obtain foreground masks and bounding boxes ... For evaluation, we set the camera pose to identity and use the same sampling strategy and bounds as at train time for the single-category cars model);
generating, a second set of training data comprising one or more optimal training paths associated with the set of content items (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, page 4, 4.1. Single-Image pixelNeRF, Given a input image I of a scene, we first extract a feature volume W = E(I). Then, for a point on a camera ray x, we retrieve the corresponding image feature by projecting x onto the image plane to the image coordinates π(x) using known intrinsics, then bilinearly interpolating between the pixelwise features to extract the feature vector W(π (x)). The image features are then passed into the NeRF network, along with the position and view direction ... In the few-shot view synthesis task, the query view direction is a useful signal for determining the importance of a particular image feature in the NeRF network. If the query view direction is similar to the input view orientation, the model can rely more directly on the input; if it is dissimilar, the model must leverage the learned prior); and
training the NeRF-based machine learning model based on the first set of training data and the second set of training data (Yu, page 8, 5.2. Pushing the Boundaries of ShapeNet, we use the off-the-shelf PointRend segmentation model to remove the background before passing through our model; Yu, pages 1-2, 1. Introduction, pixelNeRF, a learning framework that enables predicting NeRFs from one or several images in a feed-forward manner ... pixelNeRF takes spatial image features aligned to each pixel as an input. This image conditioning allows the framework to be trained on a set of multi-view images, where it can learn scene priors to perform view synthesis from one or few input views ... we condition NeRF on input images by first computing a fully convolutional image feature grid from the input image. Then for each query spatial point x and viewing direction d of interest in the view coordinate frame, we sample the corresponding image feature via projection and bilinear interpolation. The query specification is sent along with the image features to the NeRF network that outputs density and color, where the spatial image features are fed to each layer as a residual. When more than one image is available, the inputs are first encoded into a latent representation in each camera’s coordinate frame, which are then pooled in an intermediate layer prior to predicting the color and density).
Yu does not explicitly disclose the following limitations as further recited however Liu discloses
determining depth maps of objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … The renderer also outputs a depth map as seen from the new camera);
generating, based on the depth maps, a first set of training data comprising reconstructed content items depicting only the objects (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer … we compute a per-pixel binary mask … The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1);
generating, based on the depth maps, a second set of training data comprising one or more optimal training paths associated with the set of content items, wherein the one or more optimal training paths are generated based at least in part on a dissimilarity matrix associated with the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map as seen from the new camera, which we invert and multiply by the rendered mask to obtain ˆDt+1 ... Refinement and Synthesis. Given the rendered image ˆIt+1, its disparity ˆDt+1 and its mask Mt+1, our next task is to refine this image ... in our work the input is the rendered image, disparity, and mask. The generator output is a 4-channel image comprising RGB and disparity channels. We also train a single encoder that encodes the initial input image I0 to compute the latent noise ... Rinse and Repeat. A crucial part of our approach is to not just refine the RGB pixels, but also the disparity as well. Together the geometry (represented by a disparity) and RGB texture provide the necessary information for our renderer to produce the next view).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Liu and Yu because they are in the same field of endeavor. One skilled in the art would have been motivated to include the disparity map and depth map as taught by Liu in the system of Yu as an alternate means to perform novel view synthesis (Yu, Abstract; Liu, Abstract).
As per claim 17, Yu and Liu disclose the non-transitory memory of claim 16. Liu discloses wherein determining the depth maps of the objects depicted in the set of images comprises:
calculating, based on the set of content items, internal and external parameters of cameras from which the set of content items was captured (Liu, page 3, 3. Perpetual View Generation, We introduce perpetual view generation, the task of continually generating novel views of a scene ... Specifically, at test time, given an RGB image I0 and a camera trajectory ... the task is to output a new image sequence ... The trajectory is a series of 3D camera poses ... where R and t are 3D rotations and translations, respectively. In addition, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion);
determining, based on the internal and external parameters, coarse point clouds associated with the objects depicted in the set of content items (Liu, page 3, 3. Perpetual View Generation, each camera has an intrinsic matrix K. At test time the camera trajectory may be pre-specified ... At training time camera data is obtained from video clips via structure-from-motion; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), We collected 765 videos using keywords such as ‘coastal’ and ‘aerial footage’, and processed these videos with SLAM and structure-from-motion ... Disparity We use the off-the-shelf MiDaS single-view depth prediction method [22] to obtain disparity maps for every frame ... we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps; Liu, page 12, 1.2. Inference without Disparity Scaling, Scaling and shifting the disparity as described above requires a sparse point cloud, which is generated from SfM);
determining, based on the coarse point clouds, meshes of the objects depicted in the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors; Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), we use the sparse point-cloud computed for each scene during structure from motion ... We apply this scale and shift to the MiDaS output to obtain disparity maps); and
determining, based on the meshes of the objects, the depth maps of the objects depicted in the content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors … we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask to give ˆIt+1. The renderer also outputs a depth map). The motivation would be the same as above in claim 16.
As per claim 18, Yu and Liu disclose the non-transitory memory of claim 16. Liu discloses wherein generating the first set of training data comprising the reconstructed content items comprises:
determining, based on the depth maps, pixels in each content item of the set of content items to be filtered out (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering. Our render step R uses a differentiable mesh renderer. First, we convert each pixel coordinate (u, v) in It and its corresponding disparity d in Dt into a 3D point in the camera coordinate system ... We then convert the image into a 3D triangular mesh where each pixel is treated as a vertex connected to its neighbors ... we compute a per-pixel binary mask ... by thresholding the gradient of the disparity image);
filtering out the pixels in each content item of the set of content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, This use of the mask ensures that any regions in ˆIt+1 and ˆDt+1 that were occluded in It are masked out and set to zero (along with regions that were outside the field of view of the previous camera); and
sampling remaining pixels in each content item of the set of content items to generate the reconstructed content items (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 16.
As per claim 19, Yu and Liu disclose the non-transitory memory of claim 18. Liu discloses wherein determining the pixels in each content item of the set of content items to be filtered out comprises:
determining pixels in each content item of the set of content items that are outside a threshold depth range indicated by a corresponding depth map of each content item, wherein the threshold depth range indicates a depth range of at least one object depicted in each content item (Liu, pages 3-4, 3.1. Approach: Render, Refine, Repeat, Geometry and Rendering, To avoid stretched triangle artefacts at depth discontinuities, and to aid our refinement network by identifying regions to be completed, we compute a per-pixel binary mask by thresholding the gradient of the disparity image computed with a a Sobel filter ... The 3D mesh, textured with the image It and mask Mt, is then rendered from the new view Pt+1, and the rendered image is multiplied element-wise by the rendered mask). The motivation would be the same as above in claim 16.
Claim(s) 3 and 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, Alex, et al. "pixelNeRF: Neural Radiance Fields from One or Few Images." arXiv preprint arXiv:2012.02190 (2020), hereinafter, “Yu”, in view of Liu, Andrew, et al. "Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image." arXiv preprint arXiv:2012.09855 (2020), hereinafter, “Liu” as applied to claim 2 above, and further in view of Yariv, Lior, et al. "Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance." arXiv preprint arXiv:2003.09852 (2020), hereinafter, “Yariv”.
As per claim 3, Yu and Liu disclose the method of claim 2, wherein the internal and external parameters of the cameras are determined using a Structure from Motion (SfM) technique (Liu, page 5, 4. Aerial Coastline Imagery Dataset (ACID), We collected 765 videos … and processed these videos with SLAM and structure-from-motion ... We make the list of videos and the SfM camera trajectories available. See Fig. 4 for an illustrative example of our SfM pipeline … Disparity We use the off-the-shelf MiDaS single-view depth prediction method [22] to obtain disparity maps for every frame ... Because MiDaS disparity is only predicted up to scale and shift, it must first be rescaled to match our data. To achieve this, we use the sparse point-cloud computed for each scene during structure from motion. For each frame we consider only the points that were tracked in that frame, and apply least-squares to compute the optimal scale and shift which minimize the disparity error on these points. We apply this scale and shift to the MiDaS output to obtain disparity maps {Di} which are scale-consistent with the SfM camera trajectories {Pi} for each sequence; Liu, page 11, 1. Implementation Details, 1.1. ACID Collection and Processing, We take the top 10 video ids for each query as the candidate videos for our dataset. We process all the videos through a SLAM and SfM pipeline ... This returns the camera poses of the input video trajectory and 3D keypoints ... From the remaining set of sequences, we run the MiDaS system on every frame to get dense disparity ... we use the 3D keypoints produced by running SfM to compute scale and shift parameters for each frame that best fit the MiDaS disparity values to the 3D keypoints visible in that frame, so that the disparity images align with the SfM camera trajectories during training).
Yu and Liu do not explicitly disclose the following limitation as further recited however Yariv discloses
and the meshes of the objects are determined using a Poisson reconstruction technique (Yariv, page 13, A.5 Baselines methods running details, Colmap. We used the official Colmap implementation ... For unknown cameras, only the intrinsic camera parameters are given, and we used the "mapper" module to extract camera poses. For fixed known cameras the GT poses are given as inputs. For both setups we run their "feature_extractor", "exhaustive_matcher", "point_triangulator", "patch_match_stereo" and "stereo_fusion" modules to generate point clouds. We also used their screened Poisson Surface Reconstruction (sPSR) for 3D mesh generation after cleaning the point clouds ... For rendering, we used their generated 3D meshes and cameras, and rendered images for each view using the "Pyrender" package).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Yariv with Yu and Liu because they are in the same field of endeavor. One skilled in the art would have been motivated to include the reconstruction algorithm as taught by Yariv in order to provide an alternate means for multi-view 3D surface reconstruction (Yariv, Abstract).
As per claim 4, Yu and Liu disclose the method of claim 2, and (Yu, page 1, 1. Introduction, pixelNeRF takes spatial image features aligned to each pixel as an input. This image conditioning allows the framework to be trained on a set of multi-view images, where it can learn scene priors to perform view synthesis from one or few input views ... we condition NeRF on input images by first computing a fully convolutional image feature grid from the input image. Then for each query spatial point x and viewing direction d of interest in the view coordinate frame, we sample the corresponding image feature via projection and bilinear interpolation. The query specification is sent along with the image features to the NeRF network that outputs density and color, where the spatial image features are fed to each layer as a residual) but do not disclose the following limitation as further recited however Yariv discloses
wherein the internal and external parameters of the cameras and the meshes of the objects are determined using a multiview depth fusion technique (Yariv, page 13, A.5 Baselines methods running details, Colmap. We used the official Colmap implementation ... For unknown cameras, only the intrinsic camera parameters are given, and we used the "mapper" module to extract camera poses. For fixed known cameras the GT poses are given as inputs. For both setups we run their "feature_extractor", "exhaustive_matcher", "point_triangulator", "patch_match_stereo" and "stereo_fusion" modules to generate point clouds. We also used their screened Poisson Surface Reconstruction (sPSR) for 3D mesh generation after cleaning the point clouds ... For rendering, we used their generated 3D meshes and cameras, and rendered images for each view using the "Pyrender" package).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Yariv with Yu and Liu because they are in the same field of endeavor. One skilled in the art would have been motivated to include the reconstruction algorithm as taught by Yariv in order to provide an alternate means for multi-view 3D surface reconstruction (Yariv, Abstract).
Allowable Subject Matter
Claims 7-10, 15 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: Claims 7-10, 15 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims because while the prior art discloses various means to synthesize and render novel views from a single input image or a few input images, the prior art does not disclose the limitations, “wherein generating the second set of training data comprising the one or more optimal training paths comprises: determining depth maps matching metrics of the set of content items; determining silhouette matching metrics of the set of content items; generating, based on the depth maps matching metrics and the silhouette matching metrics, the dissimilarity matrix associated with the set of content items; generating, based on the dissimilarity matrix, a connected graph associated with the set of content items; and generating the one or more optimal training paths associated with the set of content items by applying a minimum spanning tree technique to the connected graph, wherein the minimum spanning tree technique rearranges the connected graph into multiple subtrees and each path of the multiple subtrees is an optimal training path” as recited in dependent claims 7, 15 and 20.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached at (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TRACY MANGIALASCHI/Primary Examiner, Art Unit 2668