Office Action Analysis: 18835021 — LEARNING METHOD, INFORMATION PROCESSING DEVICE, AND RECORDING MEDIUM

Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Priority
Receipt is acknowledged that application claims priority to foreign application with application number JP2022-038305 dated March 11, 2022. Copies of certified papers required by 37 CFR 1.55 have been received. Priority is acknowledged under 35 USC 119(e) and 37 CFR 1.78. 
Information Disclosure Statement
The IDS dated August 1, 2024 has been considered and placed in the application file.
Drawings
The drawings are objected to because Figs. 7 and 8 are blurry.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
In paragraph [0029], the “3D Map Data” should be rewritten accordingly with Fig. 2, diagram A, as “3D Map Data 30”. 
In paragraph [0033], “step Si” should be written as “step S1”. 
In paragraph [0043], “…the neural network Fe…” should be written as “…the neural network Fo…”.
In paragraph [0065], “step 521” should be rewritten accordingly with Fig. 9 as step “S21”.
In paragraph [0067], “step 523” should be rewritten accordingly with Fig. 9 as step “S23”.
In paragraph [0068], “step 524” should be rewritten accordingly with Fig. 9 as step “S24”.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  

Claim 14 recites a computer-readable recording medium. The broadest reasonable interpretation of a claim drawn to a computer-readable recording medium typically covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer-readable recording media, particularly when the specification is silent. See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. 101 as covering non-statutory subject matter. The USPTO recognizes that applicants may have claims directed to computer readable media that cover signals per se, which the USPTO must reject under 35 U.S.C. 101 as covering both non-statutory subject matter and statutory subject matter. A claim drawn to such a computer readable medium that covers both transitory and non-transitory embodiments may be amended to narrow the claim to cover only statutory embodiments to avoid a rejection under 35 U.S.C. 101 by adding the limitation "non-transitory" to the claim. Such an amendment would typically not raise the issue of new matter, even when the specification is silent because the broadest reasonable interpretation relies on the ordinary and customary meaning that includes signals per se. Applicant’s specification in paragraph [0080] recites “The drive 210 drives a removable medium 211 such as a magnetic disk, an optical disc, a magneto-optical disk, or semiconductor memory…” and in paragraph [0082] recites “…the removable medium 211 serving as a package medium for supply. The program can also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.”. Since Applicant’s disclosure does not limit the definition of “a computer-readable recording medium”, it could be a signal. As an additional note, a non-transitory computer readable medium having executable programming instructions stored thereon is considered statutory as non-transitory computer readable media excludes transitory data signals. 
Therefore, claim 14 does not fall within at least one of the four categories of patent eligible subject matter because the claim is directed to signals per se and a transitory signal, while physical and real, does not possess concrete structure that would qualify as a device or part under the definition of a machine, is not a tangible article or commodity under the definition of a manufacture (even though it is man-made and physical in that it exists in the real world and has tangible causes and effects), and is not composed of matter such that it would qualify as a composition of matter. Nuijten, 500 F.3d at 1356-1357, 84 USPQ2d at 1501-03. As such, a transitory, propagating signal does not fall within any statutory category. Mentor Graphics Corp. v. EVE-USA, Inc., 851 F.3d 1275, 1294, 112 USPQ2d 1120, 1133 (Fed. Cir. 2017); Nuijten, 500 F.3d at 1356-1357, 84 USPQ2d at 1501-03. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
Claims 1-14 are rejected under 35 U.S.C. 103 as being unpatentable and obvious over Mildenhall, Ben, et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” ArXiv.org, 2020, arxiv.org/abs/2003.08934v2 (hereinafter Mildenhall) in view of Barron, Jonathan T., et al. “Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields.” ArXiv.org, 2021, arxiv.org/abs/2111.12077v2 (hereinafter Barron).

Regarding claim 1, Mildenhall teaches a learning method performed by an information processing device (Mildenhall pg. 17, “We implement our model in Tensorflow”), the method comprising: based on a plurality of different viewpoints (Mildenhall Fig. 1, “We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.”) from low-precision three-dimensional data; (Mildenhall pg. 10, “We first show experimental results on two datasets of synthetic renderings of objects (Table 1, “Diffuse Synthetic 360◦” and “Realistic Synthetic 360◦”). The DeepVoxels [41] dataset contains four Lambertian objects with simple geometry. Each object is rendered at 512 × 512 pixels from viewpoints sampled on the upper hemisphere (479 as input and 1000 for testing).”) and performing learning processing of a neural network (Mildenhall Figs. 1-2, pg. 1, “Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density and view-dependent RGB color.”)
	However, Mildenhall is silent about rendering a plurality of depth images and generating high-precision three-dimensional data from a two-dimensional image, based on the plurality of depth images. 
Barron teaches rendering a plurality of depth images (Barron Fig. 1, pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”) that generates high-precision three-dimensional data from a two-dimensional image, based on the plurality of depth images (Barron Abstract, pg. 1, “Neural Radiance Fields (NeRF) synthesize highly realistic renderings of scenes by encoding the volumetric density and color of a scene within the weights of a coordinate based multi-layer perceptron (MLP). This approach has enabled significant progress towards photorealistic view synthesis [30]…’mip-NeRF 360’ that is capable of producing realistic renderings of these unbounded scenes (Figure 1)…Background regions of unbounded 360 scenes are observed by significantly sparser rays than the central region. This exacerbates the inherent ambiguity of reconstructing 3D content from 2D images.”; pg. 12, “We captured between 100 and 330 images in each scene.”; pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”)
Mildenhall and Barron are analogous art as both of them are related to view synthesis and scene representation. 
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall by rendering a plurality of depth images and generating high-precision three-dimensional data from a two-dimensional image, based on the plurality of depth images as taught by Barron and use that within Mildenhall’s scene representation. 
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes. 

Claim 13 is directed to an information processing device (Mildenhall pg. 17, “We implement our model in Tensorflow…”) and its scope and functions are substantially similar to the steps performed by the method claim 1 and therefore claim 13 is also rejected with the same rationale as specified in the rejection of claim 1. The Examiner notes that while not explicitly mentioned by Mildenhall, it would be obvious to a person having ordinary skill in the art to use a computer or device to be able to access Tensorflow. 

Claim 14 is directed to a computer-readable recording medium, having recorded thereon a program for executing processing of: (Mildenhall pg. 17, “We implement our model in Tensorflow…”) and its scope and functions are substantially similar to the steps performed by the method claim 1 and therefore claim 14 is also rejected with the same rationale as specified in the rejection of claim 1. The Examiner notes that while not explicitly mentioned by Mildenhall, it would be obvious to a person having ordinary skill in the art to use a computer or device to be able to access Tensorflow, which is a program that is able to execute the processing of neural networks. The computer may have a physical storage component that acts as a non-transitory computer-readable recording medium. 

Regarding claim 2, Mildenhall teaches wherein the learning processing includes learning a three-dimensional representation by the neural network (Mildenhall Figs. 1-2, pg. 1, “Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density and view-dependent RGB color.”)

Regarding claim 3, Mildenhall teaches wherein the three-dimensional representation by the neural network includes implicit function representation (Mildenhall Figs. 1-2, pg. 1, “Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density and view-dependent RGB color.”; pg. 2, “We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation…”; pg. 3, “A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.”) 

    PNG
    media_image1.png
    149
    1021
    media_image1.png
    Greyscale

The Examiner notes that unlike explicit representations such as point clouds or meshes, NeRF’s MLP is an implicit representation that represents a scene as a continuous volumetric function rather than a discrete fixed set of 3D points. This positional encoding is a specific case of Fourier feature mapping which maps low-dimensional inputs into a higher-dimensional space using sine and cosine functions.

Regarding claim 4, Mildenhall teaches wherein the learning processing includes learning Radiance Fields (Mildenhall Fig. 2, pg. 5, “An overview of our neural radiance field scene representation and differentiable rendering procedure. We synthesize images by sampling 5D coordinates (location and viewing direction) along camera rays (a), feeding those locations into an MLP to produce a color and volume density (b), and using volume rendering techniques to composite these values into an image (c). This rendering function is differentiable, so we can optimize our scene representation by minimizing the residual between synthesized and ground truth observed images (d).”)

    PNG
    media_image2.png
    362
    1036
    media_image2.png
    Greyscale

The above image is the Neural Radiance Field scene representation (Mildenhall).
Regarding claim 5, Mildenhall teaches further rendering a plurality of the two-dimensional images based on the plurality of viewpoints and the plurality of two-dimensional images (Mildenhall Fig. 1, “We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.”) from the low-precision three-dimensional data; (Mildenhall pg. 10, “We first show experimental results on two datasets of synthetic renderings of objects (Table 1, “Diffuse Synthetic 360◦” and “Realistic Synthetic 360◦”). The DeepVoxels [41] dataset contains four Lamber tian objects with simple geometry. Each object is rendered at 512 × 512 pixels from viewpoints sampled on the upper hemisphere (479 as input and 1000 for testing).”)

    PNG
    media_image3.png
    262
    1022
    media_image3.png
    Greyscale

Fig. 1 (Mildenhall)
However, Mildenhall is silent about performing the learning processing based on the plurality of depth images.
Barron teaches and performing the learning processing based on the plurality of depth images (Barron Fig. 1, pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”)
Mildenhall and Barron are analogous art as both of them are related to scene representation.
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall by performing the learning processing based on the plurality of depth images as taught by Barron.
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes. 

Regarding claim 6, Mildenhall teaches learning an implicit function so as to minimize an error, in the Radiance Fields, (Mildenhall Figs. 1-2, pg. 1, “Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density and view-dependent RGB color.”; pg. 2, “We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation…”; pg. 3, “A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.”) between an integral value of density of an object (Mildenhall pgs. 5-6, “Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space. We render the color of any ray passing through the scene using principles from classical volume rendering [16]. The volume density σ(x) can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location x. The expected color C(r) of camera ray r(t) = o+td with near and far bounds tn and tf is:…”)

    PNG
    media_image4.png
    122
    1044
    media_image4.png
    Greyscale

corresponding to the plurality of viewpoints (Mildenhall Fig. 1, “We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.”)
However, Mildenhall is silent about the plurality of depth images.
Barron teaches and the plurality of depth images (Barron Fig. 1, pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”)
Mildenhall and Barron are analogous art as both of them are related to scene representation.
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall by and the plurality of depth images as taught by Barron.
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes.

Regarding claim 7, Mildenhall teaches learning the implicit function so as to further minimize an error (Mildenhall Figs. 1-2, pg. 1, “Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density and view-dependent RGB color.”; pg. 2, “We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation…”; pg. 3, “A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.”) between rendering images corresponding to the plurality of viewpoints (Mildenhall Fig. 1, “We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.”) obtained by volume rendering using the Radiance Fields and the plurality of two-dimensional images (Mildenhall pg. 5, “Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space. We render the color of any ray passing through the scene using principles from classical volume rendering [16]. The volume density σ(x) can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location x. The expected color C(r) of camera ray r(t) = o+td with near and far bounds tn and tf is:…”)

Regarding claim 8, Mildenhall is silent about rendering the plurality of depth images.
Barron teaches rendering the plurality of depth images (Barron Fig. 1, pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”)
Mildenhall and Barron are analogous art as both of them are related to view synthesis and scene representation.
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall by rendering a plurality of depth images as taught by Barron.
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes. 

Regarding claim 9, Mildenhall is silent about fine-tuning the neural network by using an object image obtained by capturing a real object corresponding to the high-precision three-dimensional data.
Barron teaches fine-tuning the neural network by using an object image obtained by capturing a real object (Barron pg. 12, “We captured our dataset using two different mirrorless digital cameras.”) corresponding to the high-precision three-dimensional data (Barron Abstract, pg. 1, “Neural Radiance Fields (NeRF) synthesize highly realistic renderings of scenes by encoding the volumetric density and color of a scene within the weights of a coordinate based multi-layer perceptron (MLP). This approach has enabled significant progress towards photorealistic view synthesis [30]…’mip-NeRF 360’ that is capable of producing realistic renderings of these unbounded scenes (Figure 1)…Background regions of unbounded 360 scenes are observed by significantly sparser rays than the central region. This exacerbates the inherent ambiguity of reconstructing 3D content from 2D images.”; pg. 12, “We captured between 100 and 330 images in each scene.”; pg. 8, “…our model produces extremely detailed depth maps while SVS and Deep Blending do not (the “SVS depths” we show were produced by COLMAP [42] and are used as input to the model). Figure 7 shows model outputs, though we urge the reader to view our supplemental video.”)
Mildenhall and Barron are analogous art as both of them are related to view synthesis and scene representation. 
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall by fine-tuning the neural network by using an object image obtained by capturing a real object corresponding to the high-precision three-dimensional data as taught by Barron and use that within Mildenhall’s scene representation. 
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes. 

Regarding claim 10, Mildenhall teaches fine-tuning the neural network based on an error between a viewpoint image for any viewpoint obtained by inference using the neural network and the object image corresponding to the viewpoint (Mildenhall Fig. 2, pg. 9, “Our loss is simply the total squared error between the rendered and true pixel colors for both the coarse and fine renderings:

    PNG
    media_image5.png
    365
    1041
    media_image5.png
    Greyscale

”) The Examiner notes that the Multi-Layer Perceptron (MLP)—which is the neural network—is fine-tuned based on error (loss) between the rendered image (i.e. any-viewpoint image) and the ground truth image (i.e. object image).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable and obvious over Mildenhall and Barron as applied to claims 1-14, and further in view of Vision & Graphics Seminar at MIT. “Jon Barron - Understanding and Extending Neural Radiance Fields.” YouTube, 26 Feb. 2021, www.youtube.com/watch?v=HfJpQCBTqZs (hereinafter Vision & Graphics Seminar at MIT).

Regarding claim 11, Mildenhall teaches wherein the viewpoint image is the two-dimensional image (Mildenhall Fig. 1, “We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.”) 
However, Mildenhall and Barron are silent about for a viewpoint specified by a user. 
Vision & Graphics Seminar at MIT teaches for a viewpoint specified by a user

    PNG
    media_image6.png
    1014
    1859
    media_image6.png
    Greyscale

(The above screenshot is from Vision & Graphics Seminar at MIT. Jon Barron explains that viewing direction can be appended into the neural network by a user.)
The Examiner notes that the Applicant is not specific on how or in what way a viewpoint should be “specified by a user”, whether that is verbal, written, or neither. Therefore, using broadest reasonable interpretation a viewpoint can be specified by a user based on the plurality of input images that a user can choose to use for the neural network. Mildenhall uses a “set of 100 input views”—viewpoints—and then chooses to “show two novel views”. A user can “specify” which viewpoint input images to use for the Multi-Layer Perceptron in Mildenhall. Furthermore, in the video by Vision & Graphics Seminar at MIT it mentions “…we just append the viewing direction…it’s injected slightly later…”, wherein the viewing directions may be injected at a later time by a user into the computer, as commonly known in the art. 
Mildenhall, Barron, and Vision & Graphics Seminar at MIT are analogous art as all of them are related to scene representation.
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall modified by Barron by for a viewpoint specified by a user as taught by Vision & Graphics Seminar at MIT.
The motivation for the above is for reconstructing 3D data with accurate representation in arbitrary local spaces and real-world scenes.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable and obvious over Mildenhall and Barron as applied to claims 1-14, and further in view of Baruch, Gilad, et al. “ARKitScenes -- a Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data.” ArXiv.org, 2021, arxiv.org/abs/2111.08897v2 (hereinafter Baruch).

Regarding claim 12, Mildenhall and Barron are silent about including three-dimensional map data. 
Baruch teaches includes three-dimensional map data (Baruch Abstract, pg. 2, “This work provides the first large-scale dataset that is captured with Apple’s LiDAR scanner using handheld devices…to the raw and processed data above, we provide high quality ground truth and demonstrate its usability in two downstream supervised learning tasks: 3D object detection and color-guided depth upsampling. To our best knowledge, this is the first dataset that provides high quality ground truth depth data registered to frames from a widely available depth sensor…ARKitScenes is the largest indoor 3D dataset consisting of 5,048 captures of 1,661 unique scenes.”)
Mildenhall, Barron, and Baruch are analogous art as all of them are related to scene representation.
Therefore, it would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified Mildenhall modified by Barron by including three-dimensional map data as taught by Baruch.
The motivation for the above is for improving continuous scene representation with real-world map data. 

Pertinent Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Soltani, Arsalan, et al. “Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks.” Thecvf.com, 2017, pp. 1511–1519, openaccess.thecvf.com/content_cvpr_2017/html/Soltani_Synthesizing_3D_Shapes_CVPR_2017_paper.html, discloses multi-view depth maps.
P. Nguyen, A. Karnewar, L. Huynh, E. Rahtu, J. Matas and J. Heikkila, "RGBD-Net: Predicting Color and Depth Images for Novel Views Synthesis," 2021 International Conference on 3D Vision (3DV), London, United Kingdom, 2021, pp. 1095-1105, doi: 10.1109/3DV53792.2021.00117., discloses depth networks for generating depth maps on various 3D scenes and 3D point clouds that are more accurate than multi-view stereo methods and achieve faster rendering than NeRF and its variants. 
Q. Wang et al., "IBRNet: Learning Multi-View Image-Based Rendering," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 4688-4697, doi: 10.1109/CVPR46437.2021.00466., discloses view synthesis of complex scenes by interpolating a sparse set of nearby views using a network architecture that includes a multilayer perceptron and a ray transformer at continuous 5D locations. 
Li, Jiaxin, et al. “MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis.” ArXiv.org, 2021, arxiv.org/abs/2103.14910., discloses view synthesis and depth estimation of Multiplane Images given a single image as input however, it does not take viewing direction as input. It uses various real-world datasets. 
Niemeyer, Michael, et al. “RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs.” ArXiv.org, 2021, arxiv.org/abs/2112.00724., discloses regularizing the geometry and appearance of patches rendered from unobserved viewpoints and annealing the ray sampling space during training. 
Neff, T., et al. “DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields Using Depth Oracle Networks.” Computer Graphics Forum, vol. 40, no. 4, July 2021, pp. 45–59, https://doi.org/10.1111/cgf.14340., discloses a depth oracle network that predicts ray sample locations for each view ray with a single network evaluation. It shows that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth. 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMELIA VELAZQUEZ VALENCIA whose telephone number is (571)272-7418. The examiner can normally be reached M-F, 8:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Said A. Broome can be reached at (571) 272-2931. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.V.V/Examiner, Art Unit 2612                                                                                                                                                                                                        

/Said Broome/Supervisory Patent Examiner, Art Unit 2612                                                                                                                                                                                                        




Date: 3/9/2026
Read full office action
LEARNING METHOD, INFORMATION PROCESSING DEVICE, AND RECORDING MEDIUM

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

LEARNING METHOD, INFORMATION PROCESSING DEVICE, AND RECORDING MEDIUM

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email