DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This is an initial office action in response to communication(s) filed on February 27, 2024.
Claims 1-20 are pending.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on March 7, 2024 and June 9, 2025 was filed in compliance with the provisions of 37 CFR 1.97 and 1.98. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-5, 9-11, 13-14 and 16-20 are rejected under 35 U.S.C. 102(a)1(1) as being anticipated by Wofk (U.S. Pub. No. 2022/0343521 A1, hereinafter as “Wofk”).
With regard to claim 1, the claim is drawn to an apparatus (see Wofk, i.e. in abstract, para. 33, and etc., disclose “[0033] Methods and apparatus for metric depth estimation using a monocular visual-inertial system are disclosed herein…”), comprising:
one or more memories configured to store an input image (see Wofk, i.e. in para. 54, and etc., disclose that “…The example database 235 of the illustrated example of FIG. 2 is implemented by any memory, storage device and/or storage disc for storing data such as, for example, flash memory, magnetic media, optical media, etc. Furthermore, the data stored in the example database 235 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, image data, etc. While the illustrated example database 235 is illustrated as a single element, the database 235 and/or any other data storage elements described herein may be implemented by any number and/or type(s) of memories.”); and
one or more processors (see Wofk, i.e. in fig. 2, para. 55 and etc., disclose the neural network processor 255), coupled to the one or more memories, configured to:
generate, by an encoder, an encoded feature representation of the input image (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”);
generate, by a plurality of depth map prediction pathways, a plurality of outputs corresponding to a plurality of predicted depth maps based on the encoded feature representation (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”); and
generate an uncertainty metric indicating an uncertainty of the plurality of predicted depth maps based on one or more variances between the plurality of outputs (see Wofk, i.e. fig. 10, fig. 15A-C, para. 104 and etc., disclose the metric depth error map, and further disclose that “…For example, performance is qualitatively evaluated by comparing metric depth error maps computed for globally aligned depth (GA error 1025) to those computed for densely scaled depth (SML error 1030). In depth maps, brighter is closer and darker is farther. In error maps, positive inverse depth error is farther than ground truth and negative inverse depth error is closer than ground truth. Dense scale alignment with the SML network 165 improves metric depth accuracy over global alignment alone, as seen by the whiter regions in the error maps. The bottom two samples are particularly challenging cases due to low light conditions. For example, a whiter region in the error map indicates that the SML network 165 improved metric depth accuracy in that region. The first sample depicts a neighborhood scene where the building towards center-right is pushed further back under dense scale alignment, as confirmed by a reduction in negative error in inverse depth. A tree shown in the RGB image 1005 behind the pool is brought closer, as shown by the reduction in positive error. The latter two samples depict significantly more challenging scenes due to low light as well as proximity to walls and the ground. In both, the SML network 165 still realigns surfaces towards correct metric depth…”).
With regard to claim 2, the claim is drawn to the apparatus of claim 1, wherein each of the plurality of depth map prediction pathways comprises a respective decoder configured to:
receive as input the encoded feature representation (see Wofk, i.e. in fig. 9 and in para. 103 and etc., disclose that “an example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections…”); and
generate as output a respective predicted depth map of the plurality of predicted depth maps based on the encoded feature representation (see Wofk, i.e. in fig. 9, para. 103, disclose that “[0103] FIG. 9 illustrates an example network architecture(s) 900, 932, and/or 965 used in conjunction with the visual-inertial depth estimation pipeline 100 of FIG. 1. For example, the ScaleMapLearner (SML) network 165 that performs dense scale alignment on globally aligned metric depth maps can be based on the MiDaS-small architecture (e.g., a mobile friendly version in the robust and generalizable MiDaS family of monocular depth estimation models). In the example of FIG. 9, the network architecture(s) include a first network architecture 900 for MiDaS-small, a second network architecture 932 that shows a FeatureFusion block structure 935 and a ResidualConvUnit 960 as part of the MiDaS-small network architecture, and a third network architecture 965 that illustrates the SML network 165 using the MiDaS-small architecture blocks. For example, the first network architecture 900 for MiDaS-small is designed for monocular depth estimation based on an input RGB image 905. An example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940. These blocks are parametrized by the number of input (IF) and output (OF) features. In the example of the third network architecture 965, the ScaleMapLearner (SML) network 165 uses the MiDaS-small architecture blocks shown in connection with the first network architecture 900 and the second network architecture 932. For example, the encoder 910 receives the globally-aligned depth map 155 and/or the scale map scaffolding 160. Whereas MiDaS-small outputs affine-invariant depth maps, the SML network 165 outputs metrically accurate depth maps (e.g., output depth map 175). By default, the SML network 165 regresses scale residuals with a single OutputConv head 920. For ablation experiments where regression of dense shift is performed in addition to scale residuals, a second identical OutputConv head 922 can be used in parallel, and the encoder 910 and feature fusion blocks remain common to both regression tasks.”).
With regard to claim 3, the claim is drawn to the apparatus of claim 2, wherein for each of the plurality of depth map prediction pathways, the respective decoder comprises one or more convolutional layers and one or more upsampling layers (see Wofk, i.e. in para. 9 and para. 103, disclose that “[0103] FIG. 9 illustrates an example network architecture(s) 900, 932, and/or 965 used in conjunction with the visual-inertial depth estimation pipeline 100 of FIG. 1. For example, the ScaleMapLearner (SML) network 165 that performs dense scale alignment on globally aligned metric depth maps can be based on the MiDaS-small architecture (e.g., a mobile friendly version in the robust and generalizable MiDaS family of monocular depth estimation models). In the example of FIG. 9, the network architecture(s) include a first network architecture 900 for MiDaS-small, a second network architecture 932 that shows a FeatureFusion block structure 935 and a ResidualConvUnit 960 as part of the MiDaS-small network architecture, and a third network architecture 965 that illustrates the SML network 165 using the MiDaS-small architecture blocks. For example, the first network architecture 900 for MiDaS-small is designed for monocular depth estimation based on an input RGB image 905. An example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940. These blocks are parametrized by the number of input (IF) and output (OF) features. In the example of the third network architecture 965, the ScaleMapLearner (SML) network 165 uses the MiDaS-small architecture blocks shown in connection with the first network architecture 900 and the second network architecture 932. For example, the encoder 910 receives the globally-aligned depth map 155 and/or the scale map scaffolding 160. Whereas MiDaS-small outputs affine-invariant depth maps, the SML network 165 outputs metrically accurate depth maps (e.g., output depth map 175). By default, the SML network 165 regresses scale residuals with a single OutputConv head 920. For ablation experiments where regression of dense shift is performed in addition to scale residuals, a second identical OutputConv head 922 can be used in parallel, and the encoder 910 and feature fusion blocks remain common to both regression tasks.”; finally, in para. 49, further disclose that “Additionally, hyperparameters are used as part of the training process to control how the learning is performed (e.g., a learning rate, a number of layers to be used in the machine learning model, etc.). Hyperparameters are defined to be training parameters that are determined prior to initiating the training process.”; also see the architecture illustrated in i.e. fig. 9 and etc.).
With regard to claim 4, the claim is drawn to the apparatus of claim 3, wherein the one or more convolutional layers and the one or more upsampling layers correspond to symmetric counterparts of convolutional layers and downsampling layers in the encoder (see Wofk, i.e. the illustration of the architecture 900 in fig. 9, the encoder 910 are symmetrically fed into the encoder 915 with four FeatureFusion blocks).
With regard to claim 5, the claim is drawn to the apparatus of claim 2, wherein for each of the plurality of depth map prediction pathways, the respective decoder comprises a respective output convolutional head (see Wofk, i.e. in fig. 9, the architecture 900 illustrate that the decode 915 and the output convolution block 920, further in para. 103 and etc., disclose that “An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930”).
With regard to claim 9, the claim is drawn to the apparatus of claim 1, wherein the encoder comprises a neural network architecture including convolutional blocks between one or more encoding stages (see Wofk, i.e. in para. 119, 129 and etc., disclose that “[0119] FIG. 18 is a block diagram of an example processing platform structured to execute the instructions of FIG. 2 to implement the example global alignment generator circuitry 120 of FIGS. 1. The processor platform 1800 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad™), a personal digital assistant (PDA), an Internet appliance, a headset or other wearable device, or any other type of computing device…”).
With regard to claim 10, the claim is drawn to the apparatus of claim 9, wherein one or more of the convolutional blocks feed a decoding stage of one or more decoding stages of the plurality of depth map prediction pathways (see Wofk, i.e.in fig. 9, the encoder 910 is fed to the decoder 915 as illustrated).
With regard to claim 11, the claim is drawn to the apparatus of claim 1, wherein the plurality of outputs comprise the plurality of predicted depth maps (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “…. For example, the encoder 910 receives the globally-aligned depth map 155 and/or the scale map scaffolding 160. Whereas MiDaS-small outputs affine-invariant depth maps, the SML network 165 outputs metrically accurate depth maps (e.g., output depth map 175). By default, the SML network 165 regresses scale residuals with a single OutputConv head 920. For ablation experiments where regression of dense shift is performed in addition to scale residuals, a second identical OutputConv head 922 can be used in parallel, and the encoder 910 and feature fusion blocks remain common to both regression tasks”).
With regard to claim 13, the claim is drawn to the apparatus of claim 1, wherein the plurality of outputs comprise features output from one or more respective intermediate layers of each of the plurality of depth map prediction pathways (see Wofk, i.e. in para. 49-51, 79 and etc., disclose that “[0137] The machine executable instructions 420 of FIG. 5 may be stored in the mass storage device 1928, in the volatile memory 1914, in the non-volatile memory 1916, and/or on a removable non-transitory computer readable storage medium such as a C[0049] Many different types of machine learning models and/or machine learning architectures exist. In examples disclosed herein, deep neural network models are used. In general, machine learning models/architectures that are suitable to use in the example approaches disclosed herein can be based on supervised learning and/or semi-supervised learning. In general, implementing a ML/AI system involves two phases, a learning/training phase and an inference phase. In the learning/training phase, a training algorithm is used to train a model to operate in accordance with patterns and/or associations based on, for example, training data. In general, the model includes internal parameters that guide how input data is transformed into output data, such as through a series of nodes and connections within the model to transform input data into output data. Additionally, hyperparameters are used as part of the training process to control how the learning is performed (e.g., a learning rate, a number of layers to be used in the machine learning model, etc.). Hyperparameters are defined to be training parameters that are determined prior to initiating the training process.D or DVD…”, and “[0079] In examples disclosed herein, the SML network 165 (e.g., based on the scale map learner circuitry 315) can be constructed using a MiDaS family of monocular depth estimation models (e.g., MiDaS-Small). For example, FIG. 9 provides an architecture diagram for MiDaS-Small and shows the SML network 165 using MiDaS-small blocks. For example, an encoder backbone can be initialized with pretrained ImageNet weights, while other layers are initialized randomly.”).
With regard to claim 14, the claim is drawn to the apparatus of claim 13, wherein the one or more respective intermediate layers comprise one or more respective convolutional kernels (see Wofk, i.e. in para. 40, 103 (i.e. fig. 9, convolution block 920 and etc.), 111 and etc., disclose that “[0111] FIG. 13 illustrates example tabulated data 1300 associated with input and regressed modalities in ScaleMapLearner (SML) on TartanAir and with zero-shot testing on VOID. For example, a number of input and regressed data modalities can be tested when designing the SML network 165. As previously described, the SML network 165 receives two input channels (e.g., globally aligned inverse depth {tilde over (z)} and a scale map scaffolding). However, four additional inputs can be tested: (1) a confidence map derived from a binary map pinpointing known sparse depth locations (e.g., first dilated with a 7×7 circular kernel and then blurred with a 5×5 Gaussian kernel to mimic confidence spread around a fixed known point), (2) a gradient map (e.g., computed using the Scharr operator), (3) a grayscale conversion of the original RGB image, and (4) the RGB image. Inputs are concatenated in the channel dimension and fed into the SML network 165 as a single tensor. Tabulated data 1300 shows the impact of different input combinations on the metric accuracy of depth output by the SML network 165 after retraining. In the example of FIG. 13, the results include input modality combination(s) 1305, regressing scale shift indication 1310, TartanAir-based results 1315, and VOID-based results 1320…”).
With regard to claim 16, the claim is drawn to the apparatus of claim 1, further comprising at least one image sensor configured to acquire the input image (See Wofk, i.e. in para. 38, disclose that “[0038] FIG. 1 illustrates an example visual-inertial depth estimation pipeline 100 disclosed herein, including an input processing stage, a global scale and shift alignment stage, and a learning-based dense scale alignment stage associated with an example global alignment generator circuitry 120 and an example scale aligner circuitry 150. In the example of FIG. 1, a single RGB image 105 derived from an RGB image sequence 110 (e.g., sequence of individual RBG images) can be provided to a monocular depth estimator circuitry 125 as part of the global alignment generator circuitry 120, as described in connection with FIG. 2. In parallel, the RBG image sequence 110 and corresponding synchronized inertial measurement unit (IMU) data 115 can be provided to an example visual-inertial odometry sensor circuitry 130 associated with the global alignment generator circuitry 120, as described in connection with FIG. 2. As such, input data processing can be performed by the monocular depth estimator circuitry 125 and/or the visual-inertial odometry sensor circuitry 130…).
With regard to claim 17, the claim is drawn to the apparatus of claim 1, further comprising a modem, coupled to one or more antennas, and coupled to the one or more processors, wherein the modem and the one or more antennas are configured to receive the input image (see Wofk, i.e. in para. 125, disclose “[0125] The interface circuit 1820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 1826. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.”).
With regard to claim 18, the claim is drawn to the apparatus of claim 17, wherein the modem and the one or more antennas are integrated into one of a vehicle, an extra-reality device, or a mobile device (see Wofk, i.e. in para. 125, 129 and etc., disclose that “[0129] FIG. 19 is a block diagram of an example processing platform 1900 structured to execute the instructions of FIG. 5 to implement the example first computing system 230 of FIG. 2. The processor platform 1900 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad™), a personal digital assistant (PDA), an Internet appliance, or any other type of computing device.”).
With regard to claim 19, the claim is drawn to a method for generating an uncertainty metric (see Wofk, i.e. in abstract, para. 33, and etc., disclose “[0033] Methods and apparatus for metric depth estimation using a monocular visual-inertial system are disclosed herein…”), comprising:
generating, by an encoder, an encoded feature representation of an input image (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”);
generating, by a plurality of depth map prediction pathways, a plurality of outputs corresponding to a plurality of predicted depth maps based on the encoded feature representation (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”); and
generating an uncertainty metric indicating an uncertainty of the plurality of predicted depth maps based on one or more variances between the plurality of outputs (see Wofk, i.e. fig. 10, fig. 15A-C, para. 104 and etc., disclose the metric depth error map, and further disclose that “…For example, performance is qualitatively evaluated by comparing metric depth error maps computed for globally aligned depth (GA error 1025) to those computed for densely scaled depth (SML error 1030). In depth maps, brighter is closer and darker is farther. In error maps, positive inverse depth error is farther than ground truth and negative inverse depth error is closer than ground truth. Dense scale alignment with the SML network 165 improves metric depth accuracy over global alignment alone, as seen by the whiter regions in the error maps. The bottom two samples are particularly challenging cases due to low light conditions. For example, a whiter region in the error map indicates that the SML network 165 improved metric depth accuracy in that region. The first sample depicts a neighborhood scene where the building towards center-right is pushed further back under dense scale alignment, as confirmed by a reduction in negative error in inverse depth. A tree shown in the RGB image 1005 behind the pool is brought closer, as shown by the reduction in positive error. The latter two samples depict significantly more challenging scenes due to low light as well as proximity to walls and the ground. In both, the SML network 165 still realigns surfaces towards correct metric depth…”).
With regard to claim 20, the claim is drawn to a non-transitory computer-readable medium comprising instructions (see Wofk, i.e. in para. 137, discloses that “[0137] The machine executable instructions 420 of FIG. 5 may be stored in the mass storage device 1928, in the volatile memory 1914, in the non-volatile memory 1916, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD”), which when executed by one or more processors, cause the one or more processors to perform operations comprising:
generating, by an encoder, an encoded feature representation of an input image (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”);
generating, by a plurality of depth map prediction pathways, a plurality of outputs corresponding to a plurality of predicted depth maps based on the encoded feature representation (see Wofk, i.e. in fig. 9, para. 103 and etc., disclose that “… an example encoder 910 incorporates an EfficientNet-Lite3 backbone, with skip connections propagating out features at four levels. An example decoder 915 includes four FeatureFusion blocks that progressively upsample and merge features from the encoder 910 and the skip connections. An example output convolution block 920 can include Rectified Linear Units (RELUs) for generating an example final depth map 930. The second network architecture 932 includes diagrams of the FeatureFusion block structure 935 and the ResidualConvUnit 940.”); and
generating an uncertainty metric indicating an uncertainty of the plurality of predicted depth maps based on one or more variances between the plurality of outputs (see Wofk, i.e. fig. 10, fig. 15A-C, para. 104 and etc., disclose the metric depth error map, and further disclose that “…For example, performance is qualitatively evaluated by comparing metric depth error maps computed for globally aligned depth (GA error 1025) to those computed for densely scaled depth (SML error 1030). In depth maps, brighter is closer and darker is farther. In error maps, positive inverse depth error is farther than ground truth and negative inverse depth error is closer than ground truth. Dense scale alignment with the SML network 165 improves metric depth accuracy over global alignment alone, as seen by the whiter regions in the error maps. The bottom two samples are particularly challenging cases due to low light conditions. For example, a whiter region in the error map indicates that the SML network 165 improved metric depth accuracy in that region. The first sample depicts a neighborhood scene where the building towards center-right is pushed further back under dense scale alignment, as confirmed by a reduction in negative error in inverse depth. A tree shown in the RGB image 1005 behind the pool is brought closer, as shown by the reduction in positive error. The latter two samples depict significantly more challenging scenes due to low light as well as proximity to walls and the ground. In both, the SML network 165 still realigns surfaces towards correct metric depth…”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 12 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Wofk as applied to claim 1 above, and further in view of Sinha et al. (SAMARTH SINHA ET AL: "DIBS: Diversity inducing Information Bottleneck in Model Ensembles", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 March 2020 (2020-03-10), XP081618258, hereinafter as “Sinha et al.”, a copy of Sinha et al. has been placed in the record).
With regard to claim 12, the claim is drawn to the apparatus of claim 11, wherein the one or more variances comprise at least one of: block-level statistical variance between one or more portions of the plurality of predicted depth maps; or pixel-level statistical variance between the plurality of predicted depth maps.
The teaching of Wofk do not explicitly disclose the aspect relating to “wherein the one or more variances comprise at least one of: block-level statistical variance between one or more portions of the plurality of predicted depth maps; or pixel-level statistical variance between the plurality of predicted depth maps”. However, Sinha discloses an analogous invention relates to a principled scheme of ensemble learning, by jointly maximizing data likelihood, constraining information flow through a bottleneck to ensure the ensembles capture only relevant statistics of the input, and maximizing a diversity inducing objective to ensure that the multiple plausible hypotheses learned are diverse. Instead of K different neural nets, we have K different stochastic decoder heads, as shown in Fig. 1. We explicitly maximize diversity among the ensembles by an adversarial loss. More specifically, in Sinha, in section 3.3 (predictive uncertainty estimation), disclose that a prediction is carried out based on the variance, by disclosing “…proposed method is able to meaningfully capture both epistemic and aleatoric uncertainty. Aleatoric uncertainty is typically modeled as the variance of the output distribution, which can be obtained by outputting a distribution…”.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wofk to include the limitation(s) discussed and also taught by Sinha, with the limitations discussed above, as the cited prior arts are at least considered to be analogous arts if not also in the same field of endeavor relating to image processing arts. Further, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wofk by the teachings of Sinha, and to incorporate the limitation(s) discussed and also taught by Sinha, thereby “… to obtain more reliable aleatoric uncertainty estimate and hence better predictive uncertainty overall…” (see Sinha, i.e. section 3.3 and etc.).
With regard to claim 15, the claim is drawn to the apparatus of claim 13, wherein the plurality of outputs comprise the plurality of predicted depth maps, and wherein to generate the uncertainty metric, the one or more processors are configured to use an error prediction machine learning model with the one or more variances as input to the error prediction machine learning model (see Wofk, i.e. fig. 10, fig. 15A-C, para. 104 and etc., disclose the metric depth error map, and further disclose that “…For example, performance is qualitatively evaluated by comparing metric depth error maps computed for globally aligned depth (GA error 1025) to those computed for densely scaled depth (SML error 1030). In depth maps, brighter is closer and darker is farther. In error maps, positive inverse depth error is farther than ground truth and negative inverse depth error is closer than ground truth. Dense scale alignment with the SML network 165 improves metric depth accuracy over global alignment alone, as seen by the whiter regions in the error maps. The bottom two samples are particularly challenging cases due to low light conditions. For example, a whiter region in the error map indicates that the SML network 165 improved metric depth accuracy in that region. The first sample depicts a neighborhood scene where the building towards center-right is pushed further back under dense scale alignment, as confirmed by a reduction in negative error in inverse depth. A tree shown in the RGB image 1005 behind the pool is brought closer, as shown by the reduction in positive error. The latter two samples depict significantly more challenging scenes due to low light as well as proximity to walls and the ground. In both, the SML network 165 still realigns surfaces towards correct metric depth…”; also, in Sinha, in section 3.3 (predictive uncertainty estimation), disclose that a prediction is carried out based on the variance, by disclosing “…proposed method is able to meaningfully capture both epistemic and aleatoric uncertainty. Aleatoric uncertainty is typically modeled as the variance of the output distribution, which can be obtained by outputting a distribution…”).
Allowable Subject Matter
With regard to Claim 6-8, claims are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and overcoming the corresponding rejections and/or objection (if any) set forth in the Office Action above.
The following is a statement of reasons for the indication of allowable subject matter:
With regard to claim 6, the closest prior arts of record, Wofk and Sinha, do not disclose or suggest, among the other limitations, the additional required limitation of “the apparatus of claim 1, wherein the plurality of depth map prediction pathways share at least one decoder component”. These additional features in combination with all the other features required in the claimed invention, are neither taught nor suggested by prior art(s) of record.
With regard to claims 7-8, the claims are depending directly or indirectly from the independent Claim 6, each encompasses the required limitations recited in the independent claim discussed above.
Therefore, claims 6-8 are objected to.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kang et al. (U.S. Pat/Pub No. 200) disclose an invention relates to an apparatus a method for coding machine vision data by using prediction.
The Art Unit (or Workgroup) location of your application in the USPTO has changed. To aid in correlating any papers for this application, all further correspondence regarding this application should be directed to Art Unit 2681.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jacky X. Zheng whose telephone number is (571) 270-1122. The examiner can normally be reached on Monday - Friday, 9:00 am - 5:00 pm, alt. Friday Off.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached on (571) 272-3438. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JACKY X ZHENG/Primary Examiner, Art Unit 2681