DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1 and 16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wofk et al. (US 2022/0343521).
Regarding claim 1, Wofk et al. discloses a method of estimating a depth of an object within an image, comprising:
obtaining single-frame image data (“a single RGB image 105” at paragraph 0038, line 7);
obtaining scaling factor data based on the single-frame image data (“The scale aligner circuitry 150 can be used to perform dense (local) scale alignment given that global alignment may not adequately resolve metric scale in all regions of a depth map. As such, a learning-based approach can be used for determining dense (per-pixel) scale factors that are applied to globally aligned depth estimates. In the example of FIG. 1, a ScaleMapLearner (SML) network 165 can be trained (e.g., using an open-source machine learning framework such as MiDaS-small, etc.) to realign individual values in an input depth map to improve metric accuracy” at paragraph 0043, line 1);
generating scale-invariant depth data through inputting the single-frame image data into a depth estimation network (“The monocular depth estimator circuitry 125 performs monocular depth estimation, as described in connection with FIG. 1. For example, the monocular depth estimator circuitry 125 predicts depth from a monocular image (e.g., single RGB image 105). In the example of FIG. 2, the monocular depth estimator circuitry 125 includes a pretrained model that produces a dense depth map up to a specified scale (e.g., depth estimation model 265). In some examples, a depth estimator can be selected (e.g., DPT-Hybrid, etc.) and a transformer-based model trained on a large meta-dataset using scale- and shift-invariant losses” at paragraph 0046, line 1); and
generating metric depth data based on the scaling factor data and the scale-invariant depth data (“As such, the SML network 165 regresses the dense scale residual map 170 (e.g., dense scale map image 190 regressed by the SML network 165) and a resulting scale map can be generated and applied to the input depth, yielding an example final metric dense depth output 175 (e.g., shown visually using the final depth map output 192)” at paragraph 0043, last sentence).
Regarding claim 16, Wofk et al. discloses an image-based depth estimation system, comprising:
an image sensor configured to capture images (“real-world data can be collected using an Intel RealSense D435i camera” at paragraph 0078, third to last sentence);
at least one processor (“example processor circuitry 1812 of FIG. 18, the example processor circuitry 2200 of FIG. 22, and/or the example Field Programmable Gate Array (FPGA) circuitry 2300 of FIG. 23” at paragraph 0047, line 10); and
memory storing computer instructions that, when executed by the at least one processor (“the monocular depth estimator circuitry 125 may be implemented by machine executable instructions such as that implemented by at least blocks 415, 420, 425 of FIG. 4 executed by processor circuitry” at paragraph 0047, line 5), cause the depth estimation system to:
obtain single-frame image data (“a single RGB image 105” at paragraph 0038, line 7);
obtain scaling factor data based on the single-frame image data (“The scale aligner circuitry 150 can be used to perform dense (local) scale alignment given that global alignment may not adequately resolve metric scale in all regions of a depth map. As such, a learning-based approach can be used for determining dense (per-pixel) scale factors that are applied to globally aligned depth estimates. In the example of FIG. 1, a ScaleMapLearner (SML) network 165 can be trained (e.g., using an open-source machine learning framework such as MiDaS-small, etc.) to realign individual values in an input depth map to improve metric accuracy” at paragraph 0043, line 1);
generate scale-invariant depth data through inputting the single-frame image data into a depth estimation network (“The monocular depth estimator circuitry 125 performs monocular depth estimation, as described in connection with FIG. 1. For example, the monocular depth estimator circuitry 125 predicts depth from a monocular image (e.g., single RGB image 105). In the example of FIG. 2, the monocular depth estimator circuitry 125 includes a pretrained model that produces a dense depth map up to a specified scale (e.g., depth estimation model 265). In some examples, a depth estimator can be selected (e.g., DPT-Hybrid, etc.) and a transformer-based model trained on a large meta-dataset using scale- and shift-invariant losses” at paragraph 0046, line 1); and
generate metric depth data based on the scaling factor data and the scale-invariant depth data (“As such, the SML network 165 regresses the dense scale residual map 170 (e.g., dense scale map image 190 regressed by the SML network 165) and a resulting scale map can be generated and applied to the input depth, yielding an example final metric dense depth output 175 (e.g., shown visually using the final depth map output 192)” at paragraph 0043, last sentence).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wofk et al. and Petrovai et al. (“MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to Depth-aware Video Panoptic Segmentation”) that incorporates elements of Cheng et al. (“Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation”).
Regarding claim 2, Wofk et al. discloses the elements of claim 1 as described above.
Wofk et al. does not explicitly disclose generating panoptic segmentation data using panoptic segmentation of the single-frame image data, wherein the panoptic segmentation data is used for generating the metric depth data.
Petrovai et al. teaches a method in the same field of endeavor of monocular depth estimation, comprising generating panoptic segmentation data using panoptic segmentation of the single-frame image data (“We propose MonoDVPS, a novel depth-aware video panoptic segmentation network that performs panoptic segmentation, instance tracking and monocular depth estimation” at section 3, line 1), wherein the panoptic segmentation data is used for generating the depth data (“We extend the semantic decoder with a depth prediction head that has a [55; 64] depthwise separable convolution, followed by bilinear interpolation, concatenation with lowlevel features and [55; 32] and [11; 1] convolutions. We adopt multi-scale depth prediction and image reconstruction at four scales with output stride 2, 4, 8 and 16 relative to the original image resolution” at section 3.2, line 1; “We propose two main mechanisms to improve the performance of the depth estimation by panoptic guidance. First, we start from the observation that the panoptic segmentation has a strong correlation with the depth map and introduce three panoptic guided losses. Second, we generate motion masks using consecutive panoptic labels that are applied to the photometric loss” at section 3.3, line 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize the panoptic guidance as taught by Petrovai et al. in the depth estimation of Wofk et al. to reduce depth error and “offers a good trade-off between inference speed and accuracy.” (Petrovai et al. at section 5, last sentence).
Regarding claim 3, Petrovai et al. discloses a method wherein the panoptic segmentation is performed using a panoptic decoder that takes, as input, feature data generated by a feature encoder (“Panoptic DeepLab has a shared backbone and dual decoders for semantic and instance segmentation” at section 3.1, line 5).
Regarding claim 4, Petrovai et al. discloses a method wherein the feature data is multi-feature fusion data that is or is derived from feature data from two different layers within the feature encoder (“The encoder backbone is adapted from an ImageNet-pretrained neural network paired with atrous convolution for extracting denser feature maps in its last block” Cheng et al. at section 3.1, paragraph 2, line 1).
Claim(s) 5-12 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wofk et al. and Pirvu et al. (“Depth distillation: unsupervised metric depth estimation for UAVs by finding consensus between kinematics, optical flow and deep learning”).
Regarding claims 5 and 17, Wofk et al. discloses the elements of claims 1 and 16 above.
Wofk et al. does not explicitly disclose that the scaling factor data is generated using a scaling factor network that is trained as a part of a student network that further includes the depth estimation network.
Privu et al. teaches a method and system in the same field of endeavor of metric depth determination, wherein the scaling factor data is generated using a scaling factor network (“DUnsup is scaled according to DOdoFlow by estimating a single scaling parameter per frame” at section 3.1, last paragraph, line 4) that is trained as a part of a student network that further includes the depth estimation network (“During training, the student learns to mimic the ”teacher”, which is the ensemble formed by averaging the analytical depth DOdoFlow with the data-driven, deep learning depth DUnsup” at section 2, paragraph 8, line 1; “For the DUnsup → DOdoFlow scaling procedure (which makes DUnsup metric) we use a range of depth of [50, 150] meters and further refine r by masking all pixels that are more than 5 meters apart, for a maximum of 10 iterations.” at section 4.1, line 1; the scale factor generation, while not explicitly part of the student network, is a refinement conducted as a precursor to the student network learning, and could therefore be incorporated into the student network as a pre-processing step).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize a teacher-student configuration as taught by Pirvu et al. in the depth estimation of Wofk et al. as “the student can outperform the teacher and generalizes well to new scenes” (Pirvu et al. at page 3216, right column, Comments on the numerical results, line 5) and in cases “can be deployed on an embedded GPU and it has, besides strong performance, a near real-time speed” (Pirvu et al. at page 3216, right column, Comments on the numerical results, last sentence).
Regarding claims 6 and 18, the Wofk et al. and Pirvu et al. combination discloses a method and system wherein the scaling factor network is trained by a teacher network based on loss calculated using metric depth data generated by the teacher network (“For the DUnsup → DOdoFlow scaling procedure (which makes DUnsup metric) we use a range of depth of [50, 150] meters and further refine r by masking all pixels that are more than 5 meters apart, for a maximum of 10 iterations.” Pirvu et al. at section 4.1, line 1; see also Figure 3 that demonstrates the Loss term leading into the student network) and scale-invariant depth data generated by the depth estimation network (“The monocular depth estimator circuitry 125 performs monocular depth estimation, as described in connection with FIG. 1. For example, the monocular depth estimator circuitry 125 predicts depth from a monocular image (e.g., single RGB image 105). In the example of FIG. 2, the monocular depth estimator circuitry 125 includes a pretrained model that produces a dense depth map up to a specified scale (e.g., depth estimation model 265). In some examples, a depth estimator can be selected (e.g., DPT-Hybrid, etc.) and a transformer-based model trained on a large meta-dataset using scale- and shift-invariant losses” Wofk et al. at paragraph 0046, line 1).
Regarding claims 7 and 19, the Wofk et al. and Pirvu et al. combination discloses a method and system wherein the scale-invariant depth data of the student network is combined with data output by the scaling factor network in order to generate metric depth data for the student network (“For the DUnsup → DOdoFlow scaling procedure (which makes DUnsup metric) we use a range of depth of [50, 150] meters and further refine r by masking all pixels that are more than 5 meters apart, for a maximum of 10 iterations.” Pirvu et al. at section 4.1, line 1), and wherein the loss is calculated based on the metric depth data for the student network and the metric depth information of the teacher network (“The student (S) has to mimic the teacher, instead of learning the actual process fT , by minimizing: minL(fS(A), fT (A))” Pirvu et al. at section 2, “Distillation of multiple depths”, line 6).
Regarding claim 8, Wofk et al. discloses a method of training a depth estimation network, comprising:
inputting image data into a ML model in order to generate scale-invariant depth data (“The monocular depth estimator circuitry 125 performs monocular depth estimation, as described in connection with FIG. 1. For example, the monocular depth estimator circuitry 125 predicts depth from a monocular image (e.g., single RGB image 105). In the example of FIG. 2, the monocular depth estimator circuitry 125 includes a pretrained model that produces a dense depth map up to a specified scale (e.g., depth estimation model 265). In some examples, a depth estimator can be selected (e.g., DPT-Hybrid, etc.) and a transformer-based model trained on a large meta-dataset using scale- and shift-invariant losses” at paragraph 0046, line 1); and
training a network based on loss calculated using the metric depth data and the scale-invariant depth data (“As such, the SML network 165 regresses the dense scale residual map 170 (e.g., dense scale map image 190 regressed by the SML network 165) and a resulting scale map can be generated and applied to the input depth, yielding an example final metric dense depth output 175 (e.g., shown visually using the final depth map output 192)” at paragraph 0043, last sentence).
Wofk et al. does not explicitly disclose inputting image data into a teacher machine learning (ML) model in order to generate metric depth data and student ML model.
Privu et al. teaches a method in the same field of endeavor of metric depth determination comprising:
inputting image data into a teacher machine learning (ML) model in order to generate metric depth data (“DUnsup by using a deep neural network specifically fine-tuned, unsupervised, on the training flight video, to maximize performance on this particular scene. DUnsup becomes metric by scaling it according to DOdoFlow (find a single scale parameter per frame, by matching the depths per pixel). Both DOdoFlow and DUnsup form the unsupervised teacher” at section 2, Distillation of multiple depths, third bullet); and
training a student network based on loss calculated using the metric depth data and the scale depth data (“During training, the student learns to mimic the ”teacher”, which is the ensemble formed by averaging the analytical depth DOdoFlow with the data-driven, deep learning depth DUnsup” at section 2, paragraph 8, line 1; “For the DUnsup → DOdoFlow scaling procedure (which makes DUnsup metric) we use a range of depth of [50, 150] meters and further refine r by masking all pixels that are more than 5 meters apart, for a maximum of 10 iterations.” Pirvu et al. at section 4.1, line 1; see also Figure 3 that demonstrates the Loss term leading into the student network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize a teacher-student configuration as taught by Pirvu et al. in the depth estimation of Wofk et al. as “the student can outperform the teacher and generalizes well to new scenes” (Pirvu et al. at page 3216, right column, Comments on the numerical results, line 5) and in cases “can be deployed on an embedded GPU and it has, besides strong performance, a near real-time speed” (Pirvu et al. at page 3216, right column, Comments on the numerical results, last sentence).
Regarding claim 9, the Wofk et al. and Pirvu et al. combination discloses a method wherein the student network includes a depth decoder that is used to generate the scale-invariant depth data (“The monocular depth estimator circuitry 125 performs monocular depth estimation, as described in connection with FIG. 1. For example, the monocular depth estimator circuitry 125 predicts depth from a monocular image (e.g., single RGB image 105). In the example of FIG. 2, the monocular depth estimator circuitry 125 includes a pretrained model that produces a dense depth map up to a specified scale (e.g., depth estimation model 265). In some examples, a depth estimator can be selected (e.g., DPT-Hybrid, etc.) and a transformer-based model trained on a large meta-dataset using scale- and shift-invariant losses” Wofk et al. at paragraph 0046, line 1) and a scaling factor network that generates scaling factor data that (“As previously described, the SML network 165 learns per-pixel scale factors by which to multiply input depth estimates {tilde over (z)}, such that the output depth {circumflex over (z)} achieves higher metric accuracy” Wofk et al. at paragraph 0113, line 1), when combined with the scale-invariant depth data, results in metric depth data of the student network (“As such, the SML network 165 regresses the dense scale residual map 170 (e.g., dense scale map image 190 regressed by the SML network 165) and a resulting scale map can be generated and applied to the input depth, yielding an example final metric dense depth output 175 (e.g., shown visually using the final depth map output 192)” Wofk et al. at paragraph 0043, last sentence).
Regarding claim 10, the Wofk et al. and Pirvu et al. combination discloses a method wherein the metric depth data of the student network is compared with the scale-aware depth data of the teacher network in order to determine the loss (“The student (S) has to mimic the teacher, instead of learning the actual process fT , by minimizing: minL(fS(A), fT (A))” Pirvu et al. at section 2, “Distillation of multiple depths”, line 6).
Regarding claim 11, the Wofk et al. and Pirvu et al. combination discloses a method wherein the image data input into the teacher ML model is multi-frame image data (“The visual-inertial odometry sensor circuitry 130 uses IMU data 115 together with visual data (e.g., RGB sequence 110) to determine metric scal” Wofk et al. at paragraph 0041, line 1), and wherein the image data input into the student model is single-frame image data (“a single RGB image 105” Wofk et al. at paragraph 0038, line 7; “The student becomes a compact representation of both pathways (Fig. 1): fS : RGB −→ Depth, and it is used for real-time metric depth prediction from single RGB frames, with no sensor information required” Pirvu et al. at section 2, “Distillation of multiple depths, last paragraph, last sentence).
Regarding claim 12, the Wofk et al. and Pirvu et al. combination discloses a method wherein the multi-frame image data includes the single-frame image data such that a frame of the multi-frame image data is a frame represented by the single-frame image data (“a single RGB image 105 derived from an RGB image sequence 110” Wofk et al. at paragraph 0038, line 7).
Regarding claim 20, Wofk et al. disclose the elements of claim 16 as described above.
While Wofk et al. briefly mentions the disclosure being applicable to vehicle data, Wofk et al. does not explicitly disclose an onboard vehicle computer system having the image-based depth estimation system of claim 16.
However, as Pirvu et al. demonstrates the use of monocular depth estimation in the field of endeavor of UAV navigation (see section 1, paragraph 3), it would therefore make sense that the system of Wofk et al. be incorporated into a vehicle navigation system as “depth perception is fundamental to visual navigation, where correctly estimating distances to objects in view is critical for proper motion planning and/or obstacle avoidance” (Wofk et al. at paragraph 0033, last sentence).
Claim(s) 13 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wofk et al. and Pirvu et al. as applied to claim 8 above, and further in view of Petrovai et al.
The Wofk et al. and Pirvu et al. combination discloses the elements of claim 8 as described above.
The Wofk et al. and Pirvu et al. combination does not explicitly disclose that the teacher network is trained using a training process that includes determining pose information and/or determining panoptic segmentation data for the multi-frame image data.
Petrovai et al. teaches a method in the same field of endeavor of monocular depth estimation, wherein the teacher network is trained using a training process that includes determining pose information and/or determining panoptic segmentation data for the multi-frame image data (“We propose MonoDVPS, a novel depth-aware video panoptic segmentation network that performs panoptic segmentation, instance tracking and monocular depth estimation” at section 3, line 1; “We propose two main mechanisms to improve the performance of the depth estimation by panoptic guidance. First, we start from the observation that the panoptic segmentation has a strong correlation with the depth map and introduce three panoptic guided losses. Second, we generate motion masks using consecutive panoptic labels that are applied to the photometric loss” at section 3.3, line 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize the panoptic guidance as taught by Petrovai et al. in the depth estimation of the Wofk et al. and Pirvu et al. combination to reduce depth error and “offers a good trade-off between inference speed and accuracy.” (Petrovai et al. at section 5, last sentence).
Claim(s) 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Wofk et al. and Pirvu et al. as applied to claim 8 above, and further in view of Guizilini et al. (US 2024/0087151).
The Wofk et al. and Pirvu et al. combination discloses the elements of claim 8 as described above.
The Wofk et al. and Pirvu et al. combination does not explicitly disclose that the teacher network determines a cost volume between two frames of the multi-frame image data in order to generate the metric depth data.
Guizilini et al. teaches a method in the same field of endeavor of monocular depth estimation, wherein the teacher network determines a cost volume between two frames of the multi-frame image data in order to generate the metric depth data (“In some examples, cost volumes may be used by a multi-frame depth estimation network (e.g., a multi-frame monocular depth estimation network) to estimate a depth of an environment. In some examples, the cost volume is generated by combining information from multiple images onto a single 3D structure and evaluating a similarity metric between all pixel pairs given a series of possible depth ranges” at paragraph 0025, line 8).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize the cost volume analysis as taught by Guizilini et al. in the depth estimation of the Wofk et al. and Pirvu et al. combination as “Multi-frame depth estimation may be considered an improvement of over single frame depth estimation because multi-frame depth estimation may leverage geometric relationships between images via feature matching, in addition to learning appearance-based features” (Guizilini et al. at paragraph 0025, line 2).
Regarding claim 15, Guizilini et al. discloses a method wherein the two frames of the multi-frame image data are temporally-adjacent (“a cross-attention cost volume based on a current image of the environment and a previous image of the environment in a sequence of images” at paragraph 0056, line 13).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATRINA R FUJITA whose telephone number is (571)270-1574. The examiner can normally be reached Monday - Friday 9:30-5:30 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at 5712723638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KATRINA R FUJITA/Primary Examiner, Art Unit 2672