DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2. This is in response to the applicant response filed on 12/30/2025. In the applicant’s response, claim 8 was amended, and claim 32 was newly added. Accordingly, claims 1 and 3-32 are pending and being examined. Claims 1, 18, 24, and 30 are independent form.
Claim Rejections - 35 USC § 103
3. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
4. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5. Claims 1, 3-4, 8-9, 13-14, 18-21, 24-27, 30, and 32 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (“Semi-supervised Depth Estimation from Sparse Depth and a Single Image for Dense Map Construction”, IEEE, 2019, hereinafter “Huang”) in view of Ma et al (“Sparse-to-Dense: Depth Prediction from
Sparse Depth Samples and a Single Image”, IEEE, 2018, hereinafter “Ma”).
Regarding claim 1, Huang discloses a processor-implemented method for depth estimation (the proposed depth estimation model/method for learning dense depth from sparse depth and its corresponding (single) image; see figs.1-2 and Sec. III), comprising:
generating a depth output from a monocular depth model based on an input image frame (the proposed method generates the predicted depth image (e.g., fig.2 (c)) from the single image (e.g., fig.2 (a)) and the spare depth image (e.g., fig.2(b)); see fig.2 (a), (b), and (c); see sec. III, para.1; wherein the single image may be captured by a “monocular camera”; see Sec. II, para.3—"Depth prediction from a single image and sparse depth”; see predicted depth
D
i
,
j
k
in Eq(3));
estimating depths for a subset of pixels in the input image frame (see Abstract, lines 6-9: “[t]his paper estimates pixel-wise depth from a single image and a few depth points which are constructed from registered LiDAR or acquired from visual SLAM systems to construct a dense map.”) to generate a partial estimated ground truth (see the spare depth image fig.2(b) for the input image frame (see the single image fig.2 (a)), wherein the partial estimated ground truth comprises the estimated depths for a subset of pixels in the input image frame and lacks estimated depth for at least one pixel, in the input image frame, not included in the subset of pixels (wherein the sparse depth points of 500 in the spare depth image shown in fig.2 (b) are included by the 912x228 points in the single fig.2(a)). In other words, the single image shown in fig.2(a) includes much more points than the spare depth points shown in fig.2(b, that is, 500<912x228);
determining a depth loss for the monocular depth model based on the depth output and on the the proposed method minimizes the total loss Ls defined by Eq.(2) which includes the depth loss Ed defined by Eq(3) that further includes the difference between the depth output D and the truth GT; see Eq(3) and Eq(2), and sec. III, B, 1),
updating the monocular depth model based at least in part on the depth loss; and
outputting a new depth output generated using the updated monocular depth model (the proposed method trains the network constructed in fig.1 to minimize the total four loss terms defined by Eq (2) and output the predicted depth image shown by fig.2 (c); see figs.1, fig.2, sec. III, and sec. IV, B, para.1: “The proposed network is trained”).
As can be seen, the mere difference between the claimed invention and the method of Huang is that: Huang does not explicitly disclose that the depth loss defined by Eq (3) is based on “the partial estimated ground truth”, wherein the number of the partial estimated ground truth points is less than the number of input image’s pixels. In fact, as shown in Eq (3), the predicted depth
D
i
,
j
k
of Huang is obtained by minimizing the loss between the predicted depth
D
i
,
j
k
and the ground truth
G
T
i
,
j
k
rather than the partial estimated ground truth. However, for one of ordinary skill in the art, it would have been to understand that how sparse the ground depth truth sample is completely case-by-case dependent. For some cases (i.e., some datasets or some practical applications), the number of depth ground truth samples may be equal to the number of input image’s pixels; for other cases otherwise, the number of depth ground truth values may be sparse, partial, and less than the number of input image’s pixels. No matter whether the ground truth
G
T
i
,
j
k
shown in Eq (3) and the input image shown by fig.2(a) have the same dimensions (i.e., no matter whether the number of ground depth truth samples is equal to the number of input image’s pixels or less than the number of input image’s pixels), the Eq (3) works well. In other words, the method of Huang includes the invention claimed by claim 1, and can be trained by datasets where the number of the obtained depth ground truth samples is less than the number of input image’s pixels. As evidence, in the same field of endeavor, that is, in the field of depth prediction from sparse depth samples and a single image (see the title of Ma), Ma teaches that the depth prediction CNN can be trained by the KITTI dataset which only has partial ground depth truth samples compared to the RGB image pixels. Ma, see Sec. IV-B, in the left col., on page 4799, states:
PNG
media_image1.png
306
400
media_image1.png
Greyscale
(The emphases added by the examiner.)
In other words, Ma teaches that the depth prediction method can trained by a dataset like the KITTI dataset which only includes partial ground truth samples with--“only 18k projected [depth] measurements out of the 208k image pixels”, namely, the number of actual ground truth samples/measurements is less than the number of input image’s pixels. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Ma into the teachings of Huang and train a depth prediction model based on partial depth samples and a single image. Suggestion or motivation for doing so would have been to predict depth based on sparse depth samples as taught by Ma and Huang, see the respective Abstracts. Therefore, the combination of Huang and Ma suggests or teaches the claimed invention, and the claim is unpatentable over Huang in view of Ma.
Regarding claim 3, 19, 25, the combination of Huang and Ma discloses, further comprising determining the partial estimated ground truth for the input image frame using one or more sensors (Huang, wherein the truth image is a raw LiDAR scan, see fig.2(c)).
Regarding claim 4, the combination of Huang and Ma discloses the processor-implemented method of Claim 3, wherein the one or more sensors comprise one or more of: a camera sensor, a LiDAR sensor, or a radar sensor (Huang, ibid.).
Regarding claim 8, the combination of Huang and Ma discloses the processor-implemented method of Claim 1, further comprising determining that the monocular depth model lacks a generated depth output for one or more pixels in the input image frame; and based on the determination that the monocular depth model lacks a generated depth output for the one or more pixels in the input image frame, applying a mask to the depth loss to scale the depth loss (Huang, see masks Mk in Eq(3) k=t, t+1, masks Ml,t in Eq(7), and mask Mg in Eq(9). It should be noticed that, wherein “Mt and Mt+1 are the binary matrix for the t and t +1 images where the ground truth is available, respectively.” In other words, the binary mask matrix
M
k
i
,
j
=
{
0
,
o
t
h
e
r
w
i
s
e
.
1
,
w
h
e
n
t
h
e
g
r
o
u
n
d
t
r
u
t
h
i
s
a
v
a
i
l
a
b
l
e
f
o
r
p
i
x
e
l
(
i
,
j
)
Stated different, all the pixels which have no corresponding ground truth in the input image are masked, or, the predicted depth
D
i
,
j
k
output by Eq(3) does not include (lacks) those pixels which have no corresponding ground truth in the input image.)
Regarding claim 9, 21, 27, the combination of Huang and Ma discloses, further comprising: determining a depth gradient loss for the depth model based on the depth output (Huang, see sec. III, B, 4) “Smoothness Loss Es”; see Eq(11)), wherein the total loss is determined using a multi-component loss function comprising the depth loss and the depth gradient loss (Huang, see Eq(2)).
Regarding claim 13, the combination of Huang and Ma discloses the processor-implemented method of Claim 1, wherein the depth output comprises predicted depths for a plurality of pixels of the input image frame (Huang, see fig.2 (c)).
Regarding claim 14, the combination of Huang and Ma discloses the processor-implemented method of Claim 1, wherein the depth output comprises predicted disparities for a plurality of pixels of the input image frame (Huang, wherein the predicted depths shown in fig.2(c) are mathematically and inherently linked to the corresponding predicted disparities via the camera’s intrinsic parameters, such as the focal length).
Regarding claims 18, 24, 30, each of which is an inherent variation of claim 1, thus it is interpreted and rejected for the reasons set forth in the rejection of claim 1.
Regarding claim 20, 26, the combination of Huang and Ma discloses, the operation further comprising: determining the estimated depths for the set of pixels of the input image frame based on a model of an object in the input image frame, wherein the partial estimated ground truth comprises different depths for different pixels of the set of pixels of the input image frame (Huang, see Di,j in Eq(3); wherein i and j are corresponding all the pixels of frame t and frame t+1 instead of the partial points shown in fig.2(b).
Regarding claim 32, the combination of Huang and Ma discloses the processor-implemented method of Claim 8, wherein determining that the monocular depth model lacks a generated depth output for the one or more pixels in the input image frame comprises determining that a reprojection error for the one or more pixels is greater than the depth loss for the one or more pixels (see the examiner’s explanation in the rejection of claim 8. Therefore, the predicted depth
D
i
,
j
k
obtained by Eq(3) in Huang does not include (lacks) those pixels which have no corresponding ground truth in the input image because those pixels have no corresponding ground truth and will cause large errors.).
6. Claims 5-7 and 31 are rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ma and further in view of Cai et al (“Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation”, AAAI, 2020, hereinafter “Cai”).
Regarding claim 5, the combination of Huang and Ma does not disclose, wherein the partial estimated ground truth for the input image frame is defined by a bounding polygon defining the set of pixels in the input image frame. However, in the same field of endeavor, that is, in the field of 3D object detection in images, Cai teaches a structured polygon bounding box on an image shown in fig. 3 for covering the truth object of the vehicle in the image. See fig.3 and section “Our Approach”, “Decoupled Tasks”. It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Cai into the teachings of the combination of Huang and Ma and use a structured polygon bounding box taught by Cai to indicate the truth object on the training image in the method of Huang. Suggestion or motivation for doing so would have been to estimate depths and detect object in images as taught by Cai, cf., Abstract. Therefore, the claim is unpatentable.
Regarding claim 31, the combination of Huang, Ma, and Cai discloses the processor-implemented method of Claim 5, wherein a respective estimated depth for each respective pixel of the subset of pixels is defined as a function of a respective distance of the respective pixel from a central pixel of the bounding polygon (Cai, see pg.10481, the right col., 2nd para., lines 2-9: “the location of an object is defined as the position of the bottom center of its 3D bounding box, so we use the average of the midpoints of the diagonal P2P7 and P3P6 to estimate the location (x, y, z).” See 3D bounding box fig.3. Huang, see the proposed method minimizes the total loss Ls defined by Eq.(2) which includes the loss defined by Eq(3) that further includes the distance between the depth output D and the truth GT; see Eq(3) and Eq(2), and sec. III, B, 1)).
Regarding claim 6, the combination of Huang, Ma, and Cai discloses the processor-implemented method of Claim 5, wherein the partial estimated ground truth comprises a same estimated depth for each pixel of the set of pixels of the input image frame and wherein the same estimated depth is based on a central pixel of the bounding polygon (Cai, see section “Height-guided depth estimation”, para.3: “Specifically, in KITTI dataset the of an object is defined as the position of the bottom center of its 3D bounding box,”).
Regarding claim 7, the combination of Huang, Ma, and Cai discloses the processor-implemented method of Claim 5, further comprising: determining the estimated depths for the set of pixels of the input image frame based on a model of an object in the input image frame within the bounding polygon (Cai, see fig.5; wherein the “structured polygon” shown in the 2D image is determined by the “3D bounding box” of the 3D object), wherein the partial estimated ground truth comprises different depths for different pixels of the set of pixels of the input image frame (Cai, see fig.5 and section “Height-guided depth estimation”, para.2: “The height values of the four vertical edges in 3D space are the same, while the projected height of the four vertical edges are different due to their different locations in the 3D space.).
7. Claims 10-12, 22, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ma and further in view of Ma et al (“Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera”, ICRA, 2019, hereinafter “Ma2”). Ma2 is cited by applicant in IDS filed on 1/6/2023.
Regarding claim 10, 22, 28, the combination of Huang and Ma discloses, comprising: and the nearby frame by utilizing the both frames and the depth to obtain the relative transformation: T1-2. See “Pose2,1” in Fig.3 and sec. IV, b). It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Ma2 into the teachings of the combination of Huang and Ma and use the Model-based pose estimation taught by Ma2 to obtain the relative transformation between the two frames for estimating a photometric loss of Huang. Suggestion or motivation for doing so would have been to estimate a dense depth image from sparse depth measurements as taught by Ma2, cf., Abstract. Therefore, the combination of Huang, Ma, and Ma2 suggests or teaches the claimed invention, and the claim is unpatentable.
Regarding claim 11, the combination of Huang, Ma, and Ma2 discloses the processor-implemented method of Claim 10, wherein generating the estimated image frame comprises interpolating the estimated image frame based on the one or more context frames and wherein the interpolation comprises bilinear interpolation (Ma2, see sec. IV, c): “using bilinear interpolation”).
Regarding claim 12, the combination of Huang, Ma, and Ma2 discloses the processor-implemented method of Claim 10, further comprising generating the pose estimate with a pose model, separate from the depth model (Ma2, see fig.3, wherein the two processes are separated from each other).
8. Claims 15-17, 23, and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ma and further in view of Hambarde et al (“S2DNet: Depth Estimation from Single Image and Sparse Samples”, IEEE, 2020, hereinafter “Hambarde”).
Regarding claim 15, the combination of Huang and Ma does not explicitly disclose wherein updating the depth model based on the total loss comprises preforming gradient descent on one or more parameters of the depth model. However, the technique of preforming gradient descent on the parameters of a model to be trained is well known for one of ordinary skill in the art. As evidence, in the same field of endeavor, Hambarde, see sec. IV, A, 2) “Training Details”, teaches: “The proposed S2DNet is trained on the NYU-Depth-V2, KITTI odometry and KITTI depth completion datasets utilizing stochastic gradient descent (SGD) back propagation algorithm.” It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Hambarde into the teachings of the combination of Huang and Ma and utilize stochastic gradient descent (SGD) back propagation algorithm taught by Hambarde to learn dense depth from sparse depth and a single image in the method of Huang. Suggestion or motivation for doing so would have been to learn dense depth from sparse depth and a single image as taught by Hambarde, cf., Abstract. Therefore, the claim is unpatentable.
Regarding claim 16, 23, 29, the combination of Huang, Ma, and Hambarde discloses, further comprising: generating a runtime depth output by processing a runtime input image frame using the depth model; outputting the runtime depth output (Hambarde, “Training hyperparameters are updated” until “Predicted pixel accuracy” δi<Thr; see sec. IV, A, 2) and 3)); and in response to determining that one or more triggering criteria are satisfied, refining the depth model, comprising: determining a runtime depth loss for the depth model based on the runtime depth output and a runtime estimated ground truth for the runtime input image frame, the runtime estimated ground truth comprising estimated depths for a set of pixels of the runtime input image frame; determining a runtime total loss for the depth model based at least in part on the runtime depth loss; and updating the depth model based on the runtime total loss (Hambarde, “Training hyperparameters are updated” until “Predicted pixel accuracy” δi<Thr; see sec. IV, A, 2) and 3));
Regarding claim 17, the combination of Huang, Ma, and Hambarde discloses the processor-implemented method of Claim 16, wherein the one or more triggering criteria comprise at least one of: a predetermined schedule for retraining (Hambarde, “Training hyperparameters are updated” until “Predicted pixel accuracy” δi<Thr; see sec. IV, A, 2) and 3); wherein Thr is predetermined.); performance deterioration of the depth model; or availability of computing resources.
Response to Arguments
9. Applicant’s arguments, with respects to claim 1, filed on 12/30/2025, have been fully considered but they are not persuasive.
9-1. On page 12, regarding claim 1, Applicant states:
The Examiner argues that Huang teaches "estimating depths for a subset of pixels in the input image frame to generate a partial estimated ground truth." Office Action, p. 3 (citing Fig. 2(b) of Huang). First, Applicant maintains that the "sparse depth" shown in Figure 2(b) of Huang is not the ground truth (tleft,groundtruth in Figure 1 of Huang), but rather the input to the neural network (tleft,sparse depth in Figure 1 of Huang). Furthermore, Huang is silent with respect to estimating depths for a subset of pixels in an input image frame, the same input image frame based on which a depth output was generated from a monocular depth model. Thus, Huang fails to teach "estimating depths for a subset of pixels in the input image frame to generate a partial estimated ground truth for the input image frame."
(The emphases added by the examiner.)
The examiner respectfully disagrees with the arguments. It is because Huang clearly states: “See Figure 2 for an example. Practically, the errors in the sparse depth
calculated by SLAM systems will affect the accuracy of depth estimation inescapably. Therefore, in the experiment, not only the sparse depth from the ground truth was sampled but also the sparse depth from stereo ORB-SLAM2 system was used as input, to evaluate the results of the proposed framework.” See Sec. IV, paragraph 1. Therefore, the "sparse depth" shown in Figure 2(b) of Huang does not only an input of the system but also is sampled from the ground truth. Therefore, the "sparse depth image” shown in Figure 2(b) of Huang is a sparse depth estimation, and does not only an input of the system but also is sampled from the ground truth. In the instant case, there is no conflict between as a ground truth image and as an input image. The arguments therefore are not persuasive.
9-2. On page 13, regarding claim 1, Applicant states:
However, Ma says nothing about a partial estimated ground truth or estimating depths for a subset of pixels in the input image frame to generate a partial estimated ground truth. At best, Ma merely uses the KITTI dataset for conventional training and testing.
The examiner respectfully points out that the reference of Ma cited by the examiner is to provide evidence to show that: the ground truth
G
T
i
,
j
k
defined by Eq (3) in Huang could work well either for “a partial ground truth image” argued by the applicant or for a full ground truth image. In addition, the examiner respectfully points out that either a partial ground truth image or a full ground truth image in either Huang or Ma is a depth estimation and recognized to be true. The arguments therefore are not persuasive.
9-3. The other arguments have been considered as well but they are not persuasive in view of the examiner’s explanation and interpretation set forth above. Therefore, examiner maintains rejections.
Conclusion
10. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
11. Any inquiry concerning this communication or earlier communications from the examiner should be directed to RUIPING LI whose telephone number is (571)270-3376. The examiner can normally be reached 8:30am--5:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, HENOK SHIFERAW can be reached on (571)272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit https://patentcenter.uspto.gov; https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center, and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RUIPING LI/Primary Examiner, Ph.D., Art Unit 2676