Last updated: May 29, 2026
Application No. 18/331,174
ESTIMATING DEPTH FOR IMAGE AND RELATIVE CAMERA POSES BETWEEN IMAGES

Non-Final OA §102§103§112
Filed
Jun 08, 2023
Priority
Dec 08, 2020 — continuation of PCTEP2020085061
Examiner
JONES, ANDREW B
Art Unit
2667
Tech Center
2600 — Communications
Assignee
Huawei Technologies Co., Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +21.4% interview lift. Examiner has a relatively high allowance rate (70%); +21.4% interview lift. A written response may suffice.
Based on 78 resolved cases, 2023–2026
Examiner Intelligence

JONES, ANDREW B View full profile →
Grants 70% — above average
Career Allowance Rate
55 granted / 78 resolved
+8.5% vs TC avg
Strong +21% interview lift
Without
With
+21.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
18 currently pending
Career history
100
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
88.2%
+48.2% vs TC avg
§102
1.7%
-38.3% vs TC avg
§112
9.0%
-31.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 78 resolved cases
Office Action

§102 §103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10 March, 2026 has been entered.
 
Response to Amendment
The amendment filed 10 March, 2026 has been entered.
The amendment of claims 1, 7, 10, and 16 has been acknowledged.
The cancellation of claims 6 and 15 has been acknowledged.

Response to Arguments
Applicant’s arguments, see page 7, section “Rejections under 35 U.S.C. 102”, filed 10 March, 2026 with respect to the rejections of claims 1 – 9 have been fully considered but they are not persuasive.
Applicant states on page 9 of the reply filed 10 March, 2026 that Gordon et al (A. Gordon, H. Li, R. Jonschkowski and A. Angelova, "Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 8976-8985, hereinafter “Gordon”) does not disclose generating the first reconstructed target image based on the mapped pixel values, wherein if a transformed target pixel location does not fall into an integer pixel location in the source image during the mapping, determining a target pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image. 
Additionally, the applicant states Gordon does not disclose backwards warping the source image by: projecting a plurality of target pixel location of the target image into a three-dimensional (3D) space based on the target depth map and a set of camera intrinsic parameters, transforming positions of the projected pixel locations to the source image based on the pose transformation, mapping pixel values of the source image onto the corresponding target pixel locations to generate the first reconstructed target image based on the mapped pixel values, and generating the first reconstructed target image based on the mapped pixel values, wherein if a transformed target pixel location does not fall into an integer pixel location in the source image during the mapping, determining a target pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image. The examiner respectfully disagrees. The applicant has not provided a rationale or a reasoning as to why Gordon fails to teach these limitations as previously indicated by the examiner in the prior action filed 10 December, 2025. As such the previous claim mapping is upheld by the examiner.
Regarding the claim limitation “backwards warping the source image by: projecting a plurality of target pixel location of the target image into a three-dimensional (3D) space based on the target depth map and a set of camera intrinsic parameters”, Gordon teaches the following on page 8977, Section 3 “the backbone of our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K. Eq. 1 describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.”. 
Regarding the claim limitation “transforming positions of the projected pixel locations to the source image based on the pose transformation”, Gordon teaches equation 1 which describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.”.
Regarding the claim limitation “mapping pixel values of the source image onto the corresponding target pixel locations to generate the first reconstructed target image based on the mapped pixel values”, Gordon teaches the following on page 8978, Section 4.1: Losses “Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels… Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.”.
Regarding the claim limitation “generating the first reconstructed target image based on the mapped pixel values”, Gordon teaches the following on page 8978, Section 4.1: Losses: “Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels… Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.”.
Finally regarding the claim limitation “wherein if a transformed target pixel location does not fall into an integer pixel location in the source image during the mapping, determining a target pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image.”, this appears to be a rewriting of the dependent claim 7 limitation of “wherein mapping the pixel values of the source image onto the corresponding target pixel locations includes determining a pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image if a transformed target pixel location does not fall into an integer pixel location in the source image”. As such, the examiner relies on the mapping of dependent claim 7 from the previous office action, page 8978, Section 4.2 teaches “The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer. Therefore obtaining the depth on the target frame at (i’, j’), zti’,j’ requires interpolation. (Emphasis added); Examiner’s note: Examiner understands interpolation to be a method of using adjacent pixel values to determine the value of the target pixel.”

Applicant’s arguments, see page 11, section “Rejections under 35 U.S.C. 103”, filed 10 March, 2026 with respect to the rejections of claims 10 - 18 have been fully considered but they are not persuasive. 
Applicant states on page 11 of the remarks filed 10 March, 2026 that Tang et al (U.S. Patent Publication No. 2021/0319577 A1, hereinafter “Tang”) fails to cure the defects of Gordon. Therefore, the combination of Gordon and Tang fails to teach each of the claim limitations of independent claim 1 and 10.
Again, the examiner respectfully disagrees. For the same reasons as listed above in response to the arguments made for the rejection of claims under “35 U.S.C. 102”, the examiner understands Gordon to fully teach the limitations of claims 1 – 9, and each limitation of claims 10 – 18, excluding the claim limitation of “a non-transitory computer-readable media storing computer instructions that configure at least one processor, upon execution of the instructions, to perform the following steps” as claimed in independent claim 10. The applicant has provided no rationale for why these arts would not teach the independent claim limitations of claims 1 and 10, or the dependent claim limitations of claims 2 – 9 and 11 – 18.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 7 and 16 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.
The newly amended claim 1 states the following limitation: “wherein if a transformed target pixel location does not fall into an integer pixel location in the source image during the mapping, determining a target pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image”, whereas both claims 7 and 16 state the following “wherein mapping the pixel values of the source image onto the corresponding target pixel locations includes determining a pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image if a transformed target pixel location does not fall into an integer pixel location in the source image” (emphasis added). These limitations appear to be the same, the only difference being the order of the wording of the limitation.
Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of pre-AIA  35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(b) the invention was patented or described in a printed publication in this or a foreign country or in public use or on sale in this country, more than one year prior to the date of application for patent in the United States.


Claims 1 – 9 are rejected under pre-AIA  35 U.S.C. 102(a)(1) as being anticipated by Gordon et al (A. Gordon, H. Li, R. Jonschkowski and A. Angelova, "Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 8976-8985).

Regarding claim 1, Gordon teaches a computer implemented method of estimating depth for an image and relative camera poses between images in a video sequence, comprising: 
estimating a target depth map or a target image in a time series of two or more images (Page 8977, Section 3, ¶ 1: the backbone of our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K.; Page 8978, Section 4.1, ¶ 1 - 2: Depth is predicted form a single image by Unet [28] encoder-decoder network with a ResNet 18 base and a softplus activation. A second network (shown in Fig. 2) predicts camera motion, a dense residual translation representing motion of objects relative to the scene, as well as the camera intrinsics, from two consecutive images.);
estimating a pose transformation from the target image to a source image, adjacent to the target image in the time series (Page 8978, Section 4.1, ¶ 2: A second network (shown in Fig. 2) predicts camera motion, a dense residual translation representing motion of objects relative to the scene, as well as the camera intrinsics, from two consecutive images.);
backwards warping the source image (Page 8977, Section 3, ¶ 1: the backbone of our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K. Eq. 1 describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.; Page 8978, Section 4.1, ¶ 3: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels…) by:
projecting a plurality of target pixel locations of the target image into a three-dimensional (3D) space, based on the target depth map and a set of camera intrinsic parameters (Page 8977, Section 3, ¶ 1: the backbone of our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K. Eq. 1 describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.);
transforming positions of the projected pixel locations to the source image, based on the pose transformation (Eq. 1 describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.);
mapping pixel values of the source image onto the corresponding target pixel locations to generate the first reconstructed target image based on the mapped pixel values (Page 8978, Section 4.1: Losses: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels… Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.); and
generating the first reconstructed target image based on the mapped pixel values (Page 8978, Section 4.1: Losses: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels… Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.), wherein if a transformed target pixel location does not fall into an integer pixel location in the source image during the mapping, determining a target pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image (Page 8978, Section 4.2, ¶ 2: The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer. Therefore obtaining the depth on the target frame at (i’, j’), zti’,j’ requires interpolation. (Emphasis added); Examiner’s note: Examiner understands interpolation to be a method of using adjacent pixel values to determine the value of the target pixel.).;
calculating an initial image reconstruction loss, based on the target image and the first reconstructed target image (Page 8978, Section 4.1, ¶ 3: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels…);
estimating a source depth map for the source image (Page 8978, Section 4.2, ¶ 2: For each pixel (I, j) in the source frame, the predicted depth zij and the camera intrinsic matrix are used to obtain the respective point in space, (xij, yij, zij).);
forward warping the source depth map to generate a second reconstructed target depth map, based on the pose transformation and the source depth map (Page 8978, Section 4.2, ¶ 1: Given a depth map and a motion field in one frame, one could actually detect where occlusion is about to occur, and exclude the occluded areas from the consistency loss.; Page 8978, Section 4.2, ¶ 2: For each pixel (i, j) in the source frame, the predicted depth zij and the camera intrinsic matrix are used to obtain the respective point in space, (xij, yij, zij). The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer.);
generating an occlusion mask by the second reconstructed target depth map, the occlusion map indicating one or more occluded areas of the target image (Page 8979, Section 4.2, ¶ 3: Occlusions happen at (i’, j’) where z’ becomes multivalued. At such points, color and depth consistency should be applied only to the visible branch of z’, that is, the branch where z’ is smaller. If the source and target frames are nearly consistent, the visible branch will be close to target depth at (i’, j’), zti’,j’. The way we propose to pick the visible branch is to include in the losses only points (i’, j’) where z’i’,j’ -≤ zti’,j’. In other words, only if a transformed pixel on the source frame lands in front of the depth map in the target frame, do we include that pixel in the loss.; Examiner’s note: While the term “occlusion mask” is not explicitly mentioned, ¶ 3 of section 4.2 describes a process of masking pixels based on a location within depth maps.); and
regularising the initial image reconstruction loss based on the generated occlusion mask to generate a regularised image reconstruction loss (Page 8979, Section 4.2, ¶ 4: The losses described in Sec. 4.1 are invoked in an “occlusion-aware” manner, as described in this section, except for SSIM. For the latter, we handle occlusions by replacing all averaging operations by a weighted averaging, where the weight of a pixel is a decreasing function of the depth error in that pixel. The exact expression is given in the SM.).

Regarding claim 2, Gordon teaches the method of claim 1.
Additionally, Gordon teaches wherein the estimating the target depth map and the source depth map uses a first neural network (Page 8978, Section 4.1, ¶ 1 - 2: Depth is predicted form a single image by Unet [28] encoder-decoder network with a ResNet 18 base and a softplus activation.).

Regarding claim 3, Gordon teaches the method of claim 1.
Additionally, Gordon teaches further comprising training the first neural network based on the regularised image reconstruction loss (Page 8978: Section 4, ¶ 1: By imposing consistency across neighboring frames as a loss, the networks simultaneously learn to predict depth maps, motion fields and the camera intrinsics.; Pahe 8978, Section 4.1 , Losses: Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.).

Regarding claim 4, Gordon teaches the method of claim 1.
Additionally, Gordon teaches wherein the estimating the pose transformation uses a second neural network (Page 8978, Section 4.1, ¶ 2: A second network (shown in Fig. 2) predicts camera motion, a dense residual translation representing motion of objects relative to the scene, as well as the camera intrinsics, from two consecutive images.).

Regarding claim 5, Gordon teaches the method of claim 4.
Additionally, Gordon teaches further comprising training the second neural network based on the regularised image reconstruction loss (Page 8978: Section 4, ¶ 1: By imposing consistency across neighboring frames as a loss, the networks simultaneously learn to predict depth maps, motion fields and the camera intrinsics.; Pahe 8978, Section 4.1 , Losses: Additionally, we impose a cycle consistency loss on the motion field by estimating both forward and backward motion, which we obtain by applying the networks on the frames in normal and in reversed order.).

Regarding claim 7, Gordon teaches the method of claim 1.
Gordon does not explicitly teach wherein mapping the pixel values of the source image onto the corresponding target pixel locations includes determining a pixel value using bilinear sampling of pixel values from adjacent pixel locations of the source image if a transformed target pixel location does not fall into an integer pixel location in the source image (Page 8978, Section 4.2, ¶ 2: The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer. Therefore obtaining the depth on the target frame at (i’, j’), zti’,j’ requires interpolation. (Emphasis added); Examiner’s note: Examiner understands interpolation to be a method of using adjacent pixel values to determine the value of the target pixel.).

Regarding claim 8, Gordon teaches the method of claim 1.
Additionally, Gordon teaches wherein the forward warping comprises:
projecting a plurality of depth values from the source image into a 3D space based on the source depth map and a set of camera intrinsic parameters (Page 8978, Section 4.2, ¶ 2: For each pixel (i, j) in the source frame, the predicted depth zij and the camera intrinsic matrix are used to obtain the respective point in space, (xij, yij, zij).);
generating a pose transformation from the source image to the target image by reversing the pose transformation from the target image to the source image (Page 8978, Section 4.2, ¶ 3: This scheme is not symmetrical with respect to interchanging the source and target frames, which is why we always apply it in a symmetrized way: We transform the source onto the target, calculate the losses, and then switch the roles of source and target. Fig. 3 illustrates the method.);
transforming positions of the projected depth values, based on the pose transformation from the source image to the target image (Figure 3; Page 8979 Figure 3: dashed line shows the depth map obtained from warping one view onto the other (z’R and z’L ). The warped depth map can become a multivalued function, which indicates occlusions (see the greenshaded rectangle).); and
mapping the transformed depth values onto the second reconstructed target depth map based on the set of camera intrinsic parameters (Page 8978, Section 4.2, ¶ 2: The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer.).

Regarding claim 9, Gordon teaches the method of claim 8.
Additionally, Gordon teaches wherein the mapping the transformed depth values onto the second reconstructed target depth map includes determining a minimum depth value from an occluded set of depth values and discarding other depth values in the occluded set if the occluded set of depth values are mapped onto a single pixel location of the second reconstructed target depth map (Page 8979, Section 4.2, ¶ 3: Occlusions happen at (i’, j’) where z’ becomes multivalued. At such points, color and depth consistency should be applied only to the visible branch of z’, that is, the branch where z’ is smaller. If the source and target frames are nearly consistent, the visible branch will be close to target depth at (i’, j’), zti’,j’. The way we propose to pick the visible branch is to include in the losses only points (i’, j’) where z’i’,j’ -≤ zti’,j’. In other words, only if a transformed pixel on the source frame lands in front of the depth map in the target frame, do we include that pixel in the loss.).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 10 - 18 are rejected under 35 U.S.C. 103 as being unpatentable over Gordon et al (A. Gordon, H. Li, R. Jonschkowski and A. Angelova, "Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 8976-8985) in view of Tang et al (U.S. Patent Publication No. 2021/0319577 A1, hereinafter “Tang”).

Regarding claim 10, Gordon teaches 
estimating a target depth map or a target image in a time series of two or more images (Page 8977, Section 3, ¶ 1: the backbone of our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K.; Page 8978, Section 4.1, ¶ 1 - 2: Depth is predicted form a single image by Unet [28] encoder-decoder network with a ResNet 18 base and a softplus activation. A second network (shown in Fig. 2) predicts camera motion, a dense residual translation representing motion of objects relative to the scene, as well as the camera intrinsics, from two consecutive images.);
estimating a pose transformation from the target image to a source image, adjacent to the target image in the time series (Page 8978, Section 4.1, ¶ 2: A second network (shown in Fig. 2) predicts camera motion, a dense residual translation representing motion of objects relative to the scene, as well as the camera intrinsics, from two consecutive images.);
backwards warping the source image to generate a first reconstructed target image, based on the pose transformation and the target depth map (Page 8977, Section 3, ¶ 1: the backbone of
our method is the equation that ties together two adjacent video frames using a depth map (z) and the camera matrix K. Eq. 1 describes the shift in a pixel position p due to a rotation matrix R and a translation vector t: 
    PNG
    media_image1.png
    28
    263
    media_image1.png
    Greyscale
 p’ and z’ are the new homogenous coordinates of the pixel and new depth.); Page 8978, Section 4.1, ¶ 3: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels…);
calculating an initial image reconstruction loss, based on the target image and the first reconstructed target image (Page 8978, Section 4.1, ¶ 3: Based on the estimated depth map, camera intrinsics, rotation, and the translation field, we warp the first frame to match the second one and compare those using two losses: 1) a structural similarity (SSIM) loss and 2) the sum of L1 distances for the color channels…);
estimating a source depth map for the source image (Page 8978, Section 4.2, ¶ 2: For each pixel (I, j) in the source frame, the predicted depth zij and the camera intrinsic matrix are used to obtain the respective point in space, (xij, yij, zij).);
forward warping the source depth map to generate a second reconstructed target depth map, based on the pose transformation and the source depth map (Page 8978, Section 4.2, ¶ 1: Given a depth map and a motion field in one frame, one could actually detect where occlusion is about to occur, and exclude the occluded areas from the consistency loss.; Page 8978, Section 4.2, ¶ 2: For each pixel (i, j) in the source frame, the predicted depth zij and the camera intrinsic matrix are used to obtain the respective point in space, (xij, yij, zij). The point is moved in space according to the predicted motion field. In particular, the depth changes to z’. The new spatial location is reprojected back onto the camera frame, and falls at some generally-different location (i’; j’) on the target frame. i’ and j’ are generally non-integer.);
generating an occlusion mask by the second reconstructed target depth map, the occlusion map indicating one or more occluded areas of the target image (Page 8979, Section 4.2, ¶ 3: Occlusions happen at (i’, j’) where z’ becomes multivalued. At such points, color and depth consistency should be applied only to the visible branch of z’, that is, the branch where z’ is smaller. If the source and target frames are nearly consistent, the visible branch will be close to target depth at (i’, j’), zti’,j’. The way we propose to pick the visible branch is to include in the losses only points (i’, j’) where z’i’,j’ -≤ zti’,j’. In other words, only if a transformed pixel on the source frame lands in front of the depth map in the target frame, do we include that pixel in the loss.; Examiner’s note: While the term “occlusion mask” is not explicitly mentioned, ¶ 3 of section 4.2 describes a process of masking pixels based on a location within depth maps.); and
regularising the initial image reconstruction loss based on the generated occlusion mask to generate a regularised image reconstruction loss (Page 8979, Section 4.2, ¶ 4: The losses described in Sec. 4.1 are invoked in an “occlusion-aware” manner, as described in this section, except for SSIM. For the latter, we handle occlusions by replacing all averaging operations by a weighted averaging, where the weight of a pixel is a decreasing function of the depth error in that pixel. The exact expression is given in the SM.).
Gordon does not explicitly teach a non-transitory computer-readable media storing computer instructions that configure at least one processor, upon execution of the instructions, to perform the following steps
However, Tang does teach a non-transitory computer-readable media storing computer instructions that configure at least one processor, upon execution of the instructions, to perform the following steps (¶ 0006: In another aspect of the present disclosure, a non-transitory computer-readable medium with non-transitory program code recorded thereon for depth estimation is disclosed. The program code is executed by a processor and includes program code).
Gordon and Tang are considered to be analogous art as both pertain to depth estimation. Therefore, it would have been obvious to one of ordinary skill in the art to combine the unsupervised monocular depth learning from unknown cameras (as taught by Gordon) and the depth estimation based on ego-motion estimation and residual flow estimation (as taught by Tang) before the effective filing date of the claimed invention.  The motivation for this combination of references would be the method of Tang improves monocular depth estimated generated from a depth network (See ¶ 0034).
This motivation for the combination of Gordon and Tang is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).

Regarding claim 11, claim 11 has been analyzed with regard to respective claim 2 and is rejected for the same reasons of obviousness as used above.

Regarding claim 12, claim 12 has been analyzed with regard to respective claim 3 and is rejected for the same reasons of obviousness as used above.

Regarding claim 13, claim 13 has been analyzed with regard to respective claim 4 and is rejected for the same reasons of obviousness as used above.

Regarding claim 14, claim 14 has been analyzed with regard to respective claim 5 and is rejected for the same reasons of obviousness as used above.

Regarding claim 15, claim 15 has been analyzed with regard to respective claim 6 and is rejected for the same reasons of obviousness as used above.

Regarding claim 16, claim 16 has been analyzed with regard to respective claim 7 and is rejected for the same reasons of obviousness as used above.

Regarding claim 17, claim 17 has been analyzed with regard to respective claim 8 and is rejected for the same reasons of obviousness as used above.

Regarding claim 18, claim 18 has been analyzed with regard to respective claim 9 and is rejected for the same reasons of obviousness as used above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANDREW JONES whose telephone number is (703)756-4573. The examiner can normally be reached Monday - Friday 8:00-5:00 EST, off Every Other Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571) 272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANDREW B. JONES/Examiner, Art Unit 2667                                                                                                                                                                                                        

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667
Read full office action
Prosecution Timeline

Jun 08, 2023
Application Filed
Aug 14, 2025
Non-Final Rejection mailed — §102, §103, §112
Nov 14, 2025
Response Filed
Dec 10, 2025
Final Rejection mailed — §102, §103, §112
Mar 10, 2026
Response after Non-Final Action
Apr 10, 2026
Request for Continued Examination
Apr 13, 2026
Response after Non-Final Action
Apr 22, 2026
Non-Final Rejection mailed — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/928,992
Patent 12626475
SAMPLE OBSERVATION DEVICE AND SAMPLE OBSERVATION METHOD
3y 5m to grant Granted May 12, 2026
18/346,846
Patent 12607515
SYSTEMS AND METHODS FOR TEMPERATURE DETERMINATION
2y 9m to grant Granted Apr 21, 2026
18/206,868
Patent 12599285
ANALYSIS OF IN-VIVO IMAGES USING CONNECTED GRAPH COMPONENTS
2y 10m to grant Granted Apr 14, 2026
17/999,149
Patent 12587607
CORRECTION OF COLOR TINTED PIXELS CAPTURED IN LOW-LIGHT CONDITIONS
3y 4m to grant Granted Mar 24, 2026
18/016,719
Patent 12586201
ORAL IMAGE PROCESSING DEVICE AND ORAL IMAGE PROCESSING METHOD
3y 2m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
70%
Grant Probability
92%
With Interview (+21.4%)
3y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 78 resolved cases by this examiner. Grant probability derived from career allowance rate.