Office Action Analysis: 18440243 — TRAINING OF MODELS FOR MONOCULAR DEPTH AND VISUAL ODOMETRY

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/13/2024 has been considered by the examiner.
Status of Claims
Currently pending Claim(s):
1-21

Claim Objections
Claims 12  and 16 are objected to because of the following informalities: 
Claim 12 line 2 recites “a warping module configured to generated…”. This should read “a warping module configured to generate…”.  
Claim 16 line 2 recites “a warping module configured to generated…”. This should read “a warping module configured to generate…”.  
Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
“encoder module” in claims 1, 6, and 21.
“first decoder module” in claims 1, 6, and 21.
“second decoder module” in claim 1, 6, and 21.
“a training module” in claims 6-11, 15, and 18.
“a warping module” in claims 12, 13, and 16.
“loss module” in claims 16, 17 and 19.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
Claims 1, 6, and 21: “encoder module” corresponds to figure 12, element 230 “A pair of images (denoted I and I′ in FIG. 12) are input to the model 130. The model 130 includes first and second encoders (modules) 1204 and 1208 corresponding to and function as described above regarding the encoder modules 230” (Application Pub, paragraph [0176])
Claims 1, 6, and 21: “first decoder module” corresponds to figure 12, element 1212 “The model 130 includes first and second decoders (modules) 1212 and 1216 corresponding to the decoder module 270. The first and second decoders 1212 and 1216 generate depth maps (D and D′) based on the representations output by the first and second encoders 1204 and 1208, respectively. The depth map D′ is for the image I′, and the depth map D is for the image I.” (Application Pub, paragraph [0177]).
Claims 1, 6, and 21: “second decoder module” corresponds to figure 12, element 1216 “The model 130 includes first and second decoders (modules) 1212 and 1216 corresponding to the decoder module 270. The first and second decoders 1212 and 1216 generate depth maps (D and D′) based on the representations output by the first and second encoders 1204 and 1208, respectively. The depth map D′ is for the image I′, and the depth map D is for the image I.” (Application Pub, paragraph [0177]).
Claims 6-11, 15, and 18: “a training module” corresponds to figure 1, element 50 “FIG. 1 includes a functional block diagram of an example implementation of the herein described two-phase training process performed by a training module 50 for training a machine learning model. The training starts with the training module 50 performing an unsupervised pre-training 110 of a pretext learning model 130 on the pretext task.” (Application Pub, paragraph [0058]).
Claims 12, 13, and 16: “a warping module” corresponds to figure 18, element 1224 “At 1820, the warping module 1224 performs the warping described above, and the loss module 1312 determines the losses as described above including Lself. In various implementations, the warping module 112 and the loss module 1312 may be implemented in the training module 50.” (Application Pub, paragraph [0211]).
Claims 16, 17 and 19: “loss module” corresponds to figure 18, element 1312 “At 1820, the warping module 1224 performs the warping described above, and the loss module 1312 determines the losses as described above including Lself. In various implementations, the warping module 112 and the loss module 1312 may be implemented in the training module 50.” (Application Pub, paragraph [0211]).
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
	Positive Statement regarding 35 U.S.C. 101: Claims 1-21 are determined to be eligible under 35 U.S.C. 101. The claim 1, for example, at lines 6-8 recites “a first decoder module configured to decode the first and second representations and generate first and second depth maps for the images based on the first and second representations”. It is given the weight of the description in the specification page 16 that decoding the first and second representations and generating a first and second depth map is not simply transmitting and receiving data. Page 10, paragraph 168 “applies the trained task specific machine learning model to the first image thereby extracting a depth map from the first image, where the depth map corresponds to the surroundings of the autonomous apparatus…Based on the extracted depth map, the autonomous apparatus determines a distance to an object in the surroundings of the autonomous apparatus. Based on the determined distance to the object the autonomous apparatus adjusts its velocity and/or its direction of motion.”. Therefore, it seems that the “a first decoder module configured to decode the first and second representations and generate first and second depth maps for the images based on the first and second representations” in combination with other limitation/features in the claim, claim as a whole, makes it eligible under 35 USC 101.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 4, 6-8, 10-16 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. ("Undeepvo: Monocular visual odometry through unsupervised deep learning." 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018) (hereinafter, “Li”) in view of Tang et al. (US 11,727,588 B2) (hereinafter, “Tang”).
Regarding claim 1, Li discloses an encoder module configured to encode first and second images (the two consecutive monocular images equate to the first and second image) into first and second representations, respectively (Section (II) – System Overview, page 7287 left column, paragraph 3: “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps.” Page 7288 Figure 3: “… The total loss including spatial losses and temporal losses can then be calculated based on raw RGB images, estimated depth maps and poses.” Page 7288, [left column paragraph 2] “The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images."),
Figure 2

    PNG
    media_image1.png
    550
    391
    media_image1.png
    Greyscale

Figure 3

    PNG
    media_image2.png
    550
    393
    media_image2.png
    Greyscale

the first and second images being from consecutive frames from video (Section (III) – Objective Losses, page 7287 [right column paragraph 1 continuing to page 7288 left column] “UnDeepVO is trained with losses through backpropagation. Since the losses are built on geometric constraints rather than labeled data, UnDeepVo is trained in an unsupervised manner. Its total loss includes spatial image losses and temporal image losses, as shown in Fig. 3. The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.”);
a first decoder module (the depth estimator equates to the first decoder module) configured to decode the first and second representations and generate first and second depth maps (the dense depth maps equate to the first and second depth maps) for the images based on the first and second representations, respectively (Section (II) – System Overview, page 7287 [left column paragraph 3] “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps. Different from other depth estimation methods [17], [18] which produce disparity images (inverse of the depth) from the network, the depth estimator of UnDeepVO is designed to directly predict depth maps. “; Page 7289 [left column paragraph 4] “UnDeepVO is the first to recover both scaled camera poses and depth maps by benefiting all these losses together with the 3D geometric registration and pose consistency losses.”); and
a second decoder module (the pose estimator equates to the second decoder module) configured to determine a six degree of freedom pose translation of a camera that captured the video based on the first and second representations (“Our system is composed of a pose estimator and a depth estimator, as shown in Fig. 2. Both estimators take consecutive monocular images as inputs, and produce scaled 6-DoF pose and depth as outputs, respectively. For the pose estimator, it is a VGG-based [19] CNN architecture. It takes two consecutive monocular images as input and predicts the 6-DoF transformation between them.).
However, Li fails to teach a non-transitory computer readable medium storing a computer model.
Tang teaches a non-transitory computer readable medium storing a computer model (Column 2 [lines 41-42] “a method, apparatus, system, computer program product, non-transitory computer-readable medium”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective
filing date to modify Li’s reference to include a non-transitory computer readable medium storing a computer model taught by Tang’s reference. The motivation for doing so would have been to store the executable software as suggested by Tang (see Tang, column 13 [lines 19-20]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 1.
Regarding claim 3, which claim 1 is incorporated, Li discloses wherein the encoder module includes adapter modules trained based on minimizing a geometric consistency loss (Page 7287, left column paragraph 2 “utilize both spatial and temporal geometric consistencies of a stereo image sequence to formulate the loss function. The red points in one image all have the corresponding ones in another… temporal geometric consistency represents the geometric projective constraint between the corresponding points in two consecutive monocular images… By using these constraints to construct the loss function and minimizing them all together, the UnDeepVO learns to estimate scaled 6-DoF poses and depth maps in an end-to-end unsupervised manner”).
Regarding claim 4, which claim 1 is incorporated, Li discloses wherein the encoder module includes adapter modules trained based on minimizing a photometric loss (Page 7289, left column, paragraph 1 “Photometric Consistency Loss: The photometric loss is computed from two consecutive monocular images. Similar to DTAM [4], in order to estimate 6-DoF transformation, the projective photometric error is employed as the loss to minimize.”).
Regarding claim 6, Li discloses A system comprising: a model including:
Figure 1

    PNG
    media_image3.png
    463
    550
    media_image3.png
    Greyscale

an encoder module configured to encode first and second images (the two consecutive monocular images equate to the first and second image) into first and second representations, respectively (Section (II) – System Overview, page 7287 [left column, paragraph 3]: “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps.”; Page 7288, Figure 3 “… The total loss including spatial losses and temporal losses can then be calculated based on raw RGB images, estimated depth maps and poses.”; Page 7288 [left column paragraph 2] “The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.");
Figure 2

    PNG
    media_image1.png
    550
    391
    media_image1.png
    Greyscale

Figure 3

    PNG
    media_image2.png
    550
    393
    media_image2.png
    Greyscale

a first decoder module (the depth estimator equates to the first decoder module) configured to decode the first and second representations and generate first and second depth maps (the dense depth maps equate to the first and second depth maps) for the images based on the first and second representations (Figure 1; Section (II) – System Overview, page 7287 [left column paragraph 3] “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps. Different from other depth estimation methods [17], [18] which produce disparity images (inverse of the depth) from the network, the depth estimator of UnDeepVO is designed to directly predict depth maps. “; Page 7289 [left column paragraph 4] “UnDeepVO is the first to recover both scaled camera poses and depth maps by benefiting all these losses together with the 3D geometric registration and pose consistency losses.”), respectively; and
a second decoder module (the pose estimator equates to the second decoder module) configured to determine a six degree of freedom pose translation of a camera that captured the video based on the first and second representations (System Overview page 7287 [left column paragraph 1] “Our system is composed of a pose estimator and a depth estimator, as shown in Fig. 2. Both estimators take consecutive monocular images as inputs, and produce scaled 6-DoF pose and depth as outputs, respectively. For the pose estimator, it is a VGG-based [19] CNN architecture. It takes two consecutive monocular images as input and predicts the 6-DoF transformation between them.”); and
a training module configured to (Figure 1 and Figure 2):
train the model using pairs of images, each pair of images including at least part of a same scene and captured at different times (Section (II) – System Overview, page 7287 [left column paragraphs 1-2] “Our system is composed of a pose estimator and a depth estimator, as shown in Fig. 2. Both estimators take consecutive monocular images as inputs, and produce scaled 6-DoF pose and depth as outputs, respectively. For the pose estimator, it is a VGG-based [19] CNN architecture. It takes two consecutive monocular images as input and predicts the 6-DoF transformation between them.”); and
train parameters of adapter modules of the encoder module using consecutive frames of monocular video based on depth maps and pose translations determined by the model based on the consecutive frames of [monocular video] (Section (III) – Objective Losses, page 7287 [right column paragraph 1 continuing to page 7288 left column] “UnDeepVO is trained with losses through backpropagation. Since the losses are built on geometric constraints rather than labeled data, UnDeepVo is trained in an unsupervised manner. Its total loss includes spatial image losses and temporal image losses, as shown in Fig. 3. The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.”).
However, Li fails to teach frames from a monocular video.
Tang teaches frames from a monocular video (Column 7 [lines 53-55] “target image 200 may be captured by a monocular camera. The monocular camera may capture a forward-facing view of an agent (e.g., a vehicle).”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective
filing date to modify Li’s reference to include frames from a monocular video taught by Tang’s reference. The motivation for doing so would have been to improve the self-supervised monocular depth estimates as suggested by Tang (see Tang, column 5 [lines 30-32]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 6.
Regarding claim 7, which claim 6 is incorporated, Li discloses wherein the training module is configured to train the parameters of the adapter modules after training the model using the pairs of images (Figure 2:  two consecutive monocular images)

    PNG
    media_image1.png
    550
    391
    media_image1.png
    Greyscale
.
Regarding claim 8, which claim 6 is incorporated, Li discloses wherein the training module is configured to train the parameters of the adapter modules without annotations for the frames of the monocular video (Figure 1; Section (I) – Introduction page 7286 [right column paragraph 1] “These limitations suggest us to look for various unsupervised learning VO schemes, and consequently we can train them with easily collected unlabeled datasets and apply them to localization scenarios.”).
Regarding claim 10, which claim 6 is incorporated, Li discloses all other parameters of the model are fixed (Page 7287 [left column paragraph 1] “…estimated pose but also estimated dense depth map are generated with absolute scales thanks to the use of stereo image pairs during training.”; Section (IV) – Experimental Evaluation, page 7289 last paragraph “Adam optimizer was employed to train the network for up to 20-30 epochs with parameter β1 = 0.9 and β2 = 0.99. The learning rate started from 0.001 and decreased by half for every 1/5 of total iterations. The sequence length of images feeding to the pose estimator was 2. The size of image input to the networks was 416×128.”).
However, Li fails to teach wherein the training module is configured to train the parameters of the adapter modules.
Tang teaches wherein the training module is configured to train the parameters of the adapter modules (Column 2 [lines 30-35] “where s( ) determines a structure similarity, c( ) determines a contrast similarity, and l( ) determines a luminance similarity. α, β, and γ are parameters for adjusting a relative importance of each component, and each parameter is greater than zero.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include teaches wherein the training module is configured to train the parameters of the adapter modules taught by Tang’s reference. The motivation for doing so would have been to adjust the relative importance of each component as suggested by Tang (see Tang, column 2 [lines 30-35]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 10.
Regarding claim 11, which claim 6 is incorporated, Li discloses wherein the training module is configured to train the parameters of the adapter modules based on minimizing a geometric consistency loss (Page 7287 [left column paragraph 2] “utilize both spatial and temporal geometric consistencies of a stereo image sequence to formulate the loss function. The red points in one image all have the corresponding ones in another… temporal geometric consistency represents the geometric projective constraint between the corresponding points in two consecutive monocular images… By using these constraints to construct the loss function and minimizing them all together, the UnDeepVO learns to estimate scaled 6-DoF poses and depth maps in an end-to-end unsupervised manner”).
Regarding claim 12, which claim 11 is incorporated, Li fails to teach a warping module configured to generated a warped depth map based on the first depth map; and a loss module configured to determine the geometric consistency loss based on differences between the warped depth map and the first depth map.
Tang teaches a warping module configured to generated a warped depth map based on the first depth map (Column 11 [lines 7-15] “a warped depth estimate 614 of the target depth estimate 612 may be generated based on the pose estimate 606. The warped depth estimate 614 may be computed by transforming the target depth estimate 612 according to the pose estimate 602. The transformation module (not shown) may warp the target depth estimate 612.”); and
a loss module configured to determine the geometric consistency loss based on differences between the warped depth map and the first depth map (Column 11 [lines 15-19] “The warped depth estimate 614 and the source depth estimate 610 may be input to the residual pose network 604 which may minimize a geometric difference between the warped depth estimate 614 and the source depth estimate 610.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include a warping module configured to generated a warped depth map based on the first depth map; and a loss module configured to determine the geometric consistency loss based on differences between the warped depth map and the first depth map taught by Tang’s reference. The motivation for doing so would have been to improve the accuracy of the warped image as suggested by Tang (see Tang, column 11 [lines 23-28]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 12.
Regarding claim 13, which claim 12 is incorporated, Li fails to teach wherein the warping module is configured to generate the warped depth map further based on the pose translation.
Tang teaches wherein the warping module is configured to generate the warped depth map further based on the pose translation (Column 11 [lines 7-15] “a warped depth estimate 614 of the target depth estimate 612 may be generated based on the pose estimate 606. The warped depth estimate 614 may be computed by transforming the target depth estimate 612 according to the pose estimate 602. The transformation module (not shown) may warp the target depth estimate 612.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include wherein the warping module is configured to generate the warped depth map further based on the pose translation taught by Tang’s reference. The motivation for doing so would have been to improve the accuracy of the warped image as suggested by Tang (see Tang, column 11 [lines 23-28]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 13.
Regarding claim 14, which claim 13 is incorporated, Li fails to teach wherein the warping module is configured to generate the warped depth map based on transforming the first depth map to a three dimensional space and projecting to the second image using the pose translation.
Tang teaches wherein the warping module is configured to generate the warped depth map based on transforming the first depth map to a three dimensional space and projecting to the second image using the pose translation (Column 9 [lines 19-22] “the depth map 508 may be a per-pixel depth map. A view estimation module 510 receives the per-pixel depth map 508 and the six DoF transformation (e.g., relative pose) between the target image 504 and the source image 506.”; Column 11 [lines6-19] “the pose estimate 602 may be used to estimate a residual pose estimate 606…a warped depth estimate 614 of the target depth estimate 612 may be generated based on the pose estimate 606. The warped depth estimate 614 may be computed by transforming the target depth estimate 612 according to the pose estimate 602. The transformation module (not shown) may warp the target depth estimate 612. The warped depth estimate 614 and the source depth estimate 610 may be input to the residual pose network 604 which may minimize a geometric difference between the warped depth estimate 614 and the source depth estimate 610.”; Column 12 [lines 14-17] “the determined residual pose accounts for geometric information and improves a consistency of depth estimates between the target and source views.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include wherein the warping module is configured to generate the warped depth map based on transforming the first depth map to a three dimensional space and projecting to the second image using the pose translation taught by Tang’s reference. The motivation for doing so would have been to improve the accuracy of the warped image as suggested by Tang (see Tang, column 11 [lines 23-28]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 14.
Regarding claim 15, which claim 6 is incorporated, Li discloses wherein the training module is configured to train the parameters of the adapter modules based on minimizing a photometric loss (Page 7289 [left column, paragraph 1] “Photometric Consistency Loss: The photometric loss is computed from two consecutive monocular images. Similar to DTAM [4], in order to estimate 6-DoF transformation, the projective photometric error is employed as the loss to minimize.”).
Regarding claim 16, which claim 15 is incorporated, Li fails to teach a warping module configured to generated a warped image based on the first image; and a loss module configured to determine the photometric consistency loss based on differences between the warped image and the first image.  
Tang teaches a warping module configured to generated a warped image based on the first image (Column 6 [lines 58-62] “each individual pixel for the target image is warped according to its own depth and pose estimates to generate a reconstructed image (e.g., warped source image)”); and 
a loss module configured to determine the photometric consistency loss based on differences between the warped image and the first image (Column 10 [lines 8-11] “A photometric loss is calculated based on the difference between the target image 504 and the warped image 512 (e.g., the warped source image that approximates the target image).”).  
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include a warping module configured to generated a warped image based on the first image; and a loss module configured to determine the photometric consistency loss based on differences between the warped image and the first image taught by Tang’s reference. The motivation for doing so would have been to generate a 3D representation of the target image and use the loss to update the network as suggested by Tang (see Tang, column 3 [lines 43-45] and column 10 [lines 13-15]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 16.
Regarding claim 21, Li discloses A method, comprising:
train a model using pairs of images, each pair of images including at least part of a same scene and captured at different times (Section (III) – Objective Losses, page 7287 [right column paragraph 1 continuing to page 7288 left column] “The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.”),
the model including (Figure 1):
an encoder module configured to encode first and second images (the two consecutive monocular images equate to the first and second image) into first and second representations (Section (II) – System Overview, page 7287 [left column, paragraph 3]: “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps.”; Page 7288, Figure 3 “… The total loss including spatial losses and temporal losses can then be calculated based on raw RGB images, estimated depth maps and poses.”; Page 7288 [left column paragraph 2] “The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images."; Figure 2; and Figure 3), respectively;
a first decoder module (the depth estimator equates to the first decoder module) configured to decode the first and second representations and generate first and second depth maps (the dense depth maps equate to the first and second depth maps) for the images based on the first and second representations, respectively (Figure 1; Section (II) – System Overview, page 7287 [left column paragraph 3] “The depth estimator is mainly based on an encoder-decoder architecture to generate dense depth maps. Different from other depth estimation methods [17], [18] which produce disparity images (inverse of the depth) from the network, the depth estimator of UnDeepVO is designed to directly predict depth maps. “; Page 7289 [left column paragraph 4] “UnDeepVO is the first to recover both scaled camera poses and depth maps by benefiting all these losses together with the 3D geometric registration and pose consistency losses.”); and
a second decoder module (the pose estimator equates to the second decoder module) configured to determine a six degree of freedom pose translation of a camera that captured the video based on the first and second representations (System Overview page 7287 [left column paragraph 1] “Our system is composed of a pose estimator and a depth estimator, as shown in Fig. 2. Both estimators take consecutive monocular images as inputs, and produce scaled 6-DoF pose and depth as outputs, respectively. For the pose estimator, it is a VGG-based [19] CNN architecture. It takes two consecutive monocular images as input and predicts the 6-DoF transformation between them.”); and
training parameters of adapter modules of the encoder module using consecutive frames of monocular video based on depth maps and pose translations determined by the model based on the consecutive frames of [monocular video] (Section (III) – Objective Losses, page 7287 [right column paragraph 1 continuing to page 7288 left column] “UnDeepVO is trained with losses through backpropagation. Since the losses are built on geometric constraints rather than labeled data, UnDeepVo is trained in an unsupervised manner. Its total loss includes spatial image losses and temporal image losses, as shown in Fig. 3. The spatial image losses drive the network to recover scaled depth maps by using stereo image pairs, while the temporal image losses are designed to minimize the errors on camera motion by using two consecutive monocular images.”).
However, Li fails to teach frames from a monocular video.
Tang teaches frames from a monocular video (Column 7 [lines 53-55] “target image 200 may be captured by a monocular camera. The monocular camera may capture a forward-facing view of an agent (e.g., a vehicle).”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s reference to include frames from a monocular video taught by Tang’s reference. The motivation for doing so would have been to improve the self-supervised monocular depth estimates as suggested by Tang (see Tang, column 5 [lines 30-32]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Tang with Li to obtain the invention specified in claim 21.
Claims 2, 9 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. ("Undeepvo: Monocular visual odometry through unsupervised deep learning." 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018) (hereinafter, “Li”) in view of Tang et al. (US 11,727,588 B2) (hereinafter, “Tang”) further in view of Wofk et al (US 2022/0343521 A1) (hereinafter, “Wofk”).
Regarding claim 2, which claim 1 is incorporated, Li and Tang fail to teach wherein the first and second decoder modules include dense prediction transformer (DPT) decoders.
Wofk teaches wherein the first and second decoder modules include dense prediction transformer (DPT) decoders (Paragraph [0046] “in addition to using a DPT-Hybrid depth estimator, a DPT-Large depth estimator can be used for higher depth estimation accuracy while a machine-learning framework such as MiDaS-small can be selected for computational efficiency,”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include wherein the first and second decoder modules include dense prediction transformer (DPT) decoders taught by Wofk’s reference. The motivation for doing so would have been to obtain higher depth estimation accuracy as suggested by Wofk (see Wofk, paragraph [0046]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Wofk with Li and Tang to obtain the invention specified in claim 2.
Regarding claim 9, which claim 6 is incorporated, Li fails to teach wherein the adapter modules include an up projection module, a rectified linear unit (ReLU), and a down projection module, and wherein the training module is configured to train parameters of at least one of the up projection module, the ReLU, and the down projection module based on the depth maps and pose translations determined by the model based on the consecutive frames of monocular video.
Tang teaches wherein the adapter modules include an up projection module, and a down projection module (Column 7 [lines 14-19] “some aspects of the present disclosure implement an encoder-decoder architecture with 3D packing and unpacking blocks. In some such aspects, the packing and unpacking blocks downsample and upsample feature maps during the encoding and decoding stages.”), and 
wherein the training module is configured to train parameters of at least one of the up projection module, [the ReLU], and the down projection module based on the depth maps and pose translations determined by the model based on the consecutive frames of monocular video (Column 8 [lines 22-38] “The encoder 302 includes multiple encoder layers 302a-d. Each encoder layer 302a-d may be a packing layer for downsampling features during the encoding process. The decoder 304 includes multiple decoder layers 304a-d. In FIG. 3, each decoder layer 304a-d may be an unpacking layer for upsampling features during the decoding process. That is, each decoder layer 304a-d may unpack a received feature map. Skip connections 306 transmit activations and gradients between encoder layers 302a-d and decoder layers 304a-d. The skip connections 306 facilitate resolving higher resolution details… the skip connections 306 directly transmit image details (e.g., features) from convolutional layers to deconvolutional layers, thereby improving image recovery at higher resolutions.”; Columns 6 [lines 65-67] and Column 7 [lines 1-5] “The network may be trained to account for external pixel-wise motion, such that the network correctly warps pixels with external motion. In such implementations, correctly warped pixels are aligned between frames even considering the presence of dynamic objects.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li’s to include wherein the adapter modules include an up projection module, and a down projection module, and wherein the training module is configured to train parameters of at least one of the up projection module, the ReLU, and the down projection module based on the depth maps and pose translations determined by the model based on the consecutive frames of monocular video taught by Tang’s reference. The motivation for doing so would have been to preserve and recover spatial information for depth estimation as suggested by Tang (see Tang, Column 8 [lines 49-51]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results.
However, Li and Tang fail to teach a rectified linear unit (ReLU).
Wofk teaches a rectified linear unit (ReLU) (Paragraph [0053] “the scale map learner circuitry 315 regresses a dense scale residual map r where values can be negative. In examples disclosed herein, the resulting scale map can be represented as ReLU (1+r) and applied to the input depth {tilde over (z)} to produce the output depth {circumflex over (z)}=ReLU (1+r){tilde over (z)}.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include a rectified linear unit (ReLU) taught by Wofk’s reference. The motivation for doing so would have been to account for negative values as suggested by Wofk (see Wofk, paragraph [0073]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Wofk with Li and Tang to obtain the invention specified in claim 9.
Regarding claim 20 (drawn to a system), claim 20 is rejected the same as claim 2 and the arguments similar to that presented above for claim 2 are equally applicable to the claim 20, and all the other limitations similar to claim 2 are not repeated herein, but incorporated by reference.
Claims 5 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. ("Undeepvo: Monocular visual odometry through unsupervised deep learning." 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018) (hereinafter, “Li”) in view of Tang et al. (US 11,727,588 B2) (hereinafter, “Tang”) further in view of Guizilini et al. (US 11,386,567 B2) (hereinafter, “Guizilini”)
Regarding claim 5, which claim 1 is incorporated, Li and Tang fail to teach wherein the encoder module includes adapter modules trained based on minimizing an edge smoothness loss.
Guizilini teaches wherein the encoder module includes adapter modules trained based on minimizing an edge smoothness loss (Paragraph [0046] “in addition to using a DPT-Hybrid depth estimator, a DPT-Large depth estimator can be used for higher depth estimation accuracy while a machine-learning framework such as MiDaS-small can be selected for computational efficiency,”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include wherein the encoder module includes adapter modules trained based on minimizing an edge smoothness loss taught by Guizilini’s reference. The motivation for doing so would have been to modify the parameters of the models to perform the training as suggested by Guizilini (see Guizilini, Column 5 [lines 8-11]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Guizilini with Li and Tang to obtain the invention specified in claim 5. 
Regarding claim 17, which claim 16 is incorporated, Li and Tang fail to teach wherein the loss module is configured to determine the photometric consistency loss based on downweighting regions of the first image including moving objects.
Guizilini teaches wherein the loss module is configured to determine the photometric consistency loss based on downweighting regions of the first image including moving objects (Column 13 [lines 1-10] “the training module 230 masks out static pixels by removing pixels that have a warped photometric loss higher than a corresponding unwarped photometric loss, which the training module 230 calculates using the original source image (e.g., 620) without synthesizing the target. The mask (Mp) removes pixels that have appearance loss that does not change between frames, which includes static scenes and dynamic objects moving at a similar speed as the camera.”).
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include wherein the loss module is configured to determine the photometric consistency loss based on downweighting regions of the first image including moving objects taught by Guizilini’s reference. The motivation for doing so would have been to remove pixels that have appearance loss as suggested by Guizilini (see Guizilini, Column 13 [lines 6-8]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Guizilini with Li and Tang to obtain the invention specified in claim 17.
Regarding claim 18, which claim 6 is incorporated, Li and Tang fail to teach wherein the training module is configured to train the parameters of the adapter modules based on minimizing an edge smoothness loss.
Guizilini teaches wherein the training module is configured to train the parameters of the adapter modules based on minimizing an edge smoothness loss (Column 13 [lines 15-20] “Ls represents depth smoothness loss and is implemented to regularize the depth in textureless low-image gradient regions, as shown in equation (5). The smoothness loss is an edge-aware term that is weighted for separate pyramid levels starting from 1 and decaying by a factor of two for the separate scales.)”.
Equation 5

    PNG
    media_image4.png
    74
    442
    media_image4.png
    Greyscale

Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include wherein the training module is configured to train the parameters of the adapter modules based on minimizing an edge smoothness loss taught by Guizilini’s reference. The motivation for doing so would have been to modify the parameters of the models to perform the training as suggested by Guizilini (see Guizilini, Column 5 [lines 8-11]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Guizilini with Li and Tang to obtain the invention specified in claim 18.
Regarding claim 19, which claim 18 is incorporated, Li and Tang fail to teach a loss module configured to determine the edge smoothness loss based on first derivatives of pixel values of the first and second depth maps.
Guizilini teaches a loss module configured to determine the edge smoothness loss based on first derivatives of pixel values of the first and second depth maps (Column 16 [lines 25-30] “the training module 230 uses this appearance-based loss as both the first stage loss to account for pixel-level similarities and irregularities along edge regions between a synthesized image derived from depth predictions of the depth model and a target image that is the original input into the depth model 260.”.)
Therefore, it would have been obvious to one of ordinary skill of the art before the effective filing date to modify Li in view of Tang to include a loss module configured to determine the edge smoothness loss based on first derivatives of pixel values of the first and second depth maps taught by Guizilini’s reference. The motivation for doing so would have been to modify the parameters of the models to perform the training as suggested by Guizilini (see Guizilini, Column 5 [lines 8-11]).
Further, one skilled in the art could have combined the elements described above by known methods with no change to the respective functions, and the combination would have yielded nothing more that predictable results. Therefore, it would have been obvious to combine Guizilini with Li and Tang to obtain the invention specified in claim 19.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chidlovoskii et al. (US 2022/0011778 A1) discloses a system comprising a depth module including an encoder and a decoder and configured to receive a first image from a from a camera; and based on the first image, generate a depth map including depths between the camera and objects in the image.
Ji et al. (CN 117,321,631 A) discloses a systems and method for improving a depth estimation model. The system and method receive image frames collected by the camera; the first module is the used for generating a map point of the environment and a depth map of each image frame; and the second module is used for calculating one or more loss parameters.
Venkataraman et al. (US 12,340,538 B2) discloses a system that comprises a process which receives plurality of images from the imaging system; estimate a pose of an object depicted in the images; render a shape estimate from a 3-D model of the object in accordance with the pose of the object; and generate a data point of a training dataset.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to UROOJ FATIMA whose telephone number is (571)272-2096. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/UROOJ FATIMA/Examiner, Art Unit 2676                                                                                                                                                                                                        

/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676
Read full office action
TRAINING OF MODELS FOR MONOCULAR DEPTH AND VISUAL ODOMETRY

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

TRAINING OF MODELS FOR MONOCULAR DEPTH AND VISUAL ODOMETRY

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email