Last updated: May 29, 2026
Application No. 18/493,517
ADAPTIVE DEPTH COMPLETION

Final Rejection §103
Filed
Oct 24, 2023
Priority
Jun 29, 2023 — provisional 63/523,939
Examiner
MENDEZ MUNIZ, DYLAN JOHN
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Carnegie Mellon University
OA Round
2 (Final)
Interview Optional

— +26.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 81% grant rate with +26.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 21 resolved cases, 2023–2026
Examiner Intelligence

MENDEZ MUNIZ, DYLAN JOHN View full profile →
Grants 81% — above average
Career Allowance Rate
17 granted / 21 resolved
+19.0% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
8 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
89.1%
+49.1% vs TC avg
§102
1.8%
-38.2% vs TC avg
§112
9.1%
-30.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 21 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) was filed on 10/24/2023.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Response to Arguments
Applicant first argues that the cited reference Xu does not teach independent claim 1 limitation “decode the features into a depth map using a decoder of the depth model according to an affinity-based shift correction embedded with the decoder”. More specifically the underlined part by mentioning that Xu is a post processing step and “embedded with the decoder” is the same as “embedded within the decoder”. Examiner disagrees, the BRI (broadest reasonable interpretation) of these two limitations is distinct. The BRI of “embedded”, according to the claim language, signifies an attachment to the decoder. A person of ordinary skill in the art would not interpret “embedded with the decoder” as “embedded within the decoder”. Xu clearly teaches as seen on page 5 fig. 2, the pipeline shows the main components of the invention and they are all connected (everything in a pipeline is connected/attached), therefore they are all embedded with each other. In addition to that, the combination is a clear improvement under the KSR rationales, it clearly shows to have better performance and better analysis as seen on page 3 column 1 paragraphs 1-2 and page 7 column 2 paragraphs 2-3. Therefore the limitation is still taught under a BRI, examiner recommends changing the specified limitation to “embedded within the decoder” for a concrete interpretation in accordance to the applicants desire. 

Applicant also argues in remark B that “affinity” in the context of the claim refers to semantic-based similarity. Examiner disagrees, the BRI of the independent claim 1 limitation does not make any reference to a semantic-based similarity, the BRI of “affinity-based” is any reference to an alignment/connection/relation/similarity,  a person of ordinary skill in the art would not interpret “affinity-based” as just “semantically similar”, the BRI of “semantically similar” is a lot more narrowed on only semantics, as opposed to “affinity-based” which covers any type of connection. Xu shows an alignment between pixels and depth, even if it includes a distance similarity, it still within the BRI of something that is “affinity-based”. Examiner recommends specifying in the claim language that affinity signifies only a semantic similarity. Not only that, applicant argues that Xu relies on geometric proximity (distance) and affinity is “visual or semantic similarity”. It would be obvious to a person of ordinary skill in the art would agree that a distance is something I can see, you can see how close or how far an object is to another”, therefore “distance” is also “visual”.

Applicant also argues that prior art Xu does not teach the limitation "...apply a correction confidence prediction to selectively integrate information” from claims 6 and 19. Examiner disagrees, Xu shows on page 3 column 1 paragraph 2, it shows that the affinity based shift correction allows a correction (recovery) of a confidence (error) prediction (depth). More specifically for the coarse misalignment error. The coarse misalignment error (confidence of a prediction) is alleviated (recovered/correction), by using the model presented in section 2.2 Metric depth recovery (affinity-based shift correction), is further explained in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error.” Since the misalignment error (correction confidence) includes globally aligned predictions, it is within the BRI of applying a correction confidence prediction. In addition it is used for prediction (estimation) of the depth as seen in table 4 on page 8 and page 8 section 3.3. It also includes an obvious improvement with the combination of Guizilini and Xu under the KSR rationales as it shows clearly better performance such as metric accuracy and better analysis of the weaknesses of depth estimation (prediction). See page 3 column 1 paragraphs 1-2. “our local recovery strategy can retrieve location-related scale map and shift map to adjust the distribution of prediction. Experiments show that our method can significantly improve metric accuracy… our method can further boost their performance with our local recovery strategy. In addition to boosting performance, the second benefit of our local strategy is to better analyze the weakness of all existing depth estimation methods and guide the design and choice of loss functions…”. See also page 7 column 2 paragraphs 2-3. Xu.

	Applicant also amended claims 4 and 6 with new claim language that changes the scope, therefore examiner added a new reference Bristow that teaches the newly added semantic similarity of claim 4 and examiners rejects claim 6 with a previously cited reference Warburg which also includes several iterations in order to align intermediate depth maps to sparse data and alignment with decoder features, please see the rejection below and most importantly fig. 1 in Warburg. Therefore all of the claims stay rejected.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 7, 8, 9, 10, 14, 15, 19, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et. al. (US Publication No.  20220148204 A1) in view of Xu et. al. (Xu, G. K., and F. Zhao. Toward 3D scene reconstruction from locally scale-aligned monocular video depth. JUSTC, 2024, 54 (4): 0402. DOI: 10.52396. Vol. 44. JUSTC-2023-0061 Page 2, 2022.).

As per claim 1, Guizilini teaches “a depth system, comprising: one or more processors; a memory communicably coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to: (See paragraph 60, shows processors and memory. See also paragraphs 28 and 87-88. Guizilini)
acquire sensor data including at least an image of a surrounding environment; (See paragraph 9 and fig. 6 unit 610. See paragraph 51 “[0051] At 610, the network module 220 acquires the sensor data 240. In general, the sensor data 240 is comprised of at least a monocular image (e.g., image 450), but may also include, as previously outlined, sparse depth data. The sparse depth data can be LiDAR data or depth data from another source, such as a radar. In any case, the sparse depth data provides a mechanism by which the depth system 170 can integrate further information about the surrounding environment in order to guide the generation of the depth map 260 and improve the quality of the depth estimates included therein.” Guizilini)
encode the sensor data into features using an encoder of a depth model; (See paragraphs 22 and 23 “[0022] In any case, the depth model implements an additional encoder, which is referred to herein as the sparse auxillary network (SAN), to process the sparse depth data. The SAN is a machine learning algorithm, such as a convolutional neural network (CNN). The SAN accepts range information in the form of the sparse depth data from a range sensor and outputs depth features.  ”. As seen in paragraph 51, the sparse depth data is from the sensor data. Guizilini)
decode the features into a depth map using a decoder of the depth model… with the decoder; and (See paragraph 47 “[0047] With continued reference to FIG. 5, the depth model 250 further includes the image decoder 420. One example of how the image decoder 420 functions includes unfolding (i.e., adapting dimensions of the tensor to extract the features) the previously encoded spatial information in order to derive the depth map 260 according to learned correlations associated with the encoded features.” Guizilini)
provide the depth map that indicates depths within the surrounding environment.”  (See paragraph 48, 57 and 56 “[0056]… In this way, the network module 220 generates the depth map 260 as a dense representation of depths for a depicted scene of the surrounding environment.” Guizilini), however Guizilini does not teach “an affinity-based shift correction embedded”.

Xu teaches “an affinity based shift correction embedded” (See page 5 fig. 2 and see all of page 5-6 section 2.2 Metric depth recovery along with equations 3 and 5, it presents a recovery (correction) of the shift. Page 5 column 2 paragraph 1 “Motivated by this observation, we propose leveraging a locally recovering method i.e., locally weighted linear regression (LWLR), to recover a scale and a shift map. Guided by very sparse ground-truth points, we can fix and quantify these low-rank spatial errors which are common in depth estimation tasks.”. See also page 4 all of section Metric depth recover module along with column 2 paragraph 2 , which includes “Our proposed local recovery strategy leverages the sparse anchor points obtained from the SLAM system, SFM algorithms, or some low-quality sensors, and can not only boost performance of depth estimation but also be an analytical tool to decouple the prediction depth errors into the coarse misalignment error and the detail missing error. ” The sparse anchor points (also presented as sparse guided points) are used for local alignment, this alignment falls within the BRI (broadest reasonable interpretation) of “affinity based”. See page 4 section RGB-D fusion module and page 5 column 1 paragraph 1. “The sparse guided points used for local alignment can be obtained from various SLAM systems, SFM algorithms, and some low-quality sensors such as Tof sensors of mobile phones…”. It is well known in the art that Time of flight sensors (ToF) utilize pixels for depth. See also page 3 column 1 paragraphs 1-4 along with fig. 1, it is explained that the anchor points with alignment are used to recover the metric depth along with the use of RGB-D datasets, and it includes that those depth points are used to adjust the distribution of prediction. “With the robust monocular depth and local recovery strategy, the metric depth can be recovered by locally aligning with some sparse points. The last challenge is how to obtain accurate sparse anchor points as metric guidance…”. See it also with fig. 1, which shows the per-pixel error maps of the predicted depth aligned (which includes the BRI of affinity based), see page 6 column 1 last paragraph for more information regarding figure 1. Therefore the shift recovery (correction) is affinity based since it shows a relation between pixels and depth points. Xu)

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini with the teachings of Xu to embed an affinity based shift correction with the decoder. The modification would have been motivated by the desire to have better performance such as metric accuracy and have the ability of better analysis of the weaknesses of depth estimation, therefore it is an improvement, as suggested by Xu (See page 3 column 1 paragraphs 1-2. “our local recovery strategy can retrieve location-related scale map and shift map to adjust the distribution of prediction. Experiments show that our method can significantly improve metric accuracy… our method can further boost their performance with our local recovery strategy. In addition to boosting performance, the second benefit of our local strategy is to better analyze the weakness of all existing depth estimation methods and guide the design and choice of loss functions…”. See also page 7 column 2 paragraphs 2-3. Xu)

Claim 9 is rejected under the same analysis as claim 1. (See paragraph 87 for non-transitory.)
Claim 14 is rejected under the same analysis as claim 1. 

As per claim 2, Guzilini in view of Xu already teaches “the depth system of claim 1,” and Guizilini teaches “wherein the sensor data includes the image and sparse depth data corresponding to the image, wherein the instructions to acquire the sensor data include instructions to derive the image from one of a camera and a LiDAR,” (See paragraph 49 “[0049] It should be appreciated that, in one embodiment, the network module 220 generally includes instructions that function to control the processor 110 to execute various actions to control the depth model 250 to produce the depth map 260. The network module 220, in one or more approaches, acquires the sensor data 240 including the sparse depth data 440 and the monocular image 450 by controlling the camera 126 and a LiDAR 124 to capture the sensor data 240 from a data bus, or electronic memory, or another available communication pathway…” Guizilini)  “and wherein the instructions to decode the features include instructions to integrate the sparse depth data into the decoder” (See paragraph 55, the depth features integrated into the decoder include sparse depth data. “[0055] At 640, the network module 220 injects the depth features into the depth model 250. For example, the network module 220 injects the depth features by concatenating the depth features with the image features and providing concatenated features into the image decoder 420 of the depth model 250…In general, injecting the sparse depth data via the depth features using the skip connections is an optimal approach to integrating this information…”. See also paragraphs 8, 9, 10, 36 and 49. Guizilini ) and Xu already teaches “through the affinity-based shift correction” as seen in claim 1 and includes the same motivation since the affinity-based shift correction is already included in the decoder in claim 1.

Claim 10 is rejected under the same analysis as claim 2. 
Claim 15 is rejected under the same analysis as claim 2. 

As per claim 6, Guizilini in view of Xu already teaches “the depth system of claim 1, wherein the instructions to decode the features using the affinity- based shift correction”, however Guizilini also teaches “include instructions to… selectively integrate information from sparse depth data into decoding the depth map from the features.” (Paragraphs 10, 23 and 39 show that depth features include sparse depth data. Paragraph 9 shows “[0009]… The network module includes instructions to selectively inject the depth features into a depth model. The network module includes instructions to generate a depth map from at least a monocular image using the depth model that is guided by the depth features when injected. ” Paragraph 8 then shows that those features are fed into the decoder “ [0008]… The depth model concatenates the image features and the depth features and provides the concatenated features into the decoder of the encoder/decoder structure, which then produces the depth estimates.” Paragraph 47-48 then shows that the decoder generates the depth map from features. “[0047] With continued reference to FIG. 5, the depth model 250 further includes the image decoder 420. One example of how the image decoder 420 functions includes unfolding (i.e., adapting dimensions of the tensor to extract the features) the previously encoded spatial information in order to derive the depth map 260 according to learned correlations associated with the encoded features. ” Guizilini)  And Xu teaches “include instructions to apply a correction confidence prediction…” (See page 3 column 1 paragraph 2, it shows that the affinity based shift correction allows a correction (recovery) of a confidence (error) prediction (depth). “the second benefit of our local recovery strategy is to better analyze the weakness of all existing depth estimation methods and guide the design and choice of loss functions. The depth error can be decoupled into two parts: the coarse misalignment error and the detail missing error. Compared with the error of global fitting, the alleviated error achieved by the local recovery is the coarse misalignment error, while the remaining one is the detail missing error.” The coarse misalignment error (confidence of a prediction) is alleviated (recovered/correction), by using the model presented in section 2.2 Metric depth recovery (affinity-based shift correction), is further explained in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error.” Therefore it is used for prediction (estimation) of the depth as seen in table 4 on page 8 and page 8 section 3.3. Therefore, within a broadest reasonable interpretation (BRI), the interpretation of correction confidence prediction is covered under the system presented by the reference which also includes the affinity-based shift correction.) It is motivated by the same improvements presented on claim 1 since the correction confidence prediction falls under (is a part of) the affinity-based shift correction already rejected.

Claim 19 is rejected under the same analysis as claim 6. 

As per claim 7, “the depth system of claim 1, wherein the instructions to provide the depth map include instructions to communicate the depth map to one or more systems within a vehicle to facilitate control of the vehicle,” (Paragraphs 9 and 11 show the use of instructions including a depth map with the network module. This network module can communicate with a system of the vehicle for autonomous planning and navigation (therefore it facilitates) as shown in paragraph 49 and paragraph 57. See also fig. 1. “[0057] At 660, the network module 220 provides the depth map 260 as depth estimates of objects represented in the monocular image. In one arrangement, the network module 220 provides the depth map 260 to control a device (e.g., the vehicle 100) to navigate through a surrounding environment. As should be appreciated, in one arrangement, the network module 220 electronically provides the map 260 to other systems of the vehicle 100 in support of, for example, autonomous planning and navigation of the vehicle 100. ” Guizilini) “and wherein the depth model performs monocular depth estimation.” (See paragraph 6 “[0006]… For example, the novel network architecture includes a depth model that can use monocular images alone to provide depth estimates or that can also use sparse depth data provided via an integrated sparse auxiliary network (SAN) to derive improved depth estimates. ” Guizilini)

Claim 20 is rejected under the same analysis as claim 7. 

As per claim 8, Guizilini in view of Xu teaches “The depth system of claim 1, wherein the depth system is integrated within a vehicle,” (See fig. 1 and paragraphs 24-27. Guizilini) “and wherein the depth model selectively accepts depth data in addition to the image.” (See paragraphs 7 and 9-11. “[0007] As such, the depth model is more robust than a model utilizing a single input stream since the depth model can selectively integrate the sparse depth data into the depth estimates as the sparse depth data is available. That is, for example, the sensors of the device (e.g., a vehicle) may encounter difficulties, such as hardware failures during operation. As such, when the sparse depth data is unavailable, the depth model is still capable of producing depth estimates according to the monocular image.”. See also paragraphs 20-21. Guizilini)

Claims 3, 6, 11, 12, 13, 16, 17, 18  are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini in view of Xu and further in view of Warburg et. al. (Warburg, Frederik, Michael Ramamonjisoa, and Manuel López-Antequera. "Sparseformer: Attention-based depth completion network." arXiv preprint arXiv:2206.04557 (2022).).

As per claim 3, Guizilini in view of Xu already teaches “the depth system of claim 1, wherein the instructions to decode the features using the affinity- based shift correction”,  however Guizilini in view of Xu does not teach “include instructions to iteratively align depth predictions to sparse depth data from the sensor data according to predicted affinities between the sparse depth data and pixels of the features.”, Warburg teaches “include instructions to iteratively align depth predictions to sparse depth data from the sensor data according to predicted affinities between the sparse depth data and pixels of the features.” (See page 1 column 2 paragraph 2 “This attention volume describes the similarity between each region in an image and each 3D landmark. This can be interpreted as an affinity matrix that can be used to interpolate depth to the entire scene in a single step. The global nature of this affinity allows the model to generalize well to different landmark distributions and sparsity levels.” This means that attention volume is interchangeable with affinity. See page 2-3 section SparseFrormer “The goal of each SparseFormer is to diffuse depth and confidence from a few 3D points to enrich deep convolutional features. E.g. if a sparse point is observed on a plane, we aim to diffuse the known depth to the entire plane.”. It also shows how the attention volume works by using sparse features (See page 3 columns 1 paragraphs 2-4 and column 2 paragraphs 1-4.)  Therefore it uses sparse depth data and pixels of the features, since it also utilizes the Structure from Motion landmarks (SfM) (which contain pixels and depth points as seen in page 5 column 1 and fig. 3 – fig. 5). It is interpreted as “sensor data” as it includes visual images as those seen in fig. 3-5 (camera was used). See page 3 column 2 section refinement  module. See also page section 1. Introduction columns 1-2 as it shows that Sfm (structure from motion) already contains the means for an affinity since it is using the pixels containing depth. “The differentiation is important as SfM landmarks are less uniformly distributed, more error prone and much sparser (typically < 0.1% of pixels have depth)…”.) On fig. 1, along with the accompanying paragraph, it shows all of the limitations together by utilizing sparseformer, examiner interprets iterations in this case as 4, since it is cycled through sparseformer 4 times in order to align depth predictions by diffusing depth and confidence with sparse depth data. See also page 3-4 section Handling varying levels of depth sparsity along with 4. Experimental Results. Warburg) 

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini and Xu with the teachings of Warburg to iteratively align depth predictions according to predicted affinities. The modification would have been motivated by the desire to have better performance for depth completion and less errors in depths, therefore it is an improvement, as suggested by Warburg (See page 1 column 2 paragraph 3 “We find that the SparseFormer especially improves performance even for low densities of points. Then we show on a large outdoor dataset that our approach is effective for very sparse depth completion with 3D points obtained from a standard SfM pipeline.” See also page 3 column 2 section Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”. See also page 5 Conclusion. Warburg)

Claim 11 is rejected under the same analysis as claim 3. 
Claim 16 is rejected under the same analysis as claim 3. 

As per claim 12, Guizilini in view of Xu already teaches “the depth system of claim 9, wherein the instructions to decode the features using the affinity- based shift correction include instructions”, however Guizilini in view of Xu  does not teach “to determine an affinity for depth points from the sensor data in relation to an intermediate depth map by computing the affinity between pairs of the depth points and pixels of the intermediate depth map.” 
Warburg teaches “to determine an affinity for depth points from the sensor data in relation to an intermediate depth map by computing the affinity between pairs of the depth points and pixels of the intermediate depth map.”( See page 1 column 2 paragraph 2 “This attention volume describes the similarity between each region in an image and each 3D landmark. This can be interpreted as an affinity matrix that can be used to interpolate depth to the entire scene in a single step. The global nature of this affinity allows the model to generalize well to different landmark distributions and sparsity levels.”. The attention volume in this case is interchangeable with the affinity. See fig. 1 and page 3 column 1 paragraphs 2-4, it shows that the attention volume is calculated using deep features/sparse features for depth points (see the refinement module and sparseformer) (equation 1-3) and various intermediate depth maps are obtained as seen in fig. 1 by using sparseformer. This is performed by computing the affinity between depth points and pixels as it uses the SfM landmarks by the refinement module (which contain pixels and depth points as seen in page 5 column 1 and fig. 3 – fig. 5). See page 3 column 2 section refinement  module. See also page section 1. Introduction columns 1-2 as it shows that Sfm (structure from motion) already contains the means for an affinity since it is using the pixels containing depth. “The differentiation is important as SfM landmarks are less uniformly distributed, more error prone and much sparser (typically < 0.1% of pixels have depth)…”. It is interpreted as “sensor data” as it includes visual images as those seen in fig. 3-5. See also page 2 section Iterative Fusion and 3. Method. Warburg )

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini and Xu with the teachings of Warburg to decode the features by determining an affinity utilizing intermediate depth maps. The modification would have been motivated by the desire to have better performance for depth completion and less errors in depths, therefore it is an improvement, as suggested by Warburg (See page 1 column 2 paragraph 3 “We find that the SparseFormer especially improves performance even for low densities of points. Then we show on a large outdoor dataset that our approach is effective for very sparse depth completion with 3D points obtained from a standard SfM pipeline.” See also page 3 column 2 section Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”. See also page 5 Conclusion. Warburg)

Claim 17 is rejected under the same analysis as claim 12. 

As per claim 13, Guizilini in view of Xu and further in view of Warburg already teaches “the depth system of claim 12, wherein the instructions to decode the features using the affinity- based shift correction”, however Guizilini in view of Xu and Warburg also teaches “include instructions to generate the depth map using the affinities to correlate the depth points (See fig. 1 on page 3, sparseformer outputs a depth map utilizing affinities to correlate depth points. See also page 3 columns 1-2 along with equations 1-3. The attention volume (affinity) correlates depth points. See also page 4 with fig. 2. Warburg) and determine depth errors to correct the depth map.”  (It also corrects the depth map by filtering depth errors, see page 3 Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”, by learning there is a determination of depth errors and by filtering it is correcting the depth map as shown in figure 1. It utilizes the same motivation presented in  claim 4 since it falls under the information presented in the rejection of claim 4. Warburg) (Xu also teaches “and determine depth errors to correct the depth map” in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error. Compared with the error of global recovery, the alleviated error brought by local recovery represents the coarse misalignment error…” See also table 3 and table 4 for more information and page 4 column 2 paragraph 1 and fig. 2 “Compared to the global least-squares fitting, our local recovery strategy generates a scale and a shift map for each depth map, which can not only recover metric depth but also correct overall depth maps and ensure the accuracy and consistency of 3D reconstruction.” Xu)

Claim 18 is rejected under the same analysis as claim 13. 

As per claim 6, Guizilini in view of Xu already teaches “the depth system of claim 1, wherein the instructions to decode the features using the affinity- based shift correction”, however Guizilini also teaches “include instructions to… selectively integrate information from sparse depth data into decoding the depth map from the features.” (Paragraphs 10, 23 and 39 show that depth features include sparse depth data. Paragraph 9 shows “[0009]… The network module includes instructions to selectively inject the depth features into a depth model. The network module includes instructions to generate a depth map from at least a monocular image using the depth model that is guided by the depth features when injected. ” Paragraph 8 then shows that those features are fed into the decoder “ [0008]… The depth model concatenates the image features and the depth features and provides the concatenated features into the decoder of the encoder/decoder structure, which then produces the depth estimates.” Paragraph 47-48 then shows that the decoder generates the depth map from features. “[0047] With continued reference to FIG. 5, the depth model 250 further includes the image decoder 420. One example of how the image decoder 420 functions includes unfolding (i.e., adapting dimensions of the tensor to extract the features) the previously encoded spatial information in order to derive the depth map 260 according to learned correlations associated with the encoded features. ” Guizilini)  And Xu teaches “include instructions to apply a correction confidence prediction…” (See page 3 column 1 paragraph 2, it shows that the affinity based shift correction allows a correction (recovery) of a confidence (error) prediction (depth). “the second benefit of our local recovery strategy is to better analyze the weakness of all existing depth estimation methods and guide the design and choice of loss functions. The depth error can be decoupled into two parts: the coarse misalignment error and the detail missing error. Compared with the error of global fitting, the alleviated error achieved by the local recovery is the coarse misalignment error, while the remaining one is the detail missing error.” The coarse misalignment error (confidence of a prediction) is alleviated (recovered/correction), by using the model presented in section 2.2 Metric depth recovery (affinity-based shift correction), is further explained in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error.” Therefore it is used for prediction (estimation) of the depth as seen in table 4 on page 8 and page 8 section 3.3. Therefore, within a broadest reasonable interpretation (BRI), the interpretation of correction confidence prediction is covered under the system presented by the reference which also includes the affinity-based shift correction.) It is motivated by the same improvements presented on claim 1 since the correction confidence prediction falls under (is a part of) the affinity-based shift correction already rejected. However Guizilini in view of Xu does not teach “including iteratively aligning an intermediate depth map to the sparse depth data and fusing a result of the alignment with decoder features for a subsequent decoder stage.”

Warburg teaches “including iteratively aligning an intermediate depth map to the sparse depth data and fusing a result of the alignment with decoder features for a subsequent decoder stage.” (See pages 2-3 section Sparseformer and fig. 1. Fig. 1 shows that sparseformer performs iterations (4 in this case) as decoder stages and the depth maps (din ) and (dout) and iteratively aligns the intermediate depth map  to sparse depth data along with the refinement module which fuses features. “Figure 1: (a) Architecture overview. We insert 4 SparseFormers into a depth prediction encoder-decoder architecture [10]. (b) SparseFormer. The SparseFormer takes a convolutional feature map f, and N 3D sparse points as input and outputs an interpolated depth map dout and a confidence map m. (c) Refinement module. The refinement module fuses deep features of SfM points and their associated depth, allowing the features to communicate via attention.”. See also page 3 section Refinement Module. )

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini and Xu with the teachings of Warburg to decode the features by determining an affinity utilizing intermediate depth maps and iteratively align the intermediate depth maps to sparse data and fuse features as decoder stages. The modification would have been motivated by the desire to have better performance for depth completion and less errors in depths, in addition it also filters outliers, improves the depth and improves the features of each sparse point, therefore it is an improvement, as suggested by Warburg (See page 1 column 2 paragraph 3 “We find that the SparseFormer especially improves performance even for low densities of points. Then we show on a large outdoor dataset that our approach is effective for very sparse depth completion with 3D points obtained from a standard SfM pipeline.” See also page 3 column 2 section Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”. See also page 5 Conclusion. See also page 3 refinement module “The refinement module has three objectives: (1) add depth information to the deep features, and allow the features to communicate to (2) filter outliers and (3) improve the depth. Fig. 1 (c) gives an overview of the refinement module. First, deep features for each of the 3D points are concatenated with their associated depth to add depth information. Then, these concatenated features are fed through a standard transformer with self-attention. The transformer improves the feature of each 3D point by allowing it to share information between the other land marks.” Warburg)


Claims 4-5 are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini in view of Xu and further in view of Warburg and Bristow et. al., hereafter Bristow (Bristow, Hilton, Jack Valmadre, and Simon Lucey. "Dense semantic correspondence where every pixel is a classifier." Proceedings of the IEEE International Conference on Computer Vision. 2015.) .

As per claim 4, Guizilini in view of Xu already teaches “the depth system of claim 1, wherein the instructions to decode the features using the affinity- based shift correction include instructions” and “using depth errors…” (Xu also teaches the use of depth errors in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error. Compared with the error of global recovery, the alleviated error brought by local recovery represents the coarse misalignment error…” See also table 3 and table 4 for more information and page 4 column 2 paragraph 1 and fig. 2 “Compared to the global least-squares fitting, our local recovery strategy generates a scale and a shift map for each depth map, which can not only recover metric depth but also correct overall depth maps and ensure the accuracy and consistency of 3D reconstruction.” Xu), however Guizilini in view of Xu  does not teach “to determine an affinity for depth points from the sensor data in relation to an intermediate depth map by computing the affinity between pairs of the depth points and pixels of the intermediate depth map.” and “semantically similar pixels where the semantic similarity defines a similarity between of the pixels according to classes of associated objects depicted by the pixels.”
Warburg teaches “to determine an affinity for depth points from the sensor data in relation to an intermediate depth map by computing the affinity between pairs of the depth points and pixels of the intermediate depth map.”( See page 1 column 2 paragraph 2 “This attention volume describes the similarity between each region in an image and each 3D landmark. This can be interpreted as an affinity matrix that can be used to interpolate depth to the entire scene in a single step. The global nature of this affinity allows the model to generalize well to different landmark distributions and sparsity levels.”. The attention volume in this case is interchangeable with the affinity. See fig. 1 and page 3 column 1 paragraphs 2-4, it shows that the attention volume is calculated using deep features/sparse features for depth points (see the refinement module and sparseformer) (equation 1-3) and various intermediate depth maps are obtained as seen in fig. 1 by using sparseformer. This is performed by computing the affinity between depth points and pixels as it uses the SfM landmarks by the refinement module (which contain pixels and depth points as seen in page 5 column 1 and fig. 3 – fig. 5). See page 3 column 2 section refinement  module. See also page section 1. Introduction columns 1-2 as it shows that Sfm (structure from motion) already contains the means for an affinity since it is using the pixels containing depth. “The differentiation is important as SfM landmarks are less uniformly distributed, more error prone and much sparser (typically < 0.1% of pixels have depth)…”. It is interpreted as “sensor data” as it includes visual images as those seen in fig. 3-5. See also page 2 section Iterative Fusion and 3. Method. Warburg )

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini and Xu with the teachings of Warburg to decode the features by determining an affinity utilizing intermediate depth maps. The modification would have been motivated by the desire to have better performance for depth completion and less errors in depths, therefore it is an improvement, as suggested by Warburg (See page 1 column 2 paragraph 3 “We find that the SparseFormer especially improves performance even for low densities of points. Then we show on a large outdoor dataset that our approach is effective for very sparse depth completion with 3D points obtained from a standard SfM pipeline.” See also page 3 column 2 section Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”. See also page 5 Conclusion. Warburg)

Bristow teaches “semantically similar pixels where the semantic similarity defines a similarity between of the pixels according to classes of associated objects depicted by the pixels.” (See page 3 section 3. Dense Semantic correspondence column 1 paragraph 1-5 and column 2. Column 1 shows “Given two images, IA ∈ RMN and IB ∈ RPQ, and a discrete set of points x = [x1 x2 xMN], dense semantic correspondence involves minimizing the inverse fitting problem… where f is the unary function that evaluates the likelihood of a particular assignment for each xi based on the image content, and g is a regularizer which enforces constraints on the joint configuration of the points. In semantic correspondence, the unary function must be a good indicator of semantic similarity, and so must be robust to significant intra-class variation. In the framework we adopt, there are no constraints on its complexity or properties… For our LDA classifiers, we extract features from a window of pixels around xi and return a multi-channel patch centered at xi… This implies a feature space capable of producing similar outputs for semantically similar inputs. Finding such a feature embedding is a difficult task in general, and as a result significant object detection literature has focused on learning classifiers to distinguish classes instead.” Column 2 paragraph 4 also shows “This is traditional binary classification, where the positive class contains the reference pixel, and its true correspondence in the target image, and the negative class contains all other pixels…” See the invention being applied to fig. 1 on page 1, the object in this case is the elephant, therefore it also covers the BRI (broadest reasonable interpretation) “of pixels according to classes of associated objects” since an object is associated with classifiers of pixels. See also equations 1-5 on page 3, it shows that the unary function (which is used in several equations) acts as a semantic similarity between pixels used for classifiers. See also pages 3-5 sections 3.1 and 3.2. In page 3-4 section 3.1 it shows the use of objects based on classes. On page 6 fig. 3 it shows the structure of the objects used for classifying. See also abstract, it shows that it is used for object recognition. Bristow )

It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to combine the teachings of Guizilini, Xu and Warburg with the teachings of Bristow to use depth errors according to a semantic similarity between pixels. The modification would have been motivated by the desire to have more robustness to intra-class variation, in addition it is also motivated for higher precision and also faster training,  therefore it is an improvement, as suggested by Bristow (See abstract “LDA classifiers have two distinct benefits: (i) they exhibit higher average precision than similarity metrics typically used in correspondence problems, and (ii) unlike exemplar SVM, can output globally interpretable posterior probabilities without calibration, whilst also being significantly faster to train.” See also page 3 section 3 “In semantic correspondence, the unary function must be a good indicator of semantic similarity, and so must be robust to significant intra-class variation. In the framework we adopt, there are no constraints on its complexity or properties.” Bristow)

As per claim 5, Guizilini in view of Xu and further in view of Warburg and Bristow already teaches “the depth system of claim 4, wherein the instructions to decode the features using the affinity- based shift correction”, however Guizilini in view of Xu, Warburg and Bristow also teaches “include instructions to generate the depth map using the affinities to correlate the depth points (See fig. 1 on page 3, sparseformer outputs a depth map utilizing affinities to correlate depth points. See also page 3 columns 1-2 along with equations 1-3. The attention volume (affinity) correlates depth points. See also page 4 with fig. 2. Warburg) and determine depth errors to correct the depth map.”  (It also corrects the depth map by filtering depth errors, see page 3 Refinement Module “The transformer can also filter 3D points with erroneous depth, which are common in SfM pipelines due to e.g. wrong 2d-2d matches. It can achieve this by learning a mapping that sends these features far away from the rest of the features in the embedding space. This is important in order to not propagate erroneous depth to the interpolation stage.”, by learning there is a determination of depth errors and by filtering it is correcting the depth map as shown in figure 1. It utilizes the same motivation presented in  claim 4 since it falls under the information presented in the rejection of claim 4. Warburg) (Xu also teaches “and determine depth errors to correct the depth map” in page 7 column 2 section Decoupling of monocular depth error “the local recovery strategy is also performed to decouple monocular depth error between ground truth and globally aligned prediction into coarse misalignment error and detail-missing error. Compared with the error of global recovery, the alleviated error brought by local recovery represents the coarse misalignment error…” See also table 3 and table 4 for more information and page 4 column 2 paragraph 1 and fig. 2 “Compared to the global least-squares fitting, our local recovery strategy generates a scale and a shift map for each depth map, which can not only recover metric depth but also correct overall depth maps and ensure the accuracy and consistency of 3D reconstruction.” Xu)

Pertinent Prior Art
Cheng et. al. discloses utilizing depth estimation utilizing affinity (Cheng, Xinjing, Peng Wang, and Ruigang Yang. "Depth estimation via affinity learned with convolutional spatial propagation network." Proceedings of the European conference on computer vision (ECCV). 2018.)

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DYLAN J MENDEZ MUNIZ whose telephone number is (703)756-5672. The examiner can normally be reached M-F, 8AM - 5PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Moyer can be reached at (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DYLAN JOHN MENDEZ MUNIZ/Examiner, Art Unit 2675                                                                                                                                                                                                        
/ANDREW M MOYER/Supervisory Patent Examiner, Art Unit 2675
Read full office action
Prosecution Timeline

Oct 24, 2023
Application Filed
Nov 19, 2025
Non-Final Rejection mailed — §103
Jan 27, 2026
Interview Requested
Feb 03, 2026
Applicant Interview (Telephonic)
Feb 04, 2026
Response Filed
Feb 07, 2026
Examiner Interview Summary
Apr 09, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/255,360
Patent 12633129
METHOD AND SYSTEM FOR AUTOMATIC DRIVING DATA COLLECTION AND CLOSED-LOOP MANAGEMENT
2y 11m to grant Granted May 19, 2026
18/254,236
Patent 12620263
GESTURE RECOGNIZING METHOD, INTERACTIVE METHOD, GESTURE INTERACTIVE SYSTEM, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 11m to grant Granted May 05, 2026
18/006,925
Patent 12597231
INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD
3y 2m to grant Granted Apr 07, 2026
18/043,797
Patent 12573053
Image Shadow Detection Method and System, and Image Segmentation Device and Readable Storage Medium
3y 0m to grant Granted Mar 10, 2026
18/476,399
Patent 12573040
IMAGE PROCESSING APPARATUS AND IMAGE PROCESSING METHOD
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+26.7%)
2y 10m (~3m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 21 resolved cases by this examiner. Grant probability derived from career allowance rate.