Last updated: May 29, 2026
Application No. 18/165,163
DEPTH COMPLETION USING IMAGE AND SPARSE DEPTH INPUTS

Non-Final OA §103
Filed
Feb 06, 2023
Examiner
ZAK, JACQUELINE ROSE
Art Unit
2666
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
3 (Non-Final)
Interview Optional

— -4.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 53% grant rate with -4.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 17 resolved cases, 2023–2026
Examiner Intelligence

ZAK, JACQUELINE ROSE View full profile →
Grants 53% of resolved cases
Career Allowance Rate
9 granted / 17 resolved
-9.1% vs TC avg
Minimal -4% lift
Without
With
+-4.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
31 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
94.2%
+54.2% vs TC avg
§102
5.1%
-34.9% vs TC avg
§112
0.7%
-39.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 17 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/11/2025 has been entered.

Claim Status
Claims 1-30 are pending for examination in the application filed 11/21/2025. Claims 1 and 24 have been amended. 

	Response to Arguments and Amendments
Applicant’s arguments regarding claims 1 and 24 have been considered but are moot because the new ground of rejection does not rely on the reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 18-19, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong (US20230092248A1) in view of Pugh (US20210142497A1). 

Regarding claim 1, Xiong teaches an apparatus for generating depth information from one or more images ([0007] an apparatus for obtaining scene segmentation includes an image sensor, a depth sensor, and at least one processing device. The at least one processing device is configured to obtain, from the image sensor, image data of a real-world scene; obtain, from the depth sensor, sparse depth data of the real-world scene; and pass the image data to a first neural network to obtain one or more object ROIs and one or more feature map ROIs, where each object ROI includes at least one detected object. The at least one processing device is also configured to pass the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs), the apparatus comprising: at least one memory (Fig. 1, memory 160); 
and at least one processor coupled to the at least one memory (Fig. 1, main processor 140), the at least one processor being configured to: 
obtain segmentation information associated with an image of a scene, the image including a plurality of pixels having a resolution ([0040] In some embodiments, the one or more detected objects 240 include one or more bounding boxes or other demarcation of at least one region in the image data 205 in which at least one object has been recognized (for which the detection and extraction network 215 has been trained to recognize). The specific pixels within the image data 205 containing each object may have not yet been isolated. As shown in visualization provided at FIG. 8A, one or more detected objects 240 may, in some embodiments, include or be associated with one or more bounding boxes defining the region(s) in the image data 205 containing the one or more detected instances of at least one trained object. In some cases, the feature maps 250 include feature maps of features detected at a plurality of patch sizes in each of the ROIs covered by the detected object(s) 240. [0062] for a given ROI, the object ROI, feature map ROI, and dense depth map ROI are resized to a common size, where the common size includes an equivalent array of points (such as m×n pixels or m×n depth points)…resizing may be performed using an image-guided super-resolution algorithm (such as the method 500 in FIG. 5) to increase the resolution of the ROI to the input patch size specified by the segmentation network),
obtain depth information associated with one or more objects in the scene ([0006] obtaining, from a depth sensor, sparse depth data of the real-world scene. [0039] The sparse depth map data 210 include a collection of depth measurements obtained from a field of view of a depth sensor that overlaps the field of view of the camera used to obtain the image data 205); 
generate a plurality of features corresponding to the plurality of pixels, wherein each feature of the plurality of features corresponds to a particular pixel of the plurality of pixels, and wherein each feature includes respective segmentation information of the particular pixel and respective depth information of the particular pixel ([0065] According to some embodiments, the aligned input patch data includes image data 701 of the ROI, where the image data includes pixels of data from the ROI at the input patch size (such as m×n pixels) of the fully convolutional network 700. As shown in FIG. 7, the aligned input patch data also includes feature map data 703 of the same ROI as the image data 701. Note that the feature map data 703 may include one or more maps of the detected features in the ROI, such as constituent maps of a feature map pyramid, presented at the resolution specified by the input patch size of the fully convolutional network 700. In some embodiments, the aligned input patch data further includes depth map data 705, where the depth map data 705 includes depth points for a coordinate grid corresponding to the ROI covered by the image data 701 and the feature map data 703. The depth map data 705, like the image data 701 and the feature map data 703, is aligned and presented at a resolution and patch size specified by the fully convolutional network 700); 
and process the plurality of features to generate a dense depth output corresponding to the image ([0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map. [0008] align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs; and pass the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network).
Xiong does not teach wherein the segmentation information includes a segmentation map of the image, and wherein the segmentation map is indicative of a respective segmentation classification of a plurality of segmentation classifications for each pixel of the plurality of pixels; determine, based on the segmentation map, a boundary map indicative of a set of boundaries between different segmentation classifications of the plurality of segmentation classifications, wherein each boundary of the set of boundaries comprises an edge between a respective first segmentation classification and a respective second segmentation classification different from the respective first segmentation classification; and process the plurality of features to generate a dense depth output corresponding to the image using a predicted dense depth value uncertainty for each feature of the plurality of features, wherein the predicted dense depth value uncertainty for each feature of the plurality of features corresponds to a distance from a corresponding boundary of the set of boundaries, wherein the respective segmentation information included in each feature is the same as the respective first or second segmentation classification of the corresponding boundary.
Pugh, in the same field of endeavor of image depth analysis, teaches wherein the segmentation information includes a segmentation map of the image, and wherein the segmentation map is indicative of a respective segmentation classification of a plurality of segmentation classifications for each pixel of the plurality of pixels; determine, based on the segmentation map, a boundary map indicative of a set of boundaries between different segmentation classifications of the plurality of segmentation classifications, wherein each boundary of the set of boundaries comprises an edge between a respective first segmentation classification and a respective second segmentation classification different from the respective first segmentation classification ([0028] In examples, improving the object boundary depths is accomplished by: identifying the edges within a dense (reasonably accurate) depth map (e.g., based on depth gradients, based on an edge map extracted from the same input image(s), based on a semantic segmentation map determined from the same input image(s), etc.); determining the object that the edges belong to (e.g., based on the semantic segmentation map); and correcting the edge depths based on the depth of the object that the edges belong to. [0078] Segmenting the scene S420 preferably functions to determine semantic probabilities for each of a set of pixels, and can optionally determine edges for each identified object. Segmenting the scene (S420) can include generating a semantic segmentation map that can be used to: refine edge depth in the dense, scaled, fused depth map; generate object masks; and/or be otherwise used. [0079] In a first variation, S420 includes segmenting objects appearing in the scene captured by the raw image set (obtained at S100), and determining a class (and/or a class probability) for each of pixel forming the segments. [0081] In a third variation, the per pixel object-classes can be combined to obtain object masks (e.g., each pixel associated with an object class can be combined to form a single object mask);
and process the plurality of features to generate a dense depth output corresponding to the image using a predicted dense depth value uncertainty for each feature of the plurality of features, wherein the predicted dense depth value uncertainty for each feature of the plurality of features corresponds to a distance from a corresponding boundary of the set of boundaries, wherein the respective segmentation information included in each feature is the same as the respective first or second segmentation classification of the corresponding boundary ([0084] In variants, estimating dense pixelwise geometry S430 functions to determine a dense depth map for the image (e.g., obtained at S100, composited at S300, etc.)…In some implementations, the determined depth estimates can be inaccurate and/or unsealed. S470: [0115] In one example, the refinement algorithm can include three passes. The first phase (e.g., FIG. 8) can include: determining high probability points p (e.g., probability(p)>0.9) using a point-to-plane distance, given a 3D estimated floor plane and a probability map of segmented floor points. A threshold over Euclidean distance can be used. Depth of confirmed points can be refined using the computed point-to-plane distance difference. [0122] The data assets generated at S500 can include one or more of: photorealistic imagery 501; an object mask with clean depth edges 502; a dense depthmap that includes depth estimates for each pixel of the image being processed 503; and information identifying pixels that correspond to key geometric surfaces (e.g., walls, floors, horizontal surfaces, etc.) 504, as shown in FIG. 3). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Pugh to determine a boundary map indicative of boundaries between different segmentation classifications and generate a dense depth output "for generating an interactive, photorealistic model of a real-world scene with existing objects modeled in a manner to enable occlusions, to better provide mixed-reality interactive experiences, as compared to conventional systems and methods. In particular, the interactive platform renders virtual objects within a photographic scene, while providing believable mixed-reality depth occlusions using improved and smoothed 3D depth estimates and improved 3D edge boundaries (which are both noisy in practice). Improved object boundary depths can dramatically improve user experience, as humans are particularly sensitive to errant boundary pixels" [Pugh 0028].

Regarding claim 18, Xiong and Pugh teach the apparatus of claim 1. Xiong further teaches wherein each feature of the plurality of features further includes pixel coordinate information of a position of the particular pixel in the image ([0046]  In this example, when an image patch contains an object that the models of the residual blocks 320A-320C have been trained to recognize, the residual block detecting the object within the image patch outputs a coordinate range (such as a bounding box) defining a superset of pixels of image data that includes pixels of the identified object and a classification of the detected object).

Regarding claim 19, Xiong and Pugh teach the apparatus of claim 1. Xiong further teaches wherein each feature of the plurality of features further includes positional information indicative of a position of the particular pixel in the image ([0046]  In this example, when an image patch contains an object that the models of the residual blocks 320A-320C have been trained to recognize, the residual block detecting the object within the image patch outputs a coordinate range (such as a bounding box) defining a superset of pixels of image data that includes pixels of the identified object and a classification of the detected object).

Regarding claim 24, Xiong teaches a method for generating depth information from one or more images ([0006] a method for obtaining scene segmentation includes obtaining, from an image sensor, image data of a real-world scene; obtaining, from a depth sensor, sparse depth data of the real-world scene; and passing the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, where each object ROI includes at least one detected object. The method also includes passing the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs), the method comprising: 
obtaining segmentation information associated with an image of a scene, the image including a plurality of pixels having a resolution ([0040] In some embodiments, the one or more detected objects 240 include one or more bounding boxes or other demarcation of at least one region in the image data 205 in which at least one object has been recognized (for which the detection and extraction network 215 has been trained to recognize). The specific pixels within the image data 205 containing each object may have not yet been isolated. As shown in visualization provided at FIG. 8A, one or more detected objects 240 may, in some embodiments, include or be associated with one or more bounding boxes defining the region(s) in the image data 205 containing the one or more detected instances of at least one trained object. In some cases, the feature maps 250 include feature maps of features detected at a plurality of patch sizes in each of the ROIs covered by the detected object(s) 240. [0062] for a given ROI, the object ROI, feature map ROI, and dense depth map ROI are resized to a common size, where the common size includes an equivalent array of points (such as m×n pixels or m×n depth points)…resizing may be performed using an image-guided super-resolution algorithm (such as the method 500 in FIG. 5) to increase the resolution of the ROI to the input patch size specified by the segmentation network),
obtaining depth information associated with one or more objects in the scene ([0006] obtaining, from a depth sensor, sparse depth data of the real-world scene. [0039] The sparse depth map data 210 include a collection of depth measurements obtained from a field of view of a depth sensor that overlaps the field of view of the camera used to obtain the image data 205); 
generating a plurality of features corresponding to the plurality of pixels, wherein each feature of the plurality of features corresponds to a particular pixel of the plurality of pixels, and wherein each feature includes respective segmentation information of the particular pixel and respective depth information of the particular pixel ([0065] According to some embodiments, the aligned input patch data includes image data 701 of the ROI, where the image data includes pixels of data from the ROI at the input patch size (such as m×n pixels) of the fully convolutional network 700. As shown in FIG. 7, the aligned input patch data also includes feature map data 703 of the same ROI as the image data 701. Note that the feature map data 703 may include one or more maps of the detected features in the ROI, such as constituent maps of a feature map pyramid, presented at the resolution specified by the input patch size of the fully convolutional network 700. In some embodiments, the aligned input patch data further includes depth map data 705, where the depth map data 705 includes depth points for a coordinate grid corresponding to the ROI covered by the image data 701 and the feature map data 703. The depth map data 705, like the image data 701 and the feature map data 703, is aligned and presented at a resolution and patch size specified by the fully convolutional network 700); 
and processing the plurality of features to generate a dense depth output corresponding to the image ([0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map. [0008] align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs; and pass the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network).
Xiong does not teach wherein the segmentation information includes a segmentation map of the image, and wherein the segmentation map is indicative of a respective segmentation classification of a plurality of segmentation classifications for each pixel of the plurality of pixels; determining, based on the segmentation map, a boundary map indicative of a set of boundaries between different segmentation classifications of the plurality of segmentation classifications, wherein each boundary of the set of boundaries comprises an edge between a respective first segmentation classification and a respective second segmentation classification different from the respective first segmentation classification; and processing the plurality of features to generate a dense depth output corresponding to the image using a predicted dense depth value uncertainty for each feature of the plurality of features, wherein the predicted dense depth value uncertainty for each feature of the plurality of features corresponds to a distance from a corresponding boundary of the set of boundaries, wherein the respective segmentation information included in each feature is the same as the respective first or second segmentation classification of the corresponding boundary.
Pugh, in the same field of endeavor of image depth analysis, teaches wherein the segmentation information includes a segmentation map of the image, and wherein the segmentation map is indicative of a respective segmentation classification of a plurality of segmentation classifications for each pixel of the plurality of pixels; determining, based on the segmentation map, a boundary map indicative of a set of boundaries between different segmentation classifications of the plurality of segmentation classifications, wherein each boundary of the set of boundaries comprises an edge between a respective first segmentation classification and a respective second segmentation classification different from the respective first segmentation classification ([0028] In examples, improving the object boundary depths is accomplished by: identifying the edges within a dense (reasonably accurate) depth map (e.g., based on depth gradients, based on an edge map extracted from the same input image(s), based on a semantic segmentation map determined from the same input image(s), etc.); determining the object that the edges belong to (e.g., based on the semantic segmentation map); and correcting the edge depths based on the depth of the object that the edges belong to. [0078] Segmenting the scene S420 preferably functions to determine semantic probabilities for each of a set of pixels, and can optionally determine edges for each identified object. Segmenting the scene (S420) can include generating a semantic segmentation map that can be used to: refine edge depth in the dense, scaled, fused depth map; generate object masks; and/or be otherwise used. [0079] In a first variation, S420 includes segmenting objects appearing in the scene captured by the raw image set (obtained at S100), and determining a class (and/or a class probability) for each of pixel forming the segments. [0081] In a third variation, the per pixel object-classes can be combined to obtain object masks (e.g., each pixel associated with an object class can be combined to form a single object mask);
and processing the plurality of features to generate a dense depth output corresponding to the image using a predicted dense depth value uncertainty for each feature of the plurality of features, wherein the predicted dense depth value uncertainty for each feature of the plurality of features corresponds to a distance from a corresponding boundary of the set of boundaries, wherein the respective segmentation information included in each feature is the same as the respective first or second segmentation classification of the corresponding boundary ([0084] In variants, estimating dense pixelwise geometry S430 functions to determine a dense depth map for the image (e.g., obtained at S100, composited at S300, etc.)…In some implementations, the determined depth estimates can be inaccurate and/or unsealed. S470: [0115] In one example, the refinement algorithm can include three passes. The first phase (e.g., FIG. 8) can include: determining high probability points p (e.g., probability(p)>0.9) using a point-to-plane distance, given a 3D estimated floor plane and a probability map of segmented floor points. A threshold over Euclidean distance can be used. Depth of confirmed points can be refined using the computed point-to-plane distance difference. [0122] The data assets generated at S500 can include one or more of: photorealistic imagery 501; an object mask with clean depth edges 502; a dense depthmap that includes depth estimates for each pixel of the image being processed 503; and information identifying pixels that correspond to key geometric surfaces (e.g., walls, floors, horizontal surfaces, etc.) 504, as shown in FIG. 3). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Xiong with the teachings of Pugh to determine a boundary map indicative of boundaries between different segmentation classifications and generate a dense depth output "for generating an interactive, photorealistic model of a real-world scene with existing objects modeled in a manner to enable occlusions, to better provide mixed-reality interactive experiences, as compared to conventional systems and methods. In particular, the interactive platform renders virtual objects within a photographic scene, while providing believable mixed-reality depth occlusions using improved and smoothed 3D depth estimates and improved 3D edge boundaries (which are both noisy in practice). Improved object boundary depths can dramatically improve user experience, as humans are particularly sensitive to errant boundary pixels" [Pugh 0028].

Claims 20-23 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Pugh and Sadeghi (US20220101549A1). 


Regarding claim 20, Xiong and Pugh teach the apparatus of claim 19. Sadeghi teaches wherein the positional information is a positional encoding indicative of pixel coordinate information of the position of the particular pixel in the image ([0135] In particular, the second implementation can accommodate “feature-based” UMs that take into account image features (i.e. the actual content of the images) in estimating depth uncertainty. [0155] At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. At run-time the UM is implemented as a look-up table detailing the variance. The lookup can be dependent on pixel location and disparity value. The UM can be stored as an array, meaning the result can be looked up in O(1) time).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to have positional encoding indicative of pixel coordinate information for use in "a lookup table [which] has the benefit of being fast, operating with O(1) time complexity" [Sadeghi 0025].

Regarding claim 21, Xiong and Pugh teach the apparatus of claim 19. Xiong further teaches wherein: each pixel of the plurality of pixels is associated with a respective patch of a plurality of patches, the respective patch including a subset of the plurality of pixels ([0046] In this example, when an image patch contains an object that the models of the residual blocks 320A-320C have been trained to recognize, the residual block detecting the object within the image patch outputs a coordinate range (such as a bounding box) defining a superset of pixels of image data that includes pixels of the identified object and a classification of the detected object); 
and the positional information is indicative of the respective patch associated with the particular pixel and a position of the particular pixel in the respective patch ([0046] In this example, when an image patch contains an object that the models of the residual blocks 320A-320C have been trained to recognize, the residual block detecting the object within the image patch outputs a coordinate range (such as a bounding box) defining a superset of pixels of image data that includes pixels of the identified object and a classification of the detected object).
Xiong does not teach the positional information is a positional encoding.
Sadeghi teaches the positional information is a positional encoding ([0135] In particular, the second implementation can accommodate “feature-based” UMs that take into account image features (i.e. the actual content of the images) in estimating depth uncertainty. [0155] At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. At run-time the UM is implemented as a look-up table detailing the variance. The lookup can be dependent on pixel location and disparity value. The UM can be stored as an array, meaning the result can be looked up in O(1) time).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to have a positional encoding for use in "a lookup table [which] has the benefit of being fast, operating with O(1) time complexity" [Sadeghi 0025].

Regarding claim 22, Xiong, Pugh, and Sadeghi teach the apparatus of claim 21. Xiong further teaches wherein: the respective patch includes pixels in a same column of the image; and the position of the particular pixel in the respective patch comprises a row index ([0045] Referring to the illustrative example of FIG. 3, at an operation 305, image data from the initially-obtained frame(s) of image data (such as the image data 205) is input into the neural network 300 as image patches of size m×m×x, where the height and width of the patches are of dimension m and where x represents the number of channels of the color space used to generate the image data (typically three or four channels are used).

Regarding claim 23, Xiong and Pugh teach the apparatus of claim 1. Sadeghi teaches wherein each feature of the plurality of features further includes an uncertainty value associated with the respective depth information of the particular pixel ([0135] In particular, the second implementation can accommodate “feature-based” UMs that take into account image features (i.e. the actual content of the images) in estimating depth uncertainty. [0155] At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to calculate a uncertainty value associated with depth because "In the context of autonomous vehicles, providing a robust assessment of the level of uncertainty associated with an observed depth estimate—which translates to the level of uncertainty the AV has about its 3D surroundings—allows critical driving decisions to be made in a way that properly accounts for the level of uncertainty associated with the observations on which those decisions are based" [Sadeghi 0011].

Claims 2-6, 10-15 and 25-27 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Pugh, and Masoumian (Masoumian, Armin, et al. "GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network." Neurocomputing 517 (2023): 81-92.). 

Regarding claim 2, Xiong and Pugh teach the apparatus of claim 1. Xiong teaches wherein: the at least one processor is configured to process the plurality of features to generate the dense depth output corresponding to the image ([0075] According to some embodiments, at an operation 920, the image data and the sparse depth data are passed to a second neural network (such as the first neural network 401 in FIG. 4) to obtain one or more dense depth maps, where each dense depth map is associated with an ROI for which an object ROI and a feature map ROI were generated. In some embodiments, the operation 920 includes passing the image and sparse depth data to a neural network to obtain an initial set of predicted depth values and performing image-guided super resolution (such as by applying the method 500 of FIG. 5) to increase the resolution (densify) the sparse depth map); 
and the depth information includes a coarse depth map comprising a plurality of locations having the resolution ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene).
Xiong does not teach using a graph neural network. 
Masoumian, in the same field of endeavor of depth estimation, teaches using a graph neural network (Fig. 1. An illustration of the proposed GCN [graph convolutional network] module containing two hidden layers). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Masoumian to use a graph neural network because "CNN is limited, since it does not consider the characteristics of the geometric depth information and object location, as well as contextual features in the scene…The effectiveness of the graph convolution network (GCN) has been proved in processing graph data on the tasks of classification and segmentation [Masoumian pg. 82 para. 2]. 

Regarding claim 3, Xiong, Pugh, and Masoumian teach the apparatus of claim 2. Xiong further teaches wherein the at least one processor is further configured to: determine contextual information associated with at least one location of the plurality of locations, the contextual information indicative of an absolute pixel location of the at least one location or indicative of a relative pixel location of the at least one location ([0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map).

Regarding claim 4, Xiong, Pugh, and Masoumian teach the apparatus of claim 3. Xiong further teaches wherein the contextual information is further indicative of an object classification associated with a sparse depth value obtained for the at least one location ([0038] Referring to the illustrative example of FIG. 2, the architecture 200 receives, as inputs, one or more frames of image data 205 and sparse depth map data 210 and outputs segmentation results 299. According to various embodiments, the segmentation results 299 may include one or more masks defining one or more regions of an image frame as corresponding to one or more specific instances of one or more classified objects).

Regarding claim 5, Xiong, Pugh, and Masoumian teach the apparatus of claim 2. Xiong further teaches wherein, to obtain the segmentation information, the at least one processor is configured to: obtain the image of the scene ([0006] In a first embodiment, a method for obtaining scene segmentation includes obtaining, from an image sensor, image data of a real-world scene…and passing the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, where each object ROI includes at least one detected object);
and generate, using a segmentation machine learning network, the segmentation map based on the image of the scene ([0006] and passing the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network to obtain a segmentation of the real-world scene, where the segmentation contains one or more pixelwise predictions of one or more objects in the real-world scene).

Regarding claim 6, Xiong, Pugh, and Masoumian teach the apparatus of claim 2. Xiong further teaches wherein each location of the plurality of locations in the coarse depth map includes a value representing a respective measured depth or a respective predicted depth of a pixel having a corresponding location in the image ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene. [0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map. According to some embodiments, computing the depths at the operation 515 includes passing a Gaussian filter over the low-resolution depth map 505 to determine the weighted average value depths for points of the high-resolution depth map 260).

Regarding claim 10, Xiong, Pugh, and Masoumian teach the apparatus of claim 2. Xiong further teaches wherein, to obtain the depth information, the at least one processor is configured to: obtain a sparse depth map associated with one or more objects in the scene, the sparse depth map comprising a plurality of locations having the resolution ([0039] The sparse depth map data 210 include a collection of depth measurements obtained from a field of view of a depth sensor that overlaps the field of view of the camera used to obtain the image data 205. [0054] Referring to the illustrative example of FIG. 5, the method 500 takes, as inputs, the one or more frames of image data 205 and a low-resolution depth map 505); 
and generate, using a coarse depth prediction machine learning network, the coarse depth map based on the sparse depth map and the image of the scene ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene).

Regarding claim 11, Xiong, Pugh, and Masoumian teach the apparatus of claim 10. Xiong further teaches wherein: each location of a first subset of locations of the plurality of locations in the sparse depth map includes a value representing a respective depth of a respective pixel having a corresponding location in the image ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene. [0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map. According to some embodiments, computing the depths at the operation 515 includes passing a Gaussian filter over the low-resolution depth map 505 to determine the weighted average value depths for points of the high-resolution depth map 260); 
and each location of a second subset of locations of the plurality of locations in the sparse depth map includes a zero-value corresponding to a lack of depth information for a respective pixel having a corresponding location in the image ([0062] Where resizing a sparse ROI (such as one having comparatively fewer data points) requires upsampling, resizing may be performed using an image-guided super-resolution algorithm (such as the method 500 in FIG. 5) to increase the resolution of the ROI to the input patch size specified by the segmentation network. [0055] Referring to the illustrative example of FIG. 5, at an operation 510, the processing platform computes, for each existing point p of the sparse depth map, weightings of adjacent depth points in the neighborhood of the point p based on localized pose differences, color differences, and spatial differences. In this way, where the color, pose, and spatial data points indicate abrupt differences…in the neighborhood of the depth point p).

Regarding claim 12, Xiong and Pugh teach the apparatus of claim 1. Masoumian teaches to generate a plurality of graph nodes using the plurality of features, and wherein each graph node of the plurality of graph nodes includes a respective feature of the plurality of features corresponding to a particular pixel of the plurality of pixels ([pg. 82 para. 3] We propose a novel autoencoder (CNN-GCN) for monocular depth estimation, which its encoder network is based on ResNet as a backbone to extract key features of the input frame. A decoder network then utilizes the structure of the GCN through the whole decoding process to improve the accuracy of depth maps by learning the nodes (i.e., pixels) representation via constructing the depth maps via iteratively propagating neighbor’s information until reaching a stable point).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Masoumian to generate graph nodes with features corresponding to pixels because "CNN is limited, since it does not consider the characteristics of the geometric depth information and object location, as well as contextual features in the scene…The effectiveness of the graph convolution network (GCN) has been proved in processing graph data on the tasks of classification and segmentation [Masoumian pg. 82 para. 2]. 

Regarding claim 13, Xiong, Pugh, and Masoumian teach the apparatus of claim 12. Xiong further teaches wherein, to generate the dense depth output, the at least one processor is configured to: determine a predicted depth value for each respective pixel of the plurality of pixels ([0075]  In some embodiments, the operation 920 includes passing the image and sparse depth data to a neural network to obtain an initial set of predicted depth values and performing image-guided super resolution (such as by applying the method 500 of FIG. 5) to increase the resolution (densify) the sparse depth map);
and generate the dense depth output using the predicted depth value for each respective pixel of the plurality of pixels ([0051] Referring to the illustrative example of FIG. 4, the output of the first neural network (such as a depth map) is provided to an image-guided super-resolution stage 405, which refines and increases the data density of the depth map output by the first neural network 401 based at least in part on the image data 205. In contrast to interpolative approaches to recovering dense depth map data, the image-guided super-resolution stage 405 produces depth maps with fewer artifacts and that better represent edges and structural features of three-dimensional objects in the depth maps).

Regarding claim 14, Xiong, Pugh, and Masoumian teach the apparatus of claim 13. Xiong further teaches wherein, to determine the predicted depth value for a pixel of the plurality of pixels, the at least one processor is configured ([0007] an apparatus for obtaining scene segmentation includes an image sensor, a depth sensor, and at least one processing device. The at least one processing device is configured to obtain, from the image sensor, image data of a real-world scene; obtain, from the depth sensor, sparse depth data of the real-world scene; and pass the image data to a first neural network to obtain one or more object ROIs and one or more feature map ROIs, where each object ROI includes at least one detected object. The at least one processing device is also configured to pass the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs).
Masoumian teaches to obtain a respective subset of graph nodes included in the plurality of graph nodes, the respective subset of graph nodes corresponding to the pixel ([pg. 82 para. 3] We propose a novel autoencoder (CNN-GCN) for monocular depth estimation, which its encoder network is based on ResNet as a backbone to extract key features of the input frame. A decoder network then utilizes the structure of the GCN through the whole decoding process to improve the accuracy of depth maps by learning the nodes (i.e., pixels) representation via constructing the depth maps via iteratively propagating neighbor’s information until reaching a stable point); 
and process, using an adjacency matrix and one or more graph convolution layers of a graph neural network, the respective subset of graph nodes to generate the predicted depth value for the pixel ([pg. 84 para. 5] Fig. 1 illustrates the architecture of our GCN model. We randomly initialized the first adjacency matrix of the first graph in the decoder network with the exact size of the nodes in the first layer of the depth decoder. We fine-tuned a parameter, P, which represents the probability for edge creation and the percentage similarity of each node (i.e., vertices) or pixel with their neighbor nodes in the graph. Using P = 0.7 with the first random adjacency matrix yields the best estimated depth maps).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Masoumian to evaluate a graph node and adjacent graph nodes corresponding to pixels to predict depth values because "the GNN performs better than CNN because the model learns the features by inspecting neighbouring nodes" [Masoumian pg. 83 para. 3]. 

Regarding claim 15, Xiong, Pugh, and Masoumian teach the apparatus of claim 14. Masoumian teaches wherein the respective subset of graph nodes includes: a graph node corresponding to the pixel ([pg. 82 para. 3] We propose a novel autoencoder (CNN-GCN) for monocular depth estimation, which its encoder network is based on ResNet as a backbone to extract key features of the input frame. A decoder network then utilizes the structure of the GCN through the whole decoding process to improve the accuracy of depth maps by learning the nodes (i.e., pixels) representation via constructing the depth maps via iteratively propagating neighbor’s information until reaching a stable point); 
and one or more adjacent graph nodes, each adjacent graph node corresponding to a pixel of the plurality of pixels that is adjacent to the pixel ([pg. 84 para. 5] Fig. 1 illustrates the architecture of our GCN model. We randomly initialized the first adjacency matrix of the first graph in the decoder network with the exact size of the nodes in the first layer of the depth decoder. We fine-tuned a parameter, P, which represents the probability for edge creation and the percentage similarity of each node (i.e., vertices) or pixel with their neighbor nodes in the graph. Using P = 0.7 with the first random adjacency matrix yields the best estimated depth maps).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Masoumian to evaluate a graph node and adjacent graph nodes corresponding to pixels because "the GNN performs better than CNN because the model learns the features by inspecting neighbouring nodes" [Masoumian pg. 83 para. 3]. 

Regarding claim 25, Xiong and Pugh teach the method of claim 24. Xiong teaches wherein: the plurality of features are processed to generate the dense depth output corresponding to the image ([0075] According to some embodiments, at an operation 920, the image data and the sparse depth data are passed to a second neural network (such as the first neural network 401 in FIG. 4) to obtain one or more dense depth maps, where each dense depth map is associated with an ROI for which an object ROI and a feature map ROI were generated. In some embodiments, the operation 920 includes passing the image and sparse depth data to a neural network to obtain an initial set of predicted depth values and performing image-guided super resolution (such as by applying the method 500 of FIG. 5) to increase the resolution (densify) the sparse depth map); 
and the depth information includes a coarse depth map comprising a plurality of locations having the resolution ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene).
Xiong does not teach using a graph neural network. 
Masoumian, in the same field of endeavor of depth estimation, teaches using a graph neural network (Fig. 1. An illustration of the proposed GCN [graph convolutional network] module containing two hidden layers). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Xiong with the teachings of Masoumian to use a graph neural network because "CNN is limited, since it does not consider the characteristics of the geometric depth information and object location, as well as contextual features in the scene…The effectiveness of the graph convolution network (GCN) has been proved in processing graph data on the tasks of classification and segmentation [Masoumian pg. 82 para. 2].

Regarding claim 26, Xiong, Pugh, and Masoumian teach the method of claim 25. Xiong further teaches wherein, obtaining the segmentation information comprises: obtaining the image of the scene ([0006] In a first embodiment, a method for obtaining scene segmentation includes obtaining, from an image sensor, image data of a real-world scene…and passing the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, where each object ROI includes at least one detected object); 
and generating, using a segmentation machine learning network, the segmentation map based on the image of the scene ([0006] and passing the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network to obtain a segmentation of the real-world scene, where the segmentation contains one or more pixelwise predictions of one or more objects in the real-world scene).

Regarding claim 27, Xiong, Pugh, and Masoumian teach the method of claim 25. Xiong further teaches wherein each location of the plurality of locations in the coarse depth map includes a value representing a respective measured depth or a respective predicted depth of a pixel having a corresponding location in the image ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene. [0056] According to various embodiments, at an operation 515, a depth value for each pixel or coordinate location of a dense depth map is computed based a function of a weighted average of depth values of neighboring depth points of the sparse depth map. According to some embodiments, computing the depths at the operation 515 includes passing a Gaussian filter over the low-resolution depth map 505 to determine the weighted average value depths for points of the high-resolution depth map 260).

Claims 7-9 and 28-30 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Pugh, Masoumian, and Sadeghi. 

Regarding claim 7, Xiong, Pugh, and Masoumian teach the apparatus of claim 6. Sadeghi, in the same field of endeavor of depth estimation teaches wherein each feature of the plurality of features further includes a depth uncertainty value ([0135] In particular, the second implementation can accommodate “feature-based” UMs that take into account image features (i.e. the actual content of the images) in estimating depth uncertainty) associated with the respective depth information of the particular pixel, and wherein: the depth uncertainty value is zero based on the respective depth information comprising a measured depth from the coarse depth map (See Fig. 14 uncertainty map zero values. [0345] An example of the uncertainty map for models 2 and 3 is shown in FIG. 14, as applied to an urban scene shown in FIG. 10. Note that both models encode roughly the same information at a coarse level—that higher uncertainty should be assigned to larger disparities. However the UM-Superpixel possesses the ability to assign higher uncertainty for localized distortions, e.g. the unusually large variance in the lower right hand corner, which is due to the search space in this area of the image being artificially restricted, as pixels with low depths and consequently high disparities in the right image will fall outside of the field of view for the left image); 
and the depth uncertainty value is a calculated uncertainty based on the respective depth information comprising a predicted depth from the coarse depth map ([0165] The depth estimator 2 receives rectified stereo image pairs and processes each stereo image pair 6 to compute a depth estimate therefrom, in the form of a depth image. Each stereo image pair 6 consists of a left and right image represented in mathematical notation as I.sub.L.sub.i and I.sub.R.sub.i respectively. The depth estimate extracted from that pair is in the form of a disparity map represented by I.sub.d.sub.i. [0169] The uncertainty estimator 4 determines an uncertainty estimate U.sub.i for the depth estimate I.sub.d.sub.i. That is, an estimate of the uncertainty associated with the depth estimate I.sub.d.sub.i, in the form of an uncertainty image).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to calculate a depth uncertainty value because "In the context of autonomous vehicles, providing a robust assessment of the level of uncertainty associated with an observed depth estimate—which translates to the level of uncertainty the AV has about its 3D surroundings—allows critical driving decisions to be made in a way that properly accounts for the level of uncertainty associated with the observations on which those decisions are based" [Sadeghi 0011].

Regarding claim 8, Xiong, Pugh, Masoumian, and Sadeghi teach the apparatus of claim 7. Sadeghi teaches wherein, to determine the calculated uncertainty, the at least one processor (processors 24) is configured to: determine a variance of a plurality of predicted depth values from the coarse depth map ([0155] FIG. 15 is a schematic showing how an uncertainty model is trained and used in practice, in accordance with the described embodiments. At a first stage, a standard disparity estimation technology, such as SGM or CNN, is applied to produces point estimates. At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. At run-time the UM is implemented as a look-up table detailing the variance. The lookup can be dependent on pixel location and disparity value).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to determine the variance of predicted depth values because "In the context of autonomous vehicles, providing a robust assessment of the level of uncertainty associated with an observed depth estimate—which translates to the level of uncertainty the AV has about its 3D surroundings—allows critical driving decisions to be made in a way that properly accounts for the level of uncertainty associated with the observations on which those decisions are based" [Sadeghi 0011].

Regarding claim 9, Xiong, Pugh, Masoumian, and Sadeghi teach the apparatus of claim 7. Sadeghi teaches wherein, to determine the calculated uncertainty, the at least one processor (processors 24) is configured to: determine a variance of class predictions from the segmentation map of the image ([0155] FIG. 15 is a schematic showing how an uncertainty model is trained and used in practice, in accordance with the described embodiments. At a first stage, a standard disparity estimation technology, such as SGM or CNN, is applied to produces point estimates. At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. [0289] It is noted that other forms of perturbation function dependency may be incorporated to account for other sources of uncertainty. For example. the perturbation function may alternatively or additionally depend on one or more of the following factors: [0290] image content; for example, the value of the perturbation function at pixel (j,k) may depend on a type of class of structure detected at that pixel, or some other object/structure recognition output associated with that pixel. [0198]  The projected road structure has an uncertainty associated with it which comes from a combination of the uncertainty associated with the image segmentation process that is applied to detect the road structure but also the uncertainty associated with the depth data that is used to perform the geometric projection into the plane of the road map).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Sadeghi to determine the variance of class predictions because "sensors measure a physical system and thus are inherently imperfect, they have uncertainty. As a consequence, the robotic system's perception is uncertain" [Sadeghi 0131].

Regarding claim 28, Xiong, Pugh, and Masoumian teach the method of claim 27. Sadeghi, in the same field of endeavor of depth estimation, teaches wherein each feature of the plurality of features further includes a depth uncertainty value ([0135] In particular, the second implementation can accommodate “feature-based” UMs that take into account image features (i.e. the actual content of the images) in estimating depth uncertainty) associated with the respective depth information of the particular pixel; the depth uncertainty value is zero based on the respective depth information comprising a measured depth from the coarse depth map (See Fig. 14 uncertainty map zero values. [0345] An example of the uncertainty map for models 2 and 3 is shown in FIG. 14, as applied to an urban scene shown in FIG. 10. Note that both models encode roughly the same information at a coarse level—that higher uncertainty should be assigned to larger disparities. However the UM-Superpixel possesses the ability to assign higher uncertainty for localized distortions, e.g. the unusually large variance in the lower right hand corner, which is due to the search space in this area of the image being artificially restricted, as pixels with low depths and consequently high disparities in the right image will fall outside of the field of view for the left image); 
and the depth uncertainty value is a calculated uncertainty based on the respective depth information comprising a predicted depth from the coarse depth map ([0165] The depth estimator 2 receives rectified stereo image pairs and processes each stereo image pair 6 to compute a depth estimate therefrom, in the form of a depth image. Each stereo image pair 6 consists of a left and right image represented in mathematical notation as I.sub.L.sub.i and I.sub.R.sub.i respectively. The depth estimate extracted from that pair is in the form of a disparity map represented by I.sub.d.sub.i. [0169] The uncertainty estimator 4 determines an uncertainty estimate U.sub.i for the depth estimate I.sub.d.sub.i. That is, an estimate of the uncertainty associated with the depth estimate I.sub.d.sub.i, in the form of an uncertainty image).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Xiong with the teachings of Sadeghi to calculate a depth uncertainty value because "In the context of autonomous vehicles, providing a robust assessment of the level of uncertainty associated with an observed depth estimate—which translates to the level of uncertainty the AV has about its 3D surroundings—allows critical driving decisions to be made in a way that properly accounts for the level of uncertainty associated with the observations on which those decisions are based" [Sadeghi 0011].

Regarding claim 29, Xiong, Pugh, Masoumian, and Sadeghi teach the method of claim 28. Sadeghi teaches wherein determining the calculated uncertainty comprises: determining a variance of a plurality of predicted depth values from the coarse depth map ([0155] FIG. 15 is a schematic showing how an uncertainty model is trained and used in practice, in accordance with the described embodiments. At a first stage, a standard disparity estimation technology, such as SGM or CNN, is applied to produces point estimates. At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. At run-time the UM is implemented as a look-up table detailing the variance. The lookup can be dependent on pixel location and disparity value).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Xiong with the teachings of Sadeghi to determine the variance of predicted depth values because "In the context of autonomous vehicles, providing a robust assessment of the level of uncertainty associated with an observed depth estimate—which translates to the level of uncertainty the AV has about its 3D surroundings—allows critical driving decisions to be made in a way that properly accounts for the level of uncertainty associated with the observations on which those decisions are based" [Sadeghi 0011].

Regarding claim 30, Xiong, Pugh, Masoumian, and Sadeghi teach the method of claim 28. Sadeghi teaches wherein determining the calculated uncertainty comprises: determining a variance of class predictions from the segmentation map of the image ([0155] FIG. 15 is a schematic showing how an uncertainty model is trained and used in practice, in accordance with the described embodiments. At a first stage, a standard disparity estimation technology, such as SGM or CNN, is applied to produces point estimates. At a second stage, an uncertainty model (UM) receives the point estimates as input and predicts the uncertainties—encoded as variances-per pixel. [0289] It is noted that other forms of perturbation function dependency may be incorporated to account for other sources of uncertainty. For example. the perturbation function may alternatively or additionally depend on one or more of the following factors: [0290] image content; for example, the value of the perturbation function at pixel (j,k) may depend on a type of class of structure detected at that pixel, or some other object/structure recognition output associated with that pixel. [0198] The projected road structure has an uncertainty associated with it which comes from a combination of the uncertainty associated with the image segmentation process that is applied to detect the road structure but also the uncertainty associated with the depth data that is used to perform the geometric projection into the plane of the road map).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Xiong with the teachings of Sadeghi to determine the variance of class predictions because "sensors measure a physical system and thus are inherently imperfect, they have uncertainty. As a consequence, the robotic system's perception is uncertain" [Sadeghi 0131].

Claims 17 is rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Pugh, and Svekolkin (US20210279866A1). 

Regarding claim 17, Xiong and Pugh teach the apparatus of claim 1. Xiong further teaches wherein each feature of the plurality of features includes a coarse depth estimation for a pixel of the plurality of pixels ([0050] Referring to the illustrative example of FIG. 4, the processing pipeline 400 takes, as inputs, the image data 205 and the sparse depth map data 210. Here, this data may include image data and sparse depth points covering one or more ROIs identified by the detection and extraction network 215. According to various embodiments, the image data 205 and the sparse depth map data 210 are passed to a first neural network 401, which outputs a low-resolution depth map of predicted depth values for the scene).
Xiong does not teach one or more segmentation class probabilities associated with the pixel.
Svekolkin, in the same field of endeavor of image segmentation, teaches one or more segmentation class probabilities associated with the pixel ([0227] In some embodiments, a loss function can be maximized, such as loss=log(P.sub.i)+log(1−P′.sub.i), where P.sub.i is the probability that the node (a cell obtained through cell segmentation) is similar to all graph nodes, and P′.sub.i is the probability that a permutated node is similar to all graph nodes. [0267] The network output is an array of mask proposals for each separate cell. In this implementation, the outputs were not binary images, rather the output was a set of images with values in the range of 0-1 that represent the probability of a given pixel being part of the cell's mask).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Svekolkin to use segmentation class probabilities because "the output of the neural network can be at least a value indicating whether the window is associated with damaged tissue (e.g., a binary value and/or a probability of whether the window is associated with damaged tissue)" [Svekolkin 0155].

Claims 16 is rejected under 35 U.S.C. 103 as being unpatentable over Xiong in view of Pugh, Masoumian, and Svekolkin.

Regarding claim 16, Xiong, Pugh, and Masoumian teach the apparatus of claim 14. Svekolkin, in the same field of endeavor of image segmentation, teaches wherein: the graph neural network is a windowed graph neural network ([0027] FIG. 7B is a flow chart showing an exemplary computerized process for processing a first set of cell features (local cell features) using a graph neural network to identify one or more communities of cells, according to some embodiments of the technology described herein. [0154] the set of immunofluorescence marker images can be processed using a sliding window across the immunofluorescence images to process the images in smaller portions. The window can be of a certain width (e.g., 128 pixels, 256 pixels, etc.), height (e.g., 128 pixels, 256 pixels, etc.) and number of channels that represents the number of markers (e.g., 2, 3, 4, etc.)); 
and the respective subset of graph nodes is obtained using a sliding window over the plurality of pixels ([0154] The sliding window can move across the immunofluorescence images in a preconfigured pattern to process the full content of the immunofluorescence images. [0221] In some embodiments, the graph 776 includes a number of nodes that is equal to the number of detected cells).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the apparatus of Xiong with the teachings of Svekolkin to use a windowed graph neural network because "the output of the neural network can be at least a value indicating whether the window is associated with damaged tissue" [Svekolkin 0155].

Conclusion                                                                                                                                                                               
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jacqueline R Zak whose telephone number is (571)272-4077. The examiner can normally be reached M-F 9-5. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JACQUELINE R ZAK/Examiner, Art Unit 2666                                                                                                                                                                                                        
/EMILY C TERRELL/Supervisory Patent Examiner, Art Unit 2666
Read full office action
Prosecution Timeline

Show 3 earlier events
Sep 24, 2025
Final Rejection mailed — §103
Nov 21, 2025
Response after Non-Final Action
Dec 11, 2025
Request for Continued Examination
Jan 12, 2026
Response after Non-Final Action
Mar 03, 2026
Non-Final Rejection mailed — §103
Apr 28, 2026
Interview Requested
May 07, 2026
Examiner Interview Summary
May 07, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/956,679
Patent 12632957
METHODS AND SYSTEMS FOR USE IN PROCESSING IMAGES RELATED TO CROPS
3y 7m to grant Granted May 19, 2026
17/987,574
Patent 12632932
IMAGE PROCESSING DEVICE AND OPERATION METHOD THEREOF
3y 6m to grant Granted May 19, 2026
18/175,738
Patent 12586340
PIXEL PERSPECTIVE ESTIMATION AND REFINEMENT IN AN IMAGE
3y 0m to grant Granted Mar 24, 2026
18/012,667
Patent 12462343
MEDICAL DIAGNOSTIC APPARATUS AND METHOD FOR EVALUATION OF PATHOLOGICAL CONDITIONS USING 3D OPTICAL COHERENCE TOMOGRAPHY DATA AND IMAGES
2y 10m to grant Granted Nov 04, 2025
17/924,432
Patent 12373946
ASSAY READING METHOD
2y 8m to grant Granted Jul 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
53%
Grant Probability
48%
With Interview (-4.5%)
3y 1m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 17 resolved cases by this examiner. Grant probability derived from career allowance rate.