Last updated: April 19, 2026
Application No. 18/971,085
3D SURFACE RECONSTRUCTION WITH POINT CLOUD DENSIFICATION USING DEEP NEURAL NETWORKS FOR AUTONOMOUS SYSTEMS AND APPLICATIONS

Non-Final OA §101§103§112§DP
Filed
Dec 06, 2024
Examiner
MOLINA, NIKKI MARIE M
Art Unit
3662
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Nvidia Corporation
OA Round
1 (Non-Final)
Interview Optional

— +5.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 88 resolved cases, 2023–2026
Examiner Intelligence

MOLINA, NIKKI MARIE M View full profile →
Grants 77% — above average
Career Allow Rate
68 granted / 88 resolved
+25.3% vs TC avg
Moderate +6% lift
Without
With
+5.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
32 currently pending
Career history
120
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
45.2%
+5.2% vs TC avg
§102
14.0%
-26.0% vs TC avg
§112
26.7%
-13.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 88 resolved cases
Office Action

§101 §103 §112 §DP
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is a Non-final Office Action on the merits.  Claims 1-20 are currently pending and are addressed below.

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 12/06/2024, 09/12/2025, and 01/21/2026 is/are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement(s) is/are being considered by the examiner.

Claim Objections
Claims 9-11 and 15 objected to because of the following informalities: 
Claim 9 recites “…processing circuitry to: generate…”, which appears to be grammatically incorrect. This applies to the other steps in the claim “generate a second height map” and “control the vehicle”.
Claims 10-11 recite “…wherein the processing circuitry is further to generate…”, which appears to be grammatically incorrect. 
Claim 15 recites “…the one or more servers to: receive…”, which appears to be grammatically incorrect. This also applies to the other steps in the claim “generate a second top-down representation” and “provide the second top-down representation”.
Appropriate correction is required.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-20 rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of U.S. Patent No. 12172667 B2 (hereinafter “Patent ‘667”). 
Although the claims at issue are not identical, they are not patentably distinct from each other because representative claims 1-8 and 20 of the instant application (and claims 9-19 by reciting analogous limitations) encompass the subject matter of claims representative claims 1-8, 14, and 20 of Patent ‘667, as illustrated in the tables below, where differences in the claim sets are bolded.
Claims 9-14 of the instant application recite that the first height map “encodes” estimated height values, whereas claim 1-8 of the instant application recite that the first height map “represents” estimated height values. However, the terms appear to be analogous since the claims do not suggest any distinction between them or how they are used. Claims 15-20 recite obtaining a “top-down representation of estimated height values”, whereas claims 1-8 recite obtaining a “first height map representing estimated height values”. However, the terms appear to be analogous since the claims do not suggest any distinction between them or how they are used besides being a “top-down” representation. Claims 15-20 also recite removing and reintroducing a “bias”, whereas claims 1-8 recite removing and reintroducing an “aggregate height characteristic”. However, the terms appear to be analogous since the claims do not suggest any distinction between them or how they are used. Thus, omission of further claims does not mean they are not subject to the non-statutory double patenting rejection noted above. 
Present Application 18/971,085
1,9,15
2,10,16
3,11,17
4,12,18
5,13,19
6
7
8,14
20
U.S. Patent No. 12172667 B2
1,9,14-15
2, 10, 16
3,11,17
4,12,18
5,13,19
7
8
20
6,14


Present Application 18/971,085 Claim 1
U.S. Patent No. 12172667 B2 Claims 1,14
1. A method comprising: 
generating, based at least on image data generated using one or more cameras of an ego-object in an environment, a first height map representing estimated height values of a three-dimensional (3D) surface structure of the environment…




	
1. A method comprising: 
generating a first height map representing estimated height values of a three-dimensional (3D) surface structure of a component of an environment based at least on back-projecting one or more three-dimensional (3D) points generated using image data into a region of a segmentation mask predicted to represent the component, the image data generated during a capture session using one or more cameras of an ego-object in the environment…
…the first height map being computed by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic… 
14. The processor of claim 9, wherein the one or more circuits are further to: 
generate the first height map based at least on subtracting a mean height from at least one of estimated height value of an initial height map…
…generating a second height map representing second estimated height values of the 3D surface structure based at least on applying a representation of the first height map to one or more neutral networks (NNs) and…
1. A method comprising…
…generating a second height map representing second estimated height values of the 3D surface structure based at least on applying a representation of the first height map to one or more neutral networks (NNs); and…
reintroducing the height removed aggregate characteristic to one or more corresponding predicted height values; and 
14. The processor of claim 9, wherein the one or more circuits are further to…
…reintroduce the mean height to corresponding predicted heights of the second height map.
controlling one or more operations of the ego-object based at least on the second height map.
1. A method comprising…
…controlling one or more operations of the ego-object during the capture session based at least on the second height map.


A difference is that claim 1 of the instant application computes the first height map by “normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic”, whereas claim 14 of Patent ‘667 generates the first height map by “subtracting a mean height from at least one of estimated height value of an initial height map”. However, there is no patentable distinction between normalizing initial height values to remove an aggregate height characteristic and subtracting a mean height from an initial height, since an aggregate height characteristic encompasses a mean height, and “removing” and “subtracting” are synonymous. 
Present Application 18/971,085 Claim 2
U.S. Patent No. 12172667 B2 Claim 2
2. The method of claim 1, further comprising generating the first height map based at least on: 
generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and 
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment.
2. The method of claim 1, further comprising generating the first height map based at least on: 
generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and 
identifying the one or more 3D points of the first estimated 3D representation of the environment that belong to the component of the environment based at least on the back-projecting of the one or more 3D points into the region of the segmentation mask predicted to represent the component of the environment.


Another difference is that claim 2 of the instant application recites identifying 3D points that belong to a 3D surface structure of the environment, whereas claim 2 of Patent ‘667 recites identifying 3D points that belong to a component of the environment. However, there is no patentable distinction between the “3D surface structure” and “component” since both are with respect to the environment. Therefore, the two terms are synonymous.
Present Application 18/971,085 Claims 3-8 and 20
U.S. Patent No. 12172667 B2 Claims 3-8 and 20
3. The method of claim 1, further comprising: 
generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; 
projecting at least a portion of the point cloud representation to generate the representation of the first height map representing the estimated height values of the 3D surface structure of the environment; and 
applying the representation of the first height map to the one or more NNs to predict the second height map representing the estimated height values of the 3D surface structure of the environment.
3. The method of claim 1, further comprising: 
generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; 
projecting at least a portion of the point cloud representation to generate the representation of the first height map representing the estimated height values of the 3D surface structure of the component of the environment; and 
applying the representation of the first height map to the one or more NNs to predict the second height map representing the estimated height values of the component of the environment.
4. The method of claim 1, 

wherein the one or more NNs include a first input channel for the first height map representing the estimated height values of the 3D surface structure of the environment and a second input channel for a perspective image representing one or more color values of the 3D surface structure of the environment.
4. The method of claim 1, 

wherein the one or more NNs include a first input channel for the first height map representing the estimated height values of the component of the environment and a second input channel for a perspective image representing one or more color values of the component of the environment.
5. The method of claim 1, 

wherein the one or more NNs include a first output channel that regresses one or more height values of the 3D surface structure of the environment and a second output channel that regresses one or more confidence values corresponding to the one or more height values.
5. The method of claim 1, 

wherein the one or more NNs include a first output channel that regresses one or more height values of the component of the environment and a second output channel that regresses one or more confidence values corresponding to the one or more height values.
6. The method of claim 1, 

further comprising repetitively executing the method on successive instances of the image data generated in successive time slices to generate successive instances of the second height map.
7. The method of claim 1, 

further comprising repetitively executing the method on successive instances of the image data generated in successive time slices during the capture session to generate successive instances of the second height map.
7. The method of claim 1, 

wherein the one or more operations of the ego-object comprise performing at least one of: adapting a suspension system of the ego-object based at least on the second height map representing the estimated height values of the 3D surface structure, navigating the ego-object to avoid a protuberance detected in the second height map representing the estimated height values of the 3D surface structure, or applying an acceleration or deceleration to the ego-object based at least on a surface slope detected in the second height map representing the second estimated height values of the 3D surface structure.
8. The method of claim 1, 

wherein the one or more operations of the ego-object comprise performing, during the capture session, at least one of: adapting a suspension system of the ego-object based at least on the second height map representing the estimated height values of the 3D surface structure, navigating the ego-object to avoid a protuberance detected in the second height map representing the estimated height values of the 3D surface structure, or applying an acceleration or deceleration to the ego-object based at least on a surface slope detected in the second height map representing the second estimated height values of the 3D surface structure.
8. The method of claim 1, 

wherein the method is performed by at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
20. The system of claim 15, 

wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
20. The system of claim 15, 

wherein normalizing the one or more values of the 3D surface structure of the environment comprises subtracting a mean height from the one or more values, and 

reintroducing the bias comprises reintroducing the mean height to the one or more corresponding predicted height values.
6. The method of claim 1, 

further comprising: generating the representation of the first height map based at least on subtracting a mean height from at least one of the estimated height values of the first height map; and 

reintroducing the mean height to corresponding predicted heights of the second height map.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3, 7, and 15-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 3 recites “the second height map representing the estimated height values of the 3D surface structure”. It is unclear if this is referring to the same height map recited in parent claim 1: “a second height map representing second estimated height values of the 3D surface structure”. 
Claim 7 recites “the second height map representing the estimated height values of the 3D surface structure” and “the second height map representing the second estimated height values of the 3D surface structure”. It is unclear if the two instances of “second height map” are referring to the same height map, or if there is intended to be more than one “second height map”, each representing different estimated height values. 
Claim 15 recites “a component of an environment” in line 5 and “an ego-object in an environment” in line 8. It is unclear if the two instances of “an environment” are referring to the same environment.
Claim 17 recites “the second top-down representation of the estimated height values”. It is unclear if this is referring to the same height map recited in parent claim 15: “a second top-down representation of second estimated height values”.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 15-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding Independent Claim 15: 
Step 1: Claim 15 is directed to a system (i.e., a machine). Therefore, claim 15 is within at least one of the four statutory categories.
Step 2A Prong 1: Regarding Prong 1 of the Step 2A analysis in the 2019 PEG, the claims are to be analyzed to determine whether they recite subject matter that falls within one of the following groups of abstract ideas: a) mathematical concepts, b) certain methods of organizing human activity and/or c) mental processes.
Independent claim 15 includes limitations that recite an abstract idea (emphasized below) and will be used as a representative claim for the remainder of the 101 rejection. Claim 15 recites:

A system comprising: 
one or more graphics processing units (GPUs) of one or more servers of one or more data centers, the one or more servers to: 
receive a first top-down representation of estimated height values of a three-dimensional (3D) surface structure of a component of an environment, the first top-down representation of the estimated height values generated by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove a bias, the one or more values generated using image data from one or more cameras of an ego-object in an environment; 
generate a second top-down representation of second estimated height values of the 3D surface structure based at least on using the one or more GPUs to process the first top-down representation with one or more neural networks (NNs) and reintroducing the removed bias to one or more corresponding predicted height values; and 
provide the second top-down representation to a control component of the ego-object.  

The examiner submits that the foregoing bolded limitations constitute a “mental process” because under its broadest reasonable interpretation, the claim covers performance of the limitations in the human mind. For example, the limitation “receive a first top-down representation of estimated height values of a three-dimensional (3D) surface structure of a component of an environment, the first top-down representation of the estimated height values generated by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove a bias, the one or more values generated using image data from one or more cameras of an ego-object in an environment” in the context of this claim encompasses mentally generating a top-down representation of estimated height values and mentally normalizing/removing a bias using image data. The limitation “generate a second top-down representation of second estimated height values of the 3D surface structure…and reintroducing the removed bias to one or more corresponding predicted height values” in the context of this claim encompasses mentally generating a top-down representation of estimated height values and mentally reintroducing the removed bias. 
Step 2A Prong 2: Regarding Prong II of the Step 2A analysis in the 2019 PEG, the claims are to be analyzed to determine whether the claim, as a whole, integrates the abstract into a practical application. As noted in the 2019 PEG, it must be determined whether any additional elements in the claim beyond the abstract idea integrate the exception into a practical application in a manner that imposes a meaningful limit on the judicial exception. The courts have indicated that additional elements merely using a computer to implement an abstract idea, adding insignificant extra solution activity, or generally linking use of a judicial exception to a particular technological environment or field of use do not integrate a judicial exception into a “practical application.”
In the present case, the additional limitations beyond the above-noted abstract idea are as follows (where the underlined portions are the “additional limitations” while the bolded portions continue to represent the “abstract idea”):

A system comprising: 
one or more graphics processing units (GPUs) of one or more servers of one or more data centers, the one or more servers to: 
receive a first top-down representation of estimated height values of a three-dimensional (3D) surface structure of a component of an environment, the first top-down representation of the estimated height values generated by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove a bias, the one or more values generated using image data from one or more cameras of an ego-object in an environment; 
generate a second top-down representation of second estimated height values of the 3D surface structure based at least on using the one or more GPUs to process the first top-down representation with one or more neural networks (NNs) and reintroducing the removed bias to one or more corresponding predicted height values; and 
provide the second top-down representation to a control component of the ego-object.  

For the following reason(s), the examiner submits that the above identified additional limitations do not integrate the above-noted abstract idea into a practical application.
The additional limitations “one or more graphics processing units (GPUs) of one or more servers of one or more data centers” and “based at least on using the one or more GPUs to process the first top-down representation with one or more neural networks (NNs)” are recited at a high level of generality and merely describe how to generally “apply” the otherwise mental judgements in a technological environment using generic computer components. The additional limitation “provide the second top-down representation to a control component of the ego-object” is also recited at a high level of generality and is considered insignificant extra-solution activity (i.e., post-solution activity/data output and transmission). 

Thus, taken alone, the additional elements do not integrate the abstract idea into a practical application. Further, looking at the additional limitation(s) as an ordered combination or as a whole, the limitation(s) add nothing that is not already present when looking at the elements taken individually. For instance, there is no indication that the additional elements, when considered as a whole, reflect an improvement in the functioning of a computer or an improvement to another technology or technical field, apply or use the above-noted judicial exception to effect a particular treatment or prophylaxis for a disease or medical condition, implement/use the above-noted judicial exception with a particular machine or manufacture that is integral to the claim, effect a transformation or reduction of a particular article to a different state or thing, or apply or use the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is not more than a drafting effort designed to monopolize the exception (MPEP § 2106.05).  Accordingly, the additional limitation(s) do/does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. 
Step 2B: Regarding Step 2B of the 2019 PEG, representative independent claim 15 does not include additional elements (considered both individually and as an ordered combination) that are sufficient to amount to significantly more than the judicial exception for the same reasons to those discussed above with respect to determining that the claim does not integrate the abstract idea into a practical application. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of “one or more graphics processing units (GPUs) of one or more servers of one or more data centers” and “based at least on using the one or more GPUs to process the first top-down representation with one or more neural networks (NNs)” are recited at a high-level of generality and amount to nothing more than applying the exception to a technological environment. The examiner also submits that the additional limitation “provide the second top-down representation to a control component of the ego-object” is insignificant extra-solution activity.
Further, a conclusion that an additional element is insignificant extra-solution activity in Step 2A should be re-evaluated in Step 2B to determine if they are more than what is well-understood, routine, conventional activity in the field. The additional limitation “provide the second top-down representation to a control component of the ego-object” is well-understood, routine, and conventional activity in light of MPEP 2106.05(g) and the cases cited therein, including OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93, which indicate that “presenting offers” is a well-understood, routine, and conventional function when it is claimed at a high level of generality.
Therefore, claim 15 is ineligible under 35 U.S.C §101.
Dependent Claims
Dependent claims 16-20 do not recite any further limitations that cause the claims to be patent eligible. Rather, the limitations of dependent claims are directed toward additional aspects of the judicial exception. Dependent claims 16-17 are further directed to the abstract ideas of generating a 3D representation of the environment, identifying 3D points of the 3D representation of the environment, and applying a top-down representation to a neural network to predict a second top-down representation, as well as generating a point cloud representation of the environment and projecting a portion of the point cloud representation, which are insignificant extra-solution activity (i.e., data output or display). Dependent claims 18-19 further describe the input and output channels of the neural network, and dependent claim 20 is further directed to the abstract ideas of subtracting a mean height and reintroducing the mean height. Therefore, dependent claims 16-20 are not patent eligible under the same rationale as provided for in the rejection of claim 15.
Therefore, claim(s) 15-20 is/are ineligible under 35 U.S.C. §101.	

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 4, 6, 8-9, 12, 14-15, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy of US 20210150230 A1, filed 06/29/2020, hereinafter “Smolyanskiy”, in view of Guizilini of US 20210281814 A1, filed 06/12/2020, hereinafter “Guizilini”.
Regarding claim 1, Smolyanskiy teaches:
	A method comprising: generating, based at least on image data generated using one or more cameras of an ego-object in an environment, a first height map representing estimated height values of a three-dimensional (3D) surface structure of the environment, (See at least [0067]: “In another example, the geometry data 64( )of objects in the environment may be generated 635 from image data (e.g., an RGB image) generated by a sensor (e.g., a camera). For example, a known orientation and location of the sensor that captured the image data may be used to un-project the image data into a 3D representation of the environment (e.g., a 3D map or some other world space) and identify 3D locations of objects in the world space corresponding to each pixel. In this case, one or more slices of the identified 3D locations may be taken to generate 635 the geometry data 640 (one or more height maps).”)
generating a second height map representing second estimated height values of the 3D surface structure based at least on applying a representation of the first height map to one or more neutral networks (NNs) and (See at least [0080]: “…In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6). Note that the classified regions of the segmented LiDAR range image 820 (e.g., the drivable space 825) has been transformed to a corresponding region in the top-down view 830 (e.g., the transformed drivable space 835). The second stage of the neural network extracts classification data and object instance data, which is post-processed to generate bounding boxes for detected objects” & [0034]: “…The transformed classification data and geometry data may be stacked and fed into a second stage of the DNN, which may extract classification data (e.g., class confidence data such as confidence maps for any number of classes) and/or regress various types of information about the detected objects, such as location, geometry, and/or orientation…”)
Smolyanskiy does not explicitly teach:
the first height map being computed by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic; 
reintroducing the height removed aggregate characteristic to one or more corresponding predicted height values; and 
controlling one or more operations of the ego-object based at least on the second height map. 
Guizilini teaches:
the first height map being computed by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic; (See at least [0071]: “At 850, the training module 230 creates the image 360. In one or more approaches, the training module 230 creates the image 360 by synthesizing the image 360 from at least the ray surface (e.g., 350) and a depth map (e.g., 330) associated with the monocular image (i.e., 310). As implemented by the training module 230, creating the synthesized image generally includes applying the neural camera model to the noted inputs to synthesize the image 360. The neural camera model implements various functions in combination with inputs, such as a lifting operation and a projection operation. The neural camera model functions to lift pixels from the depth map 330 to produce three-dimensional points using the ray surface and a camera offset…” & [0076-0077]: “At 920, the neural camera model scales predicted ray vectors from the ray surface using the depth map. At 930, the neural camera model adjusts the predicted ray vectors according to the camera offset (i.e., camera center). The operations of 920 and 930 combine to form the lifting operation 970…”)
reintroducing the height removed aggregate characteristic to one or more corresponding predicted height values; and (See at least [0071]: “…Further, the neural camera model projects the three-dimensional points onto a context image to create the synthesized image…” & [0078-0079]: “At 940, the neural camera model determines a patch-based data association for searching pixels in the synthesized image. In one approach, the neural camera model determines the associations by defining search grids for target pixels of the synthesized image according to coordinates of respective ones of the target pixels and a defined grid size. Thus, the model determines a grid having dimensions height×width that is a space lesser than the whole image. In one approach the grid may be 100×100 pixels or another suitable grid size. In any case, by using the grid to search the image, the neural camera model reduces the computational complexity of projecting the 3D points into pixels. At 950, the neural camera model applies a softmax approximation with an annealing temperature to search over the respective search grids. Applying a softmax approximation to derive each pixel in the synthesized image generally includes identifying a predicted ray vector from the ray surface that corresponds with a direction associated with each of the three-dimensional points as defined relative to the camera offset. In this way, the neural camera model can identify pixels of the synthesized image. The operations of 940 and 950 combine to form the projecting operation 980.”)
controlling one or more operations of the ego-object based at least on the second height map. (See at least [0080]: “At 960, the neural camera model provides the synthesized image as an output…” & [0106]: “The autonomous driving module(s) 160 either independently or in combination with the depth system 170 can be configured to determine travel path(s), current autonomous driving maneuvers for the vehicle 100, future autonomous driving maneuvers and/or modifications to current autonomous driving maneuvers based on data acquired by the sensor system 120, driving scene models, and/or data from any other suitable source…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy’s method with the teachings of Guizilini. Doing so would be obvious to “improve navigation of the vehicle through the environment” and “depth estimates for monocular images” (See [0007] & [0010] of Guizilini).
Regarding claim 4, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first input channel for the first height map representing the estimated height values of the 3D surface structure of the environment and (See at least [0080]: "FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6).")
a second input channel for a perspective image representing one or more color values of the 3D surface structure of the environment. (See at least [0062]: “In another example, assume the input into the encoder/decoder 605 includes a representation of an RGB image generated by a camera, and the encoder/decoder 605 classifies each pixel of the RGB image by generating one or more classification values for each pixel. The classification values may be associated with 3D locations identified from some other sensor data, such as LiDAR or RADAR detections, or 3D locations from a 3D representation of the environment such as 3D map of the environment.”)
Regarding claim 6, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy additionally teaches:
further comprising repetitively executing the method on successive instances of the image data generated in successive time slices to generate successive instances of the second height map. (See at least [0048]: “…Additionally or alternatively, the sensor data 402 may be accumulated 510 over time in order to increase the density of the accumulated sensor data. Sensor detections may be accumulated over any desired window of time (e.g., 0.5 seconds (s), 1 s, 2 s, etc.). The size of the window may be selected based on the sensor and/or application (e.g., smaller windows may be selected for noisy applications such as highway scenarios). As such, each input into the machine learning model(s) 408 may be generated from accumulated detections from each window of time from a rolling window (e.g., from a duration spanning from t-window size to present). Each window to evaluate incremented by any suitable step size, which may but need not correspond to the window size. Thus, each successive input into the machine learning model(s) 408 may be based on successive windows, which may but need not be overlapping.”)
Regarding claim 8, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy additionally teaches:
wherein the method is performed by at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. (See at least Abstract: “A deep neural network(s) (DNN) may be used to detect objects from sensor data of a three dimensional (3D) environment. For example, a multi-view perception DNN may include multiple constituent DNNs or stages chained together that sequentially process different views of the 3D environment. An example DNN may include a first stage that performs class segmentation in a first view (e.g., perspective view) and a second stage that performs class segmentation and/or regresses instance geometry in a second view (e.g., top-down). The DNN outputs may be processed to generate 2D and/or 3D bounding boxes and class labels for detected objects in the 3D environment. As such, the techniques described herein may be used to detect and classify animate objects and/or parts of an environment, and these detections and classifications may be provided to an autonomous vehicle drive stack to enable safe planning and control of the autonomous vehicle.”)
Regarding claim 9, Smolyanskiy teaches:
One or more processors comprising processing circuitry to: generate, based at least on image data generated using one or more cameras of a vehicle in an environment, a first height map encoding estimated height values of a three-dimensional (3D) surface structure of a road surface in the environment, (See at least Fig. 6 & [0067]: “In another example, the geometry data 64( )of objects in the environment may be generated 635 from image data (e.g., an RGB image) generated by a sensor (e.g., a camera). For example, a known orientation and location of the sensor that captured the image data may be used to un-project the image data into a 3D representation of the environment (e.g., a 3D map or some other world space) and identify 3D locations of objects in the world space corresponding to each pixel. In this case, one or more slices of the identified 3D locations may be taken to generate 635 the geometry data 640 (one or more height maps)”)
generate a second height map encoding second estimated height values of the 3D surface structure of the road surface based at least on applying the first height map to one or more neural networks (NNs) and (See at least [0080]: “…In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6). Note that the classified regions of the segmented LiDAR range image 820 (e.g., the drivable space 825) has been transformed to a corresponding region in the top-down view 830 (e.g., the transformed drivable space 835). The second stage of the neural network extracts classification data and object instance data, which is post-processed to generate bounding boxes for detected objects” & [0034]: “…The transformed classification data and geometry data may be stacked and fed into a second stage of the DNN, which may extract classification data (e.g., class confidence data such as confidence maps for any number of classes) and/or regress various types of information about the detected objects, such as location, geometry, and/or orientation…”)
Smolyanskiy does not explicitly teach:
the first height map being computed by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic; and 
reintroducing the removed aggregate height characteristic to one or more corresponding predicted height values; and 
control the vehicle based at least on data representing the second height map.
Guizilini teaches:
the first height map being computed by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove an aggregate height characteristic; and (See at least [0071]: “At 850, the training module 230 creates the image 360. In one or more approaches, the training module 230 creates the image 360 by synthesizing the image 360 from at least the ray surface (e.g., 350) and a depth map (e.g., 330) associated with the monocular image (i.e., 310). As implemented by the training module 230, creating the synthesized image generally includes applying the neural camera model to the noted inputs to synthesize the image 360. The neural camera model implements various functions in combination with inputs, such as a lifting operation and a projection operation. The neural camera model functions to lift pixels from the depth map 330 to produce three-dimensional points using the ray surface and a camera offset…” & [0076-0077]: “At 920, the neural camera model scales predicted ray vectors from the ray surface using the depth map. At 930, the neural camera model adjusts the predicted ray vectors according to the camera offset (i.e., camera center). The operations of 920 and 930 combine to form the lifting operation 970…”)
reintroducing the removed aggregate height characteristic to one or more corresponding predicted height values; and (See at least [0071]: “…Further, the neural camera model projects the three-dimensional points onto a context image to create the synthesized image…” & [0078-0079]: “At 940, the neural camera model determines a patch-based data association for searching pixels in the synthesized image. In one approach, the neural camera model determines the associations by defining search grids for target pixels of the synthesized image according to coordinates of respective ones of the target pixels and a defined grid size. Thus, the model determines a grid having dimensions height×width that is a space lesser than the whole image. In one approach the grid may be 100×100 pixels or another suitable grid size. In any case, by using the grid to search the image, the neural camera model reduces the computational complexity of projecting the 3D points into pixels. At 950, the neural camera model applies a softmax approximation with an annealing temperature to search over the respective search grids. Applying a softmax approximation to derive each pixel in the synthesized image generally includes identifying a predicted ray vector from the ray surface that corresponds with a direction associated with each of the three-dimensional points as defined relative to the camera offset. In this way, the neural camera model can identify pixels of the synthesized image. The operations of 940 and 950 combine to form the projecting operation 980.”)
control the vehicle based at least on data representing the second height map. (See at least [0080]: “At 960, the neural camera model provides the synthesized image as an output. Accordingly, integrating the neural camera model within a self-supervised monocular depth estimation framework improves the training of the associated models by providing independence from the source camera such that the depth system 170 can operate on arbitrary cameras without a need to perform complex calibrations” & [0063]: “Furthermore, the depth system 170 provides the depth map 330, in one or more approaches, to additional systems/modules in the vehicle 100 in order to control the operation of the modules and/or the vehicle 100 overall. In still further aspects, the training module 230 communicates the depth map 330 to a remote system (e.g., cloud-based system) as, for example, a mechanism for mapping the surrounding environment or for other purposes (e.g., traffic reporting, etc.). As one example, the training module 230 uses the depth map 330 to map locations of obstacles in the surrounding environment and plan a trajectory that safely navigates the obstacles. Thus, the training module 230, in one embodiment, uses the depth map 330, at least in part, to control the vehicle 100 to navigate through the surrounding environment.”)
Regarding claim 12, Smolyanskiy and Guizilini in combination teach all the limitations of claim 9 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first input channel for the first height map encoding the estimated height values of the road surface and (See at least [0080]: "FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6).")
a second input channel for a perspective image encoding one or more color values of the road surface. (See at least [0062]: “In another example, assume the input into the encoder/decoder 605 includes a representation of an RGB image generated by a camera, and the encoder/decoder 605 classifies each pixel of the RGB image by generating one or more classification values for each pixel. The classification values may be associated with 3D locations identified from some other sensor data, such as LiDAR or RADAR detections, or 3D locations from a 3D representation of the environment such as 3D map of the environment.”)
Regarding claim 14, Smolyanskiy and Guizilini in combination teach all the limitations of claim 9 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. (See at least Abstract: “A deep neural network(s) (DNN) may be used to detect objects from sensor data of a three dimensional (3D) environment. For example, a multi-view perception DNN may include multiple constituent DNNs or stages chained together that sequentially process different views of the 3D environment. An example DNN may include a first stage that performs class segmentation in a first view (e.g., perspective view) and a second stage that performs class segmentation and/or regresses instance geometry in a second view (e.g., top-down). The DNN outputs may be processed to generate 2D and/or 3D bounding boxes and class labels for detected objects in the 3D environment. As such, the techniques described herein may be used to detect and classify animate objects and/or parts of an environment, and these detections and classifications may be provided to an autonomous vehicle drive stack to enable safe planning and control of the autonomous vehicle.”)
Regarding claim 15, Smolyanskiy teaches:
A system comprising: one or more graphics processing units (GPUs) of one or more servers of one or more data centers, the one or more servers to: (See at least Fig. 16D & [0232]: “FIG. 16D is a system diagram for communication between cloud-based server(s) and the example autonomous vehicle 1600 of FIG. 16A, in accordance with some embodiments of the present disclosure. The system 1676 may include server(s) 1678, network(s) 1690, and vehicles, including the vehicle 1600. The server(s) 1678 may include a plurality of GPUs 1684(A)-1684(H)…”)
receive a first top-down representation of estimated height values of a three-dimensional (3D) surface structure of a component of an environment, (See at least [0007]: “For example, the first stage may extract classification data (e.g., confidence maps, segmentations masks, etc.) from a LiDAR range image or an RGB image. The extracted classification data may be transformed to a second view of the environment, for example, by labeling corresponding 3D locations (e.g., identified by corresponding pixels of a LiDAR range image) with the extracted classification data, and projecting the labeled 3D locations to the second view, in some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from sensor data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view)…”. See also [0032-0034].)
generate a second top-down representation of second estimated height values of the 3D surface structure based at least on using the one or more GPUs to process the first top-down representation with one or more neural networks (NNs) and 
Smolyanskiy does not explicitly teach:
the first top-down representation of the estimated height values generated by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove a bias, 
the one or more values generated using image data from one or more cameras of an ego-object in an environment; 
reintroducing the removed bias to one or more corresponding predicted height values; and 
provide the second top-down representation to a control component of the ego-object.
Guizilini teaches:
the first top-down representation of the estimated height values generated by normalizing one or more values corresponding to an initial estimated height of the 3D surface structure of the environment to remove a bias, (See at least [0071]: “At 850, the training module 230 creates the image 360. In one or more approaches, the training module 230 creates the image 360 by synthesizing the image 360 from at least the ray surface (e.g., 350) and a depth map (e.g., 330) associated with the monocular image (i.e., 310). As implemented by the training module 230, creating the synthesized image generally includes applying the neural camera model to the noted inputs to synthesize the image 360. The neural camera model implements various functions in combination with inputs, such as a lifting operation and a projection operation. The neural camera model functions to lift pixels from the depth map 330 to produce three-dimensional points using the ray surface and a camera offset…” & [0076-0077]: “At 920, the neural camera model scales predicted ray vectors from the ray surface using the depth map. At 930, the neural camera model adjusts the predicted ray vectors according to the camera offset (i.e., camera center). The operations of 920 and 930 combine to form the lifting operation 970…”)
the one or more values generated using image data from one or more cameras of an ego-object in an environment; (See at least [0034]: “…As described herein, monocular images that comprise the training images 250 are, for example, images from the camera 126 or another imaging device that are part of a video, and that encompasses a field-of-view (FOV) about the vehicle 100 of at least a portion of the surrounding environment…the camera 126 is a pinhole camera, a fisheye camera, a catadioptric camera, or another form of camera that acquires images without a specific depth modality.”)
reintroducing the removed bias to one or more corresponding predicted height values; and (See at least [0071]: “…Further, the neural camera model projects the three-dimensional points onto a context image to create the synthesized image…” & [0078-0079]: “At 940, the neural camera model determines a patch-based data association for searching pixels in the synthesized image. In one approach, the neural camera model determines the associations by defining search grids for target pixels of the synthesized image according to coordinates of respective ones of the target pixels and a defined grid size. Thus, the model determines a grid having dimensions height×width that is a space lesser than the whole image. In one approach the grid may be 100×100 pixels or another suitable grid size. In any case, by using the grid to search the image, the neural camera model reduces the computational complexity of projecting the 3D points into pixels. At 950, the neural camera model applies a softmax approximation with an annealing temperature to search over the respective search grids. Applying a softmax approximation to derive each pixel in the synthesized image generally includes identifying a predicted ray vector from the ray surface that corresponds with a direction associated with each of the three-dimensional points as defined relative to the camera offset. In this way, the neural camera model can identify pixels of the synthesized image. The operations of 940 and 950 combine to form the projecting operation 980.”)
provide the second top-down representation to a control component of the ego-object. (See at least [0080]: “At 960, the neural camera model provides the synthesized image as an output…” & [0106]: “The autonomous driving module(s) 160 either independently or in combination with the depth system 170 can be configured to determine travel path(s), current autonomous driving maneuvers for the vehicle 100, future autonomous driving maneuvers and/or modifications to current autonomous driving maneuvers based on data acquired by the sensor system 120, driving scene models, and/or data from any other suitable source…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy’s system with the teachings of Guizilini. Doing so would be obvious to “improve navigation of the vehicle through the environment” and “depth estimates for monocular images” (See [0007] & [0010] of Guizilini).
Regarding claim 18, Smolyanskiy and Guizilini in combination teach all the limitations of claim 15 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first input channel for the first top-down representation of the estimated height values of the component of the environment and (See at least [0080]: "FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6).")
a second input channel for a perspective image representing one or more color values of the component of the environment. (See at least [0062]: “In another example, assume the input into the encoder/decoder 605 includes a representation of an RGB image generated by a camera, and the encoder/decoder 605 classifies each pixel of the RGB image by generating one or more classification values for each pixel. The classification values may be associated with 3D locations identified from some other sensor data, such as LiDAR or RADAR detections, or 3D locations from a 3D representation of the environment such as 3D map of the environment.”)

Claim(s) 2, 10, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy in view of Guizilini and further in view of Chiu of US 20200184718 A1, filed 07/26/2019, hereinafter “Chiu”, and Xie et al. of “RICE: Refining Instance Masks in Cluttered Environments with Graph Neural Networks”, published 06/29/2021, hereinafter “Xie”.
Regarding claim 2, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy and Guizilini in combination do not explicitly teach: 
further comprising generating the first height map based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and 
Chiu teaches:
further comprising generating the first height map based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and (See at least [0042]: “…A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Chiu’s technique of generating 3D representation of the environment using 3D structure estimation on image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).
Smolyanskiy, Guizilini, and Chiu in combination do not explicitly teach: 
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment.
Xie teaches:
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment. (See at least pg. 3: “Our method, RICE, is designed to Refine Instance masks of unseen objects in Cluttered Environments. Given an initial segmentation mask S ∈ NH×W of unseen objects, we first encode this as a segmentation graph GS, which is described in Section 3.1…Given a single instance mask Si ∈ {0,1}H× W for instance i, we crop the RGB image I ∈ RH×W×3, an organized point cloud D ∈ RH×W×3 (computed by backprojecting a depth image with camera intrinsics), and the mask Si with some padding for context. We then resize the crops to h × w and feed these into a multistream encoder network which we denote as the Node Encoder. This network applies a separate convolutional neural network (CNN) to each input, and then fuses the flattened outputs to provide a feature vector vi for this node. See Figure 2 for a visual illustration of the network. Note that we also encode the background mask as a node in the graph. This gives the segmentation graph GS = (V,E), where each vi ∈ V corresponds to an individual instance mask, and nodes are connected with undirected edges e = (i,j) ∈ E if their set distance is less than a threshold.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy, Guizilini, and Chiu’s method with Xie’s technique of identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment. Doing so would be obvious for “more accurate performance with lower variance” (See pg. 3 of Xie).
Regarding claim 10, Smolyanskiy and Guizilini in combination teach all the limitations of claim 9 as discussed above. 
Smolyanskiy and Guizilini in combination do not explicitly teach: 
wherein the processing circuitry is further to generate the first height map encoding the estimated height values of the 3D surface structure of the road surface based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and 
Chiu teaches:
wherein the processing circuitry is further to generate the first height map encoding the estimated height values of the 3D surface structure of the road surface based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and (See at least [0042]: “…A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s processor with Chiu’s technique of generating 3D representation of the environment using 3D structure estimation on image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).
Smolyanskiy, Guizilini, and Chiu in combination do not explicitly teach: 
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the road surface based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the road surface.
Xie teaches:
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the road surface based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the road surface. (See at least pg. 3: “Our method, RICE, is designed to Refine Instance masks of unseen objects in Cluttered Environments. Given an initial segmentation mask S ∈ NH×W of unseen objects, we first encode this as a segmentation graph GS, which is described in Section 3.1…Given a single instance mask Si ∈ {0,1}H× W for instance i, we crop the RGB image I ∈ RH×W×3, an organized point cloud D ∈ RH×W×3 (computed by backprojecting a depth image with camera intrinsics), and the mask Si with some padding for context. We then resize the crops to h × w and feed these into a multistream encoder network which we denote as the Node Encoder. This network applies a separate convolutional neural network (CNN) to each input, and then fuses the flattened outputs to provide a feature vector vi for this node. See Figure 2 for a visual illustration of the network. Note that we also encode the background mask as a node in the graph. This gives the segmentation graph GS = (V,E), where each vi ∈ V corresponds to an individual instance mask, and nodes are connected with undirected edges e = (i,j) ∈ E if their set distance is less than a threshold.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy, Guizilini, and Chiu’s method with Xie’s technique of identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment. Doing so would be obvious for “more accurate performance with lower variance” (See pg. 3 of Xie).
Regarding claim 16, Smolyanskiy and Guizilini in combination teach all the limitations of claim 15 as discussed above. 
Smolyanskiy and Guizilini in combination do not explicitly teach: 
the first top-down representation generated based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and 
Chiu teaches:
the first top-down representation generated based at least on: generating, based at least on applying 3D structure estimation to the image data, a first estimated 3D representation of the environment; and (See at least [0042]: “…A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Chiu’s technique of generating 3D representation of the environment using 3D structure estimation on image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).
Smolyanskiy, Guizilini, and Chiu in combination do not explicitly teach: 
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the component of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the component of the environment.
Xie teaches:
identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the component of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the component of the environment. (See at least pg. 3: “Our method, RICE, is designed to Refine Instance masks of unseen objects in Cluttered Environments. Given an initial segmentation mask S ∈ NH×W of unseen objects, we first encode this as a segmentation graph GS, which is described in Section 3.1…Given a single instance mask Si ∈ {0,1}H× W for instance i, we crop the RGB image I ∈ RH×W×3, an organized point cloud D ∈ RH×W×3 (computed by backprojecting a depth image with camera intrinsics), and the mask Si with some padding for context. We then resize the crops to h × w and feed these into a multistream encoder network which we denote as the Node Encoder. This network applies a separate convolutional neural network (CNN) to each input, and then fuses the flattened outputs to provide a feature vector vi for this node. See Figure 2 for a visual illustration of the network. Note that we also encode the background mask as a node in the graph. This gives the segmentation graph GS = (V,E), where each vi ∈ V corresponds to an individual instance mask, and nodes are connected with undirected edges e = (i,j) ∈ E if their set distance is less than a threshold.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy, Guizilini, and Chiu’s method with Xie’s technique of identifying one or more 3D points of the first estimated 3D representation of the environment that belong to the 3D surface structure of the environment based at least on back-projecting the one or more 3D points into a region of a segmentation mask predicted to represent the 3D surface structure of the environment. Doing so would be obvious for “more accurate performance with lower variance” (See pg. 3 of Xie).

Claim(s) 5, 13, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy in view of Guizilini and further in view of Li of US 20210287430 A1, filed 04/15/2020, hereinafter “Li”.
Regarding claim 5, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first output channel that regresses one or more height values of the 3D surface structure of the environment and (See at least [0034]: "In some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from LiDAR data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view). The transformed classification data and geometry data may be stacked and fed into a second stage of the DNN, which may extract classification data (e.g., class confidence data such as confidence maps for any number of classes) and/or regress various types of information about the detected objects, such as location, geometry, and/or orientation. The DNN outputs may be processed to generate 2D and/or 3D bounding boxes and class labels for detected objects in the 3D environment.")
Smolyanskiy and Guizilini in combination do not explicitly teach: 
a second output channel that regresses one or more confidence values corresponding to the one or more height values.
Li teaches:
a second output channel that regresses one or more confidence values corresponding to the one or more height values. (See at least [0235]: "In at least one embodiment, a DLA may be used to run any type of network to enhance control and driving safety, including for example and without limitation, a neural network that outputs a measure of confidence for each object detection…In at least one embodiment, a DLA may run a neural network for regressing confidence value. In at least one embodiment, neural network may take as its input at least some subset of parameters, such as bounding box dimensions, ground plane estimate obtained (e.g., from another subsystem), output from IMU sensor(s) 3366 that correlates with vehicle 3300 orientation, distance, 3D location estimates of object obtained from neural network and/or other sensors (e.g., LIDAR sensor(s) 3364 or RADAR sensor(s) 3360), among others.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Li’s technique of including a second output channel that regress one or more confidence values corresponding to one or more height values. Doing so would be obvious to enable “a system to make further decisions regarding which detections should be considered as true positive detections rather than false positive detections” (See [0235] of Li).
Regarding claim 13, Smolyanskiy and Guizilini in combination teach all the limitations of claim 9 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first output channel that regresses one or more height values of the road surface and (See at least [0034]: "In some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from LiDAR data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view). The transformed classification data and geometry data may be stacked and fed into a second stage of the DNN, which may extract classification data (e.g., class confidence data such as confidence maps for any number of classes) and/or regress various types of information about the detected objects, such as location, geometry, and/or orientation. The DNN outputs may be processed to generate 2D and/or 3D bounding boxes and class labels for detected objects in the 3D environment.")
Smolyanskiy and Guizilini in combination do not explicitly teach: 
a second output channel that regresses one or more confidence values corresponding to the one or more height values.  
Li teaches:
a second output channel that regresses one or more confidence values corresponding to the one or more height values. (See at least [0235]: "In at least one embodiment, a DLA may be used to run any type of network to enhance control and driving safety, including for example and without limitation, a neural network that outputs a measure of confidence for each object detection…In at least one embodiment, a DLA may run a neural network for regressing confidence value. In at least one embodiment, neural network may take as its input at least some subset of parameters, such as bounding box dimensions, ground plane estimate obtained (e.g., from another subsystem), output from IMU sensor(s) 3366 that correlates with vehicle 3300 orientation, distance, 3D location estimates of object obtained from neural network and/or other sensors (e.g., LIDAR sensor(s) 3364 or RADAR sensor(s) 3360), among others.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Li’s technique of including a second output channel that regress one or more confidence values corresponding to one or more height values. Doing so would be obvious to enable “a system to make further decisions regarding which detections should be considered as true positive detections rather than false positive detections” (See [0235] of Li).
Regarding claim 19, Smolyanskiy and Guizilini in combination teach all the limitations of claim 15 as discussed above. 
Smolyanskiy additionally teaches:
wherein the one or more NNs include a first output channel that regresses one or more height values of the component of the environment and (See at least [0034]: "In some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from LiDAR data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view). The transformed classification data and geometry data may be stacked and fed into a second stage of the DNN, which may extract classification data (e.g., class confidence data such as confidence maps for any number of classes) and/or regress various types of information about the detected objects, such as location, geometry, and/or orientation. The DNN outputs may be processed to generate 2D and/or 3D bounding boxes and class labels for detected objects in the 3D environment.")
Smolyanskiy and Guizilini in combination do not explicitly teach: 
a second output channel that regresses one or more confidence values corresponding to the one or more height values.
Li teaches:
a second output channel that regresses one or more confidence values corresponding to the one or more height values. (See at least [0235]: "In at least one embodiment, a DLA may be used to run any type of network to enhance control and driving safety, including for example and without limitation, a neural network that outputs a measure of confidence for each object detection…In at least one embodiment, a DLA may run a neural network for regressing confidence value. In at least one embodiment, neural network may take as its input at least some subset of parameters, such as bounding box dimensions, ground plane estimate obtained (e.g., from another subsystem), output from IMU sensor(s) 3366 that correlates with vehicle 3300 orientation, distance, 3D location estimates of object obtained from neural network and/or other sensors (e.g., LIDAR sensor(s) 3364 or RADAR sensor(s) 3360), among others.”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Li’s technique of including a second output channel that regress one or more confidence values corresponding to one or more height values. Doing so would be obvious to enable “a system to make further decisions regarding which detections should be considered as true positive detections rather than false positive detections” (See [0235] of Li).

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy in view of Guizilini and further in view of Levinson of US 10486485 B1, filed 04/19/2017, hereinafter “Levinson”.
Regarding claim 7, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy and Guizilini in combination do not explicitly teach: 
wherein the one or more operations of the ego-object comprise performing at least one of: adapting a suspension system of the ego-object based at least on the second height map representing the estimated height values of the 3D surface structure, navigating the ego-object to avoid a protuberance detected in the second height map representing the estimated height values of the 3D surface structure, or applying an acceleration or deceleration to the ego-object based at least on a surface slope detected in the second height map representing the second estimated height values of the 3D surface structure.
Levinson teaches:
wherein the one or more operations of the ego-object comprise performing at least one of: adapting a suspension system of the ego-object based at least on the second height map representing the estimated height values of the 3D surface structure, navigating the ego-object to avoid a protuberance detected in the second height map representing the estimated height values of the 3D surface structure, or applying an acceleration or deceleration to the ego-object based at least on a surface slope detected in the second height map representing the second estimated height values of the 3D surface structure. (See at least col. 28, lines 52-57: “Additionally, or in the alternative, suspension controller 628 may adjust components of the suspensions 608 based on analysis of sections of the height map. As a non-limiting example, the height map may indicate that the vehicle is about to traverse a rough terrain (e.g. cobblestones, or any other repeating pattern of deformations).”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Levinson’s technique of adapting a suspension system of the ego-object based on the second height map. Doing so would be obvious “provide a smooth ride over such a terrain” (i.e., rough terrain) (See col. 28, lines 52-64 of Levinson).

Claim(s) 3, 11, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy in view of Guizilini and further in view of Chiu.
Regarding claim 3, Smolyanskiy and Guizilini in combination teach all the limitations of claim 1 as discussed above. 
Smolyanskiy additionally teaches:
projecting at least a portion of the point cloud representation to generate the representation of the first height map representing the estimated height values of the 3D surface structure of the environment; and (See at least [0007]: "For example, the first stage may extract classification data (e.g., confidence maps, segmentations masks, etc.) from a LiDAR range image or an RGB image. The extracted classification data may be transformed to a second view of the environment, for example, by labeling corresponding 3D locations (e.g., identified by corresponding pixels of a LiDAR range image) with the extracted classification data, and projecting the labeled 3D locations to the second view, in some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from sensor data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view)." See also [0034] & [0067].)
applying the representation of the first height map to the one or more NNs to predict the second height map representing the estimated height values of the 3D surface structure of the environment. (See at least [0080]: “FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6). Note that the classified regions of the segmented LiDAR range image 820 (e.g., the drivable space 825) has been transformed to a corresponding region in the top-down view 830 (e.g., the transformed drivable space 835). The second stage of the neural network extracts classification data and object instance data, which is post-processed to generate bounding boxes for detected objects.”)
Smolyanskiy and Guizilini in combination do not explicitly teach: 
further comprising: generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; 
Chiu teaches:
further comprising: generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; (See at least [0042]: “A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Chiu’s technique of generating a point cloud representation of the environment based on applying 3D structure estimation to image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).
Regarding claim 11, Smolyanskiy and Guizilini in combination teach all the limitations of claim 9 as discussed above. 
Smolyanskiy additionally teaches:
project at least a portion of the point cloud representation to generate the first height map encoding the estimated height values of the 3D surface structure of the road surface; and (See at least [0007]: "For example, the first stage may extract classification data (e.g., confidence maps, segmentations masks, etc.) from a LiDAR range image or an RGB image. The extracted classification data may be transformed to a second view of the environment, for example, by labeling corresponding 3D locations (e.g., identified by corresponding pixels of a LiDAR range image) with the extracted classification data, and projecting the labeled 3D locations to the second view, in some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from sensor data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view)." See also [0034] & [0067].)
apply the first height map to the one or more NNs to predict the second height map encoding the second estimated height values of the 3D surface structure of the road surface. (See at least [0080]: “FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6). Note that the classified regions of the segmented LiDAR range image 820 (e.g., the drivable space 825) has been transformed to a corresponding region in the top-down view 830 (e.g., the transformed drivable space 835). The second stage of the neural network extracts classification data and object instance data, which is post-processed to generate bounding boxes for detected objects.”)
Smolyanskiy and Guizilini in combination do not explicitly teach:
wherein the processing circuitry is further to: generate, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; 
Chiu teaches:
wherein the processing circuitry is further to: generate, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; (See at least [0042]: “A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Chiu’s technique of generating a point cloud representation of the environment based on applying 3D structure estimation to image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).
Regarding claim 17, Smolyanskiy and Guizilini in combination teach all the limitations of claim 15 as discussed above. 
Smolyanskiy additionally teaches:
projecting at least a portion of the point cloud representation to generate the first top-down representation of the estimated height values of the 3D surface structure of the component of the environment; and (See at least [0007]: "For example, the first stage may extract classification data (e.g., confidence maps, segmentations masks, etc.) from a LiDAR range image or an RGB image. The extracted classification data may be transformed to a second view of the environment, for example, by labeling corresponding 3D locations (e.g., identified by corresponding pixels of a LiDAR range image) with the extracted classification data, and projecting the labeled 3D locations to the second view, in some embodiments, geometry data (e.g., height data) of objects in the 3D space may be obtained from sensor data (e.g., by projecting a LiDAR point cloud into one or more height maps in a top-down view) and/or images of the 3D space (e.g., by unprojecting an image into world space and projecting into a top-down view)." See also [0034] & [0067].)
applying the first top-down representation to the one or more NNs to predict the second top-down representation of the estimated height values. (See at least [0080]: “FIG. 8 is an illustration of an example data flow through an example multi-view perception machine learning model(s), in accordance with some embodiments of the present disclosure. In FIG. 8, a LiDAR range image 810 is input into a first stage of a neural network (e.g., the encoder/decoder 605 of FIG. 6), which segments the LiDAR range image to generate a segmented LiDAR range image 820. The segmented LiDAR range image 820 is transformed to a top-down view 830, stacked with height data, and fed through a second stage of the neural network (e.g., the encoder/decoder trunk 650, class confidence head 655, and instance regression head 660 of FIG. 6). Note that the classified regions of the segmented LiDAR range image 820 (e.g., the drivable space 825) has been transformed to a corresponding region in the top-down view 830 (e.g., the transformed drivable space 835). The second stage of the neural network extracts classification data and object instance data, which is post-processed to generate bounding boxes for detected objects.”)
Smolyanskiy and Guizilini in combination do not explicitly teach: 
the second top-down representation generated based at least on: generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; 
Chiu teaches:
the second top-down representation generated based at least on: generating, based at least on applying 3D structure estimation to the image data, a point cloud representation of the environment; (See at least [0042]: “A 3D estimation process 606 analyzes the image to determine a 3D space or point cloud. In some embodiments, the 3D estimation process 606 may include obtaining a two-dimensional image, determining a depth map from the 2D image, converting the depth map to a Euclidean video point cloud, and registering the video point cloud directly onto a 3D point cloud…”)
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with Chiu’s technique of generating a point cloud representation of the environment based on applying 3D structure estimation to image data. Doing so would be obvious to “enhance the 3D map” and ensure “updates of the 3D map for moving objects” (See [0037] of Chiu).

Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Smolyanskiy in view of Guizilini and further in view of Li et al. of “Geometry to the Rescue: 3D Instance Reconstruction from a Cluttered Scene”, published 07/28/2020, hereinafter “Li et al.”.  
Regarding claim 20, Smolyanskiy and Guizilini in combination teach all the limitations of claim 15 as discussed above. 
Guizilini additionally teaches: 
wherein normalizing the one or more values of the 3D surface structure of the environment comprises subtracting (See at least [0071]: “At 850, the training module 230 creates the image 360. In one or more approaches, the training module 230 creates the image 360 by synthesizing the image 360 from at least the ray surface (e.g., 350) and a depth map (e.g., 330) associated with the monocular image (i.e., 310). As implemented by the training module 230, creating the synthesized image generally includes applying the neural camera model to the noted inputs to synthesize the image 360. The neural camera model implements various functions in combination with inputs, such as a lifting operation and a projection operation. The neural camera model functions to lift pixels from the depth map 330 to produce three-dimensional points using the ray surface and a camera offset…” & [0076-0077]: “At 920, the neural camera model scales predicted ray vectors from the ray surface using the depth map. At 930, the neural camera model adjusts the predicted ray vectors according to the camera offset (i.e., camera center). The operations of 920 and 930 combine to form the lifting operation 970…”)
reintroducing the bias comprises reintroducing (See at least [0071]: “…Further, the neural camera model projects the three-dimensional points onto a context image to create the synthesized image…” & [0078-0079]: “At 940, the neural camera model determines a patch-based data association for searching pixels in the synthesized image. In one approach, the neural camera model determines the associations by defining search grids for target pixels of the synthesized image according to coordinates of respective ones of the target pixels and a defined grid size. Thus, the model determines a grid having dimensions height×width that is a space lesser than the whole image. In one approach the grid may be 100×100 pixels or another suitable grid size. In any case, by using the grid to search the image, the neural camera model reduces the computational complexity of projecting the 3D points into pixels. At 950, the neural camera model applies a softmax approximation with an annealing temperature to search over the respective search grids. Applying a softmax approximation to derive each pixel in the synthesized image generally includes identifying a predicted ray vector from the ray surface that corresponds with a direction associated with each of the three-dimensional points as defined relative to the camera offset. In this way, the neural camera model can identify pixels of the synthesized image. The operations of 940 and 950 combine to form the projecting operation 980.”)
Smolyanskiy and Guizilini in combination do not explicitly teach: 
…mean height…
Although Guizilini does not explicitly teach that the normalizing includes subtracting and reintroducing a mean height, Li et al. teaches that “each object’s height is normalized by the total height variation of itself in a scene” (See at least pg. 1100 of Li et al.). Therefore, one having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to perform the teachings Guizilini using the mean height taught by Li et al. since the “relative height per object is a more proper geometric cue instead of the absolute height” and with the benefit of “improv[ing] the 3D instance reconstruction (See Abstract & pg. 1100 of Li et al.).
One having ordinary skill in the art, before the effective filing date of the claimed invention, would have found it obvious to combine Smolyanskiy and Guizilini’s method with the teachings of Li et al. Doing so would be obvious “to improve the 3D instance reconstruction” (See Abstract of Li et al.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20220327844 A1 is directed to detecting objects such as a disparity of a road surface.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nikki Molina whose telephone number is (571) 272-5180. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aniss Chad, can be reached on (571) 270-3832. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NIKKI MARIE M MOLINA/Examiner, Art Unit 3662        

/ANISS CHAD/Supervisory Patent Examiner, Art Unit 3662
Read full office action
Prosecution Timeline

Dec 06, 2024
Application Filed
Mar 03, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/487,240
Patent 12589757
VEHICLE, VEHICLE PLATFORM, AND AUTONOMOUS DRIVING KIT
2y 5m to grant Granted Mar 31, 2026
17/938,210
Patent 12570309
SYSTEMS AND METHODS OF CALIBRATING SENSORS FOR AN AUTONOMOUS VEHICLE
2y 5m to grant Granted Mar 10, 2026
18/469,329
Patent 12565208
PREDICTIVE CURVE SPEED CONTROLLER
2y 5m to grant Granted Mar 03, 2026
17/881,412
Patent 12553721
MOBILE APPLICATION FOR FLIGHT LOGGING
2y 5m to grant Granted Feb 17, 2026
18/251,340
Patent 12552551
METHOD FOR DETERMINING AN EFFICIENCY FAULT OF AN AIRCRAFT TURBOSHAFT ENGINE MODULE
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
83%
With Interview (+5.6%)
2y 11m
Median Time to Grant
Low
PTA Risk
Based on 88 resolved cases by this examiner. Grant probability derived from career allow rate.