Last updated: May 29, 2026
Application No. 17/944,146
SYSTEM FOR NEURAL ARCHITECTURE SEARCH FOR MONOCULAR DEPTH ESTIMATION AND METHOD OF USING

Non-Final OA §103
Filed
Sep 13, 2022
Priority
Nov 05, 2021 — provisional 63/276,527
Examiner
ROHD, BENJAMIN MATTHEW
Art Unit
2147
Tech Center
2100 — Computer Architecture & Software
Assignee
Woven By Toyota Inc.
OA Round
2 (Non-Final)
Interview Optional

— +0.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 0% grant rate with +0.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 2 resolved cases, 2023–2026
Examiner Intelligence

ROHD, BENJAMIN MATTHEW View full profile →
Grants only 0% of cases
Career Allowance Rate
0 granted / 2 resolved
-55.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
100.0%
+60.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 2 resolved cases
Office Action

§103
DETAILED ACTION
This office action is in response to amendments filed on 11/13/2025.
Claims 1, 9, and 17 have been amended. Claims 1-20 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Claim Objections:
In light of applicant’s amendments to the claims (pg. 2-6), the objection to the claims has been withdrawn.

Prior Art Rejections:
Applicant’s arguments regarding the prior art rejections (pg. 9-10) have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Applicant argues that the amended independent claim limitation which recites “the latency specification is based on components of an in-vehicle object detection system” is not disclosed or suggested by any of the cited references. Examiner notes that the Yang reference has been brought in to teach this limitation.
The prior art rejections have been updated to include the amended limitations and to clarify the reasoning given for the limitations that were not amended.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over
	Atapour-Abarghouei et al. (hereinafter Atapour-Abarghouei), “Monocular Segment-Wise Depth: Monocular Depth Estimation Based on a Semantic Segmentation Prior” in view of
Cho et al. (hereinafter Cho), “Deep Monocular Depth Estimation Leveraging a Large-scale Outdoor Stereo Dataset”, 
	Kentley-Klay, U.S. Patent US 10459444 B1, and
	Yang et al. (hereinafter Yang), “NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications”.

	Regarding Claim 1,
	Atapour-Abarghouei teaches An in-vehicle model training system comprising: 
receiving an input image; (Pg. 4295, section 1: “In this work, we propose a model that estimates scene depth based on a single RGB image…” Pg. 4297, section 3.3: “Training data is composed of a large corpus of synthetic images [13] consisting of RGB, depth and pixel-wise class labels.”)
performing object detection, using an encoder, on the received input image to identify at least one object, wherein the encoder includes an in-vehicle neural network (NN) model; (Pg. 4296, section 3: “Based on empirical analysis, we opt for decomposing any scene captured within an urban driving scenario into four object groups… Given an input image, each group is segmented using a separate segmentation network (S), the outputs of which are object groups…” Pg. 4297, section 3.3: “For the sake of consistency, all the segmentation and depth generator networks follow a similar encoder-decoder architecture…” Segmenting an image into object groups is object detection, and the segmentation is performed by an encoder-decoder model, which inherently includes a neural network model.)
determining a distance to each of the at least one object; (Pg. 4296, section 3: “object groups… are subsequently employed to choose sections of the RGB image passed as inputs to depth generators (G), producing depth information for each object group (D1, D2, D3, D4).” Producing depth information for object groups is determining a distance to each object.)
generating a first heatmap based on the determined distance to each of the at least one object; (Pg. 4297, section 3.2: “We consider monocular depth estimation as a supervised image-to-image mapping problem, wherein an input RGB image is translated into a depth image. More formally, a depth generator network (G) approximates a mapping function that takes as its input an RGB image x and outputs a depth image y, G : x → y.” A depth image is a heatmap based on determined distance.)
Atapour-Abarghouei does not appear to explicitly disclose
a non-transitory computer readable medium configured to store instructions thereon; and a processor connected to the non-transitory computer readable medium, wherein the processor is configured to execute the instructions for: 
comparing the first heatmap with a second heatmap generated by a trained neural network (NN);
updating the in-vehicle NN model based on differences between the first heatmap and the second heatmap;
determining whether a latency of the encoder satisfies a latency specification, wherein the latency specification is based on components of an in-vehicle object detection system; and 
outputting the in-vehicle NN model in response to the latency satisfying the latency specification and the difference between the first heatmap and the second heatmap satisfying an accuracy specification.
However, Cho teaches comparing the first heatmap with a second heatmap generated by a trained neural network (NN); (Pg. 2, section 1: “First, the teacher network for stereo matching is trained using a small amount of training data with ground truth depth maps.” Pg. 3, section II.C: “In our method, the pseudo ground truth depth maps computed from the existing stereo matching network, which acts as a deep teacher network, are used to train the student network for monocular depth inference.” Pg. 5, section III.D: “Given a monocular input image and pseudo-ground-truth depth map                 
                    
                            D
                        
                        ~
                    
                    (
                    p
                    )
                
            , we propose to use the stereo confidence guided regression loss                 
                    
                            L
                        
                            c
                        
            :                 
                    
                            L
                        
                            c
                        
                    =
                    
                            1
                        
                                    ∑
                                    
                                        p
                                    
                                            M
                                        
                                            p
                                        
                            ∑
                            
                                p
                            
                                    M
                                
                                    p
                                
                            ∙
                            
                                                    D
                                                
                                                ^
                                            
                                                    p
                                                
                                            -
                                            
                                                    D
                                                
                                                ~
                                            
                                            (
                                            p
                                            )
                                        
                                    1
                                
            … where                 
                    
                            D
                        
                        ^
                    
                            p
                        
             denotes the depth map predicted by the monocular depth estimation network.” The term                 
                    
                            D
                        
                        ^
                    
                            p
                        
                    -
                    
                            D
                        
                        ~
                    
                    (
                    p
                    )
                     
            represents the difference between the depth map                 
                    
                            D
                        
                        ^
                    
                            p
                        
             (i.e. first heatmap) predicted by the student monocular depth estimation network, and the depth map                 
                    
                            D
                        
                        ~
                    
                    (
                    p
                    )
                
             (i.e. second heatmap) generated by the deep teacher network (i.e. trained neural network).)
updating the in-vehicle NN model based on differences between the first heatmap and the second heatmap; (Pg. 4, section III.A: “The pseudo-ground-truth depth maps are used to supervise the monocular student network via stereo confidence guided regression loss.” The student network (i.e. the in-vehicle NN model) is trained (i.e. its parameters are updated) based on supervision by stereo confidence guided regression loss, which is based on the difference between the student’s depth map (i.e. heatmap) and the teacher’s depth map (i.e. heatmap).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Atapour-Abarghouei and Cho. Atapour-Abarghouei teaches monocular depth estimation for autonomous vehicles using an encoder-decoder architecture and object detection via semantic segmentation. Cho teaches monocular depth estimation using an encoder-decoder architecture trained via a student-teacher knowledge distillation strategy. One of ordinary skill would have motivation to combine Atapour-Abarghouei and Cho because Cho’s strategy is applicable to “a number of challenges in robotics and computer vision tasks including… autonomous driving… and scene understanding” (pg. 1, section 1), and “obviates the need to use massive ground truth depth maps” in training (pg. 2, section 1), while still “outperform[ing] state-of-the-art approaches” (pg. 1, Abstract).
Kentley-Klay teaches a non-transitory computer readable medium configured to store instructions thereon; and a processor connected to the non-transitory computer readable medium, (Col. 11, lines 33-62: “In some examples, the vehicle system 202 may include processor(s) 204 and/or memory 206… In some examples, the memory 206 may include a non-transitory computer readable media configured to store executable instructions/modules, data, and/or data items accessible by the processor(s) 204.”)
Kentley-Klay teaches wherein the processor is configured to execute the instructions for: 
determining whether a latency of the encoder satisfies a latency specification, (Col. 4, lines 43-64: “In some examples, an autonomous vehicle and/or a remote computing device may analyze performance of the autonomous vehicle as it operated using a trained target model and/or an experimental ML model to determine whether performance of the autonomous vehicle was satisfactory or improved. Historic drive data of the autonomous vehicle and/or other autonomous vehicles may be used as a control to which to compare… Metrics for experimentally making such a determination may include…speed of achieving a result by an ML model…” Speed of achieving a result by an ML model is latency, and this latency metric associated with the experimental model (i.e. in-vehicle NN, encoder) is compared to data associated with other models (i.e. a specification) to determine whether performance is satisfactory.)
outputting the in-vehicle NN model in response to the latency satisfying the latency specification and the difference between the first heatmap and the second heatmap satisfying an accuracy specification. (Col. 3, line 65 – col. 4, line 13: “For example, a first vehicle may operate using an experimental ML model while a second vehicle monitors the first vehicle. In such examples, the second vehicle may monitor sensor data captured by the first vehicle and/or the second vehicle to analyze performance of the first vehicle. As non-limiting examples, such analysis may include, for example…an accuracy of object segmentation and/or classification… This is discussed further herein as ‘determining performance metrics.’” Col. 14, lines 59-63: “Models having performance metrics outperforming current models may be considered as validated models. Validated models may be transmitted to a remote computing system for dissemination to a fleet of vehicles, or a portion thereof.” Col. 4, lines 55-64: “Metrics for experimentally making such a determination may include…speed of achieving a result by an ML model…” Speed of achieving a result by an ML model is latency. Model performance metrics of the experimental model (i.e. in-vehicle NN), which include latency and accuracy, are compared to metrics of currently deployed models (i.e. specifications), and if performance metrics are satisfactory, the model is validated. The validated experimental model (i.e. in-vehicle NN) is then transmitted (i.e. output).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Atapour-Abarghouei, Cho, and Kentley-Klay. Atapour-Abarghouei teaches monocular depth estimation for autonomous vehicles using an encoder-decoder architecture and object detection via semantic segmentation. Cho teaches monocular depth estimation using an encoder-decoder architecture trained via a student-teacher knowledge distillation strategy. Kentley-Klay teaches a system for continuous model training, validation, and deployment for autonomous vehicles. One of ordinary skill would have motivation to combine Atapour-Abarghouei, Cho, and Kentley-Klay because “An autonomous vehicle may require updates and/or additions to its ML models to handle previously unencountered scenarios or to improve one or more driving parameters” (Kentley-Klay, col. 1, lines 58-61), and Kentley-Klay’s system provides a mechanism by which updated models can be trained, validated, and deployed to autonomous vehicles.
Yang teaches wherein the latency specification is based on components of an in-vehicle object detection system; and (Pg. 4, section 3-3.1: "We propose an algorithm, called NetAdapt, that will allow a user to automatically simplify a pretrained network to meet the resource budget of a platform while maximizing the accuracy… The resource can be latency, energy, memory footprint, etc., or a combination of these metrics." A neural network is simplified in order to satisfy a resource budget of a platform on which it is to be deployed (i.e. a specification based on components of the system). The resource budget can be a latency budget (i.e. a latency specification).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Atapour-Abarghouei, Cho, Kentley-Klay, and Yang. Atapour-Abarghouei teaches monocular depth estimation for autonomous vehicles using an encoder-decoder architecture and object detection via semantic segmentation. Cho teaches monocular depth estimation using an encoder-decoder architecture trained via a student-teacher knowledge distillation strategy. Kentley-Klay teaches a system for continuous model training, validation, and deployment for autonomous vehicles. Yang teaches adapting a neural network for deployment on a resource-constrained edge device based on the device’s latency budget. One of ordinary skill would have motivation to combine Atapour-Abarghouei, Cho, Kentley-Klay, and Yang because “DNN-based AI applications are typically too computationally intensive to be deployed on resource-constrained platforms” (Yang, pg. 1, section 1), such as autonomous vehicles, and “the proposed algorithm can achieve better accuracy versus latency trade-off (by up to 1.7× faster with equal or higher accuracy) compared with other state-of-the-art network simplification algorithms” (Yang, pg. 14, section 5).

Regarding Claim 2, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Atapour-Abarghouei also teaches wherein the processor is further configured to execute the instructions for performing the object detection using semantic segmentation. (Pg. 4295, section 1: “The major contributions of this work are thus as follows: • Using pixel-level scene segmentation as a prior to enhance the performance of monocular depth estimation. • Utilizing an end-to-end training procedure for an overall model capable of estimating depth for individual groups of scene objects based on a semantic segmentation step jointly trained within the same model.” Identifying groups of scene objects is object detection, and it is performed using semantic segmentation.)

Regarding Claim 3, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Atapour-Abarghouei also teaches wherein the processor is further configured to execute the instructions for receiving the input image includes a red-green-blue (RGB) image. (Pg. 4295, section 1: “In this work, we propose a model that estimates scene depth based on a single RGB image…”)

Regarding Claim 4, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Kentley-Klay also teaches wherein the processor is further configured to execute the instructions for receiving the latency specification and the accuracy specification from an external device. (Col. 15, lines 60-67: “In some examples, the remote computing device 222 may receive a trained target ML model from the one or more vehicles, performance metrics 236, a machine state 238, and/or recorded sensor data. The remote computing device 222 may use this data to determine whether performance of the vehicle system 202 was improved or degraded by inclusion of the target and/or experimental ML model.” Performance metric data includes accuracy and latency metrics, and comprises the specification to which model performance is compared (see claim 1). These metrics (i.e. specifications) are received from a vehicle (i.e. an external device).)

Regarding Claim 5, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Cho also teaches wherein the processor is further configured to execute the instructions for performing the object detection using the in-vehicle NN model having fewer neurons than the trained NN. (Pg. 4, section III.A: “We propose a simple yet effective approach for monocular depth estimation by leveraging the student-teacher strategy. The shallow student network learns from the more informative deep teacher network.” The student network (i.e. the in-vehicle NN) is shallow (i.e. having fewer neurons), while the teacher network (i.e. the trained NN) is deep (i.e. having more neurons).) 

Regarding Claim 6, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Atapour-Abarghouei also teaches wherein the processor is further configured to execute the instructions for determining the distance to each of the at least one object using a decoder. (Pg. 4297, section 3.3: “For the sake of consistency, all the segmentation and depth generator networks follow a similar encoder-decoder architecture…” Object distance is determined by the depth generator network, which includes a decoder.)

Regarding Claim 7, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 6, as shown above.
Cho also teaches wherein the processor is further configured to execute the instructions for updating the decoder based on differences between the first heatmap and the second heatmap. (Pg. 4, section III.A: “The pseudo-ground-truth depth maps are used to supervise the monocular student network via stereo confidence guided regression loss.” The monocular depth estimation student network is trained (i.e. its parameters are updated) based on supervision by stereo confidence guided regression loss, which is based on the difference between the student’s depth map (i.e. heatmap) and the teacher’s depth map (i.e. heatmap). Figure 5 shows the architecture of the monocular depth estimation network, including a “decoder” (pg. 5).)

Regarding Claim 8, Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach The in-vehicle model training system according to claim 1, as shown above.
Kentley-Klay also teaches wherein the processor is further configured to execute the instructions for outputting the in-vehicle NN model by causing the in-vehicle model training system to wirelessly transmit the in-vehicle NN model to a vehicle. (Col. 15, line 67 – Col. 16, line 5: “If the remote computing device 222 determines that a model improved performance of the vehicle system 202 (i.e., the remote computing device 222 “validates” the model), the remote computing device 222 may transmit the validated model to one or more vehicles of a fleet of vehicles.” Col. 12, lines 47-58: “The example vehicle system 202 may include a network interface 210 configured to establish a communication link (i.e., ‘network’) between the vehicle system 202 and one or more other devices… For example, the network interface 210 may enable wireless communication between another vehicle 218 and/or the remote computing device 222.” The validated model (i.e. in-vehicle NN) is transmitted to a vehicle via a network interface enabling wireless communication.)

Claims 9-16 are method claims, containing substantially the same elements as system claims 1-8. Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach the elements of claims 1-8, as shown above.

Claims 17-20 are product claims, containing substantially the same elements as system claims 1, 3, 5, and 8. Atapour-Abarghouei, Cho, Kentley-Klay, and Yang teach the elements of claims 1, 3, 5, and 8, as shown above.

Conclusion
Claims 1-20 are rejected.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN M ROHD whose telephone number is (571)272-6445. The examiner can normally be reached Mon-Thurs 8:00-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/B.M.R./Examiner, Art Unit 2147                                                                                                                                                                                                                                                                                                                                                                                                           /VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147
Read full office action
Prosecution Timeline

Show 3 earlier events
Nov 04, 2025
Applicant Interview (Telephonic)
Nov 04, 2025
Examiner Interview Summary
Nov 13, 2025
Response Filed
Feb 06, 2026
Final Rejection mailed — §103
Mar 02, 2026
Interview Requested
Mar 18, 2026
Response after Non-Final Action
Apr 13, 2026
Request for Continued Examination
Apr 18, 2026
Response after Non-Final Action
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
Grant Probability
With Interview (+0.0%)
4y 3m (~6m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 2 resolved cases by this examiner. Grant probability derived from career allowance rate.