Last updated: April 19, 2026
Application No. 17/669,215
INFRARED SENSING DATA-ASSISTED CLASSIFICATION OF VULNERABLE ROAD USERS

Non-Final OA §103
Filed
Feb 10, 2022
Examiner
STANDKE, ADAM C
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Waymo LLC
OA Round
3 (Non-Final)
Interview Optional

— +24.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 123 resolved cases, 2023–2026
Examiner Intelligence

STANDKE, ADAM C View full profile →
Grants 50% of resolved cases
Career Allow Rate
61 granted / 123 resolved
-5.4% vs TC avg
Strong +25% interview lift
Without
With
+24.8%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
39 currently pending
Career history
162
Total Applications
across all art units
Statute-Specific Performance

§101
18.9%
-21.1% vs TC avg
§103
55.3%
+15.3% vs TC avg
§102
8.7%
-31.3% vs TC avg
§112
14.7%
-25.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 123 resolved cases
Office Action

§103
DETAILED ACTION
Response to Arguments
Applicant’s arguments with respect to claims 11-13 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged by Applicant’s arguments in respect to a detector machine-learning model; a first neural network and a ssecond neural network as found on pgs. 10-13 of Applicant’s Remarks submitted on 12/23/2025.
In regards to Applicant arguing that the prior art of Geng in view of Dimitrievski and Lakshmi does not teach the newly added claim limitations of claims 1 and 14, Examiner respectfully disagrees. See the corresponding Office Action for the detailed teaching. Accordingly, the 103 rejection has not been withdrawn. 
Notice of Pre-AIA  or AIA  Status
	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/23/2025 has been entered.
Claim Interpretation
	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

	This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
	the sensing system configured to in claim 1
	the sensing system configured to in claim 7
	the sensing system if further configured to in claim 10
	the sensing system configured to in claim 11
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
	Claims 1-6, 8-10, 14-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Geng, et al. "Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach." Proceedings of the Institution of Mechanical Engineers, Part D: Journal of automobile engineering 233.9 (2019)(“Geng”) in view of Dimitrievski, Martin, et al. "Automatic labeling of vulnerable road users in multi-sensor data." 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021(“Dimitrievski”) and further in view of  US 2020/0086879 A1(“Lakshmi”). 
Regarding claim 1, Geng teaches a system comprising: 
a sensing system of a vehicle(Geng, pg., 2277, see also fig. 10, “To acquire the color–infrared images, the construction of an image capturing system is completed so that the color image camera and the infrared image camera is able to acquire the image data simultaneously, as shown in Figure 10[a sensing system of a vehicle].”), the sensing system configured to: 
obtain multi-modal sensing data characterizing an environment of the vehicle(Geng, pg., 2277, see also fig. 10, “To acquire the color…images, the construction of an image capturing system is completed so that the color image camera…is able to acquire the image data simultaneously, as shown in Figure 10[obtain multi-modal sensing data characterizing an environment of the vehicle].”);
and crop the multi-modal sensing data to obtain(Geng, pgs. 2273-2275, As fig. 3 details below: 

    PNG
    media_image1.png
    432
    740
    media_image1.png
    Greyscale

the operation panel allows for copping of the first image in the environment to produce a cropped label and target of the color image as detailed in fig. 5);
 first sensing data that comprises optical range camera sensing data for the identified region(Geng, pgs., 2273, “The color image camera Canon EOS M6 has a resolution of 24 million pixels[first sensing data comprises at least one of an optical range camera sensing data for the identified region]….” &), 1
second sensing data that comprises infrared camera sensing data for the identified region(Geng, pgs., 2273, “[T]he infrared image camera Flir A615 has a resolution of 300,000 pixel[a second sensing data that comprises infrared camera sensing data for the identified region]….”); 
and a perception system of the vehicle(Geng, pg., 2277, see also fig. 10, “To acquire the color–infrared images, the construction of an image capturing system is completed so that the color image camera and the infrared image camera is able to acquire the image data simultaneously, as shown in Figure 10[and a perception system of the vehicle].”), 
the perception system configured to: 
process the first sensing data, the second sensing data, [and the third sensing data] using a classifier (MLM) to obtain a classification of the  one or more (VRUs) present in the environment of the vehicle(Geng, pgs., 2275-2278, see also fig. 6 and row 1 of Table 2, “We built a dual-modal deep neural network based on Faster R-CNN and used the VGG16 model the convolutional layer… [t]he color and infrared sub-networks are integrated after the fourth VGG-16 convolutional block via fusion block, including similar feature map concatenation and a 1x1 convolutional layer… [t]hen the merged feature maps are processed by the same operation as in single-modal detection network Faster R-CNN. The dual-modal information is passed layer by layer until to the final classification and bounding box regression[process the first sensing data and the second sensing data using a (MLM) to obtain a classification]… and analyze the target recognition performance on the all testing dataset in Table 2, which shows the targets detection rate of people…in the whole testing dataset[of the one or more (VRUs) present in the environment of the vehicle]….” );2, 3 
Geng does not teach: process, by a detector machine learning model (MLM), the multi-modal sensing data, wherein an output of the detector MLM comprises an identification of a region of the environment comprising one or more vulnerable road users (VRUs);third sensing data that comprises lidar sensing data for the identified region; and the third sensing data. 
However, Dimitrievski teaches: 
process, by a detector machine learning model (MLM), the multi-modal sensing data, wherein an output of the detector MLM comprises an identification of a region of the environment comprising one or more vulnerable road users (VRUs)(Dimitrievski, pgs., 2625-2626, see also fig. 2, “A state-of-the-art object detector is used to find bounding boxes in the RGB image. Simultaneously, 3D lidar information is projected into a corresponding depth image, augmenting the bounding boxes with range data. These 3D object positions are then converted into accurate object tracks by an off-line two-pass tracking algorithm[process, by a detector machine learning model (MLM), the multi-modal sensing data]... [u]sing this methodology, we produce a dataset that includes raw radar data, time-consistent positions and identities in both image plane and 2.5-D ground plane coordinates, targeted specifically at VRU detection and tracking[wherein an output of the detector MLM comprises an identification of a region of the environment comprising one or more vulnerable road users (VRUs)].”);
third sensing data that comprises lidar sensing data for the identified region(Dimitrievski, pg., 2625-2626, see also fig. 2, “In Figure 2 we present a general system diagram of the proposed annotation tool. The tool relies on multi-modal data, more specifically: RGB video sequences for reliable object recognition, and lidar point clouds for the corresponding depth information needed to reconstruct the 3D structure of the scene... [i]n our case, we use a full HD RGB camera module (Intel Realsense D435 in rgb mode) and Ouster OS1-128 scanning lidar as the supervising sensors[third sensing data that comprises lidar sensing data for the identified region].”);
[process the first sensing data, the second sensing data, ] and the third sensing data[ using a classifier (MLM) to obtain a classification of the one or more (VRUs) present in the environment of the vehicle]( Dimitrievski, pg., 2625-2626, see also fig. 2, “In Figure 2 we present a general system diagram of the proposed annotation tool. The tool relies on multi-modal data, more specifically: RGB video sequences for reliable object recognition, and lidar point clouds for the corresponding depth information needed to reconstruct the 3D structure of the scene... [i]n our case, we use a full HD RGB camera module (Intel Realsense D435 in rgb mode) and Ouster OS1-128 scanning lidar as the supervising sensors[and third sensing data])4
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng with the teachings of Dimitrievski the motivation to do so would be to incorporate lidar sensing data for autonomous vehicles for additional perception capabilities when other sensors may be hampered by external events(Dimitrievski, pg., 2623, “Accurate perception of vulnerable road users (VRUs)
is central to the deployment of fully autonomous driving... [r]einforcing camera-based perception with more robust modalities such as lidar and radar, offers sensor redundancy for fail-safe operation, as well as the potential for increased accuracy by sensor fusion.”). 
Even though Geng in view of Dimitrievski teaches in view of the obtained classification one or more VRUs Geng in view of Dimitrievski does not teach: and cause a driving path of the vehicle to be modified. 
However, Lakshmi teaches:  
and cause a driving path of the vehicle to be modified [in view of the obtained classification of the one or more VRUs](Lakshmi, para., 0093, see also fig. 8, “[T]he vehicle system 828 may operate, or perform an action based on the predicted driver behavior associated with the x corresponding prediction frames. For example, the vehicle system 828 may be an advanced driver-assistance systems (ADAS) which may implement an automated steering or deceleration action to mitigate an anticipated collision, alert the driver of a potential collision, provide warnings, automatically engage an autonomous driving mode for the vehicle, automate lighting, provide or engage an adaptive cruise control, engaged in a collision avoidance action, generate a traffic notification, connect a smartphone, contact an emergency contact, engage in a lane departure warning mode or action, provide automatic lane centering, highlight an obstacle on a display or a HUD, etc[and cause a driving path of the vehicle to be modified].”).5  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view of Dimitrievski with the teachings of Lakshmi the motivation to do so would be to determine the driving behavior of other vehicles using less supervised data(Lakshmi, paras. 0044-0045, “A unified representation framework is proposed to enable the application of learning driving behavior or driver behavior recognition. This learning or behavior recognition may be based on three-dimensional (3D) semantic scene representations and multimodal data fusion of data from vehicle sensors, such as cameras or other sensors connected to a controller area network (CAN) bus of the vehicle, to detect tactical driver behaviors… [o]ne of the advantages or benefits provided by this unified representation framework or the techniques and systems for driver behavior recognition described herein is that the issues of data scarcity for supervised learning algorithms may be alleviated or mitigated.”).
Regarding claim 2, Geng in view Dimitrievski and Lakshmi teaches the system of claim 1, wherein the classifier MLM comprises 
a neural network (NN)(Geng, pgs., 2275-2278, see also fig. 6 and table 2, “We built a dual-modal deep neural network based on Faster R-CNN and used the VGG16 model the convolutional layer[a neural network (NN)].”), and wherein the NN comprises: 
a first subnetwork configured to process the first sensing data and to output one or more first feature vectors characteristic of the first sensing data(Geng, pgs., 2275-2278, As fig 6 details: 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale

A color image is inputted into the color sub-network which is a convolutional neural network and outputs a 512 feature map); 
a second subnetwork configured to process the second sensing data and to output one or more second feature vectors characteristic of the second sensing data(Geng, pgs., 2275-2278, As fig 6 details: 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale

An infrared image is inputted into the infrared sub-network which is a convolutional neural network and outputs a 512 feature map); 
and a fusion subnetwork configured to process an aggregated feature vector, wherein the aggregated feature vector comprises the one or more first feature vectors and the one or more second feature vectors(Geng, pgs., 2275-2278, As fig 7 details:

    PNG
    media_image3.png
    453
    565
    media_image3.png
    Greyscale

The two feature maps from the infrared sub-network and the color sub-network are concatenated together(blue block) then are input into a 1x1 convolutional layer(yellow block) and another convolutional layer with ReLU activation (grey block) to become a fused block/layer), 
and wherein the classification of the one or more VRUs is determined using an output of the fusion subnetwork(Geng, pgs., 2275-2278, As fig 6 details: 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale

The fused block/layer into a ROI pooling layer and a region proposal network and then inputted into two fully connected layers that are then later fed to a final classification head(i.e., SoftMax classification) and a bounding box regression head (i.e., Bbox regression) to predict the one or more VPUs ).  
Regarding claim 3, Geng in view Dimitrievski and Lakshmi teaches the system of claim 2, wherein the fusion subnetwork is a recurrent NN(Lakshmi, para. 0057, see also fig. 6 and 7,  “According to another aspect, the RNN unit 120 may process the fusion feature using a LSTM layer via the LSTM unit 122[wherein the fusion subnetwork is a recurrent NN].”).
  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view of Dimitrievski with the above teachings of Lakshmi for the same rationale stated at Claim 1.
Regarding claim 4, Geng in view Dimitrievski and Lakshmi teaches the system of claim 2, wherein the NN further comprises 
a plurality of classification heads, and wherein each of the plurality of classification heads is configured to process the output of the fusion subnetwork and to determine, for at least one object in the environment, a probability of the object belonging to a respective one of a plurality of VRU classes(Geng, pgs., 2275-2278, As fig 6 details: 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale

The fused block/layer into a ROI pooling layer and a region proposal network and then inputted into two fully connected layers that is then later fed to the final classification head(SoftMax classification)[and to determine, for at least one object in the environment, a probability of the object belonging to a respective one of a plurality of VRU classes] and bounding box regression head (Bbox regression)[a plurality of classification heads and wherein each of the plurality of classification heads is configured to process the output of the fusion subnetwork] for the one or more VPUs).  
Regarding claim 5, Geng in view Dimitrievski and Lakshmi teaches the system of claim 4, wherein the plurality of VRU classes comprises at least one of a pedestrian, a bicyclist, or a motorcyclist(Geng, pgs., 2278, see also fig. 9 and row 1 of Table 2, “The test results obtained by the three convolutional networks fusion architecture compare and analyze the target recognition performance on the all testing dataset in Table 2, which shows the targets detection rate of people[at least one of a pedestrian]….”).6  
Regarding claim 6, Geng in view Dimitrievski and Lakshmi teaches the system of claim 2, wherein the first subnetwork is trained using a plurality of sets of at least one of radar training data, lidar training data, or optical camera training data, and wherein the second subnetwork is a copy of the first subnetwork(Geng, pgs., 2275-2278, “At present, about 80,000 color–infrared images are collected under various environmental conditions… 70%images of this dataset… are labeled manually to generate the labeling file stored in XML format, preparing for the training process of the deep neural network. The remaining 30% images of the dataset are used for the deep neural network[wherein the first subnetwork is trained using a plurality of sets of at least one of optical camera training data].” & As fig 6 details: 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale

The second subnetwork i.e., the infrared sub-network is a copy of the first subnetwork i.e., the color sub-network in which the second subnetwork has the same convolutional structure as the first subnetwork i.e., conv_1_64, conv2_128, conv3_256, conv4_512 ).7  
Regarding claim 8, Geng in view Dimitrievski and Lakshmi teaches the system of claim 1, wherein each of the first sensing data and the second sensing data comprises 
a plurality of images, each of the plurality of images being associated with a respective sensing frame of a plurality of sensing frames, wherein each of the plurality of sensing frames corresponds to a respective one of a plurality of times, and wherein the perception system performs pipeline processing of the plurality of sensing frames using the classifier MLM(Lakshmi, para. 0108, see also fig. 14, “In FIG. 14, the first image capture sensor 806 may capture the first image sequence 1004 and pass this first image sequence 1004 on to the memory 104, the second image capture sensor 808 may capture the second image sequence 1002 and pass this second image sequence 1002 on to the memory 104, and the CAN bus 128 may capture or receive the CAN sequence 1006 from one or more of the vehicle systems 828[a plurality of images, each of the plurality of images being associated with a respective sensing frame of a plurality of sensing frames, wherein each of the plurality of sensing frames corresponds to a respective one of a plurality of times]. The first image sequence 1004 may be fed, one frame at a time, through a first CNN 1410, such as by the convolutor 110, the depth CNN unit 112, or the pose CNN unit 114, including a first portion 1412 of layers and a second portion 1414 of layers to produce or generate a first feature vector 1416 associated with image segmentation 1418. The second image sequence 1002 may be fed, one frame at a time, through a second CNN 1420, including a first portion 1422 of layers and a second portion 1424 of layers to produce or generate a second feature vector 1426 associated with driver pose 1428, such as by the convolutor 110, the depth CNN unit 112, or the pose CNN unit 114[and wherein the perception system performs pipeline processing of the plurality of sensing frames using the classifier MLM].”).  
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view Dimitrievski with the above teachings of Lakshmi for the same rationale stated at Claim 1.
Regarding claim 9, Geng in view Dimitrievski and Lakshmi teaches the system of claim 8, wherein individual frames of the plurality of sensing frame are processed using the classifier MLM in an order of acquisition associated with a direction of scanning, by the sensing system of the vehicle, of the environment of the vehicle(Lakshmi, para. 0108, see also fig. 14, “The first image sequence 1004 may be fed, one frame at a time, through a first CNN 1410, such as by the convolutor 110, the depth CNN unit 112, or the pose CNN unit 114, including a first portion 1412 of layers and a second portion 1414 of layers to produce or generate a first feature vector 1416 associated with image segmentation 1418. The second image sequence 1002 may be fed, one frame at a time, through a second CNN 1420, including a first portion 1422 of layers and a second portion 1424 of layers to produce or generate a second feature vector 1426 associated with driver pose 1428, such as by the convolutor 110, the depth CNN unit 112, or the pose CNN unit 114[wherein individual frames of the plurality of sensing frame are processed using the classifier MLM in an order of acquisition associated with a direction of scanning, by the sensing system of the vehicle, of the environment of the vehicle]… [t]he fusion feature vector 1450 may be passed through a LSTM layer 1460, thereby generating a tactical driver behavior recognition/prediction result 1470 for respective frames.”). 
  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view Dimitrievski with the above teachings of Lakshmi for the same rationale stated at Claim 1.
Regarding claim 10, Geng in view Dimitrievski and Lakshmi teaches the system of claim 1, wherein the sensing system is further configured to: obtain a radar sensing data characterizing the environment of the vehicle, and wherein to obtain the classification of the one or more VRUs, the perception system of the vehicle is further configured to process, using the classifier MLM, the radar sensing data (Dimitrievski, pgs., 2625-2626, see also figs. 1 and 2 “In Figure 2 we present a general system diagram of the proposed annotation tool...[t]he tool relies on multi-modal data...[u]sing this methodology, we produce a dataset that includes raw radar data, time-consistent positions and identities in both image plane and 2.5-D ground plane coordinates, targeted specifically at VRU detection and tracking[obtain a radar sensing data characterizing the environment of the vehicle, and wherein to obtain the classification of the one or more VRUs, the perception system of the vehicle is further configured to process, using the classifier MLM, the radar sensing data].”).  
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng with the above teachings of Dimitrievski for the same rationale stated at Claim 1.
Referring to independent claim 14, it is rejected on the same basis as independent claim 1 since they are analogous claims.
Referring to dependent claims 15-16, 18 and 20 they are rejected on the same basis as dependent claims 2-3, 6, and 8 since they are analogous claims.
Furthermore, referring to dependent claim 17 it is rejected on the same basis as dependent claims 4-5 since dependent claims 4-5 encompasses dependent claim 17 and hence are analogous claims.
Claims 11 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Geng, et al. "Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach." Proceedings of the Institution of Mechanical Engineers, Part D: Journal of automobile engineering 233.9 (2019)(“Geng”) in view of Ning et al., US 9,760,806 Bl (“Ning”) and in view of Lakshmi., US 2020/0086879 A1(“Lakshmi”). 
Regarding claim 11, Geng teaches a system comprising:
the sensing system configured to: obtain a first image [of an environment of the AV], wherein the first image comprises at least one of a lidar image a radar image, or an optical range camera image(Geng, pgs., 2273, “The color image camera Canon EOS M6 has a resolution of 24 million pixels….” & Geng, pg., 2277, see also fig. 10, “To acquire the color…images, the construction of an image capturing system is completed so that the color image camera…is able to acquire the image data simultaneously, as shown in Figure 10.”);8,9 
and obtain a second image [characterizing the environment of the AV], wherein the second image comprises an infrared camera sensing image(Geng, pgs., 2273, “[T]he infrared image camera Flir A615 has a resolution of 300,000 pixel….” & Geng, pg., 2277, see also fig. 10, “To acquire the…infrared images, the construction of an image capturing system is completed so that…the infrared image camera is able to acquire the image data simultaneously, as shown in Figure 10.”);10 
the perception system comprising: 
[a detector machine-learning model (MLM) configured to identify,] based on at least one of the first image or the second image, a first candidate object and a second candidate object [within the environment of the AV]( Geng, pgs., 2275-2278, see also fig. 6 and row 1 of Table 2, “We built a dual-modal deep neural network based on Faster R-CNN and used the VGG16 model the convolutional layer… [t]he color and infrared sub-networks are integrated after the fourth VGG-16 convolutional block via fusion block, including similar feature map concatenation and a 1x1 convolutional layer… [t]hen the merged feature maps are processed by the same operation as in single-modal detection network Faster R-CNN. The dual-modal information is passed layer by layer until to the final classification and bounding box regression… and analyze the target recognition performance on the all testing dataset in Table 2, which shows the targets detection rate of people and vehicles in the whole testing dataset[based on at least one of the first image or the second image, a first candidate object and a second candidate object]….”);11,12 
and a classifier MLM configured to process at least (i) a part of the first image and (ii) a part of the second image to determine that the first candidate object is a vulnerable road user (VRU) [in the environment of the AV] and that the second candidate object is a non-VRU(Geng, pgs., 2275-2278, see also fig. 6 and row 1 of Table 2, “We built a dual-modal deep neural network based on Faster R-CNN and used the VGG16 model the convolutional layer… [t]he color and infrared sub-networks are integrated after the fourth VGG-16 convolutional block via fusion block, including similar feature map concatenation and a 1x1 convolutional layer… [t]hen the merged feature maps are processed by the same operation as in single-modal detection network Faster R-CNN. The dual-modal information is passed layer by layer until to the final classification and bounding box regression… and analyze the target recognition performance on the all testing dataset in Table 2, which shows the targets detection rate of people[the first candidate object a vulnerable road user (VRU)] and vehicles[the second candidate object is a non-VRU] in the whole testing dataset….”)13 
wherein the part of the first image and the part of the second image are selected based on locations of the first candidate object and the second candidate object identified by the detector MLM(Geng, pgs., 2275-2278, see also fig. 6 and row 1 of Table 2, “We built a dual-modal deep neural network based on Faster R-CNN[identified by the detector MLM] and used the VGG16 model the convolutional layer… [t]he color and infrared sub-networks are integrated after the fourth VGG-16 convolutional block via fusion block, including similar feature map concatenation and a 1x1 convolutional layer… [t]hen the merged feature maps are processed by the same operation as in single-modal detection network Faster R-CNN. The dual-modal information is passed layer by layer until to the final classification and bounding box regression… and analyze the target recognition performance on the all testing dataset in Table 2, which shows the targets detection rate of people[wherein the part of the first image and the part of the second image are selected based on the first candidate object] and vehicles[the second candidate object] in the whole testing dataset….”).
Even though Geng teaches a first image, a second image, a first candidate object, a second candidate object, Geng does not teach: a detector machine-learning model (MLM) configured to identify. 
However, Ning teaches: 
a detector machine-learning model (MLM) configured to identify(Ning, col. 6, see also figs. 1, 3 and 4, “In system 200 for vision-centric deep-learning-based road situation analysis, the generic object recognition and tracking is mainly realized by the deep neural network ROLO engine 220, whose structure is shown in FIG. 3, and the process of which is illustrated in FIG. 4. Based on the visual cues from the one or more cameras 211, a generic object detection[a detector machine-learning model (MLM)]... [t]he disclosed system 200 uses a tracking module in the deep neural network ROLO engine 220, together with a Kalman filter to follow detected pedestrians/vehicles over time[configured to identify].”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng with the teachings of Ning the motivation to do so would be to implement an object detection system to improve an advanced driver assistance system(ADAS)(Ning, col. 4, lines 22-28, “The disclosed system for vision-centric deep-learning based road situation analysis, also called a vision-centric mobile advanced driver assistance systems (ADAS) system, can provide enhanced navigation via automatic environment understanding for assisting a driver to have a better road/vehicle situational awareness in complex traffic scenarios”). 
Even though Geng in view of Ning teaches a first image, a second image, a first candidate object, a second candidate object, and the first candidate object is a vulnerable road user (VRU) Geng in view of Ning does not teach: a sensing system of an autonomous vehicle (AV); of an environment of the AV; characterizing the environment of the AV; a perception system of the AV; within the environment of the AV; in the environment of the AV.
However, Lakshmi teaches:  
a sensing system of an autonomous vehicle (AV)(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway[a sensing system of an autonomous vehicle (AV)]. If the scene classifier 1520 determines that it is foggy or rainy out, the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects.”);
a perception system of the AV(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway. If the scene classifier 1520 determines that it is foggy or rainy out, the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects[a perception system of the AV].”);
of an environment of the AV(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway. If the scene classifier 1520 determines that it is foggy or rainy out[of an environment of the AV], the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects.”); 
characterizing the environment of the AV(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway. If the scene classifier 1520 determines that it is foggy or rainy out[characterizing the environment of the AV], the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects.”); 
within the environment of the AV(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway. If the scene classifier 1520 determines that it is foggy or rainy out[within the environment of the AV], the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects.”); 
in the environment of the AV(Lakshmi, para. 0153, “[T]he scene classifier 1520 determines the scene prediction to be a construction zone, the controller of the vehicle (e.g., implemented via the processor 1504) may warn or provide notifications and/or disable autonomous driving based on the scene prediction being the construction zone because autonomous vehicles may utilize pre-built, high-definition maps of a roadway. If the scene classifier 1520 determines that it is foggy or rainy out[in the environment of the AV], the processor 1504 may disable the LIDAR from one or more of the vehicle systems 1590 to mitigate ghosting effects.”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view of Ning with the teachings of Lakshmi the motivation to do so would be to determine the driving behavior of other vehicles using less supervised data(Lakshmi, paras. 0044-0045, “A unified representation framework is proposed to enable the application of learning driving behavior or driver behavior recognition. This learning or behavior recognition may be based on three-dimensional (3D) semantic scene representations and multimodal data fusion of data from vehicle sensors, such as cameras or other sensors connected to a controller area network (CAN) bus of the vehicle, to detect tactical driver behaviors… [o]ne of the advantages or benefits provided by this unified representation framework or the techniques and systems for driver behavior recognition described herein is that the issues of data scarcity for supervised learning algorithms may be alleviated or mitigated.”).
Regarding claim 13, Geng in view Ning and Lakshmi teaches the system of claim 11, wherein the classifier MLM comprises: 
a first neural network (NN) configured to process the first image and to output a first feature vector characteristic of the first image; a second NN configured to process the second image and to output a second feature vector characteristic of the second image(Ning, cols., 9-10, see also figs. 3 and 8, “The traditional CNN has 20 convolutional layers followed by 2 fully connected layers... [t]he convolutional neural network takes the video frame as its visual input 310 and produce a feature map of the whole image... the output of the first fully connected layer is a feature vector of size 4096, a dense representation of the mid-level visual features[a first neural network (NN) configured to process the first image and to output a first feature vector characteristic of the first image]...[o]nce the pre-trained convolutional weights being able to generate visual features, the YOLO architecture can be adopted... [t]he second fully connected layer of YOLO, by contrast, translates the mid-level feature to tensor representations. These predictions are encoded as an SxSx(Bx5+C) tensor. It denotes that the image is divided into SxS splits[a second NN configured to process the second image and to output a second feature vector characteristic of the second image].”); 
and a third NN configured to process an aggregated feature vector and to determine a probability of the first candidate object belonging to a VRU class, wherein the aggregated feature vector comprises the first feature vector and the second feature vector, wherein to determine that the first candidate object is a VRU, the classifier MLM is to determine that the determined probability is above a threshold probability  (Geng, pgs., 2275-2278, As fig 7 details:

    PNG
    media_image3.png
    453
    565
    media_image3.png
    Greyscale

The two feature maps from the infrared sub-network and the color sub-network are concatenated together(blue block) then are input into a 1x1 convolutional layer(yellow block) and another convolutional layer with ReLU activation (grey block) to become a fused block/layer. And then as fig. 6 details below this fused block/layer is inputted layer into a ROI pooling layer and a region proposal network and then inputted into two fully connected layers that are then later fed to a final classification head(i.e., SoftMax classification)[vector and to determine a probability of the first candidate object belonging to a VRU class] and a bounding box regression head (i.e., Bbox regression) to predict the one or more VPUs[the classifier MLM is to determine that the determined probability is above a threshold probability ]. 

    PNG
    media_image2.png
    474
    1163
    media_image2.png
    Greyscale
).14 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng with the above teachings of Ning for the same rationale stated at Claim 11.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Geng, et al. "Low-observable targets detection for autonomous vehicles based on dual-modal sensor fusion with deep learning approach." Proceedings of the Institution of Mechanical Engineers, Part D: Journal of automobile engineering 233.9 (2019)(“Geng”) in view of Ning et al., US 9,760,806 Bl (“Ning”) and in view of Lakshmi., US 2020/0086879 A1(“Lakshmi”) and further in view of Wang, Meng, et al. "Beyond object proposals: Random crop pooling for multi-label image recognition." IEEE Transactions on Image Processing 25.12 (2016)(“Wang”).
Regarding claim 12, Geng in view of Ning and Lakshmi teaches the system of claim 11, but does not teach that: the part of the first image comprises a first cropped portion of the first image and a second cropped portion of the first image, wherein the first cropped portion comprises a depiction of the first candidate object and the second cropped portion comprises a depiction of the second candidate object.  
However, Wang teaches:
the part of the first image comprises a first cropped portion of the first image and a second cropped portion of the first image, wherein the first cropped portion comprises a depiction of the first candidate object and the second cropped portion comprises a depiction of the second candidate object(Wang, pg., 5681, see also fig. 2, “As shown in Figure 2, for each training image                         
                            
                                
                                    I
                                
                                
                                    0
                                
                            
                        
                     we first rescale its shorter side to be 512 pixels while preserving its aspect ratio. Denote the rescaled image as I[the first image]…[t]hen we rescale I according to the k ratios and generate images                         
                            
                                
                                    I
                                
                                
                                    1
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    2
                                
                            
                            ,
                             
                            …
                            
                                
                                    I
                                
                                
                                    k
                                
                            
                        
                     correspondingly. Next we stochastically crop a 224 × 224 region from each of                         
                            
                                
                                    I
                                
                                
                                    1
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    2
                                
                            
                            ,
                             
                            …
                            
                                
                                    I
                                
                                
                                    k
                                
                            
                        
                    [a first cropped portion of the first image and a second cropped portion of the first image, wherein the first cropped portion comprises a depiction of the first candidate object and the second cropped portion comprises a depiction of the second candidate object]”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Geng in view of Ning and Lakshmi with the teachings of Wang the motivation to do so would be to create variations of one image for better generalization performance for multi-label objection detection tasks(Wang, pg., 5681, “Because the rescaling ratios and crop locations are continuously changing during training, the network is able to [‘]see[’] numerous variations in one object, and our approach is thus more robust than the proposal-based methods.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20200207339 A1(recites techniques to label sensor data while an autonomous vehicle is continuously collecting real-time data and driving for path planning purposes)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM C STANDKE whose telephone number is (571)270-1806. The examiner can normally be reached Gen. M-F 9-9PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Adam C Standke/
Primary Examiner
Art Unit 2129





    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        2 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        3 Examiner Remarks: The claim limitations that are not in bold and contained with square brackets i.e., [ ] are claim limitations that are not taught by Geng
        4 Examiner Remarks: The claim limitations that are not in bold and contained with square brackets i.e., [ ] are claim limitations that are taught by Geng
        5 Examiner Remarks: The claim limitations that are not bolded and contained in brackets are claim limitations taught by the prior art of Geng 
        6 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        7 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        8 Examiner Remarks: The claim limitations that are not bolded and contained in brackets are claim limitations not taught by the prior art of Geng
        9 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        10 Examiner Remarks: The claim limitations that are not bolded and contained in brackets are claim limitations not taught by the prior art of Geng
        11 Examiner Remarks: The claim limitations that are not bolded and contained in brackets are claim limitations not taught by the prior art of Geng
        12 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        13 Examiner Remarks: The claim limitations that are not bolded and contained in brackets are claim limitations not taught by the prior art of Geng
        14 Examiner remarks: As detailed by the paper Faster r-cnn: Towards real-time object detection with region proposal networks (which has been included with this Office Action),the authors of the model Faster r-cnn adopted the following thresholding procedure: “[t]o reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2k proposal regions per image.” Ren, pg. 5. Accordingly, since Geng details on pg. 2276 that “the merged feature maps are processed by the same operation as in single-modal detection network Faster R-CNN” one of ordinary skill in the art would understand that Geng teaches that the determined probability is above a threshold probability.
Read full office action
Prosecution Timeline

Feb 10, 2022
Application Filed
Apr 04, 2025
Non-Final Rejection — §103
Jun 25, 2025
Applicant Interview (Telephonic)
Jun 26, 2025
Examiner Interview Summary
Jun 27, 2025
Response Filed
Oct 02, 2025
Final Rejection — §103
Nov 18, 2025
Response after Non-Final Action
Dec 23, 2025
Request for Continued Examination
Jan 16, 2026
Response after Non-Final Action
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/414,718
Patent 12596958
APPARATUS AND METHODS FOR MULTIPLE STAGE PROCESS MODELING
2y 5m to grant Granted Apr 07, 2026
17/165,444
Patent 12555026
INTERPRETABLE HIERARCHICAL CLUSTERING
2y 5m to grant Granted Feb 17, 2026
17/072,709
Patent 12547875
AUTOMATED SETUP AND COMMUNICATION COORDINATION FOR TRAINING AND UTILIZING MASSIVELY PARALLEL NEURAL NETWORKS
2y 5m to grant Granted Feb 10, 2026
17/007,438
Patent 12541704
MACHINE-LEARNING PREDICTION OR SUGGESTION BASED ON OBJECT IDENTIFICATION
2y 5m to grant Granted Feb 03, 2026
17/119,592
Patent 12541691
MIXUP DATA AUGMENTATION FOR KNOWLEDGE DISTILLATION FRAMEWORK
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
74%
With Interview (+24.8%)
4y 3m
Median Time to Grant
High
PTA Risk
Based on 123 resolved cases by this examiner. Grant probability derived from career allow rate.