Prosecution Insights
Last updated: April 19, 2026
Application No. 18/466,460

KERNELIZED BIRD’S EYE VIEW SEGMENTATION FOR MULTI-SENSOR PERCEPTION

Non-Final OA §103
Filed
Sep 13, 2023
Examiner
SALEH, ZAID MUHAMMAD
Art Unit
2668
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
65%
Grant Probability
Favorable
1-2
OA Rounds
3y 1m
To Grant
99%
With Interview

Examiner Intelligence

Grants 65% — above average
65%
Career Allow Rate
28 granted / 43 resolved
+3.1% vs TC avg
Strong +48% interview lift
Without
With
+48.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
30 currently pending
Career history
73
Total Applications
across all art units

Statute-Specific Performance

§101
5.7%
-34.3% vs TC avg
§103
58.5%
+18.5% vs TC avg
§102
28.0%
-12.0% vs TC avg
§112
4.4%
-35.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 43 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement (IDS) submitted on March 06, 2024, July 17, 2024 and August 27, 2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement has been considered by the examiner. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1 – 4, 12 – 18, 26, 27 and 28 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi ‘BEVTransFusion: LiDAR-Camera Fusion Under Bird’s-Eye-View for 3D Object Detection with Transformers’ (hereinafter Kaiqi) in view of Lin US Patent Application Publication No. US-20250076070-A1 (hereinafter Lin). Regarding claim 1, Kaiqi discloses about an apparatus for processing image data and position data, the apparatus comprising(Kaiqi in Fig. 1 discloses, “Fig. 1. An overview of BEVTransFusion framework. Point clouds and multi-view images are used to extract BEV features through their respective streams”); wherein the set of 3D point cloud frames comprises a second 3D representation of the 3D environment (Kaiqi in [Section – 3, Paragraph – 2] discloses, “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse”); generate, based on the first 3D representation and the second 3D representation, a set of bird’s eye view (BEV) feature kernels in a continuous space (Kaiqi in fig. 2 discloses about fusing both 3D representation to generate a BEV feature, “Fig. 2. Query Initialization Module. We propose a simple fusion module to fuse LiDAR and image BEV features, then use a clustering algorithm to select top-N positions on the heatmap, and finally extract the corresponding positions on the LiDAR BEV feature map as the initial object queries”); and generate, based on the set of BEV feature kernels, an output (Kaiqi in [Page – 24, Left side Paragraph – 5] discloses, “We use the feed-forward networks (FFN) to make predictions for both the 3D bounding boxes b and the object classes c” wherein the output is the predicted bounding box and object classes). Kaiqi doesn’t disclose about the following limitation as further recited in the claim. Lin discloses about a memory for storing the image data (Lin [0181] discloses, “The GPU(s) 1008 may include graphics memory, such as display memory, for storing pixel data or any other suitable data”), and processing circuitry in communication with the memory, wherein the processing circuitry is configured to: convert the set of 2D camera images into a first 3D representation of a 3D environment corresponding to the image data and the position data (Lin in [0003] discloses, “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Lin into the system of Kaiqi because it would allow the system to properly align visual data with LiDAR point clouds. Summary of Citations (Lin) Paragraph [0003]; “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”. Paragraph [0181]; “The GPU(s) 1008 may include graphics memory, such as display memory, for storing pixel data or any other suitable data, such as GPGPU data. The display memory may be included as part of the memory 1004”. Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse”. [Page – 24, Left side Paragraph – 5]; “We use the feed-forward networks (FFN) to make predictions for both the 3D bounding boxes b and the object classes c”. Regarding claim 2, Kaiqi in the combination discloses the apparatus of claim 1, wherein the processing circuitry is configured to generate the set of BEV feature kernels without relying on a BEV feature grid that includes a set of BEV feature cells each having a fixed dimension (Kaiqi in [Section – 3 (B), Paragraph – 6] discloses, “For the object query located at (i, j) in the X×Y grid of the LiDAR BEV feature map, it is important to note that object cannot exist solely in a grid. Therefore, it is not sufficient to only fuse the image features at the corresponding grid position”). Summary of Citations (Kaiqi) [Section – 3 (B), Paragraph – 6]; “For the object query located at (i, j) in the X×Y grid of the LiDAR BEV feature map, it is important to note that object cannot exist solely in a grid. Therefore, it is not sufficient to only fuse the image features at the corresponding grid position”. Regarding claim 3, Lin in the combination discloses the apparatus of claim 1, wherein to convert the set of 2D camera images into the first 3D representation of the 3D environment (Lin in [0003] discloses, “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”). Kaiqi in the combination discloses about the processing circuitry is configured to apply a depth estimation unit to generate, based on the image data, a set of perspective view depth maps comprising the first 3D representation of the 3D environment corresponding to the image data (Kaiqi in [Section – 3, Paragraph – 2] discloses, “Following LSS [19] and BEVDet [20], we first use a 2D backbone to extract the basic multi-view image features and feed them into a standard Feature Pyramid Network (FPN) as well as a simple feature Adaptive Module (ADP) mentioned in [10]. Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×CImage , where X, Y, CImage represent the size of the BEV grid and dimensions of 2D image features”). Summary of Citations (Lin) Paragraph [0003]; “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”. Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Following LSS [19] and BEVDet [20], we first use a 2D backbone to extract the basic multi-view image features and feed them into a standard Feature Pyramid Network (FPN) as well as a simple feature Adaptive Module (ADP) mentioned in [10]. Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×CImage , where X, Y, CImage represent the size of the BEV grid and dimensions of 2D image features”. Regarding claim 4, Kaiqi in the combination discloses the apparatus of claim 3, wherein the depth estimation unit comprises one or more encoders and one or more decoders (Kaiqi in Fig. 1 discloses about encode and decoder for depth estimation), and wherein to apply the depth estimation unit to generate the set of perspective view depth maps, the processing circuitry is configured to apply the one or more encoders and one or more decoders to generate the set of perspective view depth maps to indicate a location of one or more objects indicated by the image data within the 3D environment (Kaiqi in [Section – 3, Paragraph – 2] discloses, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate” wherein per pixel prediction camera image space equates to perspective view and determining ‘ego-car coordinate’ implies to indicating a location. Additionally, in [Page – 24, Left side Paragraph – 5] Kaiqi discloses, “We use the feed-forward networks (FFN) to make predictions for both the 3D bounding boxes b and the object classes c”). Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Following LSS [19] and BEVDet [20], we first use a 2D backbone to extract the basic multi-view image features and feed them into a standard Feature Pyramid Network (FPN) as well as a simple feature Adaptive Module (ADP) mentioned in [10]. Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate”. [Page – 24, Left side Paragraph – 5]; “We use the feed-forward networks (FFN) to make predictions for both the 3D bounding boxes b and the object classes c”. Regarding claim 12, Lin the combination discloses the apparatus of claim 1, wherein the processing circuitry and the memory are part of an advanced driver assistance system (ADAS) (Lin [0020] discloses, “the systems and methods described herein may be used by, without limitation ... machines (e.g., in one or more adaptive driver assistance systems (ADAS))”). Summary of Citations (Lin) Paragraph [0020]; “the systems and methods described herein may be used by, without limitation, non-autonomous vehicles or machines, semi-autonomous vehicles or machines (e.g., in one or more adaptive driver assistance systems (ADAS))”. Regarding claim 13, Kaiqi in the combination discloses the apparatus of claim 1, wherein the processing circuitry is configured to use the output to control a vehicle (Kaiqi in [Introduction, Paragraph – 1] discloses, “3D object detection [1], [2] is a key task for the perception of autonomous vehicles, as it plays a critical role in ensuring their safety and improving their understanding of the environment”). Summary of Citations (Kaiqi) [Introduction, Paragraph – 1]; “3D object detection [1], [2] is a key task for the perception of autonomous vehicles, as it plays a critical role in ensuring their safety and improving their understanding of the environment”. Regarding claim 14, Kaiqi in the combination discloses the apparatus of claim 1, wherein the apparatus further comprises: one or more cameras configured to capture the set of 2D camera images (Kaiqi in [Section – 4, Paragraph – 1] discloses, “The nuScenes dataset is a comprehensive large-scale multi-modal dataset that comprises data from 6 cameras”); and a Light Detection and Ranging (LiDAR) system configured to capture the set of 3D point cloud frames (Kaiqi in [Section – 3, Paragraph – 2] discloses, “LiDAR-based 3D object detection is a technique that focuses on predicting 3D bounding boxes using point clouds”). Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “LiDAR-based 3D object detection is a technique that focuses on predicting 3D bounding boxes using point clouds”. [Section – 4, Paragraph – 1]; “The nuScenes dataset is a comprehensive large-scale multi-modal dataset that comprises data from 6 cameras”. Regarding claim 15, method claim 15 corresponds to apparatus claim 1. Therefore, the rejection analysis and motivation to combine of claim 1 is applicable to claim 15. Regarding claim 16, method claim 16 corresponds to apparatus claim 2. Therefore, the rejection analysis and motivation to combine of claim 2 is applicable to claim 16. Regarding claim 17, method claim 17 corresponds to apparatus claim 3. Therefore, the rejection analysis and motivation to combine of claim 3 is applicable to claim 17. Regarding claim 18, method claim 18 corresponds to apparatus claim 4. Therefore, the rejection analysis and motivation to combine of claim 4 is applicable to claim 18. Regarding claim 26, method claim 26 corresponds to apparatus claim 13. Therefore, the rejection analysis and motivation to combine of claim 26 is applicable to claim 13. Regarding claim 27, method claim 27 corresponds to apparatus claim 14. Therefore, the rejection analysis and motivation to combine of claim 14 is applicable to claim 27. Regarding claim 28, is a non-transitory computer readable storage medium claim corresponds to method claim 1. Therefore, the rejection analysis and motivation to combine of claim 1 is applied in claim 28. Claims 5, 7, 19 and 21 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi in view of Lin and further in view of Sharma US Patent Application Publication No. US-20240005627-A1 (hereinafter Sharma). Regarding claim 5, Lin in the combination discloses the apparatus of claim 1, wherein the first 3D representation of the 3D environment comprises a set of perspective view depth maps generated based on the image data (Lin in [0003] discloses, “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”. Furthermore, Lin in [0081] discloses, “one or more of stereo camera(s) 968 may include an integrated control unit comprising a scalable processing unit ... may be used to generate a 3D map of the vehicle's environment, including a distance estimate for all the points in the image” wherein distance estimate implies to depth map). Kaiqi in the combination discloses about to generate the set of BEV feature kernels, the processing circuitry is configured to: apply a second feature extractor to extract, from the set of 3D point cloud frames, a second set of 3D features (Kaiqi in [Section – 3, Paragraph – 3] discloses, “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse convolution to generate the feature FLiDAR ∈ RX×Y ×CLiDAR, where X, Y, CLiDAR represent the size of the BEV grid and dimensions of LiDAR features”); and generate, based on the first set of 3D features and the second set of 3D features, the set of BEV feature kernels (Kaiqi in [Section – 3, Paragraph – 1] discloses, “To extract Bird’s-Eye-View (BEV) features, we use separate backbones to process multi-view images and LiDAR points independently ... we extract LiDAR information from the specified position for self-attention ... then perform cross-attention with image BEV features to obtain the final result”. Kaiqi in Fig. 2 also discloses about a simple fusion module to fuse LiDAR and image BEV features). Kaiqi and Lin the combination doesn’t disclose about the following limitation as further recited in the claim. Sharma discloses about apply a first feature extractor to extract, from the set of perspective view depth maps, a first set of 3D features (Sharma in [0061] discloses, “ego vehicle 400 may extract height and/or depth features from the first output stream”). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Sharma into the system of Kaiqi in view of Lin because it would allow the system to more accurately identify object context and layout from the perspective of the camera. Summary of Citations (Kaiqi) [Section – 3, Paragraph – 3]; “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse convolution to generate the feature FLiDAR ∈ RX×Y ×CLiDAR, where X, Y, CLiDAR represent the size of the BEV grid and dimensions of LiDAR features”. [Section – 3, Paragraph – 1]; “To extract Bird’s-Eye-View (BEV) features, we use separate backbones to process multi-view images and LiDAR points independently ... we extract LiDAR information from the specified position for self-attention ... then perform cross-attention with image BEV features to obtain the final result”. Summary of Citations (Sharma) Paragraph [0061]; “ego vehicle 400 may extract height and/or depth features from the first output stream”. Summary of Citations (Lin) Paragraph [0003]; “images present a two-dimensional (2D) representation of the environment which then needs to be converted to three-dimensional (3D) space for localization and navigation”. Paragraph [0081]; “one or more of stereo camera(s) 968 may include an integrated control unit comprising a scalable processing unit ... may be used to generate a 3D map of the vehicle's environment, including a distance estimate for all the points in the image”. Regarding claim 7, Kaiqi in the combination discloses the apparatus of claim 1, generate, based on the set of 2D camera images, a 3D feature volume, and populate the 3D feature volume with the set of perspective view features to create a populated 3D feature volume (Kaiqi in [Section – 3, Paragraph – 2] discloses, “we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×CImage , where X, Y, CImage represent the size of the BEV grid and dimensions of 2D image features”). Sharma in the combination discloses about to convert the set of 2D camera images into the first 3D representation of the 3D environment, the processing circuitry is configured to: apply a feature extractor to extract, from the set of camera images, a set of perspective view features (Sharma in [0057] discloses, “The ego perception module 310 is configured to observing unlabeled multi-view videos at training time for learning to map ... the feature tensor formation model 312 is configured to extract, using a convolutional neural network (CNN), features from an image captured by the first sensor 304 and/or the second sensor 306”); Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×CImage , where X, Y, CImage represent the size of the BEV grid and dimensions of 2D image features”. Summary of Citations (Sharma) Paragraph [0057]; “The ego perception module 310 is configured to observing unlabeled multi-view videos at training time for learning to map ... the feature tensor formation model 312 is configured to extract, using a convolutional neural network (CNN), features from an image captured by the first sensor 304 and/or the second sensor 306”. Regarding claim 19, method claim 19 corresponds to apparatus claim 5. Therefore, the rejection analysis and motivation to combine of claim 5 is applicable to claim 19. Regarding claim 21, method claim 21 corresponds to apparatus claim 7. Therefore, the rejection analysis and motivation to combine of claim 7 is applicable to claim 21. Claims 6 and 20 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi in view of Lin and Sharma and further in view of Urtasun US Patent Application Publication No. US-20200160559-A1 (hereinafter Urtasun). Regarding claim 6, Kaiqi in the combination discloses the apparatus of claim 5, wherein to generate the set of BEV feature kernels, the processing circuitry is configured to: fuse the first set of 3D features and the second set of 3D features to generate a fused set of 3D features (Kaiqi in [Section – 3(B), Paragraph – 2] discloses, “To use the BEV features from both camera (FCamera ∈ RX×Y ×CCamera ) and LiDAR (FLiDAR ∈ RX×Y ×CLiDAR ) sensors, we propose a simple fusion module”); Kaiqi, Lin and Sharma in the combination doesn’t disclose about the following limitation as further recited in the claim. Urtasun discloses about apply, to the fused set of 3D features, a continuous convolution decoder to generate a processed fused set of 3D features (Urtasun in [0109] discloses, “fusing at 712 can include executing one or more continuous convolutions to fuse image features from a first data stream with LIDAR features from a second data stream”); and compress the processed fused set of 3D features to generate the set of BEV feature kernels (Urtasun in [0071] discloses, “The resulting 3D volume can be considered as a BEV representation by treating the height slices as feature channels. This allows reasoning within 2D BEV space, which brings significant efficiency gain with no performance drop” wherein projecting 3D volume to 2D BEV channel implies to compression). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Urtasun into the system of Kaiqi in view of Lin and Sharma because it would improve object localization in 3D space. Summary of Citations (Kaiqi) [Section – 3(B), Paragraph – 2]; “To use the BEV features from both camera (FCamera ∈ RX×Y ×CCamera ) and LiDAR (FLiDAR ∈ RX×Y ×CLiDAR ) sensors, we propose a simple fusion module”. Summary of Citations (Urtasun) Paragraph [0109]; “fusing at 712 can include executing one or more continuous convolutions to fuse image features from a first data stream with LIDAR features from a second data stream”. Paragraph [0071]; “The resulting 3D volume can be considered as a BEV representation by treating the height slices as feature channels. This allows reasoning within 2D BEV space, which brings significant efficiency gain with no performance drop”. Regarding claim 20, method claim 20 corresponds to apparatus claim 6. Therefore, the rejection analysis and motivation to combine of claim 6 is applicable to claim 20. Claims 8 and 22 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi in view of Lin and Sharma and further in view of Athni Patent Application Publication No. DE-102020108898-A1 (hereinafter Athni). Regarding claim 8, Kaiqi in the combination discloses the apparatus of claim 7, and generate the 3D feature volume based on the 3D space and the depth distribution of each ray corresponding to the set of 2D camera images (Kaiqi in [Section – 3, Paragraph – 2] discloses, “Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×C Image , where X, Y, C Image represent the size of the BEV grid and dimensions of 2D image features”). Kaiqi, Lin and Sharma in the combination doesn’t disclose about the following limitation as further recited in the claim. Athni discloses about generate the 3D feature volume, the processing circuitry is configured to: create, based on each camera image pixel of the set of 2D camera images, a ray through a 3D space (Athni in [0120] discloses, “two-dimensional image coordinates u, v be given in the first raw image. the ray X(u, v), which is represented by the pixel at the two-dimensional coordinates u, v—these are coordinates in the distorted two-dimensional coordinate system of the first raw image —is then known from equation (2)”. And in [0117] Athni discloses about 3D space, “Let X:= (x, y, z) a ray in R3”); identify, for the ray corresponding to each camera image pixel of the set of 2D camera images, one or more points within the 3D space (Athni in [0120] discloses, “the three-dimensional Cartesian coordinate system of the first fish-eye camera, Zein is a scaled position along the beam X, i.e., Z(u, v) = A(u, v) X(u, v)”); create, for the one or more points of each ray corresponding to the set of 2D camera images, a depth distribution (Athni in [0088] discloses, “The depth map () has a corresponding depth estimated value λ and a corresponding uncertainty measure Σ at least for each of a plurality of pixels in the image region of interest, or, preferably, for all pixels in the image region of interest” wherein “depth estimated value λ and a corresponding uncertainty measure Σ” equates to depth distribution). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Athni into the system of Kaiqi in view of Lin and Sharma because it would enhance depth aware representation of the scene, allowing the system create better model of object structures and spatial relationship. Summary of Citations (Athni) Paragraph [0120]; “two-dimensional image coordinates u, v be given in the first raw image. the ray X(u, v), which is represented by the pixel at the two-dimensional coordinates u, v—these are coordinates in the distorted two-dimensional coordinate system of the first raw image —is then known from equation (2). However, the three-dimensional point Z, which represents a position of a feature in the scene which is represented by the pixel at u, v, is known only up to an unknown depth, or a distance, λ. That is, in the three-dimensional Cartesian coordinate system of the first fish-eye camera, Zein is a scaled position along the beam X, i.e., Z(u, v) = A(u, v) X(u, v)”. Paragraph [0117]; “Let X:= (x, y, z) a ray in R3 Which is incident from the scene along a surface normal direction on a surface of the fish-eye lens of the first fish-eye camera 3. A projection function that maps the beam X to the two-dimensional image coordinates (u, v)”. Paragraph [0088]; “The depth map () has a corresponding depth estimated value λ and a corresponding uncertainty measure Σ at least for each of a plurality of pixels in the image region of interest, or, preferably, for all pixels in the image region of interest”. Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Next, we explicitly forecast the discrete depth distribution for every pixel, then project them into the ego-car coordinate. This generates a pseudo voxel V ∈ RX×Y ×Z×CImage , where X, Y, CImage represent the size of the BEV grid and dimensions of 2D image features”. Regarding claim 22, method claim 22 corresponds to apparatus claim 8. Therefore, the rejection analysis and motivation to combine of claim 8 is applicable to claim 22. Claims 9 and 23 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi in view of Lin and Sharma and further in view of Guozeng "Location-Guided LiDAR-Based Panoptic Segmentation for Autonomous Driving” (hereinafter Guozeng). Regarding claim 9, Kaiqi in the combination discloses the apparatus of claim 7, wherein the feature extractor is a first feature extractor, wherein the populated 3D feature volume is a first 3D feature volume, and wherein to generate the set of BEV feature kernels, the processing circuitry is configured to: apply a second feature extractor to extract, from the set of 3D point cloud frames, a second 3D feature volume (Kaiqi in [Section – 3, Paragraph – 2] discloses, “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse convolution to generate the feature FLiDAR ∈ RX×Y ×CLiDAR” wherein VoxelNet equates to second feature extractor and ‘FLiDAR ∈ RX×Y ×CLiDAR’ is second 3D feature volume); Kaiqi, Lin and Sharma in the combination doesn’t disclose about the following limitation as further recited in the claim. Guozeng discloses about compress the first 3D feature volume to generate a set of image data BEV feature kernels; compress the second 3D feature volume to generate a set of position data BEV feature kernels; and fuse the set of image data BEV feature kernels and the set of position data BEV feature kernels to generate the set of BEV feature kernels (Guozeng in Fig. 3 [Section – 2(C)] discloses, “After projecting the raw point cloud into a polar BEV presentation, we use global feature module (GFM) to obtain global features concatenated to the input of GFM. Next, LGFM will divide the feature map into n small regions and process them by input dependent feature extractor (IDFE). ith region with shape [H0,W0, C1] learns its unique convolution kernel by the router and the kernel bank, and the features of ith region are extracted using the generated convolution kernel ... After concatenating Fin with FLGFM_out, we finally fed Fout to the panoptic backbone” wherein Fout equates to generating set of BEV feature kernel). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Guozeng into the system of Kaiqi in view of Lin and Sharma because it would improve perception tasks such as object detection and tracking. Summary of Citations (Guozeng) [Section – 2(C)]; “After projecting the raw point cloud into a polar BEV presentation, we use global feature module (GFM) to obtain global features concatenated to the input of GFM. Next, LGFM will divide the feature map into n small regions and process them by input dependent feature extractor (IDFE). ith region with shape [H0,W0, C1] learns its unique convolution kernel by the router and the kernel bank, and the features of ith region are extracted using the generated convolution kernel ... After concatenating Fin with FLGFM_out, we finally fed Fout to the panoptic backbone”. Summary of Citations (Kaiqi) [Section – 3, Paragraph – 2]; “Following VoxelNet [6] and SECOND [5], we transform the input point clouds P ∈ RN×4 into X × Y BEV grid through voxelization and further use the 3D sparse convolution to generate the feature FLiDAR ∈ RX×Y ×CLiDAR, where X, Y, C LiDAR represent the size of the BEV grid and dimensions of LiDAR features”. Regarding claim 23, method claim 23 corresponds to apparatus claim 9. Therefore, the rejection analysis and motivation to combine of claim 9 is applicable to claim 23. Claims 10, 11, 24 and 25 are rejected under 35 U.S.C 103 as being unpatentable over Kaiqi in view of Lin and further in view of Guozeng "Location-Guided LiDAR-Based Panoptic Segmentation for Autonomous Driving” (hereinafter Guozeng). Regarding claim 10, Kaiqi in the combination discloses the apparatus of claim 1. Kaiqi and Lin the combination doesn’t disclose about the following limitation as further recited in the claim. Guozeng discloses about to generate the output, the processing circuitry is configured to: generate, based on the set of BEV feature kernels, the output to include a BEV representation of one or more objects within the 3D environment (Guozeng in [Abstract] discloses, “we use the bird’s eye view (BEV) representation to project the 3D point clouds into a 2D image and divide the BEV projection into eight areas. For each area, we apply input dependent convolution kernels to extract the local feature. These local features are concatenated to the panoptic backbone for panoptic segmentation”. Furthermore, in [Introduction] Guozeng discloses about recognizing objects, “Among them, LiDAR-based panoptic segmentation is the most challenging task. It requires not only recognizing both stuff (e.g. road, sky, vegetation) and things (e.g. car, person, bicycle) but also identifying each things according to the LiDAR point clouds”) and wherein the processing circuitry is further configured to use the output to control a device within the 3D environment based on the one or more objects within the 3D environment (Guozeng in [Introduction] also discloses, “LiDAR-based panoptic segmentation has gradually increased in the last two years due to its momentous function for autonomous vehicles equipped with LiDAR”). It would have been obvious to one of ordinary skill in art before the effective filling date of the claimed invention to integrate the technique of Guozeng into the system of Kaiqi in view of Lin because it allows the apparatus to make informed decision based on the detected object and their position in the environment. Summary of Citations (Guozeng) [Abstract]; “we use the bird’s eye view (BEV) representation to project the 3D point clouds into a 2D image and divide the BEV projection into eight areas. For each area, we apply input dependent convolution kernels to extract the local feature. These local features are concatenated to the panoptic backbone for panoptic segmentation”. [Introduction]; “Among them, LiDAR-based panoptic segmentation is the most challenging task. It requires not only recognizing both stuff (e.g. road, sky, vegetation) and things (e.g. car, person, bicycle) but also identifying each things according to the LiDAR point clouds ... LiDAR-based panoptic segmentation has gradually increased in the last two years due to its momentous function for autonomous vehicles equipped with LiDAR”. Regarding claim 11, Guozeng in the combination further discloses about the processing circuitry is configured to: apply an encoder-decoder architecture to generate the set of BEV feature kernels (Guozeng in [Section – 3, Paragraph – 2] discloses, “the feature map is encoded and decoded using the same panoptic backbone of the literature [14] for semantic segmentation and instance segmentation”. Furthermore, Guozeng in [Section – 3(E), Paragraph – 1] discloses, “we use U-Net to predict semantic labels on a BEV image of C0 channels”), generate, based on the position data, kernelized ground truth corresponding to the output (Guozeng in [Section – 4, Paragraph – 1] discloses, “Each point of the dataset has a semantic label and an instance ID”); compare the kernelized ground truth with the output; and train the encoder-decoder architecture automatically based on comparing the kernelized ground truth with the output (Guozeng in [Section – 4(A), Paragraph – 1] discloses, “generate a BEV projection with [480, 360] resolution. We then divide the BEV representation into eight regions shown in Fig. 4 ... We use a combination of cross-entropy loss and Lovasz softmax loss [59] for the semantic segmentation head and use a combination of MSE loss and L1 loss for the instance segmentation head. We train our network using an NVIDIA Tesla V100 GPU”). Summary of Citations (Guozeng) [Section – 3, Paragraph – 2]; “the feature map is encoded and decoded using the same panoptic backbone of the literature [14] for semantic segmentation and instance segmentation”. [Section – 3(E), Paragraph – 1]; “we use U-Net to predict semantic labels on a BEV image of C0 channels”. [Section – 4, Paragraph – 1]; “Each point of the dataset has a semantic label and an instance ID”. [Section – 4(A), Paragraph – 1]; “generate a BEV projection with [480, 360] resolution. We then divide the BEV representation into eight regions shown in Fig. 4 ... We use a combination of cross-entropy loss and Lovasz softmax loss [59] for the semantic segmentation head and use a combination of MSE loss and L1 loss for the instance segmentation head. We train our network using an NVIDIA Tesla V100 GPU”. Regarding claim 24, method claim 24 corresponds to apparatus claim 10. Therefore, the rejection analysis and motivation to combine of claim 10 is applicable to claim 24. Regarding claim 25, method claim 25 corresponds to apparatus claim 11. Therefore, the rejection analysis and motivation to combine of claim 11 is applicable to claim 25. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZAID MUHAMMAD SALEH whose telephone number is (703)756-1684. The examiner can normally be reached M-F 8 am - 5 pm ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571)272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786 9199 (IN USA OR CANADA) or 571-272-1000. /ZAID MUHAMMAD SALEH/ Examiner, Art Unit 2668 01/07/2025 /VU LE/Supervisory Patent Examiner, Art Unit 2668
Read full office action

Prosecution Timeline

Sep 13, 2023
Application Filed
Jan 10, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602944
AUTHENTICATION OF DENDRITIC STRUCTURES
2y 5m to grant Granted Apr 14, 2026
Patent 12586501
DISPLAY DEVICE, DISPLAY METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
Patent 12586396
INFORMATION PROCESSING APPARATUS AND SYSTEM
2y 5m to grant Granted Mar 24, 2026
Patent 12562535
METHOD FOR DETECTING UNDESIRED CONNECTION ON PRINTED CIRCUIT BOARD
2y 5m to grant Granted Feb 24, 2026
Patent 12555344
METHOD AND APPARATUS FOR IMPROVING VIDEO TARGET DETECTION PERFORMANCE IN SURVEILLANCE EDGE COMPUTING
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
65%
Grant Probability
99%
With Interview (+48.4%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 43 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month