Last updated: April 19, 2026
Application No. 18/602,007
LIDAR POINT CLOUD SEGMENTATION METHOD, DEVICE, APPARATUS, AND STORAGE MEDIUM

Non-Final OA §101§103
Filed
Mar 11, 2024
Examiner
GEBRESLASSIE, WINTA
Art Unit
2677
Tech Center
2600 — Communications
Assignee
The Chinese University Of Hong Kong (Shenzhen) Future Network Of Intelligence Institute
OA Round
1 (Non-Final)
Interview Optional

— +24.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 133 resolved cases, 2023–2026
Examiner Intelligence

GEBRESLASSIE, WINTA View full profile →
Grants 76% — above average
Career Allow Rate
101 granted / 133 resolved
+13.9% vs TC avg
Strong +25% interview lift
Without
With
+24.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
53 currently pending
Career history
186
Total Applications
across all art units
Statute-Specific Performance

§101
3.3%
-36.7% vs TC avg
§103
66.4%
+26.4% vs TC avg
§102
16.8%
-23.2% vs TC avg
§112
5.0%
-35.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 133 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: collection module, extraction module, a fusion module, generation module, segmentation module modified by corresponding functional language, and not being modified by structural language as recited in claim 8.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


	Claims 16-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter as follows.   
Claim 16 recites “A computer-readable storage medium with a computer program stored thereon, wherein, when being executed by a processor, the computer program is capable of causing the processor to implement each step of the lidar point cloud segmentation method in claim 1”.  Computer programs, per se, are not in one of the statutory categories of invention because a computer program is merely a set of instructions capable of being executed by a computer - the computer program itself is not a process. The broadest reasonable interpretation of a computer-readable storage medium also comprises transitory forms of signal transmission, signals per se, such as a propagating electrical or electromagnetic signal or carrier wave which is not one of the statutory categories of invention. MPEP § 2106.  The Examiner suggests that Claims 16-20 should be rephrased as: "A non-transitory computer-readable storage medium with a computer program stored thereon…” to overcome this rejection.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-4, 8-12, and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Jaritz et al.  NPL “xMUDA:Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation”  in view of  Wang et al. NPL “Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking” and further in view of Kuen et al. (US 20230153943 A1).
Regarding claim 1, Jaritz et al. teaches a lidar point cloud segmentation method (see 1st page 3, section 3; “we investigate xMUDA using point cloud (3D modality) and image (2D modality) on the task of 3D semantic segmentation”, see also page 9, section A.2; “The point cloud comes from three 16-layer front LiDARs (left, center, right) where the left and right front LiDARS are inclined”), wherein the method comprises: obtaining a three-dimensional point cloud and a two-dimensional image of a target scene (see Fig. 1, page 3, Fig. 2; “The architecture comprises a 2D stream which takes an image as input and …. and a 3D stream which takes the point cloud as input”), randomly selecting one image block from the multiple image blocks (see page 6, 3rd para; “we crop a fixed size rectangle randomly on the horizontal image axis to reduce memory during training”); and obtaining a three-dimensional point cloud of a scene to be segmented (see Fig. 2 disclose “Figure 2: Overview of our xMUDA framework for 3D semantic segmentation….a 3D stream which takes the point cloud as input), inputting the three-dimensional point cloud into the single-modal semantic segmentation model for semantic discrimination to obtain a semantic segmentation label (see Abstract; “we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation”, Fig. 2 disclose “Figure 2: Overview of our xMUDA framework for 3D semantic segmentation…. a 3D stream which takes the point cloud as input and uses a U-Net-Style 3D SparseConvNet …The 4 segmentation outputs consist of the main predictions P3D”, and page 6, section 4.3; “The uni-modal UDA baselines[17,21,29] are applied separately on each modality” Note; the uni-modal indicating that the segmentation models may operate on a single modality such as a point cloud or image)  and segmenting the target scene based on the semantic segmentation label (see page 5, section 4.1; “the annotations are 3D bounding boxes and we obtain the point-wise labels for 3D semantic segmentation by assigning the corresponding object label if a point lies inside a 3D box…. A2D2 and SemanticKITTI provide segmentation labels”). Jaritz et al. additionally disclose two-dimensional feature extraction network, and three-dimensional feature extraction (see page 3, Fig. 2; “The architecture comprises a 2D stream which takes an image as input and uses a U-Net-style 2D ConvNet [24], and a 3D stream which takes the point cloud as input and uses a U-Net-Style 3D SparseConvNet [8]. Feature outputs of both streams have same length N, equal to the number of 3D points”, and Fig. 4; “Figure4: Architectures for fusion. (a)In Vanilla Fusion the 2D and 3D features are concatenated, fed into a linear layer with ReLU to mix the features”) but does not specifically disclose generate multi-scale two-dimensional features and generate multi-scale three-dimensional features. Jaritz et al. also does not teach, performing block processing on the two-dimensional image to obtain multiple image blocks and outputting the selected image block to a preset two-dimensional feature extraction network to generate multi-scale two-dimensional features, performing feature extraction using a preset three-dimensional feature extraction network based on the two-dimensional point cloud to generate multi-scale three-dimensional features, fusing the multi-scale two-dimensional features and the multi-scale three-dimensional features to obtain fused features, distilling the fused features with unidirectional modal preservation to obtain a single-modal semantic segmentation model.
In the same field of endeavor, Wang et al. teaches and performing block processing on the two-dimensional image to obtain multiple image blocks (see page 3, C; The 2D detection results are represented by a 2D bounding box. In our method, each image patch within a bounding box is cropped from the full image, then resized to the shape of 224×224, taken as input to the VGG-Net backbone); and outputting the selected image block to a preset two-dimensional feature extraction network to generate multi-scale two-dimensional features, performing feature extraction using a preset three-dimensional feature extraction network based on the two-dimensional point cloud to generate multi-scale three-dimensional features (see page page 3, section B; “Next, the detection results of 2D image and 3D point cloud are passed to the second stage to perform feature extraction, respectively. During this stage of feature extraction, the feature interaction module will realize the interactive multi-scale feature fusion between different modalities, image and point cloud, after each layer of the feature extraction network obtaining the deeper features. For the second stage, we useVGG-16[7]and Point Net++[10]as the back bones for feature extraction of the
mage and point cloud, respectively”), fusing the multi-scale two-dimensional features and the multi-scale three-dimensional features to obtain fused features (see page 3, section B; “the multi-scale features from two modalities are fed to a final fusion module, which will merge two sets of features into one and achieve final feature fusion”). Accordingly, it would have been obvious to one ordinary skill in the art before the effective filling date to the claimed invention to modify a cross-modal unsupervised domain adaptation for 3D semantic segmentation of Jaritz et al. in view of interactive multi-scale fusion of 2D and 3D features for multi-object tracking of Wang et al. in order to achieve good performance on the KITTI benchmark and outperform other approaches (see page 3, C). 
However, the combination of Jaritz et al. and Wang as a whole does not teach distilling the fused features with unidirectional modal preservation to obtain a single-modal semantic segmentation model.
In the same field of endeavor, Kuen et al. teaches distilling the fused features with unidirectional modal preservation to obtain a single-modal semantic segmentation model (see para [0135]; “knowledge of a multiscale fusion teacher is appropriately distilled to a low-resolution student by arranging the feature maps of teacher and student networks”, see also para [0123]; “student network decoder 920 decodes the feature map to obtain prediction information …. the prediction information includes … semantic segmentation information, and instance-level semantic segmentation information”, Note; the student learns from the teacher without altering the teacher's knowledge corresponds to unidirectional knowledge distillation). Accordingly, it would have been obvious to one ordinary skill in the art before the effective filling date to the claimed invention to modify a cross-modal unsupervised domain adaptation for 3D semantic segmentation of Jaritz et al. in view of interactive multi-scale fusion of 2D and 3D features for multi-object tracking of Wang et al. and multi-scale distillation for low resolution detection of Kuen et al. in order to obtain prediction information for the low-resolution image (para [0135]).
Regarding claim 2, the rejection of claim 1 is incorporated herein.
Zaritz et al. in the combination further teach wherein the preset two-dimensional feature extraction network comprises at least a two-dimensional convolution encoder (see page 5, section 4.2; “2D Network. We use a modified version of U-Net [24] with a ResNet34 [10] encoder where we add dropout after the 3rd and 4th layer”);
 Wang et al. in the combination further teach the randomly selecting one image block from the multiple image blocks and outputting the selected image block to a preset two-dimensional feature extraction network to generate multi-scale two-dimensional features comprises: determining a target image block from the multiple image blocks using a random algorithm (see page 6, 3rd para; “each image patch with in a bounding box is cropped from the full image, then resized to the shape of 224×224, taken as input to the VGG-Net backbone”), and constructing a two-dimensional feature map based on the target image block (see page 3, B; “For the second stage, we useVGG-16[7] and Point Net++[10] as the backbones for feature extraction of the image”); and performing a two-dimensional convolution operation on the two-dimensional feature map through the two-dimensional convolution encoder based on different scales to obtain the multi-scale two-dimensional features (see page 3, B; “After obtaining the deep representation of image and point cloud through the two aforementioned backbones, the multi-scale features from two modalities are fed to a final fusion module”, see also page 6, left col. 1st para; “Next, the features from different modalities are added up, generating gk f ∈ RK× D, and then concatenated with position information vk i encoded by a fully connected layer (FC)”).  
Regarding claim 3, the rejection of claim 2 is incorporated herein.
Zaritz et al. in the combination further teach wherein the preset two-dimensional feature extraction network further comprises a full convolution decoder (see page 5, section 4.2; “In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features”), sampling the two-dimensional features of the last convolution layer step by step using an up-sampling strategy through the full convolution decoder to obtain a decoding feature map, and performing a convolution operation on the decoding feature map using the last convolution layer in the two-dimensional convolution encoder to obtain a new multi-scale two-dimensional feature (see page 5, 4.2; “2D Network. We use a modified version of U-Net [24] with a ResNet34 [10] encoder where we add dropout after the 3rd and 4th layer and initialize with ImageNet pre trained weights provided by PyTorch. In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features. The network takes an images 2D as in put and produces an output feature map with equal spatial dimensions(H,W,F2D),whereF2D is the number of feature channels. In order to lift the2 D features to 3D, we sample them at sparse pixel locations where the 3D points project into the feature map, and obtain the final two-dimensional feature matrix (N,F2D)” Note; U-Net-style ConvNet is designed with an encoder-decoder structure by default, and it uses up-sampling (upsampling) steps as part of its core design through a decoder to produce decoded feature maps, which are further processed by convolution layers to generate multi-scale features). 
Wang et al. in the combination further teach after the performing a two-dimensional convolution operation on the two-dimensional feature map through the two-dimensional convolution encoder based on different scales to obtain the multi-scale two-dimensional features the method further comprises: extracting the two-dimensional features belonging to the last convolution layer in the two-dimensional convolution encoder from the multi-scale two-dimensional features (see page 3 B; “we useVGG-16[7 ]and PointNet++[10] as the backbones for feature extraction of the image…. The updated features are then passed to the next layer of backbone for further feature extraction and following interactive feature fusion. After obtaining the deep representation of image and point cloud through the two a forementioned backbones, the multi-scale features from two modalities are fed to a final fusion module” Note; the backbones produces hierarchical features including the deepest convolution layer).  
Regarding claim 4, the rejection of claim 1 is incorporated herein.
Zaritz et al. in the combination further teach wherein the preset three-dimensional feature extraction network comprises at least a three-dimensional convolution encoder with sparse convolution construction (see page 3, Fig 2; “a 3D stream which takes the point cloud as input and uses a U-Net-Style 3D SparseConvNet [8]”); the performing feature extraction using a preset three-dimensional feature extraction network based on the two-dimensional point cloud to generate multi-scale three-dimensional features comprises: extracting non-hollow bodies from the three-dimensional point cloud through the three-dimensional convolution encoder (see page 2, right col. last para; “Voxels are very similar to pixels, but very memory intense as most of them are empty…and similar implementation [4] address this problem by using hash tables to convolve only on active voxels… In this work, we select SparseConvNet [8] as 3D network which is the state of-the-art on the ScanNet benchmark”, Note: Sparse convolution only computes at the locations where data exists (non-empty regions or active voxels) corresponds to extracting non-hallow (non-empty) point cloud structures), and performing a convolution operation on the non-hollow bodies to obtain the three-dimensional convolution features (see page 3, Fig. 2; “a 3D stream which takes the point cloud as input and uses a U-Net-Style 3D SparseConvNet [8]. Feature outputs of both streams have same length N, equal to the number of 3D points”); up-sampling on the three-dimensional convolution features using an up-sampling strategy to obtain decoding features (see page 6, left col. 1st para; “In order to lift the 2D features to 3D, we sample them at sparse pixel locations where the 3D points project in to the feature map, and obtain the final two-dimensional feature matrix (N,F2D)”); and when the size of the sampled feature is the same as that of the original feature, stitching the three-dimensional convolution features and the decoding features to obtain the multi-scale three-dimensional features (see page 5, 4.2; “2D Network. We use a modified version of U-Net [24] with a ResNet34 [10] encoder where we add dropout after the 3rd and 4th layer and initialize with ImageNet pre trained weights provided by PyTorch. In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features” Note; U-Net-style ConvNet further includes decoder stages that progressively up-sample feature maps and combine encoder and decoder features using skip connection to generate multi-scale feature representations).
Regarding claim 8, the scope of claim 8 is fully encompassed by the scope of claim 1,
accordingly, the rejection of claim 1 is fully applicable here.
Regarding claim 9, the rejection of claim 1 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 10, the rejection of claim 2 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 11, the rejection of claim 3 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 12, the rejection of claim 4 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 16, the rejection of claim 1 is incorporated herein.
Kuen et al. in the combination further teach a computer-readable storage medium with a computer program stored thereon, wherein, when being executed by a processor, the computer program is capable of causing the processor to implement each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 17, the rejection of claim 2 is incorporated herein.
Kuen et al. in the combination further teach a computer-readable storage medium with a computer program stored thereon, wherein, when being executed by a processor, the computer program is capable of causing the processor to implement each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 18, the rejection of claim 3 is incorporated herein.
Kuen et al. in the combination further teach a computer-readable storage medium with a computer program stored thereon, wherein, when being executed by a processor, the computer program is capable of causing the processor to implement each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 19, the rejection of claim 4 is incorporated herein.
Kuen et al. in the combination further teach a computer-readable storage medium with a computer program stored thereon, wherein, when being executed by a processor, the computer program is capable of causing the processor to implement each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Claims 5, 13 and 20 rejected under 35 U.S.C. 103 as being unpatentable over Jaritz et al. and Wang et al. in view of Kuen et al. and further in view of Tchapmi et al. NPL “SEGCloud: Semantic Segmentation of 3D Point Cloud”.
Regarding claim 5, the rejection of claim 1 is incorporated herein.
Zaritz et al. in the combination further teach wherein after the performing feature extraction using a preset three-dimensional feature extraction network based on the two-dimensional point cloud to generate multi-scale three-dimensional features, and before the fusing the multi-scale two-dimensional features and the multi-scale three-dimensional features to obtain fused features, the method further comprises: adjusting resolutions of the multi-scale two-dimensional features to a resolution of the two-dimensional image using a deconvolution operation (see page 5, section 4.2; “In the decoder, each layer consists of a transposed convolution, concatenation with encoder features of same resolution (skip connection) and another convolution to mix the features”, Note: within a U-Net-like architecture, to increase the resolution, first use a transposed convolution (deconvolution), then add back high-frequency detail from the encoder (concatenation/skip connection), and finally refine the combined features with another convolution); based on the adjusted multi-scale two-dimensional features, calculating a mapping relationship between the adjusted multi-scale two-dimensional features and the corresponding point cloud through a perspective projection method, and generating a point-to-pixel mapping relationship (see page 3, Fig. 2; “we project the 3D points into the image and sample the 2D features at the corresponding pixel locations” Note implies projection mapping); determining a corresponding two-dimensional truth value label based on the point-to-pixel mapping relationship (see page 9, section A.2; “The semantic labeling was carried out in the 2D image for 38 classes and we compute the 3D labels by projection of the point cloud into the labeled image”). However, the combination of Jaritz et al. Wang et al. and Kuen et al. as a whole does not teach constructing a point-to-voxel mapping relationship of each point cloud in the three-dimensional point cloud using a preset voxel function and according to the point-to-voxel mapping relationship, interpolating the multi-scale three-dimensional features by a random linear interpolation to obtain the three-dimensional features of each point cloud.  
In the same field of endeavor, Tchapmi et al. teaches constructing a point-to-voxel mapping relationship of each point cloud in the three-dimensional point cloud using a preset voxel function; and according to the point-to-voxel mapping relationship, interpolating the multi-scale three-dimensional features by a random linear interpolation to obtain the three-dimensional features of each point cloud (see Fig. 1; “A 3D point cloud is voxelized and fed through a 3D fully convolutional neural network to produce coarse down sampled voxel labels. A trilinear interpolation layer transfers this coarse output from voxels back to the original 3D Points representation. The obtained 3D point scores are used for inference in the 3D fully connected CRF to produce the final results. Our framework is trained end-to-end”, see also page 538, left col. 4th para; “We train the voxel-level 3D-FCNN and point-level CRF jointly and end-to-end by connecting them via Trilinear interpolation enabling segmentation in the original 3D points space”).  Accordingly, it would have been obvious to one ordinary skill in the art before the effective filling date to the claimed invention to modify a cross-modal unsupervised domain adaptation for 3D semantic segmentation of Jaritz et al. in view of interactive multi-scale fusion of 2D and 3D features for multi-object tracking of Wang et al. and multi-scale distillation for low resolution detection of Kuen et al. and semantic segmentation of 3D point clouds of Tchapmi et al. in order to improve the initial coarse 3D-FCNN predictions (see Fig.1).
Regarding claim 13, the rejection of claim 5 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”). 
Claims 6, 7 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Jaritz et al. and Wang et al. in view of Kuen et al. and Tchapmi et al. as applied in claim 1, and 5 above, and further in view of Chiori et al. (US 20210247201 A1).
Regarding claim 6, the rejection of claim 5 is incorporated herein.
The combination of Zaritz et al., Wang et al., Kuen et al. and Tchapmi et al. does not teach wherein the fusing the multi-scale two-dimensional features and the multi-scale three-dimensional features to obtain fused features comprises: converting the three-dimensional features of the point cloud into the two-dimensional features using a GRU-inspired fusion (); perceiving the three-dimensional features obtained by other convolution layers corresponding to the two-dimensional features using a multi-layer perception mechanism, calculating a difference between the two-dimensional feature and the three-dimensional feature and stitching the two-dimensional feature with the corresponding two-dimensional feature in the decoding feature map; and obtaining fused features based on the difference and a result of the stitching operation.  
In the same field of endeavor, Chiori et al. teaches wherein the fusing the multi-scale two-dimensional features and the multi-scale three-dimensional features to obtain fused features comprises: converting the three-dimensional features of the point cloud into the two-dimensional features using a GRU-inspired fusion ( see para [0069]; “The multimodal fusion method may employ an encoder based on a bidirectional long short-term memory (BLSTM) or Gated Recurrent Units (GRU) to further convert the feature vector sequence so that each vector contains its contextual information”); perceiving the three-dimensional features obtained by other convolution layers corresponding to the two-dimensional features using a multi-layer perception mechanism (see para [0069]; “CNN-based features may be used directly, or one more feed-forward layer may be added to reduce the dimensionality”), calculating a difference between the two-dimensional feature and the three-dimensional feature and stitching the two-dimensional feature with the corresponding two-dimensional feature in the decoding feature map; and obtaining fused features based on the difference and a result of the stitching operation (see para [0085]; “For multiple objects, the feature vectors can be considered different vectors from different feature extractors. The multimodal fusion method can be used to fuse the feature vectors from the multiple objects. In this case, a sequence of feature vectors for each object is constructed by assigning a feature vector of an object detected in the current image to the most overlapping object detected in the previous image. The degree of the overlapping between two objects may be computed using the Intersection-over-Union (IoU) measure”).  Accordingly, it would have been obvious to one ordinary skill in the art before the effective filling date to the claimed invention to modify a cross-modal unsupervised domain adaptation for 3D semantic segmentation of Jaritz et al. and interactive multi-scale fusion of 2D and 3D features for multi-object tracking of Wang et al. and multi-scale distillation for low-resolution detection of Kuen et al. in view of semantic segmentation of 3D point clouds of Tchapmi et al. and further in view of A navigation system configured to provide driving instructions to a driver of a moving vehicle based on real-time description of objects in a scene of Chiori et al. in order to help the driver safely follow an intended route (see para [0068]).
Regarding claim 7, the rejection of claim 6 is incorporated herein.
Kuen et al. in the combination further teach wherein the distilling the fused features with unidirectional modal preservation to obtain a single-modal semantic segmentation model comprises: inputting the fused features and the converted two-dimensional features into a full connection layer of the dimensional feature extraction network in turn to obtain a corresponding semantic score (see para [0053]; “the high-resolution feature map 400 and the low-resolution feature map 405 can be input to a machine learning model. First, the features maps are concatenated, and then global pooling is performed to obtain a single feature map. Then the single feature map is fed to a fully connected (FC) layer, then a rectified linear unit (ReLU), another fully connected layer, and then a softmax function. Finally, the output can be combined with a weighted sum of the high-resolution feature map 400 and the low-resolution feature map 405 to obtain the fused feature map 410”); determining a distillation loss based on the semantic score (see para [0007]; “computing a knowledge distillation (KD) loss based on a comparison of the third feature map from the student network and the fused feature map from the teacher network; and updating parameters of the student network based on the KD loss”); and according to the distillation loss, distilling the fused features with unidirectional modal preservation to obtain the single-modal semantic segmentation model (see para [0135]; “knowledge of a multiscale fusion teacher is appropriately distilled to a low-resolution student by arranging the feature maps of teacher and student networks”, see also para [0123]; “student network decoder 920 decodes the feature map to obtain prediction information …. the prediction information includes … semantic segmentation information, and instance-level semantic segmentation information”).  
Regarding claim 14, the rejection of claim 6 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Regarding claim 15, the rejection of claim 7 is incorporated herein.
Kuen et al. in the combination further teach an electronic apparatus having a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein, when being executed by the processor, the computer program is capable of implementing each step of the lidar point cloud segmentation method (see para [0050]; “Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of code or data. A non-transitory storage medium may be any available medium that can be accessed by a computer”).  
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WINTA GEBRESLASSIE whose telephone number is (571)272-3475. The examiner can normally be reached Monday-Friday9:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at 571-270-5180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/WINTA GEBRESLASSIE/Examiner, Art Unit 2677
Read full office action
Prosecution Timeline

Mar 11, 2024
Application Filed
Mar 12, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/710,872
Patent 12579683
IMAGE VIEW ADJUSTMENT
2y 5m to grant Granted Mar 17, 2026
17/876,145
Patent 12573238
BIOMETRIC FACIAL RECOGNITION AND LIVENESS DETECTOR USING AI COMPUTER VISION
2y 5m to grant Granted Mar 10, 2026
18/177,769
Patent 12530768
SYSTEMS AND METHODS FOR IMAGE STORAGE
2y 5m to grant Granted Jan 20, 2026
17/923,954
Patent 12524932
MACHINE LEARNING IMAGE RECONSTRUCTION
2y 5m to grant Granted Jan 13, 2026
18/196,332
Patent 12511861
DETECTION OF ANNOTATED REGIONS OF INTEREST IN IMAGES
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+24.7%)
2y 5m
Median Time to Grant
Low
PTA Risk
Based on 133 resolved cases by this examiner. Grant probability derived from career allow rate.