DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement filed on June 21, 2024 is being considered by the examiner.
Response to Amendment
Applicant's amendment filed on June 21, 2024 have been fully considered.
Claim Objections
Claim 12, 17 is objected to because of the following informalities:
Claim 12 line 2, introduces “a point cloud feature extraction component”. Claim 12 recites the limitation "the feature extraction component" in line 4. There is insufficient antecedent basis for this limitation in the claim.
Claim 17 re-introduces the limitation “a three-dimensional detection border”, which is already introduced in claim 16. It is unclear whether the “three-dimensional detection border” in claim 17 refers to the “three-dimensional detection border” in claim 16 or a second “three-dimensional detection border”.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-9, 18, 20-21 are rejected under 35 U.S.C. 103 as being unpatentable over Qi et al. (US 20190147245 A1) in view of Lee et al. (US 20210101286 A1).
Regarding claim 20,
Qi teaches:
An electronic device, comprising: a memory, configured to store a computer program; and a processor, configured to execute the computer program (Qi: ¶38, “FIG. 25 . . . a device with one or more CPUs, a memory, . . .”; ¶96-97, “The device 2501 . . . a central processing unit (CPU, also “processor” and “computer processor” herein) 2505 . . . includes memory or memory location 2510 . . . the CPU 2505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 2510 . . .”),
the computer program, when executed by the processor, cause the processor to,
process an original image of a target scene (Qi: Fig. 2; NOTE: The RGB image input is the processed original image of a target scene)
by using a first target detection network to obtain first feature information of a target object (Qi: Fig 2; ¶127, “. . . the systems, methods, and devices herein leverage mature 2D object detector to propose 2D object regions in RGB images as well as to classify objects . . .”; ¶130, “Given a 2D image region (and its corresponding 3D frustum), several methods might be used to obtain 3D location of the object: One straightforward solution is to directly regress 3D object location (e.g., by 3D bounding box) from depth map using 2D CNNs. . .”; ¶133, “. . . the semantics from 2D detector are also leveraged for better instance segmentation. For example, if the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person. Specifically, the semantic category is encoded as a one-hot class vector . . .” NOTE: In fig. 2, the CNN is the first target detection network, and the obtained first feature information of a target object are the 2d region proposals and one-hot class vector as illustrated in the frustum proposal block of fig. 2.),
and process a three-dimensional point cloud of the target scene (Qi: 128, “. . . a 2D bounding box can be lifted to a frustum (with near and far plane specified by depth sensor range) that defines a 3D search space for the object. All points within the frustum are then collected to form a frustum point cloud. . .”; ¶130, “. . . Given a 2D image region (and its corresponding 3D frustum). . . to obtain 3D location of the object; NOTE: As shown in figure 2. The 3d point cloud of the target scene is the point cloud in frustum of Qi to be processed in the 3D instance segmentation block.)
by using a second target detection network to obtain second feature information of the target object (Qi: ¶130, “. . . segmentation in 3D point cloud is much more natural and easier than that in images where pixels from distant objects can be near-by to each other. Having observed this fact, segment instances in 3D point cloud occurs instead of in 2D image or depth map. Similar to Mask-RCNN that achieves instance segmentation by binary classification of pixels in image regions, the systems, methods, and devices herein implement 3D instance segmentation using PointNet-based network on point clouds in frustums. . .”; ¶132, “. . . Therefore, the segmentation PointNet is learning the occlusion and clutter patterns as well as recognizing the geometry for the object of a certain category. . .”; NOTE: The second target detection network is the PointNet-based network (which is similar to Mask-RCNN) which obtains second feature information. The second feature information is the learned occlusion, geometry for the object.);
predict a first coarse point cloud of the target object on the basis of the first feature information (Qi: Fig. 2, point cloud in frustum NOTE: The first coarse point cloud of the target object is the point cloud in frustum as shown in fig.2. It is predicted based on the 2d region proposals and one-hot class vector which is the first feature information. Also see paragraph 128.),
and predict a three-dimensional detection result of the target object on the basis of the first feature information and the second feature information (Qi: Fig. 2, masking; ¶131, “3D mask coordinates” ¶133, “. . . the semantic category is encoded as a one-hot class vector (k dimensional for the pre-defined k categories) and concatenate the one-hot vector to the intermediate point cloud features. . .”; NOTE: The predicted three-dimensional detection result is the 3D mask generated during the 3D instance Segmentation block on the basis of the first feature information (one-hot class vector) concatenated with the second feature information (intermediate point cloud features described as the occlusion and geometry of the target object). Fig. 2 illustrates the one-hot class vector is being used as an input to the pointnet.);
obtain a second coarse point cloud of the target object on the basis of the threedimensional detection result (Qi: Fig. 2, ¶134, “After 3D instance segmentation, points that are classified as the object of interest are extracted. Having obtained these segmented object points. . . ”; NOTE: The second coarse point cloud of the target object are Qi’s segmented object points on the basis of the 3D detection result (masking during 3d instance segmentation pointnet);
and obtaining an initial point cloud of the target object on the basis of the first coarse point cloud and the second coarse point cloud (Qi: ¶134, “After 3D instance segmentation, points that are classified as the object of interest are extracted. Having obtained these segmented object points, the system then normalizes its coordinates . . . the point cloud is transformed . . .”; NOTE: The initial point cloud obtained is the transformed and normalized segmented object points based on the second coarse point cloud (Fig. 2,segmented object points). The segmented object points is based on the first coarse point cloud (Fig. 2, point cloud in frustum). Fig. 2, Frustum proposal obtains first coarse point cloud, which is the point cloud in frustum >> 3D Instance Segmentation obtains second coarse point cloud, which is the segmented object points >> system normalizes the segmented object points to obtain an initial point cloud, which is the normalized segmented object point. Therefore, the initial point cloud is obtained on the basis of the first coarse point cloud and the second coarse point cloud.
Although Qi teaches object localization, boundary estimation, object classification, and segmentation of individual shapes or fused complete point cloud shapes (Qi: Abstract, ¶8) and an Amodal 3D Box Estimation PointNet which “predicts amodal bounding boxes (for entire object even if part of it is unseen) for objects given object point cloud in 3D object coordinate”, Qi fails to teach processing the initial point cloud by using a preset shape generation network to obtain a semantic instance reconstruction result of the target object.
The analogous art Lee teaches:
processing an 2.5D observation of an object using a preset shape generation network to obtain a semantic instance reconstruction result of the target object (Lee: ¶3-5, “. . . training a point cloud prediction model . . . to generate a domain-invariant three-dimensional (3D) representation of the object (e.g., a 3D point cloud of the object). . . The domain-invariant 3D representations are generated based on processing, using the trained shape prediction network (e.g., a point cloud prediction network or other 3D shape prediction network). . . the domain-invariant 3D representation can be compact (data-wise), while being semantically interpretable and directly applicable for object manipulation. NOTE: The preset shape generation network is Lee’s trained shape prediction network. The obtained semantic instance reconstruction result of the target object is the semantically interpretable domain-invariant 3D representation of an object. Lee’s pipeline: 2.5D observation including spatial information such as depth channels and RGB channels as described in paragraph 4 >> processed using a shape generation network (shape prediction network) >> obtains the semantic instance reconstruction of the target object (domain-invariant 3D representation of the object)).
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi and Lee to include: processing the initial point cloud (generated by Qi) by using a preset shape generation network to obtain a semantic instance reconstruction result of the target object by substituting Lee’s 2.5D observation with Qi’s superior 3D point cloud as the initial point cloud input to Lee’s shape prediction network obtain a semantic instance reconstruction result of the target object.
The reason for doing so is to generate a domain-invariant 3D representation of an object because “the domain-invariant 3D representation can be compact (data-wise), while being semantically interpretable and directly applicable for object manipulation” (Lee, ¶5).
Regarding CRM claim 21,
CRM claim 21 is drawn to the CRM corresponding to the computer program of using same as claimed in the apparatus of claim 20. Therefore, CRM claim 21 corresponds to the computer program in the apparatus of claim 20, and is rejected for the same reasons of obviousness as used above.
Regarding method claim 1,
Method claim 1 is drawn to the method corresponding to the computer program of using same as claimed in the apparatus of claim 20. Therefore, method claim 1 corresponds to the computer program in the apparatus of claim 20, and is rejected for the same reasons of obviousness as used above.
Regarding claim 2, depending on 1,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 1,
Qi further teaches:
wherein the original image is a Red Green Blue (RGB) image (Qi: ¶125, “Given RGB-D data as input. . .”; NOTE: Also see figure 2 illustrating an RGB image as input).
Regarding claim 3, depending on 2,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 2,
Qi further teaches:
wherein processing the original image of the target scene by using the first target detection network to obtain the first feature information of the target object, comprises: processing the original image of the target scene by using a Faster Region-Convolutional Neural Network (Faster R-CNN), to obtain two-dimensional feature information of the target object (Qi: ¶57-58, “. . . the attention region comprises a two-dimensional boundary overlaid onto the two-dimensional image data which describes a horizontal and vertical region of the object of interest . . . the regioning process is based on a Single Shot MultiBox Detector (SSD) architecture, whose efficiency is optimized by cutting the number of channels in convolution layers by half. In some embodiments, the regioning process is trained on ImageNet, Faster-RCNN, FPN, or YOLO and fine-tuned on images from 3D object detection datasets.).
Regarding claim 4, depending on 3,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 3,
Qi further teaches
wherein processing the original image of the target scene by using the Faster R-CNN, to obtain the two-dimensional feature information of the target object, comprises:
performing feature extraction on the original image of the target scene by using a convolutional layer of the Faster R-CNN (NOTE: See rejection of claim 3),
and outputting a first preset number of pieces of two-dimensional feature information comprising location information and semantic category information of the target object (Qi: ¶128 - 133, “. . . “. . . for 2D region proposal, an SSD based model is adopted. . . Given a 2D image region . . .used to obtain 3D location of the object: . . . a 2D bounding box can be lifted to a frustum (with near and far plane specified by depth sensor range) that defines a 3D search space for the object). . . the semantic category is encoded as a one-hot class vector (k dimensional for the pre-defined k categories). . .”; NOTE: As illustrated in Fig. 2, the 2D region proposal is shown with the 2D bounding box and one-hot class vector. The bounding box defines a 3D search space for the object, which is a location information. Preset number of pieces of two-dimensional feature information is the number of semantic categories encoded and mapped to a number of pre-defined k categories.)
by using an activation function (Qi: ¶140, “. . . For box size and heading angle, a hybrid of classification and regression formulations are used. . . . Softmax is used for all classification tasks . . .; NOTE: Qi’s activation function is Softmax, which is the same function disclosed by the applicant in page 11 of specification.)
Regarding claim 5, depending on 4,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 4,
Qi further teaches:
wherein the Faster R-CNN comprises a picture feature extraction component and a candidate generation component (Qi: ¶129 - 130, “. . . for 2D region proposal, an SSD based model is adopted. . . Given a 2D image region . . .used to obtain 3D location of the object: . . . using 2D CNNs.”; ¶133, “. . . the semantics from 2D detector are also leveraged for better instance segmentation. For example, if the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person. Specifically, the semantic category is encoded as a one-hot class vector (k dimensional for the pre-defined k categories). . .”; NOTE: The CNN block in Fig. 2 is the feature extraction component extracting the 3D location of the object using CNN. The 2D detector is the candidate generation component to detect semantic category);
wherein the picture feature extraction component is configured to perform feature extraction on the original image of the target scene by using a plurality of convolutional layers (NOTE: As discussed above in reference to Qi: ¶129-130 and Fig. 2, SSD based model that uses plurality if 2D CNNs to obtain 3D location of the object. Also see rejection of claim 4.);
and the candidate generation component is configured to output a first preset number of pieces of two-dimensional feature information comprising location information and semantic category information of the target object by using an activation function (NOTE: The candidate generation is the 2D detector described in paragraph 133 configured to output a first preset number of pieces of two-dimensional feature information comprising location information and semantic category information of the target object by using an activation function, also see rejection of claim 4.).
Regarding claim 6, depending on 4,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 4,
Qi further teaches:
wherein predicting the first coarse point cloud of the target object on the basis of the first feature information, comprises:
predicting the first coarse point cloud of the target object by using a point generation network on the basis of the location information and the semantic category information (Qi: ¶128, “With known camera projection matrix, a 2D bounding box can be lifted to a frustum (with near and far plane specified by depth sensor range) that defines a 3D search space for the object. All points within the frustum are then collected to form a frustum point cloud. . . The entire procedure for extracting frustum point clouds from RGB-D data is termed frustum proposal generation”; NOTE: The point generation network is the frustum proposal generation, also see fig. 2 illustration Frustum proposal block that predicts the first coarse point cloud (point cloud in frustum). The location information and the semantic category information as discussed in claim 4 and 5 is processed within the frustum proposal block. Therefore, the predicted first coarse point is based on the location information and the semantic category information.).
Regarding claim 7, depending on 6,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 6,
Qi further teaches:
wherein predicting the first coarse point cloud of the target object by using the point generation network on the basis of the location information and the semantic category information, comprises: predicting the first coarse point cloud of the target object according to the location information and the semantic category information of the target object and by using the point generation network (NOTE: See rejection of claim 6)
and a multilayer perceptron (Qi: ¶67, “. . . the boundary process is based on the PointNet architecture, wherein each three-dimensional depth data point within the frustum is individually projected into a higher embedding space by a shared Multilayer Perceptron (MLP), and aggregated by a symmetric function, such as max pooling. Some embodiments further comprise, implementing another MLP for final binary classification. In some embodiments, the semantic class is encoded as a one-hot vector and is used as additional point features along with XYZ, intensity. . .).
Regarding claim 8, depending on 6,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 6,
Although Qi teaches a semantic instance reconstruction network wherein the method further comprises: constructing a semantic instance reconstruction network comprising the first target detection network (NOTE: The CNN block in fig. 2), the second target detection network (NOTE: Pointnet network for instance segmentation in fig. 2), and the point generation network (NOTE: Frustum proposal generation as described in paragraph 128) and a three-dimensional target detection network (NOTE: The pointnet network that results in masking of the object, see fig. 2 “masking”), Qi fails to teach a preset shape generation network and a three-dimensional object reconstruction network.
As discussed in the rejection of claim 1, the analogous art Lee teaches:
a preset shape generation network (Lee: ¶3-5, “. . . training a point cloud prediction model . . . to generate a domain-invariant three-dimensional (3D) representation of the object (e.g., a 3D point cloud of the object). . . The domain-invariant 3D representations are generated based on processing, using the trained shape prediction network (e.g., a point cloud prediction network or other 3D shape prediction network). . . the domain-invariant 3D representation can be compact (data-wise), while being semantically interpretable and directly applicable for object manipulation. NOTE: The preset shape generation network is Lee’s trained shape prediction network. The obtained semantic instance reconstruction result of the target object is the semantically interpretable domain-invariant 3D representation of an object. Lee’s pipeline: 2.5D observation including spatial information such as depth channels and RGB channels as described in paragraph 4 >> processed using a shape generation network (shape prediction network) >> obtains the semantic instance reconstruction of the target object (domain-invariant 3D representation of the object)).
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi and Lee to include: processing the initial point cloud (generated by Qi) by using a preset shape generation network to obtain a semantic instance reconstruction result of the target object by substituting Lee’s 2.5D observation information with Qi’s superior 3D point cloud information as input to Lee’s shape prediction network to obtain a semantic instance reconstruction result of the target object: wherein the method further comprises: constructing a semantic instance reconstruction network comprising the first target detection network, the second target detection network, the point generation network and the preset shape generation network on the basis of a three-dimensional target detection network and a three-dimensional object reconstruction network.
The reason for doing so is to generate a domain-invariant 3D representation of an object because “the domain-invariant 3D representation can be compact (data-wise), while being semantically interpretable and directly applicable for object manipulation” (Lee, ¶5).
Regarding claim 9, depending on 8,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 8,
Although Qi teaches a loss function, it is used for optimizing segmentation and bounding box regression as described in paragraph 140, Qi does not have a reconstruction network which Lee teaches as the shape prediction network/point cloud prediction network as described in Lee paragraph 4.
Lee further teaches:
wherein the method further comprises:
constructing a total loss function, and training the semantic instance reconstruction network by using the total loss function until a preset number of iterations is satisfied, so as to obtain a trained semantic instance reconstruction network (Lee: ¶24, “The point cloud prediction network can be used to generate a predicted point cloud . . . The loss function for training the point cloud prediction network. . .”; ¶37, “. . . the loss engine 149 can generate a batch loss that is based on multiple instances of training data in a batch. The loss engine 149 then updates one or more weights of the point cloud prediction network 170 based on the generated loss. . .”; The semantic instance reconstruction network is Lee’s point cloud prediction network/shape prediction network. The loss function iterates based on the preset number of the multiple instances of training data in a batch. The iteration is satisfied upon processing the last data in the batch. After updating one or more weights, a trained semantic instance reconstruction (Lee’s point cloud prediction network) is obtained).
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi and Lee to include: wherein the method further comprises: constructing a total loss function, and training the semantic instance reconstruction network by using the total loss function until a preset number of iterations is satisfied, so as to obtain a trained semantic instance reconstruction network.
The reason for doing so is to generate a domain-invariant 3D representation of an object because “the domain-invariant 3D representation can be compact (data-wise), while being semantically interpretable and directly applicable for object manipulation” (Lee, ¶5).
Regarding claim 18, depending on 1,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 1,
Qi further teaches:
wherein obtaining the initial point cloud of the target object on the basis of the first coarse point cloud and the second coarse point cloud, comprises:
fusing the first coarse point cloud and the second coarse point cloud, to obtain the initial point cloud of the target object (Qi: Fig. 2, ¶134, “. . . After 3D instance segmentation, points that are classified as the object of interest are extracted. Having obtained these segmented object points, the system then normalizes its coordinates . . .”; NOTE: The obtained initial point cloud is the normalized segmented object points. In reference to Fig. 2, the first coarse point cloud, which is the point cloud in frustum generated in frustum proposal >> the second coarse point cloud, which is segmented object points, is generated taking the first coarse point cloud as input >> the initial point cloud is obtained once the segmented object points is normalized. Since the 3D instance segmentation block uses the point cloud in frustum as input to generate the segmented object points, then normalized segmented object points, therefore, the first coarse point cloud and the second course point cloud is fused).
Claims 10-12 are rejected under 35 U.S.C. 103 as being unpatentable over Qi in view of Lee further in view of Hampali et al. (US 20220156426 A1, hereinafter “Hampali”) and Qi et al. (Qi, Charles R, et al. “Deep Hough Voting for 3D Object Detection in Point Clouds.” Proceedings / IEEE International Conference on Computer Vision, 2019, pp. 9276–85, https://doi.org/10.1109/ICCV.2019.00937. Hereinafter “Qi NPL”)
Regarding claim 10, depending on 9,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in claim 9, wherein processing the three-dimensional point cloud of the target scene by using the second target detection network to obtain the second feature information of the target object:
Although Qi uses a Pointnet network in processing the three-dimensional point cloud of the target scene to obtain three-dimensional feature information of the target object (NOTE: See rejection of claim 1 regarding to the learned occlusion and geometry of the target object.), Qi and Lee does not disclose using VoteNet to obtain the second feature information of the target object.
The analogous art Hampali incorporating Qi NPL teaches:
processing the three-dimensional point cloud of the target scene by using a VoteNet to obtain three-dimensional feature information of the target object (Hampali: ¶83, “a 3D object detector can be implemented using a Votenet as described in Qi, Charles R and Litany, Or and He, Kaiming and Guibas, Leonidas J, “Deep Hough Voting for 3D Object Detection in Point Clouds”, which is incorporated herein by reference. . .”; ¶101, “. . . VoteNet can be used to detect 3D objects. . .”).
Further, Hampali incorporates by reference Qi, Charles R and Litany, Or and He, Kaiming and Guibas, Leonidas J, “Deep Hough Voting for 3D Object Detection in Point Clouds” (hereinafter “Qi NPL”. Note that the Qi reference used in the rejection of independent claims 20, 21, and 1 is the same inventor.)
(NOTE: The incorporated Qi NPL teaches that the VoteNet predict and obtains 3D bounding boxes, and semantic class of objects, which are feature information of the target object as described in Qi NPL, page 1, Fig. 1, Introduction)
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention combine Qi, Lee, and Hampali (incorporating Qi NPL) and include: processing the three-dimensional point cloud of the target scene by using a VoteNet to obtain three-dimensional feature information of the target object as taught by Hampali.
The reason for doing so is to “provide improved layout and object estimation, including 3D layout estimation, from one or more images” (Hampali ¶5) because “layout and object estimation for scenes can be difficult in some scenarios. For instance, occlusions by objects in a room or other space (e.g., occlusions caused by furniture in a room) can cause problems when performing layout and object estimation. In one example, important image features of an environment, such as corners or edges, might be unobservable or only partially observable due to the occlusions. In other examples, occlusions can cause illumination effects (e.g., shadows, reflections, among others), can cause lack of textures in wall and ceiling regions, can lead to limited annotated data (e.g., for machine learning approaches), among others” (Hampali: ¶4).
Regarding claim 11, depending on 10,
The combination of Qi, Lee, and Hampali teaches:
The method for reconstructing semantic instance as claimed in claim 10, wherein processing the three-dimensional point cloud of the target scene by using the VoteNet to obtain the three-dimensional feature information of the target object, comprises:
The incorporated Qi NPL teaches:
performing feature extraction on the three-dimensional point cloud of the target scene by using a PointNet of the VoteNet to obtain three-dimensional point cloud features (Qi NPL: page 3-4, Fig 2, Section 4.1, “. . . adopt PointNet++ [36] as our backbone due to its simplicity and demonstrated success on tasks ranging from normal estimation [10], semantic segmentation [21] to 3D object localization [34]). . .”; NOTE: In the Qi NPL, fig. 2 illustrates the votenet architecture which performs feature extraction on the 3D point cloud (fig. 2 input point cloud >> point cloud feature learning backbone. Section 4.1 discloses that the backbone is PointNet.)
obtaining central point coordinates of the target object by a multilayer perceptron network on the basis of the three-dimensional point cloud features and three-dimensional point cloud coordinates (Qi NPL: section 4.1, “. . . Given a set of seed points . . . a shared voting module generates votes from each seed independently. Specifically, the voting module is realized with a multi-layer perceptron (MLP) network . . .ground truth displacement . . .to the bounding box center of the object it belongs to.)
and outputting a second preset number of pieces of three-dimensional feature information comprising object category information of the target object by means of the multilayer perceptron network on the basis of the central point coordinates and the three-dimensional point cloud features (Qi NPL: section 4.2, “. . . In our implementation, we use a shared PointNet for vote aggregation and proposal in clusters. Given a vote cluster . . . To enable usage of local vote geometry, we transform vote locations to a local normalized coordinate system . . . Then an object proposal for this cluster p(C) is generated . . . where votes from each cluster are independently processed by a MLP1 before being max-pooled (channel-wise) to a single feature vector and passed to MLP2 where information from different votes are further combined. We represent the proposal p as a multidimensional vector with an objectness score, bounding box parameters (center, heading and scale parameterized as in [34]) and semantic classification scores. . .”; NOTE: The second preset number of pieces of three-dimensional feature information is the generated object proposal The object category information, which comprise semantic classification scores. The object category information is the semantic classification scores.);
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi, Lee, and Hampali (incorporating Qi NPL) and include: performing feature extraction on the three-dimensional point cloud of the target scene by using a PointNet of the VoteNet to obtain three-dimensional point cloud features;obtaining central point coordinates of the target object by a multilayer perceptron network on the basis of the three-dimensional point cloud features and three-dimensional point cloud coordinates; and outputting a second preset number of pieces of three-dimensional feature information comprising object category information of the target object by means of the multilayer perceptron network on the basis of the central point coordinates and the three-dimensional point cloud features.
The reason for doing so is to allow “to propose amodal boundaries even from partial observations, as well as predicting other parameters like orientation, class, etc.” (Qi NPL: section 4.2)
Regarding claim 12, depending on 11,
The combination of Qi, Lee, and Hampali teaches:
The method for reconstructing semantic instance as claimed in claim 11,
Hampali incorporating Qi NPL further teaches:
wherein the VoteNet comprises a point cloud feature extraction component (Qi NPL: Fig. 2, point cloud feature learning backbone),
a vote generation component (Qi NPL: Fig. 2, voting in point clouds component)
and a candidate generation component (Qi NPL: Fig. 2, object proposal and classification from votes component);
the feature extraction component is configured to perform feature extraction on the three- dimensional point cloud of the target scene by using a PointNet to obtain three- dimensional point cloud features (NOTE: as discussed in the rejection of claim 11, in reference to Qi NPL Section 4.1 incorporated by Hampali, the feature extraction component, which is the Qi NPL’s point cloud feature learning backbone in Fig. 2 uses PointNet to obtain features.);
the vote generation component is configured to fuse the three-dimensional point cloud features and the three-dimensional point cloud coordinates, and generate votes by the multilayer perceptron network, which represents central point coordinates of an object (Qi NPL: Fig. 2, NOTE: in Fig 2. The Votes (XYZ + feature represented by M x (3 + C) inside the red box) is the fusion configuration of the 3D feature extracted by the PointNet learning backbone and the 3D point cloud coordinates which is the Seeds (XYZ+Feature) as shown in fig. 2. Section 4.1 teaches using multilayer perceptron to generate the votes and wherein “∆x∗i is the ground truth displacement from the seed position xi to the bounding box center of the object it belongs to.
and the candidate generation component is configured to fuse the central point coordinates and nearby three-dimensional point cloud features, generate object candidates by using the multilayer perceptron, and predict object category information (Qi NPL: Fig. 2, section 4.2, “. . . passed to MLP2 where information from different votes are further combined . . .”, NOTE: The candidate generation component (Qi NPL’s object proposal and classification component as shown in Fig. 2) takes Votes (XYZ + Feature), which includes the central point coordinates and 3D point cloud features processed in the Voting in Point Clouds component, and fuses them into a local k cluster. Then object candidates, which is Qi NPL’s object proposal for the cluster is generated using multilayer perceptron MLP1 and MLP2 predicting object category information (semantic classification scores).
Claims 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Qi in view of Lee further in view of Hampali and Qi NPL further in view of Kim et al. (US 20210012124 A1, hereinafter “Kim”).
Regarding claim 16, depending on 1,
The combination of Qi and Lee teaches:
The method for reconstructing semantic instance as claimed in of claims 1,
Although Qi teaches wherein predicting the three-dimensional detection result of the target object on the basis of the first feature information and the second feature information, so as to obtain the second coarse point cloud of the target object on the basis of the three-dimensional detection result (see rejection of claim 20 and in reference of Qi Fig. 2, segmented object points being the second coarse point cloud.), Qi fails to teach predicting a three-dimensional detection border of the target object on the basis of the first feature information and the second feature information and by using a bounding box regression network; and extracting point cloud information of the target object from the three-dimensional point cloud on the basis of the three-dimensional detection border so as to obtain the second coarse point cloud.
The analogous art Hampali incorporating Qi NPL teaches:
predicting a three-dimensional detection border of the target object on the basis of the first feature information and the second feature information and by using a bounding box regression network (Qi NPL Fig. 2, Output: 3D bounding boxes; Section 4.2, “. . . Then an object
proposal for this cluster p(C) is generated by passing the set input through a PointNet-like module. . . where votes from each cluster . . . processed by a MLP1 . . . and passed to MLP2 . . . represent the proposal p as a multidimensional vector with an objectness score, bounding box parameters . . . or positive proposals we further supervise the bounding box estimation . . . decouples the box loss to center regression, heading angle estimation and box size estimation”; NOTE: Qi NPL’s MLP based proposal component predicts a 3D detection border, which is the 3D bounding boxes illustrated in Fig. 2. The bounding box regression network is equivalent to Qi NPL’s MLP based bounding box generation. In reference to Qi NPL Fig.2, The first feature information is the feature extracted by the Point cloud feature learning backbone in the Voting in Point Clouds, which are the Seeds, the second feature information is the classification from the object proposal component.);
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi, Lee, and Hampali (incorporating Qi NPL) and include: predicting a three-dimensional detection border of the target object on the basis of the first feature information and the second feature information and by using a bounding box regression network.
The reason for doing so is to allow “to propose amodal boundaries even from partial observations, as well as predicting other parameters like orientation, class, etc.” (Qi NPL: section 4.2).
However still, the combination of Qi, Lee, and Hampali incorporating Qi NPL still fails to teach extracting point cloud information of the target object from the three-dimensional point cloud on the basis of the three-dimensional detection border so as to obtain the second coarse point cloud.
The analogous art Kim teaches:
extracting point cloud information of the target object from the three-dimensional point cloud on the basis of the three-dimensional detection border so as to obtain the second coarse point cloud (Kim: ¶24, “. . . the object mapping step may include: a point cloud data extraction step of extracting point cloud data in an inner area of a bounding box of each object from the point cloud data . . .”; ¶70, “. . . point cloud data extraction step (S310), point cloud data in the inner area of the bounding box of each object are extracted among the point cloud data as shown in FIG. 8 . . .”; NOTE: extracting point cloud information of the target is step 310 of Kim, the extracted point cloud data is the obtained second coarse point cloud. The 3D detection border is the bounding box).
It would have been obvious to a person having ordinary skill in the art (PHOSITA) before the effective filing date of the claimed invention to combine Qi, Lee, Hampali, and Kim and include: extracting point cloud information of the target object from the three-dimensional point cloud on the basis of the three-dimensional detection border so as to obtain the second coarse point cloud.
The reason for doing so is to “collect correct road marking information in real-time by extracting objects and attributes from successive images on the basis of deep learning and map the road marking information to a high definition map, and remarkably improve accuracy of recognizing the road marking information” (Kim: Abstract).
Regarding claim 17, depending on 16,
The combination of Qi, Lee, Hampali, and Kim teaches:
The method for reconstructing semantic instance as claimed in claim 16,
Hampali incorporating Qi NPL further teaches:
wherein predicting the three-dimensional detection border of the target object on the basis of the first feature information and the second feature information and by using the bounding box regression network, comprises (NOTE: see rejection of claim 16):
Qi further teaches:
fusing two-dimensional feature information and three-dimensional feature information to obtain a feature representation of the target object (Qi: ¶133, “. . . the semantics from 2D detector are also leveraged for better instance segmentation. For example, if the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person. Specifically, the semantic category is encoded as a one-hot class vector (k dimensional for the pre-defined k categories) and concatenate the one-hot vector to the intermediate point cloud features. . . “; NOTE: the 2D the one-hot vector is obtained during frustum proposal as illustrated in Qi: Fig. 2. The 3D feature information are the intermediate point cloud features such as the geometry for the object as described in Qi: ¶132-133. The 2D one-hot vector and the 3D intermediate point cloud features are concatenated, fusing the 2D and 3D feature information. In reference to figure 2, mask is then generated. The obtained feature representation is the concatenated result);
and predicting, by a three-dimensional bounding box regression network, a three-dimensional detection border of the target object by using the multilayer perceptron on the basis of the feature representation (Qi: Fig.2, ¶118, “. . . amodal 3D bounding box regression are performed, by two variants of PointNet, consecutively. The segmentation network predicts the 3D mask of the object of interest (a.k.a. instance segmentation); and the regression network estimates the amodal 3D bounding box (covering the entire object even if only part of it is visible). . .”; ¶67, “. . . the boundary process is based on the PointNet architecture, wherein each three-dimensional depth data point within the frustum is individually projected into a higher embedding space by a shared Multilayer Perceptron (MLP) . . .”; NOTE: In reference to Qi: Figure 2, RGB image input >> 2d one-hot vector feature extraction in frustum proposal >> concatenate one-hot vector and 3D intermediate point cloud features to obtain feature representation >> MLP based bounding box regression to predict a 3D detection border of the target object, which is the 3D bound box covering the entire object).
Allowable Subject Matter
Claims 13-15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PATRICK GALERA whose telephone number is (571)272-5070. The examiner can normally be reached Mon-Fri 0800-1700 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at 571-270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PATRICK P GALERA/Examiner, Art Unit 2617 /KING Y POON/Supervisory Patent Examiner, Art Unit 2617