DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN-2023100842210 and KR-10-20230158018, filed on 02/13/2024.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/18/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Status
Claim(s) 1-2, 5-6, 11-17 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yuan Haobo et al (“Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation.”, Yuan).
Claim(s) 3, 7, 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yuan Haobo et al (“Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation.”, Yuan), in view of Shi, Wenjun, et al (“Multilevel Cross-Aware RGBD Indoor Semantic Segmentation for Bionic Binocular Robot”; Shi).
Claims 4, 8, 10 and 19 is/are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-2, 5-6, 11-17 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yuan Haobo et al (“Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation.”, Yuan).
Regarding claim 1, Yuan discloses a processor-implemented method, the method (Introduction: “As shown in Figure 1(b), the Polyphonic Former builds a unified query learning framework to enable the interaction between panoptic context and depth information and make them benefit from each other iteratively.”; Acknowledgments: “The numerical calculations in this work had been supported by the supercomputing system in the Supercomputing Center of Wuhan University.) comprising:
generating a depth-aware feature of an image dependent on image features extracted from image data of the image; (Figs. 1-2; Fig.2: Illustration of our proposed PolyphonicFormer (left). Our method contains three parts: (1) Feature Extractor to obtain two parallel features for depth estimation and panoptic segmentation. The extracted panoptic (red) and depth (green) features are directly used to predict the final output. (2) Polyphonic Head containing the panoptic path and depth path to refine all the queries at the same time (middle). We use the instance masks to group (orange) the panoptic and depth features for query-based learning.”; 3.1 Feature Extractor: “a shared backbone network (Residual Network [20] or Swin Trans former [35]) along with Feature Pyramid Network as the feature extractor. We adopt the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information. In particular, leveraging a shared backbone, the neck generates different feature maps for panoptic and depth predictions in parallel. These feature maps output from FPN are denoted as X(pFPN) and X(dFPN), respectively.”) and
generating image data, representing information corresponding to one or more segmentations of the image, based on the depth-aware feature and a depth-aware representation, (Figs .1(b) and 2; 3. Method: “a unified query learning sub-network, which takes three different types of queries and backbone features as inputs and outputs both panoptic segmentation and depth estimation predictions. It contains two paths: depth path and panoptic path, which are interacted with each other iteratively”; 3.2 Unified Query Learning - Query Reasoning. “With the updated queries, we perform QR to predict the depth and mask results.”; 3.5 Inference: “) wherein the depth-aware representation comprises depth-related information and visual-related information for the image. (3.2 Unified Query Learning: The unified query learning sub-network is composed of two parallel paths for panoptic and depth predictions, respectively. … Query Update. The masked instance-level features are then delivered to update panoptic and depth queries. … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path,:”)
Regarding claim 2, Yuan discloses wherein the generating of the depth-aware feature of the image comprises: generating a visual feature and a depth feature of the image based on the image features; (Fig.2 : Panoptic feature ; Depth feature) and generating the depth-aware feature by fusing the visual feature and the depth feature. (Feature Extractor: a shared backbone network (Residual Network [20] or Swin Trans former [35]) along with Feature Pyramid Network as the feature extractor. the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information.”; Query Update: “Query Update: “we use a gated fusion design to capture the information from queries and features … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path,”)
Regarding claim 5, Yuan discloses the generating of the image data comprises: generating a refined depth-aware representation by refining the depth-aware representation; (3.2 Unified Query Learning: “The instance level features (N × C) are used to update the queries better focusing on each instance. This process is done in both panoptic path and depth path, with the same former stage mask ms−1, as shown in Figure 2 with orange line. With the help of previous stage masks, both the depth and panoptic predictions can be refined with instance-level features.”) generating depth prediction information of the segmentations based on the refined depth-aware representation and the depth-aware feature; (Fig.2 Depth Prediction) generating an enhanced depth-aware feature by enhancing the depth-aware feature(Fig.2 Mask Grouping); and generating, as the image data, mask prediction information and category prediction information respectfully dependent on the refined depth-aware representation and the enhanced depth-aware feature. (Fig.2 Mask Prediction and Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. … Note that both query updating and reasoning perform individually (Panoptic Path and Depth Path in Figure 2). The effect of each component can be found in the experiment part. In our framework, the queries are refined in each stage to be applied to the features from FPN for panoptic and depth paths to get the predictions. The predictions are conducted with the bipartite match for calculating loss during training and merged into dense predictions when inference. All predictions contribute to the final loss, but only the final stage predictions are used when inference.”)
Regarding claim 6, Yuan discloses the generating of the refined depth-aware representation by refining the depth-aware representation comprises: generating a first depth-aware representation by processing the depth-aware representation through a first attention network; (Fig.2: QR in stage 1; Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. We simply adopt Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries, … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path, and further bolster the mutual benefit between depth and panoptic predictions”) generating a second depth-aware representation by fusing the depth-aware representation and the first depth-aware representation and normalizing a feature-fused representation obtained by the fusing; (Fig.2: QU in stage 1; Query Update. The masked instance-level features are then delivered to update panoptic and depth queries. … The query update process forces the model to learn the instance aware feature to speed up the training procedure. Specifically, we use a gated fusion design to capture the information from queries and features.”) generating a third depth-aware representation by processing the depth-aware feature and the second depth-aware representation through a second attention network; ((Fig.2: QR in stage 2 ; Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. We simply adopt Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries, … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path, and further bolster the mutual benefit between depth and panoptic predictions”) generating a fourth depth-aware representation by fusing the second depth-aware representation and the third depth-aware representation and normalizing a feature-fused representation obtained by the fusing; ((Fig.2: QU in stage 2 ; Query Update. The masked instance-level features are then delivered to update panoptic and depth queries. … The query update process forces the model to learn the instance aware feature to speed up the training procedure. Specifically, we use a gated fusion design to capture the information from queries and features.”) and generating the refined depth-aware representation based on the fourth depth-aware representation using a feedforward network. (In our framework, the queries are refined in each stage to be applied to the features from FPN for panoptic and depth paths to get the predictions. The predictions are conducted with the bipartite match for calculating loss during training and merged into dense predictions when inference.)
Regarding claim 11, Yuan discloses the generating of the mask prediction information and the category prediction information of the segmentations comprises: generating the category prediction information based on the refined depth-aware representation using a first linear layer; (Fig.1 c: Instance -level predictions with Thing -Stuff Mask Prediction”; Query Update. The masked instance-level features are then delivered to update panoptic and depth queries. We describe the panoptic path first. Inspired by K-Net [76], we adopt an adaptive query update process. The query update process avoids the heavy pixel-wised affinity calculation that is used in DETR-like models [9,3]. The query update process forces the model to learn the instance aware feature to speed up the training procedure. Specifically, we use a gated fusion design to capture the information from queries and features. We first calculate the gate features FG as follows: FpG = φ(Xps) · φ(qps−1), (2) where φ is linear transformations, and all of the φs in this section are independent. The FG has the information from both the queries and the instance-level features. With FG, we can calculate gates for features and queries with the following formulation: Gpq = σ(ψ(FpG)),GpX = σ(ψ(FpG)), (3) where ψs are independent fully connected layers and σ are sigmoid functions. With the gates, for the panoptic queries, the updated queries can be calculated as follows: qps = GpX ·ψ(Xps) +Gpq ·ψ(qps−1),(4) where ψs are also independent fully connected layers. The updated queries in this stage take advantages of both the queries”) and generating a 16th feature associated with a mask and generating the mask prediction information by fusing the 16th feature and the enhanced depth-aware feature (Mask Grouping), based on the refined depth-aware representation, using a second linear layer. ( 3.2 Unified Query Learning: “Each path takes the panoptic or depth feature as input and uses mask grouping and query update to iteratively update the queries. The updated queries are adopted to refine the corresponding mask and depth predictions by the query reasoning module. … use a gated fusion design to capture the information from queries and features. We first calculate the gate features FG as follows: FpG = φ(Xps) · φ(qps−1), (2) where φ is linear transformations, and all of the φs in this section are indepen dent. The FG has the information from both the queries and the instance-level features. … Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. We simply adopt Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries, … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path, and further bolster the mutual benefit between depth and panoptic predictions. The depth and mask predictions are then calculated as ms and ds (equation 6) ”; “(f).Ablation Study on Iterative Rounds: “tudy the effectiveness of itera tive design. The results can be found in Table 3f. With more iterative rounds, as we expected, the instance-level information from the features could be more used to refine the query to predict more accurate mask and depth results.”)
Regarding claim 12, Yuan discloses the image is a current frame image of a video to be processed, and wherein the method further comprises: generating a refined depth-aware representation of a previous frame image of the current frame image; (Fig.2: Tracking Head to learn the feature embedding among the frames (bottom) and Tracking with Thing Query: “For the tracking part, we adopt previous work design [71,39] and add a tracking head, where we directly learn appearance embeddings among the different frames.”) and performing similarity matching between a refined depth-aware representation of the current frame image and the refined depth-aware representation of the previous frame image, such that a same instance of the current frame image and the previous frame image have a unified indicator. (Fig.2 and Tracking with Thing Query: “During the inference, we use the thing masks, which are generated from the thing queries to obtain final tracking embeddings for tracking, as shown in the bottom part of Fig. 2. … The bi-directional softmax is calculated to associate the instances between two frames (annotated as n and m):” ; Inference: “the instance-level panoptic and depth predictions from the corresponding queries. … For tracking, after getting the embeddings of each instance, we calculate the similarities between the embeddings stored before with the bidirectional softmax similarity. If we detect a new instance embedding that does not appear before,we will store the embedding for future matching.”)
Regarding claim 13, Yuan discloses the image is a current frame image of a video to be processed, and wherein the method further comprises: generating a refined depth-aware representation of a previous frame image of the current frame image; (Fig.2: Tracking Head to learn the feature embedding among the frames (bottom) and Tracking with Thing Query: “For the tracking part, we adopt previous work design [71,39] and add a tracking head, where we directly learn appearance embeddings among the different frames.”) and processing, through a third attention network, a refined depth-aware representation of the current frame image and the refined depth-aware representation of the previous frame image; (Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. We simply adopt Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries … The predictions are conducted with the bipartite match for calculating loss during training and merged into dense predictions when inference. All predictions contribute to the final loss, but only the final stage predictions are used when inference.”) generating a time-domain refined depth-aware representation of a time-domain context; and determining time-domain refined depth-aware representation as the refined depth-aware representation of the current frame image. (Tracking with Thing Query: “During the inference, we use the thing masks, which are generated from the thing queries to obtain final tracking embeddings for tracking, as shown in the bottom part of Fig. 2. The bi-directional softmax is calculated to associate the instances between two frames (annotated as n and m):”)
Regarding claim 14, Yuan discloses one vector of the depth-aware representation represents one object (Fig.1: Thing Mask) in the image. (Fig.2 and Unified Query Learning: “In PolyphonicFormer, we set C = 256 and N = 100, respectively. For each mask, we generate a corresponding feature to represent the instance-level information. The instance level features (N × C) are used to update the queries better focusing on each instance. … Query Update. The masked instance-level features are then delivered to update panoptic and depth queries.”; Tracking with Thing Query: “During the inference, we use the thing masks, which are generated from the nthing queries to obtain final tracking embeddings for tracking, as shown in the bottom part of Fig. 2.)
Regarding claim 15, Yuan discloses A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the image processing method of claim 1. (Acknowledgments: “The numerical calculations in this work had been supported by the supercomputing system in the Supercomputing Center of Wuhan University.”)
Regarding claim 16, Yuan discloses an electronic device, comprising: processors configured to execute instructions; and a memory storing the instructions, wherein execution of the instructions configures the processors (Introduction: “As shown in Figure 1(b), the Polyphonic Former builds a unified query learning framework to enable the interaction between panoptic context and depth information and make them benefit from each other iteratively.”; Acknowledgments: “The numerical calculations in this work had been supported by the supercomputing system in the Supercomputing Center of Wuhan University.) to:
generate a depth-aware feature of an image dependent on image features extracted from image data of the image; (Figs. 1-2; Fig.2: Illustration of our proposed PolyphonicFormer (left). Our method contains three parts: (1) Feature Extractor to obtain two parallel features for depth estimation and panoptic segmentation. The extracted panoptic (red) and depth (green) features are directly used to predict the final output. (2) Polyphonic Head containing the panoptic path and depth path to refine all the queries at the same time (middle). We use the instance masks to group (orange) the panoptic and depth features for query-based learning.”; 3.1 Feature Extractor: “a shared backbone network (Residual Network [20] or Swin Trans former [35]) along with Feature Pyramid Network as the feature extractor. We adopt the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information. In particular, leveraging a shared backbone, the neck generates different feature maps for panoptic and depth predictions in parallel. These feature maps output from FPN are denoted as X(pFPN) and X(dFPN), respectively.”) and
generate image data, representing information corresponding to one or more segmentations of the image, based on the depth-aware feature and a depth-aware representation, (Figs .1(b) and 2; 3. Method: “a unified query learning sub-network, which takes three different types of queries and backbone features as inputs and outputs both panoptic segmentation and depth estimation predictions. It contains two paths: depth path and panoptic path, which are interacted with each other iteratively”; 3.2 Unified Query Learning - Query Reasoning. “With the updated queries, we perform QR to predict the depth and mask results.”; 3.5 Inference: “) wherein the depth-aware representation comprises depth-related information and visual-related information for the image. (3.2 Unified Query Learning: The unified query learning sub-network is composed of two parallel paths for panoptic and depth predictions, respectively. … Query Update. The masked instance-level features are then delivered to update panoptic and depth queries. … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path,:”)
Regarding claim 17, Yuan discloses the processors are further configured to, when generating the depth-aware feature generate a visual feature and a depth feature of the image based on the image features; (Fig.2 : Panoptic feature ; Depth feature) and generate the depth-aware feature by fusing the visual feature and the depth feature. (Feature Extractor: a shared backbone network (Residual Network [20] or Swin Trans former [35]) along with Feature Pyramid Network as the feature extractor. the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information.”; Query Update: “Query Update: “we use a gated fusion design to capture the information from queries and features … The query linking process will make the depth query get the instance-level information from both the panoptic path and the depth path,”)
Regarding claim 20, Yuan discloses the processors are further configured to, when generating the segmentations generate a refined depth-aware representation by refining the depth-aware representation; (3.2 Unified Query Learning: “The instance level features (N × C) are used to update the queries better focusing on each instance. This process is done in both panoptic path and depth path, with the same former stage mask ms−1, as shown in Figure 2 with orange line. With the help of previous stage masks, both the depth and panoptic predictions can be refined with instance-level features.”) generate depth prediction information of the segmentations based on the refined depth-aware representation and the depth-aware feature; (Fig.2: Depth Prediction ds) generate an enhanced depth-aware feature by enhancing the depth-aware feature (Fig.2: Mask Grouping); and generate mask prediction information and category prediction information of the segmentations based on the refined depth-aware representation and the enhanced depth-aware feature. (Fig.2: Mask Prediction and Query Reasoning: “With the updated queries, we perform QR to predict the depth and mask results. … Note that both query updating and reasoning perform individually (Panoptic Path and Depth Path in Figure 2). The effect of each component can be found in the experiment part. In our framework, the queries are refined in each stage to be applied to the features from FPN for panoptic and depth paths to get the predictions. The predictions are conducted with the bipartite match for calculating loss during training and merged into dense predictions when inference. All predictions contribute to the final loss, but only the final stage predictions are used when inference.”)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 3, 7, 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yuan Haobo et al (“Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation.”, Yuan), in view of Shi, Wenjun, et al (“Multilevel Cross-Aware RGBD Indoor Semantic Segmentation for Bionic Binocular Robot”; Shi).
Regarding claim 3, Yuan discloses the generating of the depth-aware feature by fusing the visual feature and the depth feature comprises: generating a first visual feature and a first depth feature by performing a convolution operation on the visual feature and the depth feature, respectively; generating a first feature by fusing the first visual feature and the first depth feature; (Fig.2: Feature Extractor to obtain two parallel features for depth estimation and panoptic segmentation. The extracted panoptic (red) and depth (green) features are directly used to predict the final output.”; Feature Extractor: “a shared backbone network (Residual Network [20] or Swin Transformer [35]) along with Feature Pyramid Network as the feature extractor. We adopt the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information.”;
However, Yuan does not disclose generating a second feature by fusing the first depth feature and the first feature; and generating the depth-aware feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature.
Shi discloses generating a first visual feature and a first depth feature by performing a convolution operation on the visual feature and the depth feature, respectively; generating a first feature by fusing the first visual feature and the first depth feature; (Fig.1 and B. Multilevel Cross-Aware Fusion: “Two SE blocks in F1 are responsible for learning new activations of RGB and depth features respectively. In the end the learned channel-activations are used to reweight modal-specific input across channels achieving the cross-complementary attention selection of channel-wise features for different modalities. This module can be developed as: F1out = Conv(WI · block1out,I + WD · block1out,D) (2) where WI = SEI(block1out,I ; block1out,D) and WD = SED(block1out,I ; block1out,D). SEI and SED represent the channel-wise reweighting operation respectively. The input of them are the concatenated features of RGB and depth features block1out,I , block1out,D.”)
generating a second feature by fusing the first depth feature and the first feature; and generating the depth-aware feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature. (Fig.1 and B. Multilevel Cross-Aware Fusion: “The structure of Fi is based on F1. A simple branch is added to resize the last fused feature from Fi − 1 by using a convolution layer and concatenate the resized feature with the current cross-aware fused feature. Then a convolutional layer is implemented to adapt the concatenated features to the next feature extraction blocks. The entire fusion operation can be described as: F1out = F1(block1out,I , block1out,D); F2out = F2(block2out,I , block2out,D, F1out) ; F3out = F3(block3out,I , block3out,D, F2out) ; F4out = F4(block4out,I , block4out,D, F3out) (3) where Fi(·) denotes the different cross-aware fusion operations. Fiout represents the output fused features of Fi(·). The blockiout,I and blockiout,D mean the different output features of RGB and depth streams after blocki respectively. The feature fusions are progressive, and the high-level fusion depends on output fused features of low-level fusion. Therefore, it can integrate the effective information of RGB and depth images through multi-level fusion.”)
Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yuan by including Multilevel Cross-Aware Fusion that is taught by Shi, to make the invention that a multilevel cross-aware network (MCA-Net) for RGBD semantic segmentation.; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving the quality of semantic labeling as well as enhancing the performance of 3D reconstruction and semantic mapping (Shi: Abstract)
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.
Regarding claim 7, Yuan discloses the generating of the depth prediction information of the segmentations comprises: generating a fifth feature by performing a linear operation on the refined depth-aware representation and obtaining a sixth feature by performing a convolution operation on the depth-aware feature; (Fig.2 and 3.2 Unified Query Learning: “Each path takes the panoptic or depth feature as input and uses mask grouping and query update to iteratively update the queries. The updated queries are adopted to refine the corresponding mask and depth predictions by the query reasoning module. … we use a gated fusion design to capture the information from queries and features. We first calculate the gate features FG as follows: FpG = φ(Xps) · φ(qps−1),(2) where φ is linear transformations, and all of the φs in this section are independent. With FG, we can calculate gates for features and queries with the following formulation: Gpq = σ(ψ(FpG)),GpX = σ(ψ(FpG)), (3) where ψs are independent fully connected layers and σ are sigmoid functions. With the gates, for the panoptic queries, the updated queries can be calculated as follows: qps = GpX ·ψ(Xps) +Gpq ·ψ(qps−1),(4) … For the depth path, we adopt a similar process as Equation 2 and Equation 3 to calculate depth features Fd G and gates Gd q, GdX. The features and queries from both the panoptic path and depth path will contribute to the current stage depth query:”)
generating the depth prediction information based on the ninth feature using a depth estimation network. (Query Reasoning: “we perform QR to predict the depth and mask results. Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries, which has been proved to be effective in [76]. The depth and mask predictions are then calculated by: ms =FFN(MSA(qms ))·XmFPN, ; ds =FFNMSA qds ·XdFPN.(6) The ms and ds are instance-level predictions of stage s. Note that both query updating and reasoning perform individually (Panoptic Path and Depth Path in Figure 2).
However, Yuan does not disclose generating a seventh feature by fusing the fifth feature and the sixth feature; generating an eighth feature by fusing the seventh feature and the fifth feature; generating a ninth feature by fusing the eighth feature and the sixth feature;
Shi discloses generating a seventh feature by fusing the fifth feature and the sixth feature; (Fig.1 and B. Multilevel Cross-Aware Fusion: “Two SE blocks in F1 are responsible for learning new activations of RGB and depth features respectively. In the end the learned channel-activations are used to reweight modal-specific input across channels achieving the cross-complementary attention selection of channel-wise features for different modalities. This module can be developed as: F1out = Conv(WI · block1out,I + WD · block1out,D) (2) where WI = SEI(block1out,I ; block1out,D) and WD = SED(block1out,I ; block1out,D). SEI and SED represent the channel-wise reweighting operation respectively. The input of them are the concatenated features of RGB and depth features block1out,I , block1out,D.”)
generating an eighth feature by fusing the seventh feature and the fifth feature; generating a ninth feature by fusing the eighth feature and the sixth feature; (Fig.1 and B. Multilevel Cross-Aware Fusion: “The structure of Fi is based on F1. A simple branch is added to resize the last fused feature from Fi − 1 by using a convolution layer and concatenate the resized feature with the current cross-aware fused feature. Then a convolutional layer is implemented to adapt the concatenated features to the next feature extraction blocks. The entire fusion operation can be described as: F1out = F1(block1out,I , block1out,D); F2out = F2(block2out,I , block2out,D, F1out) ; F3out = F3(block3out,I , block3out,D, F2out) ; F4out = F4(block4out,I , block4out,D, F3out) (3) where Fi(·) denotes the different cross-aware fusion operations. Fiout represents the output fused features of Fi(·). The blockiout,I and blockiout,D mean the different output features of RGB and depth streams after blocki respectively. The feature fusions are progressive, and the high-level fusion depends on output fused features of low-level fusion. Therefore, it can integrate the effective information of RGB and depth images through multi-level fusion.”) and generating the depth prediction information based on the ninth feature using a depth estimation network. (A. Network Architecture: “I represents the input RGB images and D represents the input depth maps. The output is denoted by L which means the pixel-wise labeling corresponding to RGB images. … the low-level fused features and high-level features exported from ASPP module are fed into the decoder module. Through above-mentioned steps, the results of indoor semantic segmentation L are significantly improved.”)
Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yuan by including Multilevel Cross-Aware Fusion that is taught by Shi, to make the invention that a multilevel cross-aware network (MCA-Net) for RGBD semantic segmentation.; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving the quality of semantic labeling as well as enhancing the performance of 3D reconstruction and semantic mapping (Shi: Abstract)
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.
Regarding claim 9, Yuan discloses the generating of the depth prediction information of the segmentations comprises: generating the depth prediction information and enhanced depth-related information of the segmentations based on the refined depth-aware representation and the depth-aware feature, (Mask Grouping : “For each mask, we generate a corresponding feature to represent the instance-level information. The instance level features (N × C) are used to update the queries better focusing on each instance. This process is done in both panoptic path and depth path, with the same former stage mask ms−1, as shown in Figure 2 with orange line. With the help of previous stage masks, both the depth and panoptic predictions can be refined with instance-level features.”) wherein the generating of the enhanced depth-aware feature comprises: generating a tenth feature by performing a convolution operation on the depth-aware feature and obtaining an 11th feature by performing a convolution operation on the enhanced depth-related information; (Fig.2: Feature Extractor to obtain two parallel features for depth estimation and panoptic segmentation. The extracted panoptic (red) and depth (green) features are directly used to predict the final output. (2) Polyphonic Head containing the panoptic path and depth path to refine all the queries at the same time (middle).”; Query Reasoning: “we perform QR to predict the depth and mask results. Multi-Head Self Attention [53] (MSA), Feed-Forward Neural Network [53] (FFN), followed by FC-LN-ReLU layers on queries, which has been proved to be effective in [76]. The depth and mask predictions are then calculated by: ms =FFN(MSA(qms ))·XmFPN, ; ds =FFNMSA qds ·XdFPN.(6) The ms and ds are instance-level predictions of stage s. Note that both query updating and reasoning perform individually (Panoptic Path and Depth Path in Figure 2). generating a 12th feature by fusing the tenth feature and the 11th feature; (Query Update: “The query update process forces the model to learn the instance aware feature to speed up the training procedure. Specifically, we use a gated fusion design to capture the information from queries and features.”)
However, Yuan does not disclose generating a 13th feature by fusing the 11th feature and the 12th feature; and generating the enhanced depth-aware feature by sequentially performing feature concatenation and feature transformation on the 13th feature and the depth-aware feature.
Shi discloses generating a 12th feature by fusing the tenth feature and the 11th feature; (Fig.1 and B. Multilevel Cross-Aware Fusion: “Two SE blocks in F1 are responsible for learning new activations of RGB and depth features respectively. In the end the learned channel-activations are used to reweight modal-specific input across channels achieving the cross-complementary attention selection of channel-wise features for different modalities. This module can be developed as: F1out = Conv(WI · block1out,I + WD · block1out,D) (2) where WI = SEI(block1out,I ; block1out,D) and WD = SED(block1out,I ; block1out,D). SEI and SED represent the channel-wise reweighting operation respectively. The input of them are the concatenated features of RGB and depth features block1out,I , block1out,D.”) generating a 13th feature by fusing the 11th feature and the 12th feature; and generating the enhanced depth-aware feature by sequentially performing feature concatenation and feature transformation on the 13th feature and the depth-aware feature. (Fig.1 and B. Multilevel Cross-Aware Fusion: “The structure of Fi is based on F1. A simple branch is added to resize the last fused feature from Fi − 1 by using a convolution layer and concatenate the resized feature with the current cross-aware fused feature. Then a convolutional layer is implemented to adapt the concatenated features to the next feature extraction blocks. The entire fusion operation can be described as: F1out = F1(block1out,I , block1out,D); F2out = F2(block2out,I , block2out,D, F1out) ; F3out = F3(block3out,I , block3out,D, F2out) ; F4out = F4(block4out,I , block4out,D, F3out) (3) where Fi(·) denotes the different cross-aware fusion operations. Fiout represents the output fused features of Fi(·). The blockiout,I and blockiout,D mean the different output features of RGB and depth streams after blocki respectively. The feature fusions are progressive, and the high-level fusion depends on output fused features of low-level fusion. Therefore, it can integrate the effective information of RGB and depth images through multi-level fusion.”)
Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yuan by including Multilevel Cross-Aware Fusion that is taught by Shi, to make the invention that a multilevel cross-aware network (MCA-Net) for RGBD semantic segmentation.; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving the quality of semantic labeling as well as enhancing the performance of 3D reconstruction and semantic mapping (Shi: Abstract)
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.
Regarding claim 18, Yuan discloses wherein the processors are configured to, when obtaining the depth-aware feature by fusing the visual feature and the depth feature: generate a first visual feature and a first depth feature by performing a convolution operation on the visual feature and the depth feature, respectively; (Fig.2: Feature Extractor to obtain two parallel features for depth estimation and panoptic segmentation. The extracted panoptic (red) and depth (green) features are directly used to predict the final output.”; Feature Extractor: “a shared backbone network (Residual Network [20] or Swin Transformer [35]) along with Feature Pyramid Network as the feature extractor. We adopt the semantic FPN [23] design to simply fuse the multiscale information since both the depth and panoptic predictions need the high-resolution feature to find the fine structures as well as high-level semantic information.”)
generate a first feature by fusing the first visual feature and the first depth feature; generate a second feature by fusing the first depth feature and the first feature; and generate the depth-aware feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature.
However, Yuan does not disclose generating a second feature by fusing the first depth feature and the first feature; and generating the depth-aware feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature.
Shi discloses generating a first visual feature and a first depth feature by performing a convolution operation on the visual feature and the depth feature, respectively; generating a first feature by fusing the first visual feature and the first depth feature; (Fig.1 and B. Multilevel Cross-Aware Fusion: “Two SE blocks in F1 are responsible for learning new activations of RGB and depth features respectively. In the end the learned channel-activations are used to reweight modal-specific input across channels achieving the cross-complementary attention selection of channel-wise features for different modalities. This module can be developed as: F1out = Conv(WI · block1out,I + WD · block1out,D) (2) where WI = SEI(block1out,I ; block1out,D) and WD = SED(block1out,I ; block1out,D). SEI and SED represent the channel-wise reweighting operation respectively. The input of them are the concatenated features of RGB and depth features block1out,I , block1out,D.”)
generating a second feature by fusing the first depth feature and the first feature; and generating the depth-aware feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature. (Fig.1 and B. Multilevel Cross-Aware Fusion: “The structure of Fi is based on F1. A simple branch is added to resize the last fused feature from Fi − 1 by using a convolution layer and concatenate the resized feature with the current cross-aware fused feature. Then a convolutional layer is implemented to adapt the concatenated features to the next feature extraction blocks. The entire fusion operation can be described as: F1out = F1(block1out,I , block1out,D); F2out = F2(block2out,I , block2out,D, F1out) ; F3out = F3(block3out,I , block3out,D, F2out) ; F4out = F4(block4out,I , block4out,D, F3out) (3) where Fi(·) denotes the different cross-aware fusion operations. Fiout represents the output fused features of Fi(·). The blockiout,I and blockiout,D mean the different output features of RGB and depth streams after blocki respectively. The feature fusions are progressive, and the high-level fusion depends on output fused features of low-level fusion. Therefore, it can integrate the effective information of RGB and depth images through multi-level fusion.”)
Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yuan by including Multilevel Cross-Aware Fusion that is taught by Shi, to make the invention that a multilevel cross-aware network (MCA-Net) for RGBD semantic segmentation.; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving the quality of semantic labeling as well as enhancing the performance of 3D reconstruction and semantic mapping (Shi: Abstract)
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.
Allowable Subject Matter
Claims 4, 8, 10 and 19 is/are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding claim 4, Yuan, as modified by Shi, does not disclose the generating of the depth-aware feature by sequentially performing the feature concatenation and the feature transformation comprises: generating a third feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature; generating a fourth feature by reshaping a dimension of the depth feature; generating a first depth position feature by fusing the fourth feature and a depth-related position embedding; and generating the depth-aware feature by fusing the third feature and the first depth position feature.
Claim 4 is subject matter allowable matter by the limitation the generating of the depth-aware feature by sequentially performing the feature concatenation and the feature transformation comprises: generating a third feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature; generating a fourth feature by reshaping a dimension of the depth feature; generating a first depth position feature by fusing the fourth feature and a depth-related position embedding; and generating the depth-aware feature by fusing the third feature and the first depth position feature.
Regarding claim 8, Yuan, as modified by Shi, does not disclose the generating of the depth prediction information based on the ninth feature comprises: generating a feature weight corresponding to the ninth feature by performing pooling on the ninth feature and performing a linear operation on a pooled ninth feature obtained by the pooling; and generating the depth prediction information by performing a linear operation on the ninth feature using the feature weight.
Claim 8 is subject matter allowable matter by the limitation the generating of the depth prediction information based on the ninth feature comprises: generating a feature weight corresponding to the ninth feature by performing pooling on the ninth feature and performing a linear operation on a pooled ninth feature obtained by the pooling; and generating the depth prediction information by performing a linear operation on the ninth feature using the feature
Regarding claim 10, Yuan, as modified by Shi, does not disclose the generating of the enhanced depth-aware feature by sequentially performing the feature concatenation and the feature transformation on the 13th feature and the depth-aware feature comprises: generating a 14th feature by sequentially performing feature concatenation and feature transformation on the 13th feature and the depth-aware feature; generating a 15th feature by reshaping a dimension of the enhanced depth-related information; generating a second depth position feature by fusing the 15th feature and a depth-related position embedding; and generating the enhanced depth-aware feature by fusing the 14th feature and the second depth position feature.
Claim 10 is subject matter allowable matter by the limitation the generating of the enhanced depth-aware feature by sequentially performing the feature concatenation and the feature transformation on the 13th feature and the depth-aware feature comprises: generating a 14th feature by sequentially performing feature concatenation and feature transformation on the 13th feature and the depth-aware feature; generating a 15th feature by reshaping a dimension of the enhanced depth-related information; generating a second depth position feature by fusing the 15th feature and a depth-related position embedding; and generating the enhanced depth-aware feature by fusing the 14th feature and the second depth position feature.
Regarding claim 19, Yuan, as modified by Shi, does not disclose the processors are configured to, when generating the depth-aware feature: generate a third feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature; generate a fourth feature by reshaping a dimension of the depth feature; generate a first depth position feature by fusing the fourth feature and a depth-related position embedding; and generate the depth-aware feature by fusing the third feature and the first depth position feature.
Claim 19 is subject matter allowable matter by the limitation the processors are configured to, when generating the depth-aware feature: generate a third feature by sequentially performing feature concatenation and feature transformation on the second feature and the visual feature; generate a fourth feature by reshaping a dimension of the depth feature; generate a first depth position feature by fusing the fourth feature and a depth-related position embedding; and generate the depth-aware feature by fusing the third feature and the first depth position feature.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Rhodes et al (U.S. 20200160528 A1), “High Fidelity Interactive Segmentation for Video Data with Deep Convolutional Tessellations and Context Aware Skip Connection”, teaches about applying tessellation to a video frame to generate feature frames corresponding to the video frame and applying a segmentation network implementing context aware skip connections to an input volume including the feature frames and a context feature volume corresponding to the video frame to generate a segmentation for the video frame.
Chiu et al (U.S. 20200357143 A1), “Semantically-Aware Image Based Visual Localization”, teaches about a method, apparatus and system for visual localization includes extracting appearance features of an image, extracting semantic features of the image, fusing the extracted appearance features and semantic features, pooling and projecting the fused features into a semantic embedding space having been trained using fused appearance and semantic features of images having known locations, computing a similarity measure between the projected fused features and embedded, fused appearance and semantic features of images, and predicting a location of the image associated with the projected, fused features.
Goel et al (U.S. 20210181757 A1), “Multi-task Learning for Real-Time Semantic and/or Depth Aware Instance Segmentation and/or Three-Dimensional Object Bounding”, teaches about A machine-learning (ML) architecture for determining three or more outputs, such as a two and/or three-dimensional region of interest, semantic segmentation, direction logits, depth data, and/or instance segmentation associated with an object in an image. It also teaches about the ML architecture may determine an object detection comprising a two-dimensional region of interest (ROI), a classification, a semantic segmentation, direction logits, depth data (e.g., depth bin and/or depth residual), and/or instance segmentation associated with an object.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Duy A Tran whose telephone number is (571)272-4887. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ONEAL R MISTRY can be reached at (313)-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DUY TRAN/ Examiner, Art Unit 2674
/ONEAL R MISTRY/ Supervisory Patent Examiner, Art Unit 2674