Last updated: April 19, 2026
Application No. 17/542,497
OBJECT RECOGNITION METHOD AND APPARATUS

Final Rejection §103
Filed
Dec 06, 2021
Examiner
ALFONSO, DENISE G
Art Unit
2662
Tech Center
2600 — Communications
Assignee
Huawei Technologies Co., Ltd.
OA Round
4 (Final)
Interview Optional

— +19.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 103 resolved cases, 2023–2026
Examiner Intelligence

ALFONSO, DENISE G View full profile →
Grants 74% — above average
Career Allow Rate
76 granted / 103 resolved
+11.8% vs TC avg
Strong +20% interview lift
Without
With
+19.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
31 currently pending
Career history
134
Total Applications
across all art units
Statute-Specific Performance

§101
8.3%
-31.7% vs TC avg
§103
59.8%
+19.8% vs TC avg
§102
19.4%
-20.6% vs TC avg
§112
8.1%
-31.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 103 resolved cases
Office Action

§103
DETAILED ACTIONS
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 06/26/2025 has been entered.
 
Response to Amendment
The amendment filed 10/30/2025 has been entered. Claims 1-4, 6, 9-12, 14 and new claims 15-19 remain pending in the application. Claims 5, 7-8, and 13 are cancelled. 

Response to Arguments
Applicant’s arguments with respect to claims 1, 9, and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The limitations from original dependent claim 5 was added to the independent claim 1, but it changes the scope of the independent claim because the original limitation “predicting, based on the feature of the region in which the 2D box is located, 3D information, mask information, or keypoint information of the task object of the task” was changed to “predicting, based on the feature of the region in which the 2D box is located, 3D information of the task object of the at least one task”.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-3, 6, 9-11, 14-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over  El-Khamy et al., (US/20190057507 A1, previously cited on IDS filed by the applicant), hereinafter referred to as El-Khamy, in view of Kim et al. " Parallel Feature Pyramid Network for Object Detection " (2018), hereinafter referred to as Kim, in further view of Mousavian et al., “3D Bounding Box Estimation Using Deep Learning and Geometry” (2017), hereinafter referred to as Mousavian.

Claim 1 (Currently Amended) 
El-Khamy discloses an object detection method (El-Khamy, Fig. 1A and 1B) comprising: 
receiving an input image (El-Khamy, Fig. 1B, input image); 
performing convolution processing on the input image (El-Khamy, Fig. 1A), and outputting a plurality of feature maps (El-Khamy, Fig. 1A, feature maps, 450, 430, and 410) corresponding to the image, wherein feature maps in the plurality of feature maps have different resolutions (El-Khamy, Fig. 1A, feature maps, 450, 430, and 410 has different resolutions and scales) and 
for at least one task (El-Khamy, [0045], “The training process may be end-to-end by multi-task learning, where the additional information provided by additional deep neural networks directs the instance masks to be of complete, stand-alone objects, which improves performance in case of clutter and occlusion”),
detecting a task object in said each task based on the one or more plurality of feature maps (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”, [0058], “The segmentation masks 402 are then supplied to a pyramid segmentation network 500 to generate, at 2500, a segmentation mask 502 for a particular object, as generated from the separate masks generated at different resolutions (e.g., at the different resolutions of the multi-resolution feature maps 410, 430, and 450) by the segmentation mask prediction network 400.” Cars, bicycle, and pedestrian are all different task objects that are independently detected), and 
outputting a 2D box of a region (El-Khamy, Fig. 7 shows a 2D box bounding separate chairs, [0093], “As shown in 7100, the baseline system detects only one chair in the image, while FPN P.sub.3, FPN P.sub.2, and FPN P.sub.1 all detect two separate chairs, as shown in 7103, 7102, and 7101.”) in which the task object is located (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”, [0028], “computing an average bounding box from the belonging bounding boxes for the object detected in the image”, [0055], “the RPN 300 generates a plurality of detection boxes or bounding boxes (RPN BBoxes) 302 corresponding to the locations of individual features. Each of the detection boxes is defined by a plurality of box coordinates that identify a region of interest that corresponds to one of the objects in the image (e.g., the RPN generates a detection box for each object it detects in the image)”) and confidence corresponding to the 2D box (El-Khamy, [0046], “the embodiments of the present disclosure make use of information about the “completeness” of objects, such as by favoring the detection of the entirety of an object (e.g., by increasing the confidence score of the detection of entire objects), while discouraging only part of an object (e.g., by decreasing the confidence score of the detection of parts of objects), or multiple objects, or the union of parts belonging to different objects, to be considered one entity. Embodiments of the present disclosure further improve performance on small objects (e.g., objects that make up a small portion of the entire input image) by efficiently extracting information at different scales, and aggregating such information. Some embodiments of the present disclosure include an instance segmentation module that detects objects in an image and produces a corresponding category, belonging pixels, and confidence score”), wherein said task object is an object to be detected in the task (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”), and a higher value of the confidence indicates a higher probability that the task object corresponding to the at least one task exists in the 2D box corresponding to the confidence (El-Khamy, [0046], “the embodiments of the present disclosure make use of information about the “completeness” of objects, such as by favoring the detection of the entirety of an object (e.g., by increasing the confidence score of the detection of entire objects), while discouraging only part of an object (e.g., by decreasing the confidence score of the detection of parts of objects), or multiple objects, or the union of parts belonging to different objects, to be considered one entity. Embodiments of the present disclosure further improve performance on small objects (e.g., objects that make up a small portion of the entire input image) by efficiently extracting information at different scales, and aggregating such information. Some embodiments of the present disclosure include an instance segmentation module that detects objects in an image and produces a corresponding category, belonging pixels, and confidence score”, if the object is deemed completely detected or inside the bounding box which is analogous to it existing, the confidence is higher, if the object is not completely detected or only part of it is inside the bounding box which is analogous to it existing, the confidence score is lower, [0071], “the RPN-based score refinement module 700 averages the predicted BBBoxes of all pixels that are classified to be inside the segmentation mask 502 to produce an average predicted BBBox. At 2730, the RPN-based score refinement module 700 computes an intersection over union (IoU) metric (area of intersection between a plurality of predicted BBBoxes divided by the area of their union) between the average predicted BBBox and the RPN BBox associated with this mask. At 2750, the RPN-based score refinement module 700 scales the scores of the masks in proportion to their corresponding IoU metric to generate a refined score for the instance mask” );
extracting, based on a 2D box of the task object of the task, a feature of a region in which the 2D box is located from the one or more feature maps on the backbone (El-Khamy, [0014], “calculating a four feature vector representing the belonging bounding box of the pixel, the four feature vector including: a topmost pixel; a bottommost pixel; a leftmost pixel; and a rightmost pixel”).



El-Khamy does not explicitly disclose executing multiple different tasks in parallel based on the set of feature maps, including: for each task of the multiple different tasks: detecting a task object in said each based on the set of feature maps.
However, Kim teaches outputting a set of feature maps corresponding to the image (Kim, Fig. 3, there are three feature maps that are processed in parallel, Abstract, “we adopt spatial pyramid pooling and some additional feature transformations to generate a pool of feature maps with different sizes”) , wherein feature maps in the plurality of feature maps have different resolutions (Kim, Fig. 3, Abstract, “we adopt spatial pyramid pooling and some additional feature transformations to generate a pool of feature maps with different sizes”, Section 1, “As shown in Fig. 1(d), we first employ the spatial pyramid pooling (SPP) to generate a wide FP pool with the feature maps of different sizes”) and 
executing multiple different tasks in parallel (Kim, Fig. 3, Section 1, “we apply additional feature abstraction to the feature maps of the FP pool in parallel, which makes all of them have similar levels of semantic abstraction. The multi-scale context aggregation (MSCA) modules then resize these feature maps to a uniform size and aggregate their contextual information to produce each level of the final FP.”), including:
for each task of the multiple different tasks (Kim, Fig. 3):
detecting a task object in said task based on the set of feature maps (Kim, Fig. 3, each feature maps that is processed in parallel outputs a prediction subnet for each object, the three objects are bus, car, and person, Section 2, “We use 3 × 3 Conv layers to predict the locations of objects and their class labels. For box regression sub-network (Subnet), a 3 × 3 Conv layer with 4A filters is applied to each level of the FP to calculate the relative offset between the anchor and the predicted bounding box, where A is the number of anchors per location of the feature map. For classification, another 3 × 3 Conv layer with (K + 1)A filters followed by softmax is applied to predict the probability of an object being present at each spatial position for each of the A anchors and K object classes”), and 
outputting a 2D box of a region in which the task object is located (Kim, Fig. 3, e. Detection results) and, wherein the task object is to be detected in the said each task (Kim, Fig. 3, each objects that were detected has a 2D box and combined in output).
El-Khamy and Kim are both considered to be analogous to the claimed invention because they are in the same field of object detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by El-Khamy to incorporate the teachings of Kim of executing multiple different tasks in parallel, including: for each task of the multiple different tasks: detecting a task object in said each based on the set of feature maps. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because it can provide more accurate object information for large objects (Kim, Abstract).

The combination of El-Khamy in view of Kim does not explicitly disclose predicting, based on the feature of the region in which the 2D box is located, 3D information of the task object of the at least one task. 
	However, Mousavian teaches predicting, based on the feature of the region in which the 2D box is located (Mousavian, Section 3, “Similar equations can be derived for the remaining 2D box side parameters xmax,ymin,ymax. In total the sides of the 2D bounding box provide four constraints on the 3D bounding box.”, Section 3.2, “Each side of the 2D detection box can correspond to any of the eight corners of the 3D box which results in 84 = 4096 configurations.”) , 3D information (Mousavian, Fig. 1) of the task object of the at least one task (Mousavian, Abstract, “We present a method for 3D object detection and pose estimation from a single image”, “our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box”, Section 3, “In order to leverage the success of existing work on 2D object detection for 3D bounding box estimation, we use the fact that the perspective projection of a 3D bounding box should fit tightly within its 2D detection window. We assume that the 2D object detector has been trained to produce boxes that correspond to the bounding box of the projected 3D box.)
El-Khamy, Kim, and Mousavian are all considered to be analogous to the claimed invention because they are in the same field of object detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by El-Khamy and Kim to incorporate the teachings of Kim of predicting, based on the feature of the region in which the 2D box is located, 3D information of the task object of the at least one task. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to estimate stable and accurate posed 3D bounding boxes without additional 3D shape models, or sampling strategies with complex pre-processing pipelines (Mousavian, Section 6).


Claim 2 (Currently Amended) 
The combination of El-Khamy in view of Kim in further view of Mousavian discloses the object detection method according to claim 1 (El-Khamy, Fig. 1A and 1B), wherein the steps of detecting the task object said each task (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”, [0058], “The segmentation masks 402 are then supplied to a pyramid segmentation network 500 to generate, at 2500, a segmentation mask 502 for a particular object, as generated from the separate masks generated at different resolutions (e.g., at the different resolutions of the multi-resolution feature maps 410, 430, and 450) by the segmentation mask prediction network 400.” Cars, bicycle, and pedestrian are all different task objects that are independently detected) and outputting the 2D box and confidence corresponding to the 2D box (El-Khamy, [0028], “computing an average bounding box from the belonging bounding boxes for the object detected in the image”, [0055], “the RPN 300 generates a plurality of detection boxes or bounding boxes (RPN BBoxes) 302 corresponding to the locations of individual features. Each of the detection boxes is defined by a plurality of box coordinates that identify a region of interest that corresponds to one of the objects in the image (e.g., the RPN generates a detection box for each object it detects in the image)”, [0046], “the embodiments of the present disclosure make use of information about the “completeness” of objects, such as by favoring the detection of the entirety of an object (e.g., by increasing the confidence score of the detection of entire objects), while discouraging only part of an object (e.g., by decreasing the confidence score of the detection of parts of objects), or multiple objects, or the union of parts belonging to different objects, to be considered one entity. Embodiments of the present disclosure further improve performance on small objects (e.g., objects that make up a small portion of the entire input image) by efficiently extracting information at different scales, and aggregating such information. Some embodiments of the present disclosure include an instance segmentation module that detects objects in an image and produces a corresponding category, belonging pixels, and confidence score”)comprise: 
predicting, on one or more feature maps, the region in which the task object is located, and outputting a candidate 2D box matching the region (El-Khamy, [0018], “a segmentation mask prediction network configured to calculate a plurality of segmentation masks for each detection box of the detection boxes at the multiscale resolutions of the feature maps”); 
extracting, based on the region in which the task object is located, a feature of a region in which the candidate 2D box is located from a feature map (El-Khamy, [0014], “calculating a four feature vector representing the belonging bounding box of the pixel, the four feature vector including: a topmost pixel; a bottommost pixel; a leftmost pixel; and a rightmost pixel”); 
performing convolution processing on the feature of the region in which the candidate 2D box is located, to obtain confidence that the candidate 2D box belongs to each object category (El-Khamy, Fig. 1A, [0046], “the embodiments of the present disclosure make use of information about the “completeness” of objects, such as by favoring the detection of the entirety of an object (e.g., by increasing the confidence score of the detection of entire objects), while discouraging only part of an object (e.g., by decreasing the confidence score of the detection of parts of objects), or multiple objects, or the union of parts belonging to different objects, to be considered one entity. Embodiments of the present disclosure further improve performance on small objects (e.g., objects that make up a small portion of the entire input image) by efficiently extracting information at different scales, and aggregating such information. Some embodiments of the present disclosure include an instance segmentation module that detects objects in an image and produces a corresponding category, belonging pixels, and confidence score”, if the object is deemed completely detected or inside the bounding box which is analogous to it existing, the confidence is higher, if the object is not completely detected or only part of it is inside the bounding box which is analogous to it existing, the confidence score is lower, [0071], “the RPN-based score refinement module 700 averages the predicted BBBoxes of all pixels that are classified to be inside the segmentation mask 502 to produce an average predicted BBBox. At 2730, the RPN-based score refinement module 700 computes an intersection over union (IoU) metric (area of intersection between a plurality of predicted BBBoxes divided by the area of their union) between the average predicted BBBox and the RPN BBox associated with this mask. At 2750, the RPN-based score refinement module 700 scales the scores of the masks in proportion to their corresponding IoU metric to generate a refined score for the instance mask” ), wherein the object category is an object category in a task (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”, [0058], “The segmentation masks 402 are then supplied to a pyramid segmentation network 500 to generate, at 2500, a segmentation mask 502 for a particular object, as generated from the separate masks generated at different resolutions (e.g., at the different resolutions of the multi-resolution feature maps 410, 430, and 450) by the segmentation mask prediction network 400.” Cars, bicycle, and pedestrian are all different task objects that are independently detected); 
adjusting coordinates of the candidate 2D box of the region through a neural network to obtain an adjusted 2D candidate box that matches a shape of an actual object better than the candidate 2D box does, and selecting the adjusted 2D candidate box when confidence of the adjusted 2D candidate box is greater than a preset threshold as a 2D box of the region (El-Khamy, [0070], “The bounding boxes (BBBoxes) 602 computed by the BBBox prediction network are supplied to an region proposal network (RPN) based score refinement module 700, which, at 2700, adjusts the confidence in the segmentation mask 502 generated by the pyramid segmentation network 500 based on the level of agreement between the segmentation mask 502 and the bounding boxes 602 to generate ad adjusted segmentation mask 702.” [0078], “At 2900, after both the FCIS detections 702 and density prediction 802 are obtained, the density based filtering module 900 according to one embodiment thresholds the detection confidence with a detection confidence threshold value to produce the final segmentation map and visualize its results. In more detail, the density based filtering module 900 may filter the adjusted mask instances 702 based on the calculated density metrics 802 in order to reduce or minimize the discrepancy between those calculations”).  

Claim 3 (Previously Presented) 
The combination of El-Khamy in view of Kim discloses the object detection method according to claim 2 (El-Khamy, Fig. 1A and 1B), wherein the 2Dbox is a rectangular box (El-Khamy, [0063], the BBBox prediction network 600 predicts the belonging instance positions for each pixel directly. At each pixel in the image, a vector defining the bounding box position of the instance it belongs to is predicted. In some embodiments, the vector includes the coordinates of the top left and bottom right corners, which gives the bounding boxes of these embodiments a rectangular shape. As such, the BBBox prediction network 600 computes a bounding box 602 for each pixel, represented as a 4-channel map with the resolution of the original image).
Claim 6 (Previously Presented) 
The combination of El-Khamy in view of Kim in further view of Mousavian discloses the object detection method according to claim 1 (El-Khamy, Fig. 1A and 1B), wherein the step of independently detecting a task object in each task based on the feature maps (El-Khamy, [0044], “semantic segmentation of an image of a street scene may label all of the pixels associated with each car in the scene with the label of “car,” all of the pixels associated with a person on a bicycle with the label “bicycle,” and may label all of the pixels associated with people walking in the scene with the label of “pedestrian.” Furthermore, a semantic segmentation system may generate, for each separate instance of an object in the image (e.g., each instance of a car in the scene), a separate instance mask identifying the pixels of the image that correspond to the separate instance of the object. For example, if the semantic segmentation system detects three cars and two pedestrians in the image, five separate instance masks are output: one for each of the cars and one for each of the pedestrians”, [0058], “The segmentation masks 402 are then supplied to a pyramid segmentation network 500 to generate, at 2500, a segmentation mask 502 for a particular object, as generated from the separate masks generated at different resolutions (e.g., at the different resolutions of the multi-resolution feature maps 410, 430, and 450) by the segmentation mask prediction network 400.” Cars, bicycle, and pedestrian are all different task objects that are independently detected) comprising: detecting the region in which the task object is located on a low-resolution feature map when the object is a large object (El-Khamy, [0057], “earlier stage feature maps (e.g., from level 210) can be more robust to noise because they have lower resolutions than higher level feature maps, and may be better suited for recognizing large objects”) and on a high-resolution feature map when the object is a small object (El-Khamy, [0057], “As such the later stage feature maps may be better suited to detect certain kinds of objects, such as very small objects or objects which need a larger global view to understand their semantics”).  
 
Claims 9-11 and 14 are rejected for similar reasons as those described in claims 1-3 and 5-6, respectively. The additional elements in Claims 9-11 and 14  (El-Khamy, Kim, and Mousavian) discloses includes: an object detection apparatus (El-Khamy, Fig. 1A, [0097], “electronic or electric devices and/or any other relevant devices or components) comprising: a memory storing executable instructions (El-Khamy, [0097], “computer program instructions are stored in a memory which may be implemented in a computing device using a standard memory device, such as, for example, a random access memory (RAM)”) and a processor configured to execute the executable instructions to perform operations (El-Khamy, [0097], “the various components of these devices may be a process or thread, running on one or more processors”). The proposed combination as well as the motivation for combining the El-Khamy Kim, and Mousavian references presented in the rejection of Claims 1-3 and 5-6, apply to Claims 9-11 and 14 and are incorporated herein by reference.  Thus, the method recited in Claim 9-11 and 14 is met by El-Khamy, Kim, and Mousavian.

Claims 16-17 and 19 are rejected for similar reasons as those described in claims 1-3 and 6, respectively. The additional elements in Claims 16-17 and 19 (El-Khamy, Kim, and Mousavian) discloses includes: a system (El-Khamy, Fig. 1A) comprising: a backbone (El-Khamy, Fig. 1A, [0050], “a fully convolutional instance semantic segmentation (FCIS) core network 100, which, at 2100, processes the initial image to extract core neural network features 102 from an input image 20 (e.g., a bitmap image of a scene containing one or more objects, such as a photograph of a street)”), a plurality of parallel headers coupled to the backbone and at least one serial header (El-Khamy, Fig. 1A, shows multiple parallel headers and serial headers), wherein the backbone (El-Khamy, Fig. 1A) is configured to. The proposed combination as well as the motivation for combining the El-Khamy Kim, and Mousavian references presented in the rejection of Claims 1-3 and 6, apply to Claims 16-17 and 19 and are incorporated herein by reference.  Thus, the system recited in Claim 16-17 and 19 is met by El-Khamy, Kim, and Mousavian.

Claims 4 , 12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over El-Khamy in view of Kim in further view of Mousavian in view of Chen et al. "Multi-task learning for dangerous object detection in autonomous driving" (2018), hereinafter referred to as Chen.

Claim 4 (Previously Presented) 
The combination of El-Khamy in view of Kim in further view of Mousavian discloses the object detection method according to claim 2 (El-Khamy, Fig. 1A and 1B), wherein the steps of predicting the region in which the task object is located and outputting the candidate 2D box matching the region (El-Khamy, [0018], “a segmentation mask prediction network configured to calculate a plurality of segmentation masks for each detection box of the detection boxes at the multiscale resolutions of the feature maps”) comprise: predicting, based on an anchor of an object corresponding to a task, a region in which the task object exists on the one or more feature maps provided by the backbone, to obtain a proposal (El-Khamy, [0012], “The detection boxes may be calculated by supplying the core instance features to a region proposal network”, [0056], “The segmentation mask head is a fully convolutional deep neural network that is trained to predict a segmentation mask for each box proposal 302 from the RPN 300, and for each object class. The segmentation mask prediction network 400 (or segmentation mask head) is configured to predict a segmentation mask from a cropped feature map corresponding to an RPN bounding box (e.g., a portion of a feature map, as cropped by an RPN bounding box), either by a one shot prediction or for each grid cell after pooling the feature map crop corresponding to the RPN into a fixed-size grid of cells.”), and outputting the candidate 2D box matching the proposal category (El-Khamy, Fig. 1A, [0046], “the embodiments of the present disclosure make use of information about the “completeness” of objects, such as by favoring the detection of the entirety of an object (e.g., by increasing the confidence score of the detection of entire objects), while discouraging only part of an object (e.g., by decreasing the confidence score of the detection of parts of objects), or multiple objects, or the union of parts belonging to different objects, to be considered one entity. Embodiments of the present disclosure further improve performance on small objects (e.g., objects that make up a small portion of the entire input image) by efficiently extracting information at different scales, and aggregating such information. Some embodiments of the present disclosure include an instance segmentation module that detects objects in an image and produces a corresponding category, belonging pixels, and confidence score”, if the object is deemed completely detected or inside the bounding box which is analogous to it existing, the confidence is higher, if the object is not completely detected or only part of it is inside the bounding box which is analogous to it existing, the confidence score is lower, [0071], “the RPN-based score refinement module 700 averages the predicted BBBoxes of all pixels that are classified to be inside the segmentation mask 502 to produce an average predicted BBBox. At 2730, the RPN-based score refinement module 700 computes an intersection over union (IoU) metric (area of intersection between a plurality of predicted BBBoxes divided by the area of their union) between the average predicted BBBox and the RPN BBox associated with this mask. At 2750, the RPN-based score refinement module 700 scales the scores of the masks in proportion to their corresponding IoU metric to generate a refined score for the instance mask” ), wherein the anchor is obtained based on a statistical feature of the task object to which the anchor belongs, and the statistical feature comprises a size of the object (El-Khamy, [0092], “. Another is to selectively use specific layer based on the object size or detection size. In still another embodiment, the heads on all pyramid layers are applied (see, e.g., FIG. 4(c), where P.sub.2 and P.sub.3 are both used). The heads share the same weights, take each feature pyramid layer as input and produce different-sized score maps as output. The pooling layers in either Mask-RCNN or FCIS produce fixed-size score maps or feature maps (e.g. 21?21), and may be applied to all the differently-sized output maps to get one fixed size map for each region-of-interest, and each scale., it is based on the size of the object). 

The combination of El-Khamy in view of Kim in further view of Mousavian does not explicitly disclose the statistical feature comprises a shape and a size of the object.
	However, Chen teaches the statistical feature comprises a shape of the object (Section 4.1, “: Considering the different shapes of objects, there are a set of default bounding boxes with different aspect ratios in each position of a feature map. For a feature map with the size of w × h, if there are k default bounding boxes in each position, the feature map has w × h × k default bounding boxes in all. A default bounding box with a specific aspect ratio can be responsive to the area with the same specific aspect ratio of the input image. All the detections for w × h × k default bounding boxes can cover various shape objects of input images”).
El-Khamy, Kim, Mousavian, and Chen are all considered to be analogous to the claimed invention because they are in the same field of object detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by El-Khamy and Kim to incorporate the teachings of Chen the statistical feature comprises a shape and a size of the object. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to achieve better object detection by allowing different various shapes of objects can be covered by the bounding box.

Claim 12 is rejected for similar reasons as those described in claim 4, respectively. The additional elements in Claim 12  (El-Khamy, Kim, Mousavian, and Chen) discloses includes: an object detection apparatus (El-Khamy, Fig. 1A, [0097], “electronic or electric devices and/or any other relevant devices or components) comprising: a memory storing executable instructions (El-Khamy, [0097], “computer program instructions are stored in a memory which may be implemented in a computing device using a standard memory device, such as, for example, a random access memory (RAM)”) and a processor configured to execute the executable instructions to perform operations (El-Khamy, [0097], “the various components of these devices may be a process or thread, running on one or more processors”). The proposed combination as well as the motivation for combining the El-Khamy, Kim, Mousavian, and Chen references presented in the rejection of Claim 4, apply to Claim 12 and are incorporated herein by reference.  Thus, the apparatus recited in Claim 12 is met by El-Khamy, Kim, Mousavian, and Chen.

Claim 18 is rejected for similar reasons as those described in claim 4. The additional elements in Claim  18 (El-Khamy, Kim, Mousavian, and Chen) discloses includes: n system (El-Khamy, Fig. 1A) comprising: a backbone (El-Khamy, Fig. 1A, [0050], “a fully convolutional instance semantic segmentation (FCIS) core network 100, which, at 2100, processes the initial image to extract core neural network features 102 from an input image 20 (e.g., a bitmap image of a scene containing one or more objects, such as a photograph of a street)”), a plurality of parallel headers coupled to the backbone and at least one serial header (El-Khamy, Fig. 1A, shows multiple parallel headers and serial headers), wherein the backbone (El-Khamy, Fig. 1A) is configured to. The proposed combination as well as the motivation for combining the El-Khamy, Kim, Mousavian, and Chen references presented in the rejection of Claim 4, apply to Claim 18 and are incorporated herein by reference.  Thus, the system recited in Claim 18 is met by El-Khamy, Kim, Mousavian, and Chen.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENISE G ALFONSO whose telephone number is (571)272-1360. The examiner can normally be reached Monday - Friday 7:30 - 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amandeep Saini can be reached at (571)272-3382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DENISE G ALFONSO/Examiner, Art Unit 2662                                                                                                                                                                                                        
/AMANDEEP SAINI/Supervisory Patent Examiner, Art Unit 2662
Read full office action
Prosecution Timeline

Dec 06, 2021
Application Filed
Aug 24, 2024
Non-Final Rejection — §103
Nov 26, 2024
Response Filed
Feb 13, 2025
Final Rejection — §103
May 19, 2025
Response after Non-Final Action
Jun 26, 2025
Request for Continued Examination
Jun 27, 2025
Response after Non-Final Action
Jul 26, 2025
Non-Final Rejection — §103
Oct 30, 2025
Response Filed
Feb 12, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/161,911
Patent 12586352
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
17/537,799
Patent 12579693
ELECTRONIC SHELF LABEL MANAGING SERVER, DISPLAY DEVICE AND CONTROLLING METHOD THEREOF
2y 5m to grant Granted Mar 17, 2026
18/080,993
Patent 12555371
VISION TRANSFORMER FOR MOBILENET SIZE AND SPEED
2y 5m to grant Granted Feb 17, 2026
17/821,378
Patent 12541980
METHOD FOR DETERMINING OBJECT INFORMATION RELATING TO AN OBJECT IN A VEHICLE ENVIRONMENT, CONTROL UNIT AND VEHICLE
2y 5m to grant Granted Feb 03, 2026
18/007,104
Patent 12541941
A Method for Testing an Embedded System of a Device, a Method for Identifying a State of the Device and a System for These Methods
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
74%
Grant Probability
94%
With Interview (+19.8%)
3y 1m
Median Time to Grant
High
PTA Risk
Based on 103 resolved cases by this examiner. Grant probability derived from career allow rate.