Last updated: July 17, 2026
Application No. 18/211,522
SYSTEM AND METHOD OF GENERATING BOUNDING POLYGONS

Final Rejection §103
Filed
Jun 19, 2023
Priority
Dec 22, 2020 — continuation of 11/715,276
Examiner
TRAN, TAN H
Art Unit
2141
Tech Center
2100 — Computer Architecture & Software
Assignee
Plainsight Technologies Inc.
OA Round
4 (Final)
Interview Optional

— +32.6% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 60% grant rate with +32.6% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 315 resolved cases, 2023–2026
Examiner Intelligence

TRAN, TAN H View full profile →
Grants 60% of resolved cases
Career Allowance Rate
190 granted / 315 resolved
+5.3% vs TC avg
Strong +33% interview lift
Without
With
+32.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
33 currently pending
Career history
371
Total Applications
across all art units
Statute-Specific Performance

§101
2.6%
-37.4% vs TC avg
§103
92.4%
+52.4% vs TC avg
§102
4.6%
-35.4% vs TC avg
§112
0.2%
-39.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 315 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
2.	This Office Action is sent in response to Applicant’s Communication received on 04/23/2026 for application number 18/211,522. 

Response to Amendments
3.	The Amendment filed 04/23/2026 has been entered. Claims 2, 6, and 15 have been amended. Claims 2-22 remain pending in the application. 

Response to Arguments
	Applicant argues that Cao fails to teach the amended claim 1 limitation of “wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features.” However, the argument is moot since this is a newly presented limitation, thus changes the scope of the claim. However, a newly found reference, Guo, is applied.

Claim Rejections – 35 USC § 103
4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


5.	Claims 2, 6-7, 10, 15-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Cao et al. (U.S. Patent Application Pub. No. US 20210365717 A1) in view of Chen et al. (U.S. Patent Application Pub. No. US 20180253622 A1), in view of Farooqi et al. (U.S. Patent Application Pub. No. US 20180150713 A1), and further in view of Guo et al. (A Multi-Scaled Receptive Field Learning Approach for Medical Image Segmentation, IEEE, published 4/9/2020; pages 1414-1418).

Claim 2: Cao teaches a system comprising:
at least one processor (i.e. a non-transitory computer-readable storage medium storing a plurality of processor executable instructions; para. [0009]);
a first memory with instructions stored thereon that, when executed by the at least one processor, cause the at least one processor to (i.e. The apparatus may include a memory operable to store computer-readable instructions and a processor operable to read the computer-readable instructions; para. [0007]):
extract, from a first image, a first portion that is representative of content within a bounding shape that is arranged around a depiction of a first object of a particular type (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]);
establish one or more high-level features and one or more low-level features of the first portion (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]);
apply Atrous Spatial Pyramid Pooling (ASPP) to the one or more high-level features of the first portion to aggregate the one or more high-level features as aggregate features (i.e. The high-level feature information corresponding to the first slice sample and the high-level feature information corresponding to the second slice sample may be further processed by using ASPP, to obtain high-level feature information in more different dimensions, referring to FIG. 8; para. [0141]),
up-sample the aggregate features (i.e. upsamples the high-level feature information; para. [0173]);
apply a convolution to the one or more low-level features (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]);
concatenate the aggregate features after upsampling with the one or more low-level features after convolution to form combined features (i.e. concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]); and 
segment the combined features to generate a first polygonal shape outline along first outer boundaries of the first object in the first portion (i.e. figs. 8-10, performs convolution with a convolution kernel of “3×3” on the concatenated feature information, and then upsamples the concatenated feature information obtained after convolution to obtain a size of the first slice, so that the initial segmentation result 814 of the first slice 801 can be obtained; para. [0173]); and 
a second memory with instructions stored thereon that, when executed by the at least one processor, cause the at least one processor to (i.e. The apparatus may include a memory operable to store computer-readable instructions and a processor operable to read the computer-readable instructions; para. [0007]):
receive a second image that includes a depiction of a second object of the particular type (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]);
apply at least a first convolutional neural network to the second image to generate one or more feature maps (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]);
generate a plurality of regions of interest, each of which is representative of a non-rectangular, polygonal shape (i.e. The image segmentation is a technology and a process of segmenting an image into several particular regions having special properties, and specifying a target of interest. This embodiment of this disclosure is mainly to segment a three-dimensional medical image and find a required target object. For example, a 3D medical image is divided in a z-axis direction into a plurality of single-frame slices (referred to as slices for short). A liver region or the like is then segmented from the slices. After segmentation results of all slices in the 3D medical image are obtained, these segmentation results are combined in the z-axis direction, so that a 3D segmentation result corresponding to a 3D medical image may be obtained. That is, the target object is, for example, a 3D form of the liver region. The segmented target object may be subsequently analyzed by a medical care person or another medical expert for further operation; para. [0035]); predict segmentation masks on at least a subset of the plurality of regions of interest in a pixel-to-pixel manner (i.e. A pixel belonging to a target object in the first slice sample is then selected according to the concatenated feature information, to obtain the first predicted segmentation value of the slice sample. For example, convolution with a convolution kernel of “3×3” may be specifically performed on the concatenated feature information, and upsampling is performed to obtain a size of the first slice sample, so that the first predicted segmentation value of the slice sample can be obtained; para. [0116]).
Cao does not explicitly teach slide a first window across the one or more feature maps to obtain a plurality of anchor shapes using a region proposal network; determine whether each anchor shape of the plurality of anchor shapes contains an object to generate a plurality of regions of interest, each of which is representative of a non-rectangular, polygonal shape; produce classifications for objects, if any, in each region of interest using a second convolutional neural network, trained in part using the first polygonal shape outline of the first object; and identify individual objects of the second image based on the classifications and the segmentation masks; wherein the first portion is a cropped portion of the first image; wherein the first polygonal shape outline is defined by lines and/or curves that collectively surround the first object; wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features.
However, Chen teaches slide a first window across the one or more feature maps to obtain a plurality of anchor shapes using a region proposal network (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals. In an example, a last layer 240 of the convolutional blocks (e.g., conv5, which may include 2048 channels in an example) can be convolved with a 1×1 convolutional layer to generate a feature map (e.g., a 1024-channel feature map). Then, k.sup.2 (C+1) channels feature maps, also referred to as detection position-sensitive score maps 250, can be generated, where the +1 can be for the background class and a total of C categories. The k.sup.2 can correspond to a k×k spatial grid, where the cell in the grid encodes the relative positions (e.g., top-left and bottom-right). In one example, k can be set to 7. In an example, the detection position-sensitive score maps can be generated for each RoI 232 in the image provided as output from the RPN 230. A pooling operation (e.g., position sensitive pooling 242) can be applied to the detection position-sensitive score maps to obtain a C+1-dimensional vector for each RoI 232; para. [0037]); determine whether each anchor shape of the plurality of anchor shapes contains an object to generate a plurality of regions of interest, each of which is representative of a non-rectangular, polygonal shape (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals. In an example, a last layer 240 of the convolutional blocks (e.g., conv5, which may include 2048 channels in an example) can be convolved with a 1×1 convolutional layer to generate a feature map (e.g., a 1024-channel feature map). Then, k.sup.2 (C+1) channels feature maps, also referred to as detection position-sensitive score maps 250, can be generated, where the +1 can be for the background class and a total of C categories. The k.sup.2 can correspond to a k×k spatial grid, where the cell in the grid encodes the relative positions (e.g., top-left and bottom-right). In one example, k can be set to 7. In an example, the detection position-sensitive score maps can be generated for each RoI 232 in the image provided as output from the RPN 230. A pooling operation (e.g., position sensitive pooling 242) can be applied to the detection position-sensitive score maps to obtain a C+1-dimensional vector for each RoI 232; para. [0037]); produce classifications for objects, if any, in each region of interest using a second convolutional neural network, trained in part using the first polygonal shape outline of the first object (i.e. method 400 may optionally include, at block 414, training the convolutional network based on the segmentations and/or the feedback. In an aspect, segmentation component 306, e.g., in conjunction with processor 302, memory 304, etc., can train the convolutional network based on the segmentations and/or the feedback. As described, segmentation component 306 can incorporate the feature maps 212, instance masks 214, etc. into the fully convolutional network to provide additional comparisons for determining categories and/or identifiable regions of input images; para. [0057]); predict segmentation masks on at least a subset of the plurality of regions of interest in a pixel-to-pixel manner (i.e. generate masks of instances of objects in the image, etc. As described above, and further herein, category-level semantic segmentation can relate to a process for analyzing pixels in an image and assigning a label for each pixel, where the label may be indicative of an object type or category; para. [0017]); and identify individual objects of the second image based on the classifications and the segmentation masks (i.e. the classification of each instance mask is determined by the RoI classifier 210. To further boost the performance of the ROI classifier 210, the feature maps 212 can be stacked into the layers of RoI classifier 210, which may include stacking the feature maps 212 using a pooling operation (e.g., a position sensitive pooling (PSP) 238, a compact bilinear pooling 244 or other pooling or fusion operation, etc.); para. [0038]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Cao to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.
However, Farooqi teaches wherein the first portion is a cropped portion of the first image (i.e. FIG. 2 is a process flow diagram 200 that illustrates a variation of FIG. 1 in which the cropped RGB image 160 is subject to further processing. It will be appreciated that with the example of FIG. 2, similar processes can be applied to the RGB-D image 170 and the example uses only RGB image data 160 to simplify the explanation. Similar to with FIG. 1, RGB-D data is received 110 and then bifurcated into an RGB image 150 and a depth channel image 120 so that depth segmentation 130 can be applied to the depth channel image 120. This depth segmentation 130 is used to define bounding polygons 140 are then subsequently applied to the RGB image 150 so that the RGB image 150 can be made into a cropped RGB image 160; para. [0032]); wherein the first polygonal shape outline (i.e. more than two object localization techniques can be used. Further, in some variations, the object localization techniques can be performed in sequence and/or partially in parallel. The first and second set of proposed bounding polygons (in some cases only one bounding polygon is identified by one of the localization techniques) are then analyzed to determine an intersection of union or other overlap across the first and second sets of proposed bounding polygons 230. Based on this determination, at least one optimal bounding polygon 240 is determined. This optimal bounding polygon 240 can then be used for subsequent image processing including classification of any encapsulated objects within the optimal bounding polygon 240 as applied to the cropped RGB image 160; para. [0032-0034]) is defined by lines and/or curves that collectively surround the first object (i.e. fig. 3, As is illustrated in image 340, a bounding polygon 342 can then be generated that encapsulates the foreground object. The image data encapsulated by the various edges of the bounding polygon 342 can then be subjected to further image processing including, without limitation, classification of the objects; para. [0035]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao and Chen to include the feature of Farooqi. One would have been motivated to make this modification because it the outline captures the true geometry of the object more accurately, so the CNN learns from precise, noise-reduced labels.
However, Guo teaches wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features (i.e. the ASPP module is used in the bottleneck part to extract multi-scale features from the high-level feature maps. This module combines multiple receptive field features by using atrous convolutions of different dilation rates as the final prediction. As shown in Figure 3, the mentioned ASPP module employs four parallel atrous convolution and global average pooling. Each atrous convolution contains a convolution operation, a batch normalization operation and ReLU. Then, four parallel atrous convolution and global average pooling are concatenated together, followed by a 1 × 1 convolution operation. Here the dilation rates are 1, 6, 12, 18, respectively; Section 2.3, page 1416).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, and Farooqi to include the feature of Guo. One would have been motivated to make this modification because it enhances the ASPP module by capturing multiple receptive fields and global context, thereby improving segmentation accuracy.

Claim 6: Cao teaches a computer-implemented method comprising: 
acquiring one or more high-level features and one or more low-level features that are established for an image that depicts an object of a given type (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]);
applying Atrous Spatial Pyramid Pooling (ASPP) to the one or more high-level features of the image to aggregate the one or more high-level features as aggregate features (i.e. The high-level feature information corresponding to the first slice sample and the high-level feature information corresponding to the second slice sample may be further processed by using ASPP, to obtain high-level feature information in more different dimensions, referring to FIG. 8; para. [0141]),
concatenating the aggregate features with the one or more low-level features to form combined features (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]);
segmenting the combined features to generate a polygonal shape outline along outer boundaries of the object in the image (i.e. figs. 8-10, performs convolution with a convolution kernel of “3×3” on the concatenated feature information, and then upsamples the concatenated feature information obtained after convolution to obtain a size of the first slice, so that the initial segmentation result 814 of the first slice 801 can be obtained; para. [0173]), and
the polygonal shape outline into a dataset that is used to train of the given type upon being applied to images (i.e. the preset image segmentation model may be converged by using the true values annotated in the slice sample pair, the predicted segmentation values of the slice samples in the slice sample pair, and the predicted correlation information, to obtain the trained image segmentation model; para. [0144]).
Cao does not explicitly teach incorporating the outline into a dataset that is used to train a convolutional neural network that is used to classify objects; wherein the polygonal shape outline is defined by lines and/or curves that collectively surround the object in the image; wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features.
However, Chen teaches incorporating the polygonal shape outline into a dataset that is used to train a convolutional neural network that is used to classify objects of the given type upon being applied to images (i.e. method 400 may optionally include, at block 414, training the convolutional network based on the segmentations and/or the feedback. In an aspect, segmentation component 306, e.g., in conjunction with processor 302, memory 304, etc., can train the convolutional network based on the segmentations and/or the feedback. As described, segmentation component 306 can incorporate the feature maps 212, instance masks 214, etc. into the fully convolutional network to provide additional comparisons for determining categories and/or identifiable regions of input images; para. [0057]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Cao to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.
However, Farooqi teaches wherein the polygonal shape outline (i.e. more than two object localization techniques can be used. Further, in some variations, the object localization techniques can be performed in sequence and/or partially in parallel. The first and second set of proposed bounding polygons (in some cases only one bounding polygon is identified by one of the localization techniques) are then analyzed to determine an intersection of union or other overlap across the first and second sets of proposed bounding polygons 230. Based on this determination, at least one optimal bounding polygon 240 is determined. This optimal bounding polygon 240 can then be used for subsequent image processing including classification of any encapsulated objects within the optimal bounding polygon 240 as applied to the cropped RGB image 160; para. [0032-0034]) is defined by lines and/or curves that collectively surround the object in the image (i.e. fig. 3, As is illustrated in image 340, a bounding polygon 342 can then be generated that encapsulates the foreground object. The image data encapsulated by the various edges of the bounding polygon 342 can then be subjected to further image processing including, without limitation, classification of the objects; para. [0035]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao and Chen to include the feature of Farooqi. One would have been motivated to make this modification because it the outline captures the true geometry of the object more accurately, so the CNN learns from precise, noise-reduced labels.
However, Guo teaches wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features (i.e. the ASPP module is used in the bottleneck part to extract multi-scale features from the high-level feature maps. This module combines multiple receptive field features by using atrous convolutions of different dilation rates as the final prediction. As shown in Figure 3, the mentioned ASPP module employs four parallel atrous convolution and global average pooling. Each atrous convolution contains a convolution operation, a batch normalization operation and ReLU. Then, four parallel atrous convolution and global average pooling are concatenated together, followed by a 1 × 1 convolution operation. Here the dilation rates are 1, 6, 12, 18, respectively; Section 2.3, page 1416).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, and Farooqi to include the feature of Guo. One would have been motivated to make this modification because it enhances the ASPP module by capturing multiple receptive fields and global context, thereby improving segmentation accuracy.

Claim 7: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao further teaches comprising: applying a convolution to the one or more low-level features (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]); and concatenating the one or more low-level features after convolution with the aggregate features to form combined features (i.e. concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]).

Claim 10: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao further teaches up-sampling the polygonal shape outline (i.e. the upsampled high-level feature information and low-level feature information obtained after convolution are concatenated, to obtain concatenated feature information of the first slice sample; para. [0116]).

Claim 15: Cao teaches a non-transitory, computer-readable storage medium with instructions stored thereon that, when executed by at least one data processor of a system (i.e. a non-transitory computer-readable storage medium storing a plurality of processor executable instructions; para. [0009]), cause the system to:
receive a first image that includes a depiction of a first object of a particular type (i.e. the to-be-segmented medical image may be provided to the apparatus for segmenting a medical image after each medical image acquisition device performs image acquisition on biological tissue (for example, heart or liver). The medical image acquisition device may include an electronic device such as a magnetic resonance imaging (MM) scanner, a CT scanner, a colposcope, or an endoscope; para. [0043]);
generate a set of regions of interest, wherein each region of interest is representative of a non-rectangular, polygonal shape (i.e. The image segmentation is a technology and a process of segmenting an image into several particular regions having special properties, and specifying a target of interest. This embodiment of this disclosure is mainly to segment a three-dimensional medical image and find a required target object. For example, a 3D medical image is divided in a z-axis direction into a plurality of single-frame slices (referred to as slices for short). A liver region or the like is then segmented from the slices. After segmentation results of all slices in the 3D medical image are obtained, these segmentation results are combined in the z-axis direction, so that a 3D segmentation result corresponding to a 3D medical image may be obtained. That is, the target object is, for example, a 3D form of the liver region. The segmented target object may be subsequently analyzed by a medical care person or another medical expert for further operation; para. [0035]);
the polygonal shape outline having been generated by:
establishing one or more high-level features and one or more low-level features of the second image (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]),
applying Atrous Spatial Pyramid Pooling (ASPP) to the one or more high-level features of the second image to aggregate the one or more high-level features as aggregate features (i.e. The high-level feature information corresponding to the first slice sample and the high-level feature information corresponding to the second slice sample may be further processed by using ASPP, to obtain high-level feature information in more different dimensions, referring to FIG. 8; para. [0141]);,
concatenating the aggregate features with the one or more low-level features to form combined features (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]), and
segmenting the combined features to generate the polygonal shape outline along outer boundaries of the second object in the second image (i.e. figs. 8-10, performs convolution with a convolution kernel of “3×3” on the concatenated feature information, and then upsamples the concatenated feature information obtained after convolution to obtain a size of the first slice, so that the initial segmentation result 814 of the first slice 801 can be obtained; para. [0173]), 
predict segmentation masks on at least a subset of the set of regions of interest in a pixel- to-pixel manner (i.e. A pixel belonging to a target object in the first slice sample is then selected according to the concatenated feature information, to obtain the first predicted segmentation value of the slice sample. For example, convolution with a convolution kernel of “3×3” may be specifically performed on the concatenated feature information, and upsampling is performed to obtain a size of the first slice sample, so that the first predicted segmentation value of the slice sample can be obtained; para. [0116]).
Cao does not explicitly teach generate a set of regions of interest; produce classifications for objects, if any, in each region of interest using a convolutional neural network trained in part using a polygonal shape outline of a second object depicted in a second image; identify objects of the first image based on classifications; wherein the first portion is a cropped portion of the first image; wherein the polygonal shape outline is defined by lines and/or curves that collectively surround the second object; wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features.
However, Chen teaches generate a set of regions of interest, wherein each region of interest is representative of a non-rectangular, polygonal shape (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals; para. [0037]); produce classifications for objects, if any, in each region of interest using a convolutional neural network trained in part using a polygonal shape outline of a second object depicted in a second image (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals; para. [0037-0040]), predict segmentation masks on at least a subset of the set of regions of interest in a pixel- to-pixel manner (i.e. generate masks of instances of objects in the image, etc. As described above, and further herein, category-level semantic segmentation can relate to a process for analyzing pixels in an image and assigning a label for each pixel, where the label may be indicative of an object type or category; para. [0017]), identify objects of the first image based on classifications and the segmentation masks (i.e. the classification of each instance mask is determined by the RoI classifier 210. To further boost the performance of the ROI classifier 210, the feature maps 212 can be stacked into the layers of RoI classifier 210, which may include stacking the feature maps 212 using a pooling operation (e.g., a position sensitive pooling (PSP) 238, a compact bilinear pooling 244 or other pooling or fusion operation, etc.); para. [0038]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Cao to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.
However, Farooqi teaches wherein the polygonal shape outline (i.e. more than two object localization techniques can be used. Further, in some variations, the object localization techniques can be performed in sequence and/or partially in parallel. The first and second set of proposed bounding polygons (in some cases only one bounding polygon is identified by one of the localization techniques) are then analyzed to determine an intersection of union or other overlap across the first and second sets of proposed bounding polygons 230. Based on this determination, at least one optimal bounding polygon 240 is determined. This optimal bounding polygon 240 can then be used for subsequent image processing including classification of any encapsulated objects within the optimal bounding polygon 240 as applied to the cropped RGB image 160; para. [0032-0034]) is defined by lines and/or curves that collectively surround the second object (i.e. fig. 3, As is illustrated in image 340, a bounding polygon 342 can then be generated that encapsulates the foreground object. The image data encapsulated by the various edges of the bounding polygon 342 can then be subjected to further image processing including, without limitation, classification of the objects; para. [0035]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao and Chen to include the feature of Farooqi. One would have been motivated to make this modification because it the outline captures the true geometry of the object more accurately, so the CNN learns from precise, noise-reduced labels.
However, Guo teaches wherein ASPP applies, in parallel, more than one convolutional layer and a global AVG pool to the one or more high-level features (i.e. the ASPP module is used in the bottleneck part to extract multi-scale features from the high-level feature maps. This module combines multiple receptive field features by using atrous convolutions of different dilation rates as the final prediction. As shown in Figure 3, the mentioned ASPP module employs four parallel atrous convolution and global average pooling. Each atrous convolution contains a convolution operation, a batch normalization operation and ReLU. Then, four parallel atrous convolution and global average pooling are concatenated together, followed by a 1 × 1 convolution operation. Here the dilation rates are 1, 6, 12, 18, respectively; Section 2.3, page 1416).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, and Farooqi to include the feature of Guo. One would have been motivated to make this modification because it enhances the ASPP module by capturing multiple receptive fields and global context, thereby improving segmentation accuracy.

Claim 16: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 15. Cao further teaches generating feature maps from the first image by applying at least a second convolutional neural network to at least a portion of the first image (i.e. the receptive field determines a region size of an input layer corresponding to an element in an output result of a layer. That is, the receptive field is a size, mapped on an input image, of an element point of an output result of a layer in the convolutional neural network (that is, a feature map, also referred to as feature information). For example, for details, refer to FIG. 3. Generally, a receptive field size of an output feature map pixel of a first convolutional layer (for example, C1) is equal to a convolution kernel size (a filter size), while a receptive field size of a high convolutional layer (for example, C4) is related to convolution kernel sizes and step sizes of all layers before the high convolutional layer. Therefore, different levels of information may be captured based on different receptive fields, to extract the feature information of different scales. That is, after feature extraction is performed on a slice by using different receptive fields, high-layer feature information of different scales and low-layer feature information of different scales of the slice may be obtained; para. [0045]).
Cao does not explicitly teach sliding a window across the feature maps to obtain a plurality of anchor shapes using a region proposal network; and determining if each anchor shape of the plurality of anchor shapes contains an object to generate a set of regions of interest.
However, Chen further teaches where generating a set of regions of interest comprises: generating feature maps from the first image by applying at least a second convolutional neural network to at least a portion of the first image; sliding a window across the feature maps to obtain a plurality of anchor shapes using a region proposal network; and determining if each anchor shape of the plurality of anchor shapes contains an object to generate a set of regions of interest (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals. In an example, a last layer 240 of the convolutional blocks (e.g., conv5, which may include 2048 channels in an example) can be convolved with a 1×1 convolutional layer to generate a feature map (e.g., a 1024-channel feature map). Then, k.sup.2 (C+1) channels feature maps, also referred to as detection position-sensitive score maps 250, can be generated, where the +1 can be for the background class and a total of C categories. The k.sup.2 can correspond to a k×k spatial grid, where the cell in the grid encodes the relative positions (e.g., top-left and bottom-right). In one example, k can be set to 7. In an example, the detection position-sensitive score maps can be generated for each RoI 232 in the image provided as output from the RPN 230. A pooling operation (e.g., position sensitive pooling 242) can be applied to the detection position-sensitive score maps to obtain a C+1-dimensional vector for each RoI 232; para. [0037]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Farooqi, and Guo to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.

Claim 17: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 16. Cao does not explicitly teach wherein each anchor shape is a non-rectangular, polygonal shape.
However, Chen further teaches wherein each anchor shape is a non-rectangular, polygonal shape (i.e. FIG. 1 illustrates examples of images and semantic segmentations according to one aspect of the disclosure. Given an image in (a), a ground truth of category-level semantic segmentation is shown in (b), where each pixel is labeled with its corresponding category, which is represented by reference numerals 110 for sidewalk pixels, 112 for pedestrian pixels, 114 for automobile pixels, etc. in the representation of the image shown in (b). In FIG. 1, an example of instance-level semantic segmentation ground truth is shown in (c), where each object in the image is localized based on one or more masks, and are shown as represented using reference numerals 120 and 122 for different instances of pedestrians, 124 for an instance of an automobile, etc., to denote the segmentation of the objects (or instances). In FIG. 1, the expected output of joint category-level and instance-level semantic segmentation, as described herein, is shown in (d). In (d), instances of traffic participants (e.g., cars, pedestrians and riders) are localized using masks, and categorized using category-level semantic segmentation, which can be denoted using different colors in the segmentation for categories with each instance being separately outlined, but are shown here in black and white); para. [0025]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Farooqi, and Guo to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.

Claim 19: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 15. Cao does not explicitly teach wherein each segmentation mask encodes an associated object's spatial layout.
However, Chen further teaches wherein each segmentation mask encodes an associated object's spatial layout (i.e. n the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals. In an example, a last layer 240 of the convolutional blocks (e.g., conv5, which may include 2048 channels in an example) can be convolved with a 1×1 convolutional layer to generate a feature map (e.g., a 1024-channel feature map). Then, k.sup.2 (C+1) channels feature maps, also referred to as detection position-sensitive score maps 250, can be generated, where the +1 can be for the background class and a total of C categories. The k.sup.2 can correspond to a k×k spatial grid, where the cell in the grid encodes the relative positions (e.g., top-left and bottom-right). In one example, k can be set to 7. In an example, the detection position-sensitive score maps can be generated for each RoI 232 in the image provided as output from the RPN 230. A pooling operation (e.g., position sensitive pooling 242) can be applied to the detection position-sensitive score maps to obtain a C+1-dimensional vector for each RoI 232; para. [0037]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Farooqi, and Guo to include the feature of Chen. One would have been motivated to make this modification because it enhances object representation.

6.	Claims 3-4, 8-9, 11-13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Cao in view of Chen, Farooqi, Guo, and further in view of Agarwal et al. (U.S. Patent Pub. No. US 11468675 B1).

Claim 3: Cao, Chen, Farooqi, and Guo teach the system of claim 2. Cao does not explicitly teach provide a user interface displaying the first image including the bounding shape.
However, Agarwal teaches a user interface displaying the first image including the bounding shape (i.e. fig. 5, the user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified; col. 11, lines 41-50).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 4: Cao, Chen, Farooqi, Guo, and Agarwal teach the system of claim 3. Cao does not explicitly teach receive input that is indicative of the bounding shape being placed by a user through the user interface.
However, Agarwal teaches receive input that is indicative of the bounding shape being placed by a user through the user interface (i.e. fig. 5, the user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified; col. 11, lines 41-50).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 8: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao does not explicitly teach wherein the image is representative of a rectangular portion of a second image, and wherein the rectangular portion is defined by a user.
However, Agarwal teaches wherein the image is representative of a rectangular portion of a second image, and wherein the rectangular portion is defined by a user (i.e. The user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified; col. 11, lines 41-50).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 9: Cao, Chen, Farooqi, Guo, and Agarwal teach the computer-implemented method of claim 8. Cao does not explicitly teach wherein the second image is presented on a user interface, and wherein a bounding shape is placed by the user onto the second image to define the image.
However, Agarwal further teaches wherein the second image is presented on a user interface, and wherein a bounding shape is placed by the user onto the second image to define the image user (i.e. The user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified; col. 11, lines 41-50).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 11: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao does not explicitly teach presenting, on a user interface, at least a portion of the image and the polygonal shape outline.
However, Agarwal teaches presenting, on a user interface, at least a portion of the image and the polygonal shape outline (i.e. fig. 5, the user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified; col. 11, lines 41-50). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 12: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao further teaches comprising: a second image that depicts a second object of the given type; extracting, from the second image, a portion that is representative of content (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]); acquiring one or more second high-level features and one or more second low-level features that are established for the portion of the second image (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]); applying ASPP to the one or more second high-level features of the portion of the second image to aggregate the one or more second high-level features as second aggregate features (i.e. The high-level feature information corresponding to the first slice sample and the high-level feature information corresponding to the second slice sample may be further processed by using ASPP, to obtain high-level feature information in more different dimensions, referring to FIG. 8; para. [0141]); concatenating the second aggregate features with the one or more second low-level features to form second combined features (i.e. The electronic device performs convolution with a convolution kernel of “1×1” on the low-level feature information 806 and the high-level feature information 808 of the first slice 801 by using the first segmentation network branch 803, upsamples the high-level feature information obtained after convolution to have the same size as the low-level feature information obtained after convolution, concatenates the upsampled high-level feature information and low-level feature information obtained after convolution, to obtain the concatenated feature information of the first slice 801; para. [0173]); and segmenting the second combined features to generate a second polygonal shape outline along outer boundaries of the second object in the portion of the second image (i.e. figs. 8-10, performs convolution with a convolution kernel of “3×3” on the concatenated feature information, and then upsamples the concatenated feature information obtained after convolution to obtain a size of the first slice, so that the initial segmentation result 814 of the first slice 801 can be obtained; para. [0173]).
Cao does not explicitly teach presenting, on a user interface to a user, a second image, a portion that is representative of content within a rectangular bounding shape arranged around the second object, wherein the rectangular bounding shape is placed by the user through the user interface.
However, Agarwal teaches presenting, on a user interface to a user, a second image that depicts a second object of the given type; extracting, from the second image (i.e. Returning now to FIG. 2. Once the best quality video frames of the scene have been selected (e.g., the video frames 406 and 408 of FIG. 4), the object detection process 214 may be executed. The object detection process 214 may include executing an object feature extraction process 216. The object feature extraction process 216 may be similar or the same as the process for identifying object features from images discussed above in connection with FIG. 4; col. 9, lines 23-30), a portion that is representative of content within a rectangular bounding shape arranged around the second object, wherein the rectangular bounding shape is placed by the user through the user interface (i.e. fig. 5, the user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified. The location data may be transmitted in an object identification request to the content processing engine (e.g., the content processing engine 102 of FIG. 1). The content processing engine may be configured to confer with metadata 300 to identify a scene and/or video frame corresponding to the run time. In some embodiments, an object associated with the run time and a location within a threshold distance of the location data of the bounding box (or selection) may be determined; col. 11, lines 41-57).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.

Claim 13: Cao, Chen, Farooqi, Guo, and Agarwal teach the computer-implemented method of claim 12. Cao does not explicitly tech presenting, on the user interface, the second image and the second polygonal shape outline.
However, Agarwal further teaches presenting, on the user interface, the second image and the second polygonal shape outline (i.e. fig. 5, the user bay utilize an input device (e.g., a mouse, a finger for touch input, etc.) to draw a bounding box 514 (or otherwise provide user input such as selecting an object, for example, by drawing bounding box 516 around an object (e.g., a dress)). Location data (e.g., dimensions of the bounding box, coordinates, etc.) for the bounding box (or selection) within the video frame may be identified. The location data may be transmitted in an object identification request to the content processing engine (e.g., the content processing engine 102 of FIG. 1). The content processing engine may be configured to confer with metadata 300 to identify a scene and/or video frame corresponding to the run time. In some embodiments, an object associated with the run time and a location within a threshold distance of the location data of the bounding box (or selection) may be determined; col. 11, lines 41-57).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows for more precise and focused training data preparation.


Claim 18: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 15. Cao does not explicitly teach wherein classification does not depend upon segmentation masks.
However, Agarwal teaches wherein classification does not depend upon segmentation masks (i.e. utilizing an image recognition technique to identify objects within a video frame may include obtaining a mathematical model (e.g., a neural network, a convolutional neural network, a recurrent neural network, a classification model, etc.) that has been previously trained to identify objects, lighting, facial features, or the like from training examples. In some embodiments, the mathematical model may have been previously trained with training examples (e.g., images) depicting known objects, object attributes, facial features, and lighting. These images may be labeled as including said objects, object attributes, facial features, and lighting. The images and the corresponding labels may be utilized with any suitable supervised learning algorithm to train the mathematical model to identify similar objects, object attributes, facial features, and lighting in subsequently provided input images. Some example supervised learning algorithms may include, nearest neighbor, Naive Bayes, decision trees, linear regression, and neural networks; col. 6, lines 36-54).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Agarwal. One would have been motivated to make this modification because it allows faster classification times.

7.	Claims 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cao in view of Chen, Farooqi, Guo, and further in view of Ulbricht et al. (U.S. Patent Pub. No. US 11048977 B1).

Claim 14: Cao, Chen, Farooqi, and Guo teach the computer-implemented method of claim 6. Cao further teaches comprising: receiving a second image that depicts a second object of the given type (i.e. the to-be-segmented medical image may be provided to the apparatus for segmenting a medical image after each medical image acquisition device performs image acquisition on biological tissue (for example, heart or liver). The medical image acquisition device may include an electronic device such as a magnetic resonance imaging (MM) scanner, a CT scanner, a colposcope, or an endoscope; para. [0043]); generating feature maps from the second image by applying at least a second convolutional neural network to at least a portion of the second image (i.e. the receptive field determines a region size of an input layer corresponding to an element in an output result of a layer. That is, the receptive field is a size, mapped on an input image, of an element point of an output result of a layer in the convolutional neural network (that is, a feature map, also referred to as feature information). For example, for details, refer to FIG. 3. Generally, a receptive field size of an output feature map pixel of a first convolutional layer (for example, C1) is equal to a convolution kernel size (a filter size), while a receptive field size of a high convolutional layer (for example, C4) is related to convolution kernel sizes and step sizes of all layers before the high convolutional layer. Therefore, different levels of information may be captured based on different receptive fields, to extract the feature information of different scales. That is, after feature extraction is performed on a slice by using different receptive fields, high-layer feature information of different scales and low-layer feature information of different scales of the slice may be obtained; para. [0045]); generating a set of regions of interest, each of the set of regions of interest being a non-rectangular, polygonal shape (i.e. The image segmentation is a technology and a process of segmenting an image into several particular regions having special properties, and specifying a target of interest. This embodiment of this disclosure is mainly to segment a three-dimensional medical image and find a required target object. For example, a 3D medical image is divided in a z-axis direction into a plurality of single-frame slices (referred to as slices for short). A liver region or the like is then segmented from the slices. After segmentation results of all slices in the 3D medical image are obtained, these segmentation results are combined in the z-axis direction, so that a 3D segmentation result corresponding to a 3D medical image may be obtained. That is, the target object is, for example, a 3D form of the liver region. The segmented target object may be subsequently analyzed by a medical care person or another medical expert for further operation; para. [0035]); extracting one or more feature maps from each region of interest in the set of regions of interest (i.e. FIG. 1, an example in which an apparatus for segmenting a medical image is integrated in an electronic device 100 is used. The electronic device 100 may obtain a slice pair 102 (the slice pair including two slices 103 and 104 by sampling a to-be-segmented medical image 101), perform feature extraction 105 on each slice in the slice pair by using different receptive fields, to obtain high-level feature information 107 and 108 and low-level feature information 106 and 109 of the each slice. In one aspect, the electronic device 100 then segments, for each slice in the slice pair, a target object in the slice according to the low-level feature information 106 and 109 and the high-level feature information 107 and 108 of the slice, to obtain an initial segmentation result 111 and 113 of the slice; para. [0036]).
Cao does not explicitly teach sliding a window across the feature maps to obtain a plurality of anchor shapes using a region proposal network; classifying objects in each region of interest using the convolutional neural network; identifying individual objects of the second image based on classifications output by the convolutional neural network; and counting individual objects based on an outcome of said identifying.
However, Chen further teaches sliding a window across the feature maps to obtain a plurality of anchor shapes using a region proposal network; generating a set of regions of interest, each of the set of regions of interest being a non-rectangular, polygonal shape; extracting one or more feature maps from each region of interest in the set of regions of interest; classifying objects in each region of interest using the convolutional neural network; identifying individual objects of the second image based on classifications output by the convolutional neural network (i.e. in the instance-level semantic segmentation sub-network, technologies such as R-FCN, MNC (e.g., or ROI-warping from MNC, as described herein), and/or the like, can be used, as described, for region proposals, to generate the instance masks, etc. For example, region proposals (e.g., regions of interest (RoI) 232) for the layer 202 can be generated from RPN 230, and a customized RoI classifier 210 is provided to classify the region proposals. In an example, a last layer 240 of the convolutional blocks (e.g., conv5, which may include 2048 channels in an example) can be convolved with a 1×1 convolutional layer to generate a feature map (e.g., a 1024-channel feature map). Then, k.sup.2 (C+1) channels feature maps, also referred to as detection position-sensitive score maps 250, can be generated, where the +1 can be for the background class and a total of C categories. The k.sup.2 can correspond to a k×k spatial grid, where the cell in the grid encodes the relative positions (e.g., top-left and bottom-right). In one example, k can be set to 7. In an example, the detection position-sensitive score maps can be generated for each RoI 232 in the image provided as output from the RPN 230. A pooling operation (e.g., position sensitive pooling 242) can be applied to the detection position-sensitive score maps to obtain a C+1-dimensional vector for each RoI 232; para. [0037]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Farooqi, and Guo to include the feature of Chen. One would have been motivated to make this modification because it improves detection and classification accuracy.
However, Ulbricht teaches counting individual objects based on an outcome of said identifying (i.e. the object classification set is based on the objects within an image of the scene 601. The object classification set includes one or more elements, each element including a respective subset of the plurality of pixels (of the image of the scene 601) classified as a respective object in the scene 601. In various implementations, the object classification set is generated based on segmentation masks. As mentioned above, the device counts the number of distinct objects within an image data frame rather than attempting to define what the distinct objects are within the image data frame; col. 16, lines 27-37).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Ulbricht. One would have been motivated to make this modification because it allows accurate object count.

Claim 20: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 15. Cao does not explicitly teach counting individual objects based on the objects as identified based on the classifications and the segmentation masks.
However, Ulbricht teaches counting individual objects based on the objects as identified based on the classifications and the segmentation masks (i.e. the object classification set is based on the objects within an image of the scene 601. The object classification set includes one or more elements, each element including a respective subset of the plurality of pixels (of the image of the scene 601) classified as a respective object in the scene 601. In various implementations, the object classification set is generated based on segmentation masks. As mentioned above, the device counts the number of distinct objects within an image data frame rather than attempting to define what the distinct objects are within the image data frame; col. 16, lines 27-37).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Ulbricht. One would have been motivated to make this modification because it allows accurate object count.

8.	Claims 5 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Cao in view of Chen, Farooqi, Guo, and further in view of Zhang et al. (Vehicle-Damage-Detection Segmentation Algorithm Based on Improved Mask RCNN, IEEE, published 2020, pages 6997 - 7004).

Claim 5: Cao, Chen, Farooqi, and Guo teach the system of claim 2. Cao does not explicitly teach perform RoIAlign to extract feature maps from the set of regions of interest, wherein RoIAlign utilizes binary interpolation to preserve spatial information.
However, Zhang teaches perform RoIAlign to extract feature maps from the set of regions of interest, wherein RoIAlign utilizes binary interpolation to preserve spatial information (i.e. the Rol Pooling layer is changed into the interest-region alignment layer (RoIAlign). The bi-linear interpolation [23] method preserves the spatial information on the feature map, which largely solves the error caused by the two quantizations of the feature map in the RoI Pooling layer, and solves the problem of regional mismatch of the image object. Pixel-level detection segmentation can thus be achieved; pages 6997, 6999, 7000).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Zhang. One would have been motivated to make this modification because it ensures the extracted feature maps are spatially accurate.

Claim 21: Cao, Chen, Farooqi, and Guo teach the non-transitory, computer-readable storage medium of claim 15. Cao does not explicitly teach perform RoIAlign to extract feature maps from the set of regions of interest, wherein RoIAlign utilizes binary interpolation to preserve spatial information.
However, Zhang teaches perform RoIAlign to extract feature maps from the set of regions of interest, wherein RoIAlign utilizes binary interpolation to preserve spatial information (i.e. the Rol Pooling layer is changed into the interest-region alignment layer (RoIAlign). The bi-linear interpolation [23] method preserves the spatial information on the feature map, which largely solves the error caused by the two quantizations of the feature map in the RoI Pooling layer, and solves the problem of regional mismatch of the image object. Pixel-level detection segmentation can thus be achieved; pages 6997, 6999, 7000).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Zhang. One would have been motivated to make this modification because it ensures the extracted feature maps are spatially accurate.

9.	Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Cao in view of Chen, Farooqi, Guo, and further in view of Wang et al. (U.S. Patent Application Pub. No. US 20210383549 A1).

Claim 22: Cao, Chen, Farooqi, and Guo teach the system of claim 2. Cao does not explicitly teach wherein the first polygonal shape outline is a non-rectangular polygonal shape that is defined by series of anchor points and a series of lines, each of which is representative of a connection between a first one of the series of anchor points and a second one of the series of anchor points, and wherein the series of lines collectively form a closed path that surrounds the first object.
However, Wang teaches wherein the first polygonal shape outline is a non-rectangular polygonal shape that is defined by series of anchor points and a series of lines, each of which is representative of a connection between a first one of the series of anchor points and a second one of the series of anchor points, and wherein the series of lines collectively form a closed path that surrounds the first object (i.e. figs. 10, 11, Process the to-be-segmented image through a polygonal fitting function, to obtain polygon vertex information. The polygon vertex information includes location information of a plurality of vertexes; para. [0116-0120]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Cao, Chen, Farooqi, and Guo to include the feature of Wang. One would have been motivated to make this modification because it improves the quality of the dataset used to train the CNN, leading to better classification performance.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. 
Price et al. (Pub. No. US 20180108137 A1), semantic segmentation of objects in a digital visual medium by determining a score for each pixel of the digital visual medium that is representative of a likelihood that each pixel corresponds to the objects associated with bounding boxes within the digital visual medium.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art.  In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAN TRAN whose telephone number is (303)297-4266.  The examiner can normally be reached on Monday - Thursday - 8:00 am - 5:00 pm MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached on 571-270-3264.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TAN H TRAN/Primary Examiner, Art Unit 2141
Read full office action
Prosecution Timeline

Show 4 earlier events
Apr 22, 2025
Interview Requested
Apr 30, 2025
Applicant Interview (Telephonic)
May 01, 2025
Examiner Interview Summary
May 13, 2025
Request for Continued Examination
May 18, 2025
Response after Non-Final Action
Oct 23, 2025
Non-Final Rejection mailed — §103
Apr 23, 2026
Response Filed
Jun 01, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/414,972
Patent 12682274
MODEL INTEGRATION APPARATUS, MODEL INTEGRATION METHOD, COMPUTER-READABLE STORAGE MEDIUM STORING A MODEL INTEGRATION PROGRAM, INFERENCE SYSTEM, INSPECTION SYSTEM, AND CONTROL SYSTEM
5y 0m to grant Granted Jul 14, 2026
17/654,824
Patent 12682621
META-LEARNING MODEL TRAINING BASED ON CAUSAL TRANSPORTABILITY BETWEEN DATASETS
4y 4m to grant Granted Jul 14, 2026
17/740,770
Patent 12682279
REINFORCEMENT MACHINE LEARNING FRAMEWORK FOR DYNAMIC DEMAND FORECASTING
4y 2m to grant Granted Jul 14, 2026
17/205,643
Patent 12675710
SYSTEMS AND METHODS FOR AUTOMATED ALERT PROCESSING
5y 3m to grant Granted Jul 07, 2026
17/518,059
Patent 12675679
CROSSBAR CIRCUIT FOR UNALIGNED MEMORY ACCESS IN NEURAL NETWORK PROCESSOR
4y 8m to grant Granted Jul 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
60%
Grant Probability
93%
With Interview (+32.6%)
3y 6m (~5m remaining)
Median Time to Grant
High
PTA Risk
Based on 315 resolved cases by this examiner. Grant probability derived from career allowance rate.