DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, and 3-4 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Dai (US 20160358337).
Regarding claim 1:
Dai discloses: a computer-implemented method (FIG. 7) comprising:
generating, using a segment classification neural network, an image embedding for a digital image portraying a plurality of image segments (¶ [0005] “The feature maps [image embeddings] are obtained by convoluting an input image using a plurality of layers of convolution filters. The feature maps record semantic information for respective regions on the image”; ¶ [0050] “In step 710, a sequence of convolution filtering is applied on an image to obtain feature maps, the feature maps including a plurality of activations. Each of the activations represents semantic information for a region on the image”);
determining, using the segment classification neural network, masked segment embeddings for the plurality of image segments of the digital image based on the image embedding and a plurality of masks corresponding to the plurality of image segments (¶ [0005] “…the binary masks [masks] are obtained from a set of candidate segments of the image”; ¶ [0051] “…in step 720, the low-resolution binary masks may be generated based on the binary masks and the feature maps. The low-resolution binary masks are applied onto the feature maps to generate the segment features [masked segment embeddings]”); ¶ [0075] “…masking the feature maps with binary masks to generate segment features of the image, each of the binary masks representing a candidate segment of the image”) ; and
determining, using the segment classification neural network, segment labels for the plurality of image segments based on the masked segment embeddings (¶ [0005] “…The semantic segmentation of the image is done by determining a semantic category for each pixel in the image at least in part based on the resulting segment features”; ¶ [0052] “…in step 730, the segment features may be pooled and the pooled segment features may be fully connected. In addition, the regional features on the feature maps may be pooled, where each regional feature is represented by a bounding box. The pooled regional features may be fully connected as well. Accordingly, the semantic category for each pixel in the image may be determined based on a concatenation of the connected segment features and the connected regional features”; ¶ [0075] “…and determining a semantic category for each pixel in the image at least in part based on the segment features”).
Regarding claim 3:
Dai discloses the limitations of claim 1 as applied above.
Dai further discloses: wherein determining the masked segment embeddings for the plurality of image segments of the digital image based on the image embedding and the plurality of masks comprises:
determining a masked segment embedding for an image segment of the digital image by applying a mask for the image segment to the image embedding (¶ [0034] “…the CFM layer 220 uses the binary masks representing the candidate segments to mask the convolutional feature maps produced by the last convolutional layer 210… The resulting N masked convolutional features are referred to as “segment features”; ¶ [0032] “…The CFM layer 220 is configured to mask the feature maps. That is, the masking is performed on the convolutional features rather than the raw image. To this end, binary masks are obtained from the segment proposals in the image”; ¶ [0033] “…Within the mask 300, the values of pixels located inside the candidate segment 320 are set to one (shown in white), while the values of pixels in the other part are set to zero (shown in black).”);
to prevent incorporating features represented in the image embedding that are unassociated with the image segment (¶ [0039] “…each low-resolution binary mask may be multiplied with the feature map of each channel. In this way, if the binary value in the low-resolution binary mask is one, the strength of activation at the corresponding position is maintained. Otherwise, if the binary value in the low-resolution binary masks is zero, the strength of activation at the corresponding position is set to zero”).
Regarding claim 4:
Dai discloses the limitations of claim 1 as applied above.
Dai further discloses: wherein determining the masked segment embeddings for the plurality of image segments of the digital image based on the image embedding and the plurality of masks comprises:
determining the masked segment embeddings for the plurality of image segments by executing one or more mask-based pooling operations (¶ [0045] “…the segmentation module 230 includes a first pooling layer 52. The first pooling layer 520 receives and pools the segment features generated by the CFM layers 200.”; ¶ [0079] “…Determining the semantic category for each pixel in the image comprises: pooling the segment features; and connecting the pooled segment features”);
using the image embedding and the plurality of masks (¶ [0084] “…mask the feature maps with binary masks to generate segment features of the image, each of the binary masks representing a candidate segment of the image”; ¶ [0039] “The CFM layer 220 applies each low-resolution binary mask onto the feature maps to mask the convolutional feature maps”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 2 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Dai (US 20160358337) in view of Zhao (US 20220230321).
Regarding claim 2:
Dai discloses the limitations of claim 1 as applied above.
Dai does not specifically teach: generating the plurality of masks for the digital image using a segmentation neural network.
However, in a related field, Zhao teaches: generating the plurality of masks for the digital image using a segmentation neural network (¶ [0085] “…the object segmentation system 106 provides the input image 328 to the class-agnostic segmentation network 330… The class-agnostic segmentation network 330 then utilizes the decoder 336 to decode the feature vector 334 and generate the unclassified object masks 338 (i.e., unlabeled or class-agnostic object masks).”; ¶ [0086] “…the unclassified object masks 338 includes an unclassified object mask for each object located in the input image 328”);
Dai further teaches: providing the plurality of masks with the digital image as input to the segment classification neural network (¶ [0031] “…The feature maps produced by the last convolutional layer of the convolutional layers 210 are provided to a convolutional feature masking (CFM) layer 220 included in the image processing system 122, as shown in FIG. 2.”; ¶ [0034] “…the CFM layer 220 uses the binary masks representing the candidate segments to mask the convolutional feature maps produced by the last convolutional layer 210”. Also see FIG. 4).
Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Dai to incorporate the teachings of Zhao by including: generating the plurality of masks for the digital image using a segmentation neural network in order to improve mask generation without changing Dai’s downstream masking and labeling operation, which yields a predictable result.
Regarding claim 9:
Dai discloses: a system comprising: one or more memory devices; and one or more processors (FIG. 1) configured to cause the system to:
receive a digital image portraying a plurality of image segments (¶ [0005] “The feature maps are obtained by convoluting an input image using a plurality of layers of convolution filters.”; ¶ [0020] “…The input device (s) 140 may include a camera, a scanner and/or any other device that can be used to input images”; FIG. 1, image 173 portrays image segments such as sofa, person, wall)
generate, (¶ [0005] “…the binary masks [masks] are obtained from a set of candidate segments of the image”; ¶ [0051] “…in step 720, the low-resolution binary masks may be generated based on the binary masks and the feature maps. The low-resolution binary masks are applied onto the feature maps to generate the segment features:; ¶ [0075] “…masking the feature maps with binary masks to generate segment features of the image, each of the binary masks representing a candidate segment of the image”);
determine segment labels for the plurality of image segments of the digital image by using a segment classification neural network to generate an image embedding for the digital image (¶ [0050] “In step 710, a sequence of convolution filtering is applied on an image to obtain feature maps, the feature maps including a plurality of activations. Each of the activations represents semantic information for a region on the image”; (¶ [0051] “…in step 720, the low-resolution binary masks may be generated based on the binary masks and the feature maps. The low-resolution binary masks are applied onto the feature maps to generate the segment features [masked segment embeddings]”);
generate a masked segment embedding for each image segment of the digital image based on the image embedding and a corresponding mask from the plurality of masks (¶ [0051] “…in step 720, the low-resolution binary masks may be generated based on the binary masks and the feature maps. The low-resolution binary masks are applied onto the feature maps to generate the segment features [masked segment embeddings]”); ¶ [0075] “…masking the feature maps with binary masks to generate segment features of the image, each of the binary masks representing a candidate segment of the image”) and
determine a segment label for each image segment of the digital image based on a corresponding masked segment embedding (¶ [0005] “…The semantic segmentation of the image is done by determining a semantic category for each pixel in the image at least in part based on the resulting segment features”; ¶ [0052] “…in step 730, the segment features may be pooled and the pooled segment features may be fully connected. In addition, the regional features on the feature maps may be pooled, where each regional feature is represented by a bounding box. The pooled regional features may be fully connected as well. Accordingly, the semantic category for each pixel in the image may be determined based on a concatenation of the connected segment features and the connected regional features”; ¶ [0075] “…and determining a semantic category for each pixel in the image at least in part based on the segment features”).
Dai does not specifically teach that the mask are generated by a class-agnostic segmentation neural network producing a mask for each segment.
However, Zhao teaches: that the mask are generated by a class-agnostic segmentation neural network producing a mask for each segment (¶ [0066] “…the class-agnostic segmentation network 310 includes higher neural network layers that form a decoder 316. In one or more implementations, the higher neural network layers include fully connected layers, segmentation, and/or classification (e.g., SoftMax) layers. In various implementations, the decoder 316 processes the feature vectors 314 to generate pixel segmentations for each detected object in an input image. For example, the decoder 316 generates the predicted unclassified object masks 318 from the feature vector 314 (e.g., using a SoftMax classifier) and/or generates an object segmentation for each object in an input image, from which the predicted unclassified object masks 318 (e.g., predicted unlabeled or unclassified class-agnostic object masks) are created.”; ¶ [0086] “…the unclassified object masks 338 includes an unclassified object mask for each object located in the input image 328”).
Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Dai to incorporate the teachings of Zhao in order to improve mask generation without changing Dai’s downstream masking and labeling operation, which yields a predictable result.
Claim(s) 7-8, 16-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Dai (US 20160358337) in view of Xu (US 20240153093).
Regarding claim 7:
Dai discloses the limitations of claim 1 as applied above.
Dai does not specifically teach: generating the plurality of masks for the digital image using a segmentation neural network.
However, in a related field, Xu teaches: wherein determining the segment labels for the plurality of image segments comprises determining the segment labels from an open-vocabulary label set (¶ [0019] “…The dense and rich diffusion features provided by the internal representation may be provided to other classification models, for classifying images into open-vocabulary labels, and to perform open-vocabulary panoptic and semantic segmentation tasks.”; ¶ [0025] “…each mask may be categorized into one of many open-vocabulary categories by associating each predicted mask's diffusion features with text embeddings of several category names “; ¶ [0033] “The panoptic label unit 230 may be trained to predict the category label from an open vocabulary that is assigned to each predicted mask using either category label supervision or image caption supervision.”)
Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Dai to incorporate the teachings of Xu by including: generating the plurality of masks for the digital image using a segmentation neural network in order to address the problem of open-vocabulary (open-world) segmentation where the task is to use annotated segmentation masks of known object categories to learn to segment unknown object categories at the time of testing.
Regarding claim 8:
Dai discloses the limitations of claim 1 as applied above.
Dai does not specifically teach: determining the masked segment embeddings using the segment classification neural network having parameters determined using weak-alignment data that includes training images and training image captions.
However, Xu teaches: wherein determining the masked segment embeddings using the segment classification neural network comprises:
determining the masked segment embeddings using the segment classification neural network having parameters determined using weak-alignment data that includes training images and training image captions (¶ [0026] “…training of the mask generator 120 may be supervised with available object category labels or by weaker labels in the form of image-level textual captions available for each input image”; ¶ [0036] “For the case of training with ground truth captions, no category labels are available for the annotated masks. Instead, a natural language caption is provided for each image, and the panoptic label unit 230 learns to classify the predicted mask embedding features using the image caption alone.”),
and Dai further discloses: and strong-alignment data that includes additional training images and training segment labels (¶ [0024] “…The training data 180 includes a plurality of training images 182. The semantic category of each pixel in each training image 182 is determined in advance. For example, the semantic category of pixels in the training images 182 can be obtained by user labeling. The parameters and/or coefficients of the modules/logic in the image processing system 122 are then modified according to the training images 182 in order to train the CNN.” Note that the per-pixel semantic categories are segment labels for training images which correspond to strong alignment).
Regarding claim 16: The claims limitations are similar to those of claims 1 and 8; therefore, rejected in the same manner as applied above. Dai further teaches the CRM in ¶ [0020].
Regarding claim 17:
Dai in view of Xu discloses the limitations of claim 16 as applied above.
Dai further teaches: wherein determining the masked segment embedding for each image segment comprises determining, for each image segment, the masked segment embedding having features of the digital image that represent a context surrounding the image segment within the digital image (¶ [0043] “It has been found that sometimes the segment features alone may be not enough to achieve the semantic segmentation of images. To this end, in some implementations, the segment features generated by the CFM layer may be combined with the regional features from bounding boxes.”; ¶ [0046] “…The second pooling layer 540 may pool the regional features of the image. Each regional feature is represented by a bounding box on the feature maps”).
Regarding claim 20:
Dai in view of Xu discloses the limitations of claim 16 as applied above.
Dai further teaches: wherein determining the segment label for each image segment comprises determining a plurality of segment labels for a plurality of objects portrayed in the digital image(¶ [0002] “An image may contain multiple things, including objects and stuff… the term “objects” refer to the things that have consistent shape and each instance is countable… the semantic segmentation assigns a category label to each pixel to indicate an object or a stuff to which the pixel belongs”;
each object of the plurality of objects corresponding to a mask generated for the digital image (¶ [0032] “…binary masks are obtained from the segment proposals in the image… a segment proposal refers to a candidate segment to be classified for semantic segmentation”; ¶ [0033] “…Each candidate segment may be presented by a binary mask. The binary mask may the foreground mask and enclosing bounding box. FIG. 3 shows an example of a binary mask 300 corresponding to a candidate segment in the image”).
Allowable Subject Matter
Claims 5-6, 10-15, 18-19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WASSIM MAHROUKA whose telephone number is (571)272-2945. The examiner can normally be reached Monday-Thursday 8:00-5:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Koziol can be reached at (408) 918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WASSIM MAHROUKA/Primary Examiner, Art Unit 2665