Last updated: May 29, 2026
Application No. 18/431,912
PLURALISTIC SALIENT OBJECT DETECTION

Final Rejection §103
Filed
Feb 02, 2024
Examiner
ABDI, AMARA
Art Unit
2668
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
Interview Optional

— -7.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 83% grant rate with -7.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 826 resolved cases, 2023–2026
Examiner Intelligence

ABDI, AMARA View full profile →
Grants 83% — above average
Career Allowance Rate
684 granted / 826 resolved
+20.8% vs TC avg
Minimal -7% lift
Without
With
+-7.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
17 currently pending
Career history
853
Total Applications
across all art units
Statute-Specific Performance

§101
2.9%
-37.1% vs TC avg
§103
89.6%
+49.6% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
2.5%
-37.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 826 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant's response to the last office action, filed February 26, 2026 has been entered and made of record. Claims 1-3, 5, 7-10, 12, 14-16, 18, and 20 are amended. By this amendment, claims 1-20 are pending for examination.

Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  


The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 8-9, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov et al, ("Segmenting anything, 2023 IEEE/CVF International Conference on Computer Vision (ICCV), PP 3992-4003) in view of Piergiovanni et al, (US-PGPUB 20250005924); and further in view of Taruttis et al, (US-PGPUB 20230298184).

In regards to claim 1, Kirillov et al discloses a system comprising: 
a processor; and a computer-readable medium storing instructions that are 
operative upon execution by the processor, (Page 3996, right-hand-column, first paragraph, “CPU”, implicitly includes a processor and memory), to: 
receive a first image, the first image including a first object and a second object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image. Further Figures 2-4 show an image including multiple objects, “first object and second object of the first image”);
 based on at least receiving the first image and a first token, generate, with a pluralistic object detector, a first segmentation mask, wherein the first segmentation mask corresponds to the first object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving a first token among plurality patches”. Further, Page 3996, left-hand-column, section 3, Segment Anything Model (SAM), “i.e., pluralistic object detector”, has three components, illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder, where the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks by SAM (see Fig. 3), comprising at least a first segmentation mask”, [i.e., generate, with a pluralistic object detector, “SAM”, a first segmentation mask, “implicit by generating plurality of segmentation masks by SAM (see Fig. 3)”, based on at least receiving the first image and a first token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”, wherein the first segmentation mask corresponds to the first object, implicit by generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3”]);
based on at least receiving the first image and a second token, generate, with the pluralistic object detector, a second segmentation mask, wherein the second segmentation mask corresponds to the second object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving another token, (at least a second token), among the plurality of patches”. Further, Page 3996, left-hand-column, section 3, the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks comprising at least a first segmentation mask, and second segmentation token, among plurality of patches, corresponding to different objects, as shown in Figs. 2-3”, [i.e., based on at least receiving the first image and a second token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”, generate, with the pluralistic object detector, “using (SAM)”, a second segmentation mask, “implicit by generating plurality of segmentation masks comprising at least second segmentation mask”, wherein the second segmentation mask corresponds to at least the second object of the first image, “implicit by generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3”]);
Kirillov does not expressly disclose that the first token comprising a first learned token embedding configured to represent a first set of salient objects of the first image; and the second token comprising a second learned token embedding configured to represent a second set of salient objects of the first image; and persisting the first segmentation mask and the second segmentation mask.
However, Piergiovanni discloses that the first token comprising a first learned token embedding configured to represent a first set of salient objects of the first image; and the second token comprising a second learned token embedding configured to represent a second set of salient objects of the first image, (see at least: Figs. 2-3, Par. 0037-0047, the machine-learned model 52 can include one or more image kernels (e.g., image kernel 68) configured to be applied to an individual image frame (e.g., frame 70) of the set of video data to generate a plurality of image tokens 58, “a first learned token embedding”; and one or more image kernels (e.g., image kernel 68) configured to be applied to an individual image frame (e.g., frame 70) of the set of video data to generate a plurality of image token 74, “a second learned token embedding”; where the machine-learned model 52  can include a single visual transformer 64, configured to jointly process both the plurality of video tokens (e.g., video tokens 58) and the plurality of image tokens (e.g., image tokens 74) to generate the model output 66 including object detection; and from Fig. 3, Par. 0092, image processing task may be object detection, where the image processing  output identifies one or more regions, in the one or more images and, for each region, a likelihood that regions depicts an object of interest, “first and second set of salient objects”, [i.e., the first token, “tube patches”, comprising a first learned token embedding, “video tokens 58”, configured to represent a first set of salient objects of the first image, “one or more regions depicting an object of interest”; and the second token, “2D patches”, comprising a second learned token embedding, “image tokens 74”, configured to represent a second set of salient objects of the first image, “one or more regions depicting an object of interest”]).
Kirillov and Piergiovanni are combinable because they are both concerned with generating object(s) detection. Therefore, it would have been obvious to a person of ordinary skill in the art, to modify Kirillov, to use the machine-learned model 52, as though by Piergiovanni, in order to learn plurality of image tokens, for generating a model output, (Piergiovanni, Par. 0037, 0043), for identifying one or more regions, in the one or more images and, and determining, for each region, a likelihood that region depicts an object of interest, (Piergiovanni, Par. 0092).


The combine teaching Kirillov and Piergiovanni as whole does not expressly disclose persisting the first segmentation mask and the second segmentation mask.
Taruttis discloses persisting the first segmentation mask and the second segmentation mask, (Par. 0054, the segmenter 436 writes a segmentation masks repository 439, which the segmentation masks repository 439 stores a set of segmentation masks corresponding to the fluorescence images, [i.e., persisting the first segmentation mask and the second segmentation mask, “implicit by storing set of segmentation masks”])
Kirillov, Piergiovanni, and Taruttis are combinable because they are all concerned with object(s) detection. Therefore, it would have been obvious to a person of ordinary skill in the art, to modify the combine teaching Kirillov and Piergiovanni, to use the repository, as though by Taruttis, in order to stores the set of segmentation masks, (Taruttis, Par. 0054).

The following prior art made of record, and not relied upon, Morard et al, (US-
PGPUB 20240249414), discloses also the persisting the plurality of segmentation masks, (see at least: Par. 0074, resulting segmentation masks may be saved in memory).

In regards to claim 2, the combine teaching Kirillov, Piergiovanni, and Taruttis as whole discloses the limitations of claim 1.


Kirillov further discloses wherein the first image further includes a third object, (see at least: Figs. 2-3, where plurality of objects are shown in an image, “which implicit that the first image further shows a third object of the first image”), and wherein the instructions are further operative to: 
based on at least receiving the first image and a third token, generate, with the pluralistic object detector, a third segmentation mask, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “i.e., another token, (at least a third token), among the plurality of patches”, wherein the first segmentation mask does not correspond to the second object or the third object of the first image, the second segmentation mask does not correspond to the third object of the first image, (see at least: Fig. 3, where in first row of the image, the first segmentation mask in first left patch does not correspond to the second or third object, shown in the third and fourth patches of first row (from left to right), and the second segmentation mask of the second patch does not correspond the third object of the third patch (from left to right), of the image), and
the third segmentation mask corresponds to the third object, (as shown in Fig. 3, the third segmentation mask of third patch in first row (from left to right) corresponds to the object shown in the third patch of the image); and 
In the other hand, Taruttis discloses persisting the third segmentation mask, (Taruttis, Par. 0054, implicit by storing set of segmentation masks).

Regarding claim 8, claim 8 recites substantially similar limitations as set forth in claim 1. As such, claim 8 is rejected for at least similar rational.
The Examiner further acknowledged the following additional limitation(s): “computer-implemented method”. However, Taruttis discloses the “computer-implemented method”, (see at least: Par. 0011, “method”).

Regarding claim 9, claim 9 recites substantially similar limitations as set forth in claim 2. As such, claim 9 is rejected for at least similar rational

In regards to claim 15, Kirillov et al discloses a computer storage device having computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations, (see at least: 3995, implicit by the pre-training algorithm), comprising: 
receiving a first image, the first image including a first object, a second object, and a third object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image. Further Figures 2-4 show an image including multiple objects, “implicit first object, second object, and a third object”);
 based on at least receiving the first image and a first token, generating, with a pluralistic object detector, a first segmentation mask, wherein the first segmentation mask corresponds to the first object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving a first token among plurality patches”. Further, Page 3996, left-hand-column, section 3, Segment Anything Model (SAM), “i.e., pluralistic object detector”, has three components, illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder, where the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks by SAM (see Fig. 3), comprising at least a first segmentation mask”. Further, the generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3, implicit the first segmentation mask corresponding to the first object of the first image, [i.e., generate, with a pluralistic object detector, “SAM”, a first segmentation mask, “implicit by generating plurality of segmentation masks by SAM (see Fig. 3)”, based on at least receiving the first image and a first token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”, wherein the first segmentation mask corresponding to the first object, implicit by generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3”]);
based on at least receiving the first image and a second token, generating, with the pluralistic object detector, a second segmentation mask, wherein the second segmentation mask corresponds to the second object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving another token, (at least a second token), among the plurality of patches”. Further, Page 3996, left-hand-column, section 3, the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks comprising at least a first segmentation mask, and second segmentation token, among plurality of patches, corresponding to different objects, as shown in Figs. 2-3”, [i.e., based on at least receiving the first image and a second token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”, generate, with the pluralistic object detector, “using (SAM)”, a second segmentation mask, “implicit by generating plurality of segmentation masks comprising at least second segmentation mask”, wherein the  second segmentation mask corresponds to the second object, “implicit by generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3”]);
 based on at least receiving the first image and a third token, generate, with the pluralistic object detector, a third segmentation mask, wherein the third segmentation mask corresponds to the third object, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “i.e., another token, (at least a third token), among the plurality of patches”, wherein the third segmentation mask corresponds to the third object of the first image, (as shown in Fig. 3, the third segmentation mask of third patch in first row (from left to right) corresponds to the object shown in the third patch of the image)
wherein the first segmentation mask does not correspond to the second object of the first image or the third object of the first image, the second segmentation mask does not correspond to the third object of the first image, (see at least: Fig. 3, where in first row of the image, the first segmentation mask in first left patch does not correspond to the second or third object, shown in the third and fourth patches of first row (from left to right), and the second segmentation mask of the second patch does not correspond the third object of the third patch (from left to right), of the image) 
Kirillov does not expressly disclose wherein the first token comprising a first learned token embedding configured to represent a first set of salient objects of the first image; wherein the second token comprising a second learned token embedding configured to represent a second set of salient objects of the first image; wherein the third token comprising a third learned token embedding configured to represent a third set of salient objects of the first image; and persisting the first segmentation mask and the second segmentation mask.
However, Piergiovanni discloses that the first token comprising a first learned token embedding configured to represent a first set of salient objects of the first image; and the second token comprising a second learned token embedding configured to represent a second set of salient objects of the first image, (see at least: Figs. 2-3, Par. 0037-0047, the machine-learned model 52 can include one or more image kernels (e.g., image kernel 68) configured to be applied to an individual image frame (e.g., frame 70) of the set of video data to generate a plurality of image tokens 58, “a first learned token embedding”; and one or more image kernels (e.g., image kernel 68) configured to be applied to an individual image frame (e.g., frame 70) of the set of video data to generate a plurality of image token 74, “a second learned token embedding”; where the machine-learned model 52  can include a single visual transformer 64, configured to jointly process both the plurality of video tokens (e.g., video tokens 58) and the plurality of image tokens (e.g., image tokens 74) to generate the model output 66 including object detection; and from Fig. 3, Par. 0092, image processing task may be object detection, where the image processing  output identifies one or more regions, in the one or more images and, for each region, a likelihood that regions depicts an object of interest, “first and second set of salient objects”, [i.e., the first token, “tube patches”, comprising a first learned token embedding, “video tokens 58”, configured to represent a first set of salient objects of the first image, “one or more regions depicting an object of interest”; and the second token, “2D patches”, comprising a second learned token embedding, “image tokens 74”, configured to represent a second set of salient objects of the first image, “one or more regions depicting an object of interest”]). Further, from Fig. 1, Par. 0021, due to the flexibility of the transformer model in accepting tokens of various types and/or lengths, the input to the transformer can also include tokens generated from other modalities of data beyond the image and/or video tokens, [i.e., wherein the third token comprising a third learned token embedding configured to represent a third set of salient objects of the first image, “implicit by accepting tokens of various types and/or lengths from other modalities of data beyond the image and/or video tokens”]).
Kirillov and Piergiovanni are combinable because they are both concerned with generating object(s) detection. Therefore, it would have been obvious to a person of ordinary skill in the art, to modify Kirillov, to use the machine-learned model 52, as though by Piergiovanni, in order to learn plurality of image tokens, for generating a model output, (Piergiovanni, Par. 0037, 0043), for identifying one or more regions, in the one or more images and, and determining, for each region, a likelihood that region depicts an object of interest, (Piergiovanni, Par. 0092).

The combine teaching Kirillov and Piergiovanni as whole does not expressly disclose persisting the first segmentation mask and the second segmentation mask.
Taruttis discloses persisting the first segmentation mask and the second segmentation mask, (Par. 0054, the segmenter 436 writes a segmentation masks repository 439, which the segmentation masks repository 439 stores a set of segmentation masks corresponding to the fluorescence images, [i.e., persisting the first segmentation mask and the second segmentation mask, “implicit by storing set of segmentation masks”])
Kirillov, Piergiovanni, and Taruttis are combinable because they are all concerned with object(s) detection. Therefore, it would have been obvious to a person of ordinary skill in the art, to modify the combine teaching Kirillov and Piergiovanni, to use the repository, as though by Taruttis, in order to stores the set of segmentation masks, (Taruttis, Par. 0054).

Claims 4, 11, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov, Piergiovanni, and Taruttis, as applied to claims 1, 8, and 15 above; and further in view of Sanchez et al, (US-PGPUB 20240303984); and further in view of Lin et al, (US-PGPUB 20240168617)

In regards to claim 4, the combine teaching Kirillov, Piergiovanni, and Taruttis as whole discloses the limitations of claim 1.

The combine teaching Kirillov, Piergiovanni, and Taruttis as whole does not expressly disclose wherein generating the first segmentation mask comprises: performing an encoding process to extract a plurality of multi-scale features from the first image; aggregating the plurality of multi-scale features with a feature pyramid network; and modulating the aggregated plurality of multi-scale features with a mask decoder using the first token to select the first segmentation mask from a plurality of output segmentation masks.
Sanchez et al discloses performing an encoding process to extract a plurality of multi-scale features from the first image; aggregating the plurality of multi-scale features with a feature pyramid network, (see at least: Fig. 13, and Par. 0150, the encoder 1302 can be a share backbone that acts as a joint representation learning module with an aim to learn multi-level feature representations, “i.e., extract a plurality of multi-scale features from the first image”, …., Layers of the multi-scale feature fusion 1304 can fuse the features across scales while maintaining their number and resolution, “i.e., aggregating the plurality of multi-scale features with a feature pyramid network, using the multi-scale feature fusion 1304”).
Kirillov, Piergiovanni, Taruttis, and Sanchez are combinable because they are all concerned with processing images. Therefore, it would have been obvious toa person of ordinary skill in the art, to modify the combine teaching Kirillov, Piergiovanni, and Taruttis, to use the encoder 1302, multi-scale feature fusion 1304, as though by Sanchez, in order to fuse the features across scales, (Sanchez, Par. 0150).

The combine teaching Kirillov, Piergiovanni, Taruttis, and Sanchez as whole does not expressly disclose modulating the aggregated plurality of multi-scale features with a mask decoder using the first token to select the first segmentation mask from a plurality of output segmentation masks.
Lin et al discloses modulating the aggregated plurality of multi-scale features with a mask decoder using the first token to select the first segmentation mask from a plurality of output segmentation masks, (Par. 0131-0133, the scene-based image editing system 106 identifies the replacement region 404 by generating an object mask via a segmentation neural network, where the scene-based image editing system 106 utilizes the cascaded modulation inpainting neural network 420 to generate replacement pixels for the replacement region 404, where an object mask defines a replacement region using a segmentation or a mask indicating, overlaying, covering, or outlining pixels to be removed or replaced within a digital image; and from Par. 0143, he cascaded modulation inpainting neural network 502 is an is an encoder-decoder network, [i.e., modulating the aggregated plurality of multi-scale features with a mask decoder using the first token, “implicit by the cascaded modulation inpainting neural network, which is an encoder-decoder network”, to select the first segmentation mask from a plurality of output segmentation masks, “implicit by generating replacement pixels for the replacement region”]).
Kirillov, Piergiovanni, Taruttis, Sanchez, and Lin et al are combinable because they are all concerned with processing images. Therefore, it would have been obvious toa person of ordinary skill in the art, to modify the combine teaching Kirillov, Piergiovanni, Taruttis, and Sanchez, to use the cascaded modulation inpainting neural network 420, as though by Lin et al, in order to an object mask using a segmentation, (Lin et al, Par. 0133)
 
Regarding claim 11, claim 11 recites substantially similar limitations as set forth in claim 4. As such, claim 11 is rejected for at least similar rational.

Regarding claim 17, claim 17 recites substantially similar limitations as set forth in claim 4. As such, claim 17 is rejected for at least similar rational.

Allowable Subject Matter
Claims 3, 5-7, 10, 12-14, 16, and 18-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

With respect to claim 3, the prior art of record, alone or in reasonable combination, does not teach or suggest, the following underlined limitation(s), (in consideration of the claim as a whole):
“train the pluralistic object detector to learn the first token, the second token, and the third token, such that: when receiving the first token and a second image including three or more objects, the pluralistic object detector generates a first output segmentation mask corresponding to the first object of the second image but not to the second object of the second image or the third object of the second image; when receiving the second token and the second image, the pluralistic object detector generates a second output segmentation mask corresponding to at least the second object of the second image but not to the third object of the second image; and when receiving the third token and the second image, the pluralistic object detector generates an third output segmentation mask corresponding to at least the third object of the second image”.

The relevant prior art of record, Kirillov et al, ("Segmenting anything, 2023 IEEE/CVF International Conference on Computer Vision (ICCV), PP 3992-4003) discloses a system comprising: 
a processor; and a computer-readable medium storing instructions that are 
operative upon execution by the processor, (Page 3996, right-hand-column, first paragraph, “CPU”, implicitly includes a processor and memory), to: 
receive a first image including a first object of the first image and a second object of the first image, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image. Further Figures 2-4 show an image including multiple objects, “first object and second object of the first image”);
 based on at least receiving the first image and a first token, generate, with a pluralistic object detector, a first segmentation mask, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving a first token among plurality patches”. Further, Page 3996, left-hand-column, section 3, Segment Anything Model (SAM), “i.e., pluralistic object detector”, has three components, illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder, where the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks by SAM (see Fig. 3), comprising at least a first segmentation mask”, [i.e., generate, with a pluralistic object detector, “SAM”, a first segmentation mask, “implicit by generating plurality of segmentation masks by SAM (see Fig. 3)”, based on at least receiving the first image and a first token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”]);
based on at least receiving the first image and a second token, generate, with the pluralistic object detector, a second segmentation mask, wherein the first segmentation mask corresponds to the first object of the first image, and the second segmentation mask corresponds to at least the second object of the first image, (see at least: Fig. 4, the input image to image encoder corresponds to the receiving a first image, “i.e., based on at least receiving the first image”; and Figures 3-4, show an image divided into plurality of patches, where each patch implicitly represents a token, “implicitly receiving another token, (at least a second token), among the plurality of patches”. Further, Page 3996, left-hand-column, section 3, the mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask, “i.e., implicitly generating plurality of segmentation masks comprising at least a first segmentation mask, and second segmentation token, among plurality of patches, corresponding to different objects, as shown in Figs. 2-3”, [i.e., based on at least receiving the first image and a second token, “implicit by image input to image decoder shown in Fig. 4, and a plurality of patches (tokens) generated in Figs. 2-3”, generate, with the pluralistic object detector, “using (SAM)”, a second segmentation mask, “implicit by generating plurality of segmentation masks comprising at least second segmentation mask”, wherein the first segmentation mask corresponds to the first object of the first image, and the second segmentation mask corresponds to at least the second object of the first image, “implicit by generating plurality of patches, (tokens), corresponding to different objects, as shown in Figs. 2-3”]). 
Kirillov further discloses train the pluralistic object detector to learn the first token, the second token, and the third token, (see at least: Page 3996, right-hand-column, first paragraph, under “losses and training”, which implicitly learning plurality of tokens); but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).

A further prior art of record, Yu et al, (US-PGPUB 20240112088) discloses train the pluralistic object detector to learn the first token, the second token, and the third token, (see at least: Par. 0033, processing, by the computing system, a plurality of image patches from the training image with a machine-learned image encoder to generate a plurality of image tokens in a latent space, wherein the plurality of image tokens correspond to the plurality of image patches, [i.e., implicitly learn the first token, the second token, and the third token, based on training image with a machine-learned image encoder]; but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).

With respect to claim 5, the prior art of record, alone or in reasonable combination, does not teach or suggest, the following underlined limitation(s), (in consideration of the claim as a whole):
“based on at least receiving the first image and the first segmentation mask, assign, by a quality predictor, a first quality score to the first segmentation mask without using ground truth for the first image; persist the first quality score associated with the first image and associated with the first segmentation mask; based on at least receiving the first image and the second segmentation mask, assign, by the quality predictor, a second quality score to the second segmentation mask without using the ground truth for the first image; and persist the second quality score associated with the first image and associated with the second segmentation mask”

The prior art of record, Kirillov et al, ("Segmenting anything, 2023 IEEE/CVF International Conference on Computer Vision (ICCV), PP 3992-4003), stated above, with respect to claim 3 applies also to claim 5. Further, Kirillov discloses the assigning, by a quality predictor, a quality score to the first segmentation mask, (see at least: section 6.1, supplementing the standard mIoU metric (i.e., the mean of all IoUs between predicted and ground truth masks) with a human study in which annotators rate mask quality from 1 (nonsense) to 10 (pixel-perfect)). However, while disclosing, assigning a quality score to the first segmentation mask; Kirillov fails to teach or suggest, either alone or in combination with the other cited references, assigning, by a quality predictor, a first quality score to the first segmentation mask, and the second segmentation, without using ground truth for the first image.
Regarding claims 6-7, claims 6-7 are in condition for allowance, based at least on their dependency from claim 7.

Regarding claim 10, claim 10 recites substantially similar limitations as set forth in claim 3. As such, claim 10 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claim 12, claim 12 recites substantially similar limitations as set forth in claim 5. As such, claim 12 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claims 13-14, claims 13-14 are in condition for allowance, based at least on their dependency from claim 12.

Regarding claim 16, claim 16 recites substantially similar limitations as set forth in claim 3. As such, claim 16 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claim 18, claim 18 recites substantially similar limitations as set forth in claim 5. As such, claim 18 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claims 19-20, claims 19-20 are in condition for allowance, based at least on their dependency from claim 18.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMARA ABDI whose telephone number is (571)272-0273. The examiner can normally be reached 9:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached at (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/AMARA ABDI/Primary Examiner, Art Unit 2668                                                                                                                                                                                                        04/29/2026
Read full office action
Prosecution Timeline

Feb 02, 2024
Application Filed
Jan 14, 2026
Non-Final Rejection mailed — §103
Feb 12, 2026
Applicant Interview (Telephonic)
Feb 12, 2026
Examiner Interview Summary
Feb 26, 2026
Response Filed
May 01, 2026
Final Rejection mailed — §103
May 15, 2026
Examiner Interview Summary
May 15, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/727,828
Patent 12629954
OPTICALLY VARIABLE SURFACE PATTERN AND METHOD FOR PRODUCING SAME
1y 10m to grant Granted May 19, 2026
18/327,153
Patent 12614270
MACHINE LEARNING SYSTEM FOR NATURAL GAS LEAK DETECTION
2y 11m to grant Granted Apr 28, 2026
18/565,652
Patent 12608831
IMAGE PROCESSING METHOD AND APPARATUS, AND ELECTRONIC DEVICE AND STORAGE MEDIUM
2y 4m to grant Granted Apr 21, 2026
18/569,692
Patent 12602822
METHOD DEVICE AND STORAGE MEDIUM FOR BACK-END OPTIMIZATION OF SIMULTANEOUS LOCALIZATION AND MAPPING
2y 4m to grant Granted Apr 14, 2026
18/962,814
Patent 12597252
METHOD OF TRACKING OBJECTS
1y 4m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
76%
With Interview (-7.0%)
2y 6m (~3m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 826 resolved cases by this examiner. Grant probability derived from career allowance rate.