Last updated: April 19, 2026
Application No. 18/502,719
OPEN VOCABULARY IMAGE SEGMENTATION

Non-Final OA §103
Filed
Nov 06, 2023
Examiner
HOANG, HAN DINH
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +19.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 162 resolved cases, 2023–2026
Examiner Intelligence

HOANG, HAN DINH View full profile →
Grants 74% — above average
Career Allow Rate
120 granted / 162 resolved
+12.1% vs TC avg
Strong +19% interview lift
Without
With
+19.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
25 currently pending
Career history
187
Total Applications
across all art units
Statute-Specific Performance

§101
6.9%
-33.1% vs TC avg
§103
65.7%
+25.7% vs TC avg
§102
15.5%
-24.5% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 162 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/07/2025 and 02/02/2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: means for in claim 28-30.
The corresponding structure for the means for appear to be disclosed in ¶[0148] of the specification to be a computer processor.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 8, 10-11, 17, 19-20, 26 and 28-29 are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov et al. ("Segment Anything") in view of Schulter et al. US PG-Pub(US 20230281999 A1).
Regarding Claim 1, Kirillov teaches a processing system(Abstract, We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation)  comprising: one or more memories comprising processor-executable instructions and one or more processors configured to execute the processor-executable instructions and cause the processing system(Page 5, 3. Segment Anything Model, Efficiency, Paragraph 1, “The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU”, this section of the prior art discloses the Segment Anything Model is ran on a CPU which is a processor and inherently the model would have to be stored in memory to execute the functions of the model.) to: access an input image(Figure 4, shows an image being input into an image encoder); process the input image using an image encoder to generate an image embedding tensor(Figure 4, shows a heavyweight image encoder outputs an image embedding. Page 16, Image Encoder, Paragraph 1, “In general, the image encoder can be any network that outputs a C×H×W image embedding…. The image encoder’s output is a 16× downscaled embedding of the input image”, Page 16, Image Encoder, Paragraph 1 discloses that the output of the image encoder is in vector/tensor form.); process the image embedding tensor using a mask decoder machine learning model to generate a set of mask embedding tensors(Page 5, 3. Segment Anything Model, Mask decoder, Paragraph 4, “The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask… Our modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings. After running two blocks, we upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.”, as disclosed in this section of the prior art, a mask decoder is used to process the image embedding in order to generate a set of mask foreground probability at each location of the image. Figure 4 also describes that the SAM model can output multiple valid masks and associated confidence scores.); process a textual input using a text encoder to generate a text embedding tensor(Page 11, 7.5. Zero-Shot Text-to-Mask, Paragraph 1, “we prompt SAM with the extracted CLIP image embeddings as its first interaction. The key observation here is that because CLIP’s image embeddings are trained to align with its text embeddings, we can train with image embeddings, but use text embeddings for inference. That is, at inference time we run text through CLIP’s text encoder and then give the resulting text embedding as a prompt to SAM”, as disclosed in this section of the prior art a text encoder is used to generate a text embedding to use as a prompt as input to the SAM model.) 
Kirillov does not explicitly teach generate a set of augmented masks based on aggregating the text embedding tensor with the set of mask embedding tensors.
Schulter teaches generate a set of augmented masks based on aggregating the text embedding tensor with the set of mask embedding tensors. ([0049] In the text branch, a transformer neural network 410 processes the input textual labels. Block 412 embeds the output of the transformer into the same latent space as the images. Block 414 then compares the image embeddings and the textual embeddings for a given image, for example using a cosine similarity between the respective vectors. Block 416 can then determine probabilities for each prediction and each input text.
[0050] Referring now to FIG. 5, detail on performing segmentation using the trained model 314 is shown. During testing, the trained model is used to identify labels for objects in a new image. Given an image, the model predicts masks and, for each mask, an embedding vector. ¶[0049]-¶[0050] disclose comparing image embeddings and textual embeddings in an image and using the collected data to generate mask embeddings used to label an object in the image.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov with Schulter in order generate a set of masks by comparing the text embedding and mask embedding. One skilled in the art would have been motivated to modify Kirillov in this manner in order to perform image analysis and, more particularly, to perform panoptic segmentation of images. (Schulter, ¶[0002])
Regarding Claim 2, the combination of Kirillov and Schulter teach the processing system of claim 1, where Schulter further teaches wherein, to generate the set of augmented masks, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to: generate a set of masks based on the set of mask embedding tensors and the image embedding tensor (¶[0050] “Given an image, the model predicts masks and, for each mask, an embedding vector. Thus, block 502 embeds a new input image using the trained model. A query may be provided with one or more textual query terms in block 506. To estimate a semantic category for each mask, the image's embedding vectors may be compared to a text embedding vector of the query in block 506. The text embedding vectors are the output of the text encoder, and the input to the text encoder are the class names of the query.”, ¶[0050] discloses generating masks by comparing image embeddings and text embeddings to estimate a semantic category for the mask.); generate a set of predictions based on aggregating the text embedding tensor with the set of mask embedding tensors and associate the set of predictions with the set of masks. (¶[0051], “the image embeddings may be compared to these new text embeddings to generate probabilities in block 508, for example showing the likelihood that each mask matches a bus or a taxi.”, ¶[0051] discloses generating a probability/likelihood value that the generated mask matches the text embedding of the query term.); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov with Schulter in order generate a set of masks by comparing the text embedding and mask embedding. One skilled in the art would have been motivated to modify Kirillov in this manner in order to perform image analysis and, more particularly, to perform panoptic segmentation of images. (Schulter, ¶[0002])
Regarding Claim 8, the combination of Kirilov and Schulter teach the processing system of claim 1, Schulter further teaches wherein the mask decoder was trained to perform panoptic segmentation ([0002] The present invention relates to image analysis and, more particularly, to panoptic segmentation of images.) based at least in part on: a training image([0019] To obtain a robust model for panoptic image segmentation, the model may be trained using multiple datasets with different forms of annotation.), a set of training mask embedding tensors for the training image, and a set of category text embedding tensors. ([0035] “When training the segmentation model, an image I is sampled from one of K datasets D.sub.k, where k∈{1, . . . , K},
which also defines the labelspace [AltContent: rect]k. Text embeddings e.sub.c.sup.T are computed for c∈[AltContent: rect]k—the embeddings may be predetermined if prompts are not learned. The predefined embedding space of the vision-and-language model handles the different label spaces, where different categories having different names corresponding to respective locations in the embedding space. Different names of the same semantic category, such as “sofa” and “couch,” will be located close to one another due to semantic training on large-scale natural image-text pairs.
[0036] Image augmentation may be performed, after which the model makes N predictions, each with a mask m.sub.i and corresponding object embedding Together with the ground truth and the computed matching a,” , ¶[0035] disclose the dataset when training the segmentation model contains semantic embeddings of categories in which an object could belong in and ¶[0036] discloses determining a mask and corresponding object embedding in the training image.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov with Schulter in order to perform panoptic segmentation using a training image and text and mask embeddings. One skilled in the art would have been motivated to modify Kirillov in this manner in order to generate labels for each pixel of the input image to output a segmented image. (Schulter, ¶[0020])
Regarding Claim 10, claim 10 is considered a method claim substantially corresponding to claim 1. Please see the discussion of claim 1 above for a discussion of similar limitations. Furthermore, Kirillov teaches a processor-implemented method on Page 5, 3. Segment Anything Model, Efficiency, Paragraph 1, “The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU”, this section of the prior art discloses the Segment Anything Model is ran on a CPU which is a processor and inherently the model would have to be stored in memory to execute the functions of the model.)
Regarding Claim 11, the claim recites features similar to claim 2 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 17, the claim recites features similar to claim 8 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 19, claim 19 is considered a computer readable medium claim substantially corresponding to claim 1.  Please see the discussion of claim 1 above for a discussion of similar limitations.  Furthermore, Kirlillov teaches one or more non-transitory computer-readable media comprising processor-executable instructions that, when executed by one or more processors of a processing system, cause the processing system on Page 5, 3. Segment Anything Model, Efficiency, Paragraph 1, “The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU”, this section of the prior art discloses the Segment Anything Model is ran on a CPU which is a processor and inherently the model would have to be stored in memory to execute the functions of the model.)
Regarding Claim 20, the claim recites features similar to claim 2 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 26, the claim recites features similar to claim 8 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 28, is considered a system claim substantially corresponding to claim 1. Please see the discussion of claim 1 above for a discussion of similar limitations. Furthermore, the claim is being interpreted under 35 U.S.C. 112(f) and the corresponding structure is a machine learning algorithm system described in ¶[0077] of the specification. Kirillov would teach the processing system on Page 5, 3. Segment Anything Model, Efficiency, Paragraph 1, “The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU”, this section of the prior art discloses the Segment Anything Model is ran on a CPU which is a processor and inherently the model would have to be stored in memory to execute the functions of the model), and the corresponding structure performing the means for as seen in Figure 4, it shows the Segment Anything Model which is a machine learning model that processes the input image with a machine learning algorithm.
Regarding Claim 29, the claim recites features similar to claim 2 respectively, and is rejected in the same manner, the same art, reasoning applying.
Claims 3-4, 7, 12-13, 16, 21-22 and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov et al. ("Segment Anything") in view of Schulter et al. US PG-Pub(US 20230281999 A1) in view of Xu et al. US PG-Pub(US 20240153093 A1).
Regarding Claim 3, while the combination of Kirillov and Schulter teach the processing system of claim 2, they do not explicitly teach wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to generate, based on a first augmented mask of the set of augmented masks, a classification indicating that a region of the image corresponding to the first augmented mask depicts an entity corresponding to the textual input. 
Xu teaches wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to generate, based on a first augmented mask of the set of augmented masks, a classification indicating that a region of the image corresponding to the first augmented mask depicts an entity corresponding to the textual input. (¶[0033] “The panoptic label unit 230 may be trained to predict the category label from an open vocabulary that is assigned to each predicted mask using either category label supervision or image caption supervision. The panoptic label unit 230 classifies each mask region into N potential object classes (if the class labels are known during training) or into a binary foreground/background classification label if the object class is not known a priori (e.g., when training with image caption labels).”, ¶[0033] discloses a category label is predicted for each mask region in the image based on if the label was known during training or the object class is not known. )
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Schulter with Xu in order to make a classification indication that the mask region corresponds to a text input. One skilled in the art would have been motivated to modify Kirillov and Schulter in this manner in order for text representations of category labels with the object masks and their semantic visual representations to produce panoptic segmentation data. (Xu, Abstract) 
Regarding Claim 4, the combination of Kirillov, Schulter and Xu teach the processing system of claim 3, where Schulter further teaches wherein: the mask decoder was trained based at least in part on a set of category text embedding tensors [0020] “To this end, public segmentation datasets may be employed. These datasets may include annotations that indicate a semantic category for each pixel in their constituent images as well as instance IDs for categories that are countable. Thus, a given pixel may be in the category of “car,” and may further have an instance ID identifying which of the cars in the image it relates to”, ¶[0020] discloses the training data set includes annotation for a semantic category the object could belong to. ), and the set of category text embedding tensors does not include the text embedding. ([0039] “Training the model with missing annotations can bias the predicted probabilities p.sub.i toward the “no-object” category, particularly for unseen categories, because they may appear in the training images without annotation and may therefore be assigned the “no-object”class”, ¶[0039] discloses the model can be trained with images that do not have a label or an unseen object class.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Xu with Schulter in order train the system using known and unknown text embeddings. One skilled in the art would have been motivated to modify Kirillov and Xu in this manner in order to use larger datasets to improves segmentation accuracy and improves robustness and generalization. (Schulter, ¶[0019])
Regarding Claim 7, while the combination of Kirillov and Schulter teach the processing system of claim 1, they do not explicitly teach wherein, to aggregate the text embedding tensor with the set of mask embedding tensors, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to compute a dot product between the text embedding tensor and the set of mask embedding tensors. 
Xu teaches wherein, to aggregate the text embedding tensor with the set of mask embedding tensors, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to compute a dot product between the text embedding tensor and the set of mask embedding tensors. (Xu, [0037] “FIG. 2B illustrates a conceptual diagram of a panoptic label unit 230 shown in FIG. 2A suitable for use in implementing some embodiments of the present disclosure. In an embodiment, the panoptic label unit 230 performs a dot product between the mask embeddings (included in the segmentation data) and the text embeddings to categorize the mask embeddings and compute the generated panoptic segmentation.”, ¶[0037] discloses computing a dot product between masked embedding and text embeddings to compute a generated panoptic segmentation. )
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Schulter with Xu in order to compute a dot product between the text embedding and mask embedding. One skilled in the art would have been motivated to modify Kirillov and Schulter in this manner in order to categorize the mask embeddings and compute the generated panoptic segmentation. (Xu, ¶[0037])
Regarding Claim 12, the claim recites features similar to claim 3 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 13, the claim recites features similar to claim 4 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 16, the claim recites features similar to claim 7 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 21, the claim recites features similar to claim 3 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 22, the claim recites features similar to claim 4 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 25, the claim recites features similar to claim 7 respectively, and is rejected in the same manner, the same art, reasoning applying.
Claims 5-6, 14-15, 23-24 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov et al. ("Segment Anything") in view of Schulter et al. US PG-Pub(US 20230281999 A1) in view of Lan et al. US PG-Pub(US 20240169545 A1).
Regarding Claim 5, the combination of Kirillov and Schulter teach the processing system of claim 1, where Schulter teaches wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to: sample a first set of points within a first mask corresponding to a first augmented mask of the set of augmented masks(¶[0031], “Given a predicted mask, a subset of points is first sampled via importance sampling, based on the prediction uncertainty. The ground truth for the same points is gathered and two losses are determined”); 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov with Schulter in order to sample points in an image. One skilled in the art would have been motivated to modify Kirillov in this manner in order to generate labels for each pixel of the input image to output a segmented image. (Schulter, ¶[0020])
However, Kirillov and Schulter do not explicitly teach generate a first updated mask based on processing the image embedding tensor and the first set of points using a second decoder.
Lan teaches generate a first updated mask based on processing the image embedding tensor and the first set of points using a second decoder. (¶[0037], “the neural network comprises a vision transformer. In an embodiment, the neural network comprises an image encoder configured to process the expanded image region according to a first portion of the parameters to compute an encoded image region and an attention-based decoder that processes the encoded image region according to a second portion of the parameters to produce the binary mask.”, ¶[0037] discloses using a decoder to process the encoded image region to generate a first mask pertaining to the object.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Schulter with Lan in order to use a second decoder to generate a mask of the image. One skilled in the art would have been motivated to modify Kirillov and Schulter in this manner in order to use the object masks to train an instance segmentation model to localize and segment objects with pixel-level accuracy to produce labeled segmentation masks. (Lan, ¶[0003])
Regarding Claim 6, the combination of Kirillov, Schulter and Lan teach the processing system of claim 5, where Schulter teaches wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to:sample a second set of points within the first updated mask (¶[0031], “Given a predicted mask, a subset of points is first sampled via importance sampling, based on the prediction uncertainty. The ground truth for the same points is gathered and two losses are determined: the binary cross-entropy loss and the dice loss.” ¶[0031] discloses sampling a first set of point in a predicted mask and sampling the ground truth for the same points/second sampling process to determine a loss when training the network.) 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Lan with Schulter in order to sample points in an image. One skilled in the art would have been motivated to modify Kirillov and Lan in this manner in order to generate labels for each pixel of the input image to output a segmented image. (Schulter, ¶[0020])
Lan further teaches generate a second updated mask based on processing the image embedding tensor and the second set of points using the second decoder. (¶[0038], “At step 230, parameters are applied to the expanded image region by a neural network to predict the binary mask corresponding to the object. In an embodiment, during training of the neural network, the parameters are updated based on a multiple instance learning loss the neural network further comprises a second image encoder that processes the expanded image region to compute a second encoded image region and a second attention-based decoder that processes the second encoded image region to produce a second binary mask for the expanded image region.”, ¶[0038] discloses using a second decoder to generate a second mask based off the first mask data of the object.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Schulter with Lan in order to use a second decoder to generate a mask of the image. One skilled in the art would have been motivated to modify Kirillov and Schulter in this manner in order to use the object masks to train an instance segmentation model to localize and segment objects with pixel-level accuracy to produce labeled segmentation masks. (Lan, ¶[0003])
Regarding Claim 14, the claim recites features similar to claim 5 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 15, the claim recites features similar to claim 6 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 23, the claim recites features similar to claim 5 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 24, the claim recites features similar to claim 6 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 30, the claim recites features similar to claim 5 respectively, and is rejected in the same manner, the same art, reasoning applying.
Claims 9, 18 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Kirillov et al. ("Segment Anything") in view of Schulter et al. US PG-Pub(US 20230281999 A1) in view of Ding et al. ("ZegFormer: Decoupling Zero-Shot Semantic Segmentation").
Regarding Claim 9, while the combination of Kirillov and Schulter teach the processing system of claim 1, they do not explicitly teach wherein the image encoder comprises: a component having parameters that were not trained based on semantic meaning of input images; and one or more auxiliary parameters that were trained based on embeddings generated by a semantic image encoder.
Ding teaches wherein the image encoder comprises: a component having parameters that were not trained based on semantic meaning of input images and one or more auxiliary parameters that were trained based on embeddings generated by a semantic image encoder (Figure 2: “The text embeddings are generated by putting the class names into a prompt template and then feeding them to a text encoder of a vision-language model. During training, only the seen classes are used to train the segment-level classification head. During inference, both the text embeddings of seen and unseen classes are used for segment-level classification. We can obtain two segment-level classification scores with semantic segment embeddings and image embeddings. Finally, we fuse the these two classification scores as our final class prediction of segments.”, as disclosed in the caption of figure 2 and shown in figure 2, the training of the image encoder uses text embeddings of only “seen classes” and in the inference phase is when the unseen classes are used to train the encoder to perform segment-level classification.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Kirillov and Schulter with Ding in order to train the image encoder with parameters that were known and unknown during training. One skilled in the art would have been motivated to modify Kirillov and Schulter in this manner in order to incorporate Zero-shot semantic segmentation (ZS3) to segment the novel categories that have not been seen in the training. (Ding, Page 1, Abstract)
Regarding Claim 18, the claim recites features similar to claim 9 respectively, and is rejected in the same manner, the same art, reasoning applying.
Regarding Claim 27, the claim recites features similar to claim 9 respectively, and is rejected in the same manner, the same art, reasoning applying.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344. The examiner can normally be reached Monday-Friday 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JOHN M VILLECCO can be reached at 571-272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HAN HOANG/Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Nov 06, 2023
Application Filed
Dec 13, 2025
Non-Final Rejection — §103
Mar 26, 2026
Interview Requested
Apr 08, 2026
Applicant Interview (Telephonic)
Apr 09, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/638,420
Patent 12602835
POINT CLOUD DATA TRANSMISSION DEVICE, POINT CLOUD DATA TRANSMISSION METHOD, POINT CLOUD DATA RECEPTION DEVICE, AND POINT CLOUD DATA RECEPTION METHOD
2y 5m to grant Granted Apr 14, 2026
18/186,220
Patent 12602778
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING PROGRAM
2y 5m to grant Granted Apr 14, 2026
18/223,648
Patent 12602918
LEARNING DATA GENERATING APPARATUS, LEARNING DATA GENERATING METHOD, AND NON-TRANSITORY RECORDING MEDIUM HAVING LEARNING DATA GENERATING PROGRAM RECORDED THEREON
2y 5m to grant Granted Apr 14, 2026
18/286,641
Patent 12592070
IMAGE PROCESSING APPARATUS
2y 5m to grant Granted Mar 31, 2026
18/053,450
Patent 12586364
SINGLE IMAGE CONCEPT ENCODER FOR PERSONALIZATION USING A PRETRAINED DIFFUSION MODEL
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
93%
With Interview (+19.3%)
3y 2m
Median Time to Grant
Low
PTA Risk
Based on 162 resolved cases by this examiner. Grant probability derived from career allow rate.
OPEN VOCABULARY IMAGE SEGMENTATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email